User Guide : Using Content Extraction Language : Script Examples : Example Script 4: One-To-Many Item Reports
 
Share this page                  
Example Script 4: One-To-Many Item Reports
Output Records
 
 
Filed 1
Field 2
Field 3
Filed 4
1
NEWS File:Dateline Curr.Glob.News
13552000
January 3, 1995
Dutch company Eriks Holding has expanded its German activities via the acquisition of AW Schultze. The Hamburg-based company specializes in the fields of sealants, technical plastics and industrial hoses. Schultze employs a workforce of 50 and generates annual turnover equivalent to DFl 19m. Original article approx 70 words.
Output Records
 
 
Field 5
Field 6
Field 7
1
Eriks Holding_(ERKS)
Germany (GFR) Western Europe (WEUR) Europe (EUR) European Community (EEC) North Atlantic Treaty Organization (NATO) GFIVE) Group of Seven (GSEVEN) Group of Ten (GTEN)
Germany (GFR) Western Europe (WEUR) Europe (EUR) European Community (EEC) North Atlantic Treaty Organization (NATO) GFIVE) Group of Seven (GSEVEN) Group of Ten (GTEN)
Some reports contain information that varies greatly from record to record. To make informative comparisons across data, we may want to list all possible values as the Map Designer columns even though they are rarely populated. The input source file in this example contains company profile information. Records start with a line of asterisks (*) followed by a fixed format section that contains the company name, address and main phone numbers. This is followed by a section of tagged data where tags consist of a phrase terminated with a colon (i.e., Direct sales). The third section begins with the word Category and contains data in a nested list that describes a wide range of products produced by the company. The final section consists of the string Record# followed by a numeric id.
The CXL script writes information from the three common sections of each record and a list of all possible product categories into one row of a Map Designer data file. First the script identifies each section. Empty lines are rejected. Then each section is processed in turn. The script takes advantage of associative indexing of an array. All data is stored in the result array, indexed by the name of the field or tag from which it is extracted.
When the end of the record is encountered, the values of all the array elements are written to a Map Designer data file and displayed in the Source Browser. Column labels are defined in the accept statement.
Input Source File
I******************************
IBM (International Business Machines)
Old Orchard Rd.
Armonk, NY 10504
800-426-3333; 914-765-1900
Direct sales: 800-426-7695 (IBM PC Direct)
Tech support: 800-237-5511
Tech support BBS: 919-517-0001; 800-847-7211 (OS2)
Year established: 1914
No. of employees: 250,000
Gross annual sales: $62,700,000,000
Ownership: Publicly traded NYSE (IBM)
Chairman: Louis V. Gerstner, Jr.
Category
Computers
Desktop Systems
Mainframes/Supercomputers
Notebooks
Graphics Equipment
Generators/Controllers
Graphics Systems
Record#
207 195
...
CXL Script
#!djrr
#One-To-Many Items Reports
# This BEGIN statement initializes global variables
BEGIN { inrecord = 0; haverecord = 0; needclear = 0; lineno = 0;}
# New records begin with the line consisting of all asterisks. This starts
# the first section of the record that contains the company name & address.
# We set inrecord = 1 to initiate handling of section 1 later in the script.
/^\*.*$/ { clearvars; inrecord = 1; BREAK;}
# Section 2 of each record begins with the first tagged line which will
# contain a colon somewhere in the line. We set inrecord = 2 to initiate
# appropriate handling of section 2 later in the script.
$0 ~ /:/ { inrecord = 2; tag = ""; lasttag = ""; }
# Section 3 of each record begins after all tagged lines, and starts with a line
# containing only the word Category. We set inrecord = 3 to initiate
# appropriate handling of section 3 later in the script. This line is only a
# marker and no further processing is needed.
/^ *Category/ { inrecord = 3; REJECT; }
# Section 4 of each record begins with a line containing only the word
# Record. Set inrecord = 4 to initiate appropriate handling of section 4 later in
# the script. This line is only a marker, no further processing is needed.
/^ *Record#/ { inrecord = 4; REJECT; }
# Place this here, so we actually allow getting into a record
{ if (inrecord == 0) REJECT; }
# Strip out blank lines
{ $0 = rtrim($0);
if (length($0) == 0)
REJECT;
}
# Handle Section 1 appropriately. Successive values from multiple lines in
# this section are concatenated using ";" as a separator.
{ if (inrecord == 1) {
lineno = lineno + 1;
if (lineno == 1 ) {
split($0,tmp,"(");
result["Company Name"] = rtrim(tmp[0]);
if (length(tmp[1]) <= 0) BREAK;
split(tmp[1],tmp,")");
result["Parent Co. Name"] = trim(tmp[0]);
}
if (lineno == 2 ) result["Address"] = $0;
if (lineno == 3 ) {
split($0,tmp,",");
result["City"] = tmp[0];
if (length(tmp[1]) <= 0) BREAK;
split(tmp[1],tmp," ");
result["State"] = trim(tmp[0]);
result["Zip"] = trim(tmp[2]);
}
}
}
# Process section 2 lines separating the tag from the value. Successive values from multiple lines under the tag are concatenated using ";" as a separator.
{ if (inrecord == 2) {
split($0, line, ":");
if (length(line[1]) == 0) {
tag = lasttag;
result[lasttag] += line[0];
}
else {
tag = line[0];
lasttag = tag;
if (length(result[lasttag]) > 0)
result[lasttag] += "; ";
result[lasttag] += line[1];
}
}
}
# Process section 3 lines by capturing the tags that begin in column 1 and the values that begin in column 4 under the tags. Successive values from multiple lines under the tag are concatenated using ";" as a separator.
{ if (inrecord == 3) {
if (length(trim($0(1 3))) == 0) {
hdr = lasthdr;
if (length(result[lasthdr]) > 0)
result[lasthdr] += "; ";
result[lasthdr] += ltrim($0);
}
else {
hdr = $0;
lasthdr = hdr;
}
}
}
# Process section 4 value and write the results as the Map Designer columns. Reset the inrecord value for the next record.
{ if (inrecord == 4) {
result["Record#"] = $0;
inrecord = 0;
accept <"Company Name"> result["Company Name"],
<"Parent Co. Name"> result["Parent Co. Name"],
<"Address"> result["Address"], <"City"> result["City"],
<"State"> result["State"], <"Zip"> result["Zip"],
<"Year established"> result["Year established"],
<"No. of employees"> result["No. of employees"],
<"Gross annual sales"> result["Gross annual sales"],
<"Ownership"> result["Ownership"],
<"CEO"> result["CEO"],
<"Chairman"> result["Chairman"],
<"President"> result["President"],
<"Computers"> <100> result["Computers"],
<"Graphics Equipment"> <100> result["Graphics Equipment"],
<"LAN/WAN Products"> <100> result["LAN/WAN Products"],
<"Software, Applications"> result["Software, Applications"],
<"Software, Communications"> result["Software, Communications"],
<"Record#"> result["Record#"];
}
}
Output Record
 
 
Company Name
Parent Co. Name
Address
City
1
IBM
International Business Machines
Old Orchard Rd.
Armonk
 
 
State
Zip
Year established
No. of employees
Gross annual sales
1
NY
10504
1914
250,000
$62,700,000,000
 
 
Ownership
CEO
Chairman
President
1
Publicly traded N...
 
Louis V. Gerstner, Jr
 
 
 
Computers
1
Desktop Systems;Mainframes/Supercomputers;Notebooks
 
 
Graphical Equipments
1
Generators/Controller;Graphics Systems
 
 
LAN/WAN Products
1
 
 
 
Software, Applications
Software, Communications
Record#
 
 
 
207 195