User Guide : Using Content Extraction Language : Script Examples : Example Script 3: Tagged Reports
 
Share this page                  
Example Script 3: Tagged Reports
 
 
Name
Add1
CSZ
1
WAYMOLENE ABEL
P.O. BOX 1055
 
PAULS VALLEY, OK
2
RODNEY O. ABSHER
7705 N.W. 116 ST.
OKLAHOMA CITY, OK
3
BOBBY ALDERSON (DECEASED)
 
 
4
SARAH ALDERSON
3609 OAK HAVEN DR.
FT WORTH, TEXAS
5
...
...
...
Tag style reports use pre-defined text strings to identify fields. Tags can appear in any sequence or assume their meaning from other tags nested around them. They may appear in fixed locations like the beginning of a line or be disbursed throughout the file. Tagging rules need to be understood before the script is written. The input source file in this example contains records spanning multiple lines that begin with a FN- tag and end with a pair of vertical bars ||. Each tagged element (field) begins with a tag in positions 1-3 of the input line and ends with a single vertical bar |. Text associated with a tag may continue into lines following the line containing the tag.
The script extracts text associated with each tag in a record and writes it as a single field in the memory buffer. Tag text that spans multiple lines is appended to previous text for that tag and stored until the record terminates.
When a pair of vertical bars is encountered, they are discarded and all stored text is accepted as a record.
Input Source File
FN- NEWS File:Dateline Curr.Glob.News|
AN- 13552000|
PD- January 3, 1995|
AB- Dutch company Eriks Holding has expanded its German
activities via the acquisition of AW Schultze. The
Hamburg-based company specializes in the fields of
sealants, technical plastics and industrial hoses.
Schultze employs a workforce of 50 and generates
annual turnover equivalent to DFl 19m.^ Original
article approx 70 words|
CO- Eriks Holding_(ERKS)|
RG- Germany (GFR) Western Europe (WEUR) Europe (EUR)
European Community (EEC) North Atlantic Treaty
Organization (NATO) GFIVE) Group of Seven (GSEVEN)
Group of Ten (GTEN)|
LA- English||
CXL Scripts
#!djrr
#Tagged Reports
# Clear variables before processing another record.
{ if (inrecord == 0) {
inrecord = 1;
for (ndx in result)
result[ndx] = "";
}
}
# Strip out blank lines, then get the categorization. We maintain both a current category, and a previous category. This way, if the new category is empty, we can restore it to the one previously set by assigning from lastcat. Also, replace '^' characters with semicolons.
{ if (length(trim($0)) == 0)
REJECT;
cat = trim($0(1 2));
if (length(cat) != 0)
lastcat = cat;
else
cat = lastcat;
rest = trim($0(4 80));
gsub(/\^/, ";", rest);
}
# Grab everything else that starts with a tag and store it off into the result array. This probably slows things down a bit, and uses more memory. But by doing this, the only place that needs to be changed (to control which tags are accepted) is the accept statement below.
/^[A-Z][A-Z]-/ { result[cat] += rest; }
/^ / { if (length(result[lastcat]) > 0) {
if (result[lastcat] !~ /\|$/)
result[lastcat] += " ";
result[lastcat] += rest;
}
}
# If the line associated with the current category ends with double vertical bars, it’s also the end of this record.
result[lastcat] ~ /\|\|$/ {
inrecord = 0;
# Go through and replace all "|" at the end of a string with nothing
for (idx in result)
gsub(/\|*$/, "", result[idx]);
accept result["FN"], result["AN"], result["PD"], result["AB"],
result["CO"], result["RG"], result["LA"];
}