DC 12.3 | Example Script 3: Tagged Reports

User Guide > Using Content Extraction Language > Script Examples > Example Script 3: Tagged Reports

Was this helpful?

Example Script 3: Tagged Reports

	Name	Add1	CSZ
1	WAYMOLENE ABEL	P.O. BOX 1055	PAULS VALLEY, OK
2	RODNEY O. ABSHER	7705 N.W. 116 ST.	OKLAHOMA CITY, OK
3	BOBBY ALDERSON (DECEASED)
4	SARAH ALDERSON	3609 OAK HAVEN DR.	FT WORTH, TEXAS
5	...	...	...

Tag style reports use pre-defined text strings to identify fields. Tags can appear in any sequence or assume their meaning from other tags nested around them. They may appear in fixed locations like the beginning of a line or be disbursed throughout the file. Tagging rules need to be understood before the script is written. The input source file in this example contains records spanning multiple lines that begin with a FN- tag and end with a pair of vertical bars ||. Each tagged element (field) begins with a tag in positions 1-3 of the input line and ends with a single vertical bar |. Text associated with a tag may continue into lines following the line containing the tag.

The script extracts text associated with each tag in a record and writes it as a single field in the memory buffer. Tag text that spans multiple lines is appended to previous text for that tag and stored until the record terminates.

When a pair of vertical bars is encountered, they are discarded and all stored text is accepted as a record.

Input Source File

FN- NEWS File:Dateline Curr.Glob.News|

AN- 13552000|

PD- January 3, 1995|

AB- Dutch company Eriks Holding has expanded its German

activities via the acquisition of AW Schultze. The

Hamburg-based company specializes in the fields of

sealants, technical plastics and industrial hoses.

Schultze employs a workforce of 50 and generates

annual turnover equivalent to DFl 19m.^ Original

article approx 70 words|

CO- Eriks Holding_(ERKS)|

RG- Germany (GFR) Western Europe (WEUR) Europe (EUR)

European Community (EEC) North Atlantic Treaty

Organization (NATO) GFIVE) Group of Seven (GSEVEN)

Group of Ten (GTEN)|

LA- English||

CXL Scripts

#!djrr

#Tagged Reports

# Clear variables before processing another record.

{ if (inrecord == 0) {

inrecord = 1;

for (ndx in result)

result[ndx] = "";

}

# Strip out blank lines, then get the categorization. We maintain both a current category, and a previous category. This way, if the new category is empty, we can restore it to the one previously set by assigning from lastcat. Also, replace '^' characters with semicolons.

{ if (length(trim($0)) == 0)

REJECT;

cat = trim($0(1 2));

if (length(cat) != 0)

lastcat = cat;

else

cat = lastcat;

rest = trim($0(4 80));

gsub(/\^/, ";", rest);

}

# Grab everything else that starts with a tag and store it off into the result array. This probably slows things down a bit, and uses more memory. But by doing this, the only place that needs to be changed (to control which tags are accepted) is the accept statement below.

/^[A-Z][A-Z]-/ { result[cat] += rest; }

/^ / { if (length(result[lastcat]) > 0) {

if (result[lastcat] !~ /\|$/)

result[lastcat] += " ";

result[lastcat] += rest;

}

# If the line associated with the current category ends with double vertical bars, it’s also the end of this record.

result[lastcat] ~ /\|\|$/ {

inrecord = 0;

# Go through and replace all "|" at the end of a string with nothing

for (idx in result)

gsub(/\|*$/, "", result[idx]);

accept result["FN"], result["AN"], result["PD"], result["AB"],

result["CO"], result["RG"], result["LA"];

}

Last modified date: 08/04/2024