Was this helpful?
Legacy Systems
The following are the legacy systems:
Unknown Application or File Format Connections
There are several options available in the integration platform for connecting to a file when the original application or file format are unknown.
Unknown File Format
What if you know the particular application type, but do not know the format of the file and the company who wrote the application is no longer in business, or does not provide the file format? Three options for this situation follow:
Intermediate File Format
Examine the customer application and its documentation. Determine whether there is an Export or Save As option. If so, use that option to move data to an intermediate file format that the integration platform can read and transform.
Report File
Examine the customer application and its documentation. Determine whether or not the customer can generate a report file that contains the data they want to transform. If so, generate the report and print the report to a file. This creates what is often called a print image or spool file. After this, use Content Extractor Editor to extract the data from the report and transform it to a more usable format.
Neither of the Above
Use the procedure outlined in the section "Unknown Application." All the same rules apply.
Binary Data Types
Since unknown file formats are often labeled binary. You may be tempted to label anything that is not text as binary data. But if it is binary, it is useful to understand what kind of binary data it is. The integration platform supports some basic types:
Numerics
Graphics and BLOBs
Proprietary Application Data
Compressed Data
Numerics
Numeric data is the classic case of binary data, which is usually embedded as fields within a schema (often interspersed with non-binary Text Fields). Three common types are listed below:
Packed Decimal—Useful for real numbers, commonly used in COBOL (called COMP-3). Sizes from a 1-byte field and larger.
Binary Integers—Useful for integer numbers, often used in C language (also called Shorts and Longs, COMP in Cobol). Lengths of 1, 2, 4 bytes are common and sometimes 8-byte (64-bit binary size) for huge numbers.
Floating Points—Useful for integer and real numbers, especially if high precision is required (often called floats). Size of 4 or 8 bytes.
Graphics and BLOBs
Graphics and BLOBs (binary large objects) include many file types, including JPG, TIFF, PNG, BMP, fax, and images. These types are typically found as stand-alone files, but can also be embedded inside files.
Often such binary data needs to be transmitted with non-binary systems (for example, email, ASCII, XML). A common technique for handling such data is to first encode the binary stream into a larger text equivalent stream and then decode it to original binary values at the other end. This method is similar to MIME (Multipurpose Internet Mail Extensions) for email and Simple Object Access Protocol (SOAP).
B64 Encode/Decode Technique
You can do this with Base64 encoding, which the integration platform supports with the B64 Encode and B64 Decode functions. B64 Encode encodes any binary stream, even if you do not know what the binary format is, since it is treated as a stream of bytes. Then you can use B64 Decode to convert in the other direction.
The integration platform does not support transformation of such binary graphical formats from one to the other, but there are many commercial and free, open, and shareware tools that can transform these formats. Then you can return to the integration platform and through map logic and use of EZscript, call external code modules packaged as Java, ActiveX, or DLLs.
Proprietary Application Data
Often programmers create a proprietary file format for storing their application data rather than use a standard DBMS (for example, Oracle) or a file format (for example, XML) for their data storage needs. If their proprietary file format is all text, then the integration platform provides many methods to read and write the data. However, if the file includes binary data, sometimes there is little that the integration platform can do directly with the data.
On the Target Side
Can write file format directly with the Binary connector if it is a pure fixed-length sequential file (uncommon).
Can create any kind of intermediate flat file (such as COBOL, ASCII, XML, Excel) for the target application to import.
On the Source Side
Try the Binary connector on the live data. This works if the fixed-length records are in the file and need to be extracted. Note that this would be possible regardless of underlying programming language used (such as C, COBOL, RPG, Pascal, Basic), since the integration platform supports virtually all underlying data types used by any language on any platform (for example, AS400, VAX, Mainframe).
If the source application is still working, then try to export the data to an intermediate file format that the integration platform can read.
If the source application is still working, then export the data by printing to a file and using Content Extractor Editor to flush out the data.
Compressed Data
Often in binary data, larger (typically text) streams are compressed to save space; the result is intended to be a smaller binary stream. While compression is often done at the file level (for example, ZIP) it can also be done at the field level to save space (although complex issues of fixed-length fields arise).
While compressed data is not technically "private" (compacted to save space), there is no way that the integration platform can expand compressed binary data, unless we have the compression algorithms. While there may be some standards in the compression area, the integration platform does not supply any standard compression and decompression functions in EZscript. EZscript allows you to invoke enables use of external compression and decompression libraries, even if the integration platform does not work with them directly.
Intermediate Methods to Connect to Data Formats
The following topics provide information about the intermediate methods to connect to the data formats.
Using an Intermediate File Format
If the integration platform cannot read the native file format of your application, you may need to create an intermediate file format that the integration platform can read. This is done in the application from which you want to obtain some data.
To do this, the application must have the ability to "export" the data or "save as" to a common file format such as Delimited ASCII, dBASE, Lotus 1-2-3, etc.
Locate the documentation for your application and search the Index or documentation for "Export" or "Save As". Then determine which common file formats your application can export data to and follow those instructions.
Best Practice — If preserving the field data types is important to you, select an export file format such as dBASE, if possible.
After the intermediate file is created, you can use the integration platform to transform that data to the desired destination with the ability to manipulate and clean the data as needed.
For example, if you exported the data to a dBASE file format, select one of the versions of dBASE as the source connector.
Alternative Connection Methods
When you cannot connect to a data file or table using direct methods, there is often an alternative method. Refer to the following sections for suggestions.
Accessing Data on Tape Devices
Windows and Linux
Use a utility. Dump the data from the tape device to a disk. Then use Binary as the source connector to access the data on the disk.
Best Practice — The tape unload utility often comes with "deblocking" logic, so that the data is "cleaner" on the disk than it was on the tape. This is because variable length and blocking schemes can make it harder for the Binary connector to read the data.
Linux
Directly read the data. If the tape drive is on a Linux platform, the integration platform can read the data directly off of the tape drive with this Tape Drive Sequential connector.
Windows
1. Make sure your tape device has file management capability. If so, you should be able to use the Tape Drive Sequential connector as a source connector.
2. Create a named pipe. Copy a sample of the file from the tape drive to design the map. Then run the map from the command line and use source overrides to connect to the named pipe.
Data Warehouse File Access
Most popular database applications provide a loader utility designed for feeding detail data into data warehouses and data marts. For instance, Oracle SQL Loader reads flat file data and inserts that data into one or more database tables in a data warehouse.
The integration platform provides connectors that can read and write those loader files. The following list includes some of the available connectors but is not all-inclusive:
Multiple Table Loading Connectors
Connecting Through Content Extractor Editor/CXL Scripts
Two integration tools, Content Extractor Editor and CXL, can help users extract structured data from text files and reports.
If you need to transfer data from one application to another, you are often challenged with getting useful data out of the source application. Methods to extract data through native connections and intermediate import/export formats are sometimes not available or not appropriate. In such cases, there is a fallback technique. Almost all applications have the ability to send information to a printer. To get data out of an application, redirect the printout to a text file.
Now the problem becomes a more classic data transformation, where the source file is the printout or report captured as a raw text file, but containing an inherent structure. Employing the pattern-recognition capability in the CXL scripting language, users can create scripts to extract data fields from the raw text lines and assemble those fields into clean records.
Two ways are available to create the scripts. One is to use the Content Extractor Editor user interface for visual markup of the extraction rules by direct manipulation of the text file on the screen. The other is to author the scripts directly using the CXL SDK. The latter option is useful when the rules to extract the data are too complex to be expressed in a GUI.
The added value of the Content Extractor Editor/CXL solution is that once the script is saved, the integration platform can use it. In this case, the script is used as for preprocessing of an incoming source text file to flatten it into a more normal row and column structure. Then you can map and manipulate the data in a transformation.
Custom Report Reading Scripts
The integration platform is a universal data transformation tool allowing the translation of structured, field and record-oriented data from one format to another. You can transform data that resides in databases, spreadsheets, flat files, ASCII, Binary/EBCDIC, SQL, Reports, ODBC, Accounting systems, Legacy COBOL, Math/Stats packages, Text and many other file formats.
Implemented with a powerful GUI, the integration platform allows users to intelligently intervene in the transformation to filter and edit data into the exact output format required. Specifications governing the transformations can be set up interactively, or stored for subsequent automated use.
A major advance in data transformation technology is the ability to read complex text files (for example, reports, printouts, tagged downloads, etc.), extract the desired data fields from various lines in the text file and assemble those fields into a flat record of data. Thus whole records of structured data can be extracted and presented in the conventional row and column tabular format. This flattening of complex text files into clean fields and records occurs as a pre-transformation process by the CXL report reading engine. CXL requires the integration platform for downstream processing and transformation.
The flattening of the Source text file is accomplished by using special "scripts". To author such scripts, you can use the CXL (Content eXtraction Language) SDK files that describe the scripting language. This scripting language can be used to create or customize complex scripts.
Comments about Custom Report Reading Scripts
We have developed and compiled many custom scripts to flatten text files from popular data sources, thus enabling users to transform the text data to a more structured target format. These CXL scripts are available for purchase in the form of Custom Packs.
Custom Packs of CXL scripts are currently available, in packed and Source code format, for the following:
Dodge Reports (Data Line)
EDI (now available as a target connector)
Mailing Labels
Other than the copyright on the CXL language, there are no licensing controls on the scripts created by developers for specific text reading and transformation scenarios. Developers are free to exploit the possibilities in the CXL language for commercial or corporate use.
Creating a Report File Format
If the integration platform cannot read the native file format of your application, you may need to create a report file.
To do this, the application must have the ability to generate reports. The reports can then be printed to a file, rather than to a printer.
And while the integration platform cannot read a report file without some help, you can use another integration product, Content Extractor Editor, to extract the data out of the report and assemble it into a format that the integration platform can read and transform into the desired target format.
To read a report into Content Extractor Editor, it must exist as a disk file on your PC or network. The following are suggestions for accomplishing this.
Accessing Mainframe Data
How you bring mainframe data to the integration platform depends on the format of your source files. To access mainframe data, do one of the following depending on the source:
SQL Applications: You can have direct access to your backend SQL database by using your native application for your source connector, such as DB2. Of course, you may also use ODBC 3.x to get to SQL-based application data. This may require that a special ODBC driver be installed.
Sequential mainframe data: Produce a flat file and use Binary as the source connector. To make the process easier, use a COBOL copybook to assist you in creating your structured schema, or
Sequential mainframe data: Use the print-to-file method to create a print spooler file. Then open the file in Content Extractor Editor and create a script file. Next, create a data set and select Extractor as the source connector and the script file as the source file.
ODBC direct access for non-SQL formats: VSAM data, for instance, can be accessed by using a special ODBC driver. For more information, see ODBC 3.x.
Download of any Internal File Format (dumped from mainframe database or application) – even EBCDIC.
These application-specific solutions required special software be installed:
MS HostServices/OLEDB (must be installed)
WAN (for example, Sun NFS, or any that makes mainframe a fileserver)
Recent Internet-related developments are also emerging as new options:
Through FTP directly
Through TCP directly
Through HTTP directly (Internet mainframes)
Use a COM interface. (For example, IBM’s CICS, a general-purpose online transaction processing application.)
Accessing PC Data
If the report was generated in a PC application, you may either:
Print the report to a regular paper printer, then scan and OCR it to a disk file, or
Print the report directly to a disk file from the application by following the steps below to set up a special "print to file" driver on your PC.
To set up the generic/text printer
1. In Windows operating systems, open the Printers folder.
2. Double-click the Add Printer option to start the Add Printer Wizard.
3. In the Add Printer Wizard dialog, click Next.
4. Then select Local Printer and click Next.
5. From the list of manufacturers, select Generic. The Generic/Text Only entry displays in the Printers box on the right and is highlighted. Click Next.
6. From the list of available ports, select File. Click Next.
7. In the next dialog, the Printer Name displays as Generic/Text. You can select whether or not you want this printer to be the default printer in your Windows applications. Click Next.
8. The next dialog asks if you want to print a test page. Select No and click Finish. Windows prompts you for the Windows installation CDs from which it builds the printer driver.
9. Follow the instructions during the setup of the printer driver.
10. You are now able to print to a disk file from any Windows application by selecting the Generic/Text printer.
To create the report in the application
1. Select a fixed font, such as Courier, so the information in the report is positioned consistently.
2. Use field tags (field names), or create a columnar report with headings, to identify the data more easily in the report.
Extracting Text from Binary Print Files
The integration platform does not directly support data extraction from binary print file formats such as Microsoft Word .doc, Hewlett Packard .pcl, Adobe .pdf, or Postscript .ps files. However, you can extract data through Content Extractor using one of the following methods:
While in that native application, print to a text file.
Use an external program to transform the source file to a text file, then import the text file into Content Extractor and extract the data.
Content Extractor can extract data to various popular file formats, so you may find it adequate by itself. Or you can use the integration platform to connect to your Content Extractor script and manipulate that data. You can extract data from .doc, .pcl, .pdf and .ps files using the instructions below.
Microsoft Word (.doc) Files
The easiest way to extract data from .doc files is to use Microsoft Word itself.
To extract data using Microsoft Word
1. Go to File and select Save As.
2. Click the down arrow to the right of the Save As Type box and select MS-DOS Text with Line Breaks.
If you have .doc files to transform and do not have Microsoft Word, go to http://support.microsoft.com/support/downloads. Search for "Word Viewer" to find the download page. Choose the Word Viewer for your operating system and download and install it for use in transformations.
To transform a .doc to a text file
From Word Viewer, you may do one of two things to transform your .doc to a text file:
1. From the Start menu, select Settings >Printers > Add Printer. A wizard guides you through the process of adding a generic text printer driver.
2. Select the Print to File box.
3. Select the Generic/Text printer option from the list next to the Print Name box.
To transform your .doc to a text file (alternative method)
1. Within Microsoft Word, go to the File menu and select Print. Select the Print to File box.
2. Select the Generic/Text printer option from the list next to the Print Name box.
3. Copy and paste the document from the Word Viewer to a Text Editor program such as Notepad.
4. Once your document has been saved as a text file, open it in the Content Extractor and extract the data.
Hewlett-Packard (.pcl) Files
While HP does not offer any way to export their .pcl files to text, there are many options available through third-party software. To find an external viewer, go to the Hewlett-Packard web site and download their trial software.
Once you have transformed your .pcl file to text, or extracted your data to a text file, open it in Content Extractor Editor and extract the data to the format you desire.
Adobe PDF (.pdf) Files
The best way to transfer your .pdf file to a text file is to mail the file to pdf2txt@adobe.com. They mail your file back to you in text format. However, if you would rather do the conversion yourself, you have the following options:
Go to the Adobe web site and download a free plug-in for Adobe Acrobat Reader. After you have installed the plug-in, in Acrobat Reader go to File and select Export to Text.
Go to pdfzone.com for a list of third-party tools that assist in the extraction of text from .pdf files.
Go to iceni.com for commercial-grade .pdf-to-text technology.
Once you have transformed your .pdf file to text or extracted your data to a text file, open it in Content Extractor Editor and extract the data to the format you need.
Postscript (.ps and .eps) Files
Third-party software can offer external viewers with options for exporting Postscript text content.
Go to http://www.cs.wisc.edu/~ghost/. This is the homepage for the cross-platform Ghostscript family of freeware viewer, conversion, and extraction tools. In addition to .ps formats, Ghostscript also claims to serve .pdf files. GSView is the Windows viewer.
Go to http://www.research.digital.com/SRC/virtualpaper/pstotext.html and download the Windows version of Ghostscript for running GSView.
Go to http://www.cs.wisc.edu/~ghost/gsview/ to download the Windows version of GSView.
To extract data from GSView, go to the Edit menu and select Text Extract from the list.
Once you have transformed your Postscript file to text, or extracted your data to a text file, open it in the Content Extractor Editor and extract the data to the format needed.
Tip...  Many additional native print streams for a large array of APA printers are available (such as IBM, AFP, and Xerox). The techniques for extracting data from the sources above should apply to most of these print streams. The Web is an excellent resource for discovering text extraction or conversion tools many print stream formats.
Last modified date: 12/03/2024