Intermediate Methods to Connect to Data Formats

If the integration platform cannot read the native file format of your application, you may need to create an intermediate file format that the integration platform can read. This is done in the application from which you want to obtain some data.

To do this, the application must have the ability to "export" the data or "save as" to a common file format such as Delimited ASCII, dBASE, Lotus 1-2-3, etc.

Locate the documentation for your application and search the Index or documentation for "Export" or "Save As". Then determine which common file formats your application can export data to and follow those instructions.

Best Practice — If preserving the field data types is important to you, select an export file format such as dBASE, if possible.

After the intermediate file is created, you can use the integration platform to transform that data to the desired destination with the ability to manipulate and clean the data as needed.

For example, if you exported the data to a dBASE file format, select one of the versions of dBASE as the source connector.

When you cannot connect to a data file or table using direct methods, there is often an alternative method. Refer to the following sections for suggestions.

Use a utility. Dump the data from the tape device to a disk. Then use Binary as the source connector to access the data on the disk.

Best Practice — The tape unload utility often comes with "deblocking" logic, so that the data is "cleaner" on the disk than it was on the tape. This is because variable length and blocking schemes can make it harder for the Binary connector to read the data.

Directly read the data. If the tape drive is on a Linux platform, the integration platform can read the data directly off of the tape drive with this Tape Drive Sequential connector.

1. Make sure your tape device has file management capability. If so, you should be able to use the Tape Drive Sequential connector as a source connector.

2. Create a named pipe. Copy a sample of the file from the tape drive to design the map. Then run the map from the command line and use source overrides to connect to the named pipe.

Most popular database applications provide a loader utility designed for feeding detail data into data warehouses and data marts. For instance, Oracle SQL Loader reads flat file data and inserts that data into one or more database tables in a data warehouse.

The integration platform provides connectors that can read and write those loader files. The following list includes some of the available connectors but is not all-inclusive:

Two integration tools, Content Extractor Editor and CXL, can help users extract structured data from text files and reports.

If you need to transfer data from one application to another, you are often challenged with getting useful data out of the source application. Methods to extract data through native connections and intermediate import/export formats are sometimes not available or not appropriate. In such cases, there is a fallback technique. Almost all applications have the ability to send information to a printer. To get data out of an application, redirect the printout to a text file.

Now the problem becomes a more classic data transformation, where the source file is the printout or report captured as a raw text file, but containing an inherent structure. Employing the pattern-recognition capability in the CXL scripting language, users can create scripts to extract data fields from the raw text lines and assemble those fields into clean records.

Two ways are available to create the scripts. One is to use the Content Extractor Editor user interface for visual markup of the extraction rules by direct manipulation of the text file on the screen. The other is to author the scripts directly using the CXL SDK. The latter option is useful when the rules to extract the data are too complex to be expressed in a GUI.

The added value of the Content Extractor Editor/CXL solution is that once the script is saved, the integration platform can use it. In this case, the script is used as for preprocessing of an incoming source text file to flatten it into a more normal row and column structure. Then you can map and manipulate the data in a transformation.

The integration platform is a universal data transformation tool allowing the translation of structured, field and record-oriented data from one format to another. You can transform data that resides in databases, spreadsheets, flat files, ASCII, Binary/EBCDIC, SQL, Reports, ODBC, Accounting systems, Legacy COBOL, Math/Stats packages, Text and many other file formats.

Implemented with a powerful GUI, the integration platform allows users to intelligently intervene in the transformation to filter and edit data into the exact output format required. Specifications governing the transformations can be set up interactively, or stored for subsequent automated use.

A major advance in data transformation technology is the ability to read complex text files (for example, reports, printouts, tagged downloads, etc.), extract the desired data fields from various lines in the text file and assemble those fields into a flat record of data. Thus whole records of structured data can be extracted and presented in the conventional row and column tabular format. This flattening of complex text files into clean fields and records occurs as a pre-transformation process by the CXL report reading engine. CXL requires the integration platform for downstream processing and transformation.

The flattening of the Source text file is accomplished by using special "scripts". To author such scripts, you can use the CXL (Content eXtraction Language) SDK files that describe the scripting language. This scripting language can be used to create or customize complex scripts.

We have developed and compiled many custom scripts to flatten text files from popular data sources, thus enabling users to transform the text data to a more structured target format. These CXL scripts are available for purchase in the form of Custom Packs.

Custom Packs of CXL scripts are currently available, in packed and Source code format, for the following:

Other than the copyright on the CXL language, there are no licensing controls on the scripts created by developers for specific text reading and transformation scenarios. Developers are free to exploit the possibilities in the CXL language for commercial or corporate use.

If the integration platform cannot read the native file format of your application, you may need to create a report file.

To do this, the application must have the ability to generate reports. The reports can then be printed to a file, rather than to a printer.

And while the integration platform cannot read a report file without some help, you can use another integration product, Content Extractor Editor, to extract the data out of the report and assemble it into a format that the integration platform can read and transform into the desired target format.

To read a report into Content Extractor Editor, it must exist as a disk file on your PC or network. The following are suggestions for accomplishing this.

How you bring mainframe data to the integration platform depends on the format of your source files. To access mainframe data, do one of the following depending on the source:

• SQL Applications: You can have direct access to your backend SQL database by using your native application for your source connector, such as DB2. Of course, you may also use ODBC 3.x to get to SQL-based application data. This may require that a special ODBC driver be installed.

• Sequential mainframe data: Produce a flat file and use Binary as the source connector. To make the process easier, use a COBOL copybook to assist you in creating your structured schema, or

• Sequential mainframe data: Use the print-to-file method to create a print spooler file. Then open the file in Content Extractor Editor and create a script file. Next, create a data set and select Extractor as the source connector and the script file as the source file.

• ODBC direct access for non-SQL formats: VSAM data, for instance, can be accessed by using a special ODBC driver. For more information, see ODBC 3.x.

If the report was generated in a PC application, you may either:

• Print the report directly to a disk file from the application by following the steps below to set up a special "print to file" driver on your PC.

5. From the list of manufacturers, select Generic. The Generic/Text Only entry displays in the Printers box on the right and is highlighted. Click Next.

7. In the next dialog box, the Printer Name displays as Generic/Text. You can select whether or not you want this printer to be the default printer in your Windows applications. Click Next.

8. The next dialog asks if you want to print a test page. Select No and click Finish. Windows prompts you for the Windows installation CDs from which it builds the printer driver.

2. Use field tags (field names), or create a columnar report with headings, to identify the data more easily in the report.

The integration platform does not directly support data extraction from binary print file formats such as Microsoft Word .doc, Hewlett Packard .pcl, Adobe .pdf, or Postscript .ps files. However, you can extract data through Content Extractor using one of the following methods:

• Use an external program to transform the source file to a text file, then import the text file into Content Extractor and extract the data.

Content Extractor can extract data to various popular file formats, so you may find it adequate by itself. Or you can use the integration platform to connect to your Content Extractor script and manipulate that data. You can extract data from .doc, .pcl, .pdf and .ps files using the instructions below.

The easiest way to extract data from .doc files is to use Microsoft Word itself.

If you have .doc files to transform and do not have Microsoft Word, go to http://support.microsoft.com/support/downloads. Search for "Word Viewer" to find the download page. Choose the Word Viewer for your operating system and download and install it for use in transformations.

From Word Viewer, you may do one of two things to transform your .doc to a text file:

1. From the Start menu, select Settings >Printers > Add Printer. A wizard guides you through the process of adding a generic text printer driver.

While HP does not offer any way to export their .pcl files to text, there are many options available through third-party software. To find an external viewer, go to the Hewlett-Packard web site and download their trial software.

Once you have transformed your .pcl file to text, or extracted your data to a text file, open it in Content Extractor Editor and extract the data to the format you desire.

The best way to transfer your .pdf file to a text file is to mail the file to pdf2txt@adobe.com. They mail your file back to you in text format. However, if you would rather do the conversion yourself, you have the following options:

• Go to the Adobe web site and download a free plug-in for Adobe Acrobat Reader. After you have installed the plug-in, in Acrobat Reader go to File and select Export to Text.

Once you have transformed your .pdf file to text or extracted your data to a text file, open it in Content Extractor Editor and extract the data to the format you need.

Third-party software can offer external viewers with options for exporting Postscript text content.

• Go to http://www.cs.wisc.edu/~ghost/. This is the homepage for the cross-platform Ghostscript family of freeware viewer, conversion, and extraction tools. In addition to .ps formats, Ghostscript also claims to serve .pdf files. GSView is the Windows viewer.

• Go to http://www.research.digital.com/SRC/virtualpaper/pstotext.html and download the Windows version of Ghostscript for running GSView.

To extract data from GSView, go to the Edit menu and select Text Extract from the list.

Once you have transformed your Postscript file to text, or extracted your data to a text file, open it in the Content Extractor Editor and extract the data to the format needed.

Tip: Many additional native print streams for a large array of APA printers are available (such as IBM, AFP, and Xerox). The techniques for extracting data from the sources above should apply to most of these print streams. The Web is an excellent resource for discovering text extraction or conversion tools many print stream formats.