Data File Formats

This section explains how the transformation tools categorize the various types of data file formats. Although most users see data as it appears on the integration platform user interface, the integration platform sees the data as it is stored in the data file.

Below are the most popular data formats read by the integration platform:

The list of applications in each category is not a complete listing of all the supported formats. Delimited ASCII is so varied that it deserves a category by itself. A few formats appear in two categories because of possible variations in the particular data file you are working with.

Structured formats are data files in which both the data and the file structure are stored in the same file. (There may be additional memo files, such as in dBASE or xBASE.) Complete metadata should exist that determine file structure. When the integration platform reads these data files, the data is automatically "parsed" into fields and records. It is not necessary to know the schema of the file.

Structured file formats are the easiest type to view and transform. Some of the applications in this category are dBASE, DataEase, DataFlex, Excel, Goldmine, Lotus 1-2-3, Quattro Pro, SAS, SPLUS, SPSS, USMARC, and XDB.

For more information, search for "source and target" in the documentation.

Semi-structured formats are data files in which the data and some of the file structure is stored in the data file. Some metadata exists to define the files. For instance, there is often an index file that accompanies the data file—its file extension is often .idx.

When these data files are read, the data is usually parsed into records, but not into fields. When you look at these files in a source or target browser, you see clearly defined rows (records), but the columns (fields) are not defined. You must either specify an external dictionary file from which the integration platform reads the field structure, or manually define the fields in the Data Parser.

The data is often a mixture of text, packed, and binary data. It is usually recommended that you have a schema from which to work, or at least be familiar with packed and binary data types when defining the fields. However, this is not mandatory, since the Data Parser assists you visually, and allows "trial and error" efforts.

An application in this category is ASCII (Fixed).

Unstructured formats are data files in which only the data is stored. No metadata defines the structure of the files. There are no readable field and no record separators; therefore, you must either specify an external dictionary file from which the integration platform reads the record and field structure, or manually define the fields in the Data Parser.

This data is often a mixture of ASCII, EBCDIC, packed, and binary data. It is usually recommended to have a schema from which to work, or at least be familiar with packed and binary data types when defining the fields. However, this is not mandatory, since the Data Parser assists you visually, and allows "trial and error" efforts.

Some of the applications/formats in this category are ASCII (Fixed) and Binary.

For more information, search for "source and target" in the documentation.

Delimited ASCII data could easily fall into the category with other structured data file formats because it is indeed structured data. However, because delimited ASCII data can include so many variations, we have elected to give it a category of its own.

The rules that apply to all valid delimited ASCII data files are as follows:

• Each character of the data must be one of the first 128 ANSI characters. In other words, the data must be readable – it cannot contain any binary or packed data.

• Each record within the data file must contain the same number of fields. In other words, if one record contains 8 fields, every record must contain 8 fields.

If you are not familiar with the terms "record," "field," or "Null" character, see the Glossary.

Delimited ASCII data files may (or may not) contain a header at the beginning of the file. If a header is present, it may contain information that is totally irrelevant to the transformation, or it may contain field names which are relevant. If the header is irrelevant, set the Starting Offset in Source Properties to the byte length of the header so that the integration platform for skip the header. If the header contains field names, set the Header in Source Properties to True to instruct the integration platform to use that that information to name the fields.

Record separators, field separators, and field delimiters may include any one or any combination of the entire 256 ANSI character set (excluding the null character). The standard for most delimited ASCII files on Microsoft Windows operating systems is to use a comma as the field separator, a carriage-return and a line feed as the record separator, and quotation marks as delimiters. If the separators and delimiters are non-standard in your file, open Source Properties and set them to the appropriate values.

The integration platform is designed to handle many of the possible variations of delimited ASCII data. The Hex Browser can help you determine variations in the files you are working with.

For more information, see ASCII (Delimited).

This category includes the wide variety of SQL and relational database systems. Although the integration platform reads the data from these systems already parsed into fields and records, there are special features that necessitate putting this type of data into a separate category.

Relational database systems store data in multiple tables linked by primary key fields. When you use the integration platform to transform data from or to one of these systems, you typically transform only one table per transformation. It may be necessary to create and save many transformations to complete a data integration project.