Concepts to Know : Formats and Schemas : Formats
 
Share this page                  
Formats
A format describes the structure and representation of values in a data file. Given the format, it is possible to interpret the contents of a file as a sequence of records. Similarly, it makes it possible to write a file containing a given sequence of records. Formats are often text-based (such as CSV files), but may also be binary.
Some examples of the information encapsulated in a format:
Separations between individual records. For instance, the use of UNIX-style newlines to delimit records.
Identification of individual field values within records. For example, the use of commas to separate fields in CSV or the quoting of field values to escape special characters.
Recognizing embedded metadata, such as comments.
In most cases, formats are not explicitly used. The file-based operators provided with DataFlow, such as those outlined in Performing I/O Operations, internally manage the format. Users need only to specify certain aspects of the format and the operator manages the rest. However, if existing file operators do not describe the format of a file, it may be necessary to define a specialized format to use with one of the generic file I/O operators.