User Guide : Map Connectors : Types of Connections : File Connections
 
Share this page             
File Connections
The following file connections are available:
Physical Data Connections
Intermediate Import and Export Format Connections
XML Support
eDoc Connections
Data File Formats
HDFS Connectivity
Physical Data Connections
The most basic connection is from a dataset to a physical file. There are two processes for connecting to physical files:
Raw Sequential Connection Process
Physical File Format Connection Process
Raw Sequential Connection Process
This process can be used to open, read, and parse data from any fixed record length sequential file, such as COBOL legacy data. This includes the ability to read ASCII or EBCDIC and text or binary data of virtually any type or style (for example, COBOL packed, reverse byte-order, old floating point formats, and blobs). You can define rules governing a particular flat file and its structure and then extract clean data records for transformation.
Because of its binary reading capability, the integration platform can extract data from unknown raw file formats. Most commercial applications store data records using fixed-length techniques. When this is the case, the integration platform can be used as a brute force extractor to open the file, navigate to the beginning of fixed data, and rip records out according to rules that you create.
For related information, see Binary.
Physical File Format Connection Process
This process is used to physically read and write in the native internal storage format of the file. For example, you can use this process with xBASE files. The integration platform can use this process when there is published information describing in detail the internal storage structure of the file.
The advantage of this process is that it provides fast access, high performance, and code that can be used across platforms. However, because it requires developers to keep up with changes to file formats, this process is not often used. It is, however, an option for certain file formats, particularly text formats and those that are old or obsolete.
Intermediate Import and Export Format Connections
When the integration platform cannot connect directly to source or target data, it uses intermediate import and export formats for data transformation.
Although all programs, applications, and software packages store data in native internal formats, most offer some primitive level of import and export support (for example, .sdf, .dbf, .wk1) for reading and writing external data. The integration platform can read and write most dialects of intermediate file (fixed ASCII, delimited ASCII, sequential, Lotus 123, Excel, and xBASE), modifying and mapping data between unequal sources and targets to perform transformations.
Much of this work involves modifying the structure of fields (for example, adding, deleting, rearranging, and changing field sizes) and modifying the data itself (for example, parsing names, formatting dates, and expanding codes). The integration platform provides a visual interface for performing this work quickly.
The following topic provides details about how the integration platform works with intermediate import and export file formats.
Importing Through Intermediate Data
When the integration platform cannot write directly to the native format of a target application, you can use its ability to import through intermediate data if the application has a minimum facility for importing external data files. Using such import capabilities enables you to insert data into complex internal native file systems and to perform indexing, transaction control, audit trail updates, and other advanced tasks. However, to import the data, target applications often require that it first be organized in particular ways.
The integration platform provides a utility that cleans and formats data for batch loading into SQL databases such as Oracle, SQL Server, and Sybase. Although the integration platform can transform data directly into SQL targets (with the Native API or ODBC), SQL inserts of individual records into a database table is slow and therefore not practical for large amounts of data.
Most SQL databases have an off-line batch loading facility (for example, BCP for Sybase and SQL Server, SQL Loader for Oracle, DBLoad for Informix). These loader utilities have been optimized for the mass insertion of large amounts of data directly into database tables but demand that incoming data be formatted in specific ways. You can use the integration platform to map, manipulate, and transform the foreign source data into a format suitable for feeding into batch loader tools.
Exporting Through Intermediate Data
When the integration platform cannot read data directly from the native format of a source application, you can use its ability to export data if the target application has a minimal facility for exporting external data files. However, exported data often needs to be transformed into formats that downstream applications can use. You can use the integration platform to transform the data as needed for downstream use.
XML Support
You can read and write all dialects of XML, whether or not you have DTDs or schemas to define the file structure. Simply use XML as your source or target connector and optionally specify any DTD or schema files in the Source or Target Schema box.
The following list does not include every XML flavor that the integration platform supports. You may be familiar with an industry standard that is not listed here. If the markup language uses the basic elements of standard XML markup, it is probably compatible with the XML metadata connector.
DataConnect supports XML, XML Schema, and DTD Editors from the Eclipse Web Tools Platform. For information about the XML Editors, see the following link:
http://help.eclipse.org/mars/index.jsp?topic=%2Forg.eclipse.wst.xmleditor.doc.user%2Ftopics%2Ftxedttag.htmlExternal Link
To find standards and specification information on any of the markup languages listed below, use your favorite Web search engine.
ACORD XML – Association for Cooperative Operations Research and Development
Adex – Document type for newspaper classified ads
ADML – Astronomical Dataset Markup Language
AdXML – Advertising for XML
AIML – Astronomical Instrument Markup Language
AnthroGlobe Schema – Social Sciences metadata format
AppML – Application Markup Language
BoleroXML – Bolero.net's cross-industry XML standard
BSML – Bioinformatic Sequence Markup Language
CDF – Channel Definition Format
CIDS – Component Information Dictionary Standard
CIDX – Chemical Industry Data Exchange
CML – Chemical Markup Language
cXML – Commerce XML
CoopML – Cooperation Markup Language
CWMI – Common Warehouse Metamodel Interchange
DAML+OIL – DARPA Markup Language
DITA – Darwin Information Typing Architecture
DocBook XML – DTD for books, computer documentation
ebXML – Electronic Business Markup Language
Eda XML – Electronic Design Automation XML
esohXML – Environmental, Safety and Occupational Health XML
FpML – Financial products Markup Language
GDML – Geometry Description Mark-up Language
GEML – Gene Expression Markup Language
HumanML – Human Markup Language
JSML – Jspeech Markup Language
LMML – Learning Material Markup Language
LOGML – Log Markup Language
MathML – Mathematical Markup Language
MCF – Meta Content Framework
MoDL – Molecular Dynamics Language
MusicXML – Music XML
NetBeans XML Project – Open source XML editor for NetBeans Integrated Development Environment
NewsML – NetFederation Showcase
NITF – News Industry Text Format
OMF – Weather Observation Definition Format
OAGI – Open Applications Group
OpenOffice.org XML Project – Sun Microsystem's XMLfile format used in the StarOffice suite
OSD – Open Software Description Format
PetroXML – Oil and Gas Field Operations DTD
P3P – Platform for Privacy Preferences
PMML – Predictive Model Markup Language
QEL – Quotation Exchange Language
rezML – XML DTD and style sheets for Resume and Job Listing structures
SMBXML – Small and Medium Business XML
SML – Spacecraft Markup Language
TranXML – XML Transportation & Logistics
UBL – Universal Business Language
UGML – Unification Grammar Markup Language
VCML – Value Chain Markup Language
VIML – Virtual Instruments Markup Language
VocML – Vocabulary Markup Language
WSCI – Web Service Choreography Interface
X3D – Extensible 3D
XBEL – XML Bookmark Exchange Language
XBRL – eXtensible Business Reporting Language
XFRML – Extensible Financial Reporting Markup Language
XGMML – Extensible Graph Markup and Modeling Language
XLF – Extensible Logfile Format
XML/EDI Group – XML and Electronic Data Interchange (EDI) ebusiness initiative
XVRL – Extensible Value Resolution Language
XML Schema
XML Schema is a file that defines the content used in an XML document.
XML Schema provides data typing, which enables defining data by type (character, integer). Schema reuse, or schema inheritance, enables tags referenced in one schema to be used in other schemas. XML Schema name spaces enable multiple schemas to be combined into one. Global attributes within XML Schema assign properties to all elements. XML Schema also allows associating Java classes for additional data processing. Authoring information adds improved documentation for schema designers.
Unlike DTD files, XML schemas are written in XML syntax. Although more verbose than DTD files, XML Schemas can be created with any XML tools. Examples of XML schemas can be found at www.xml.org.
In XML Schema, there is a basic difference between complex types, which allow elements in their content and may carry attributes, and simple types, which cannot have element content and cannot carry attributes. There is also a major distinction between definitions that create new types (both simple and complex), and declarations that enable the appearance in document instances of elements or attributes with specific names and types (both simple and complex).
The following simple types are currently supported for XML Schema connections:
Binary
Boolean
byte
century
date
decimal
double
ENTITIES
ENTITY
float
ID
IDREF
IDREFS
int
integer
language
long
month
Name
NCName
NegativeInteger
NMTOKEN
NMTOKENS
nonNegativeInteger
nonPositiveInteger
NOTATION
PositiveInteger
QName
RecurringDate
recurringDay
recurringDuration
short
string
time
timeDuration
timeInstant
timePeriod
unsignedByte
unsignedInt
unsignedLong
unsignedShort
urlReference
year
Within XML Schema, new simple types are defined by derivation from existing simple types (built-ins and derived) through a technique called restriction. A new type must have a name different from the existing type, and the new type may constrain the legal range of values obtained from the existing type by applying one or more facets.
XML Schema simple types are supported to the extent that we recognize them and will use a type with that name. But the integration platform has no way to handle many of the simple types, such as timePeriod and century, so they will use that type in name only within your transformations.
XML Schema (MS XDR Biztalk)
BizTalk uses XML to describe schemas using a notation convention called XML Data, subset reduced. The abbreviated term, XDR, stands for the schema format that most recent Microsoft XML schema parsers understand. The advantage of using XML to describe a schema, such as XDR conventions dictate, is that users can use their XML skills to describe the schema.
Other advantages that XDR brings over other forms of contract description that can be found in use (such as DTD). Schemas allow you to describe the name of the tags and the order of the tags. You can also specify that the content of one element is a string of characters, and that another element is actually a number.
To set the XDR Schema for source or target files
1. In the map window, click the down arrow to the right of the Source or Target Structured Schema box and select XDR (XML Data Reduced) as the External Type.
2. Navigate to the schema file you want to use.
3. After you have connected to the file, two radio button choices appear: Replace Existing Layout(s), Append to Existing Layout(s). Click OK to save your selections.
Element and Attribute
XML documents consist of elements and attributes. If the field is an attribute, the DefaultExpression contains the attribute type (attribute field name). However, the Default Expression of element field is empty.
Attribute of an Element
An element type may contain attributes. In this case, the attribute field is named as element field name (element type)_attribute field name (attribute type). Because one element field may have the same name as the other has in the same record, a naming conflict is avoided.
XML Schema (W3C XSDL)
This XML Schema is the current standard for the World Wide Web Consortium (W3C). XSDL stands for XML Schema Definition Language. For information on W3C standards, visit the Web site.
eDoc Connections
Electronic Document (eDoc) connectors are designed to enable mapping to and from different electronic document exchange messages based on standard protocols and structures. Most eDoc connectors have schemas that you can download from Actian ESD, or leverage XML schema (that is, DTD or XSD), which you can download from the relevant standard websites. Examples of these connectors include, but not limited to, EDI x12, HIPAA, SWIFT, and HL7.
Data File Formats
This section explains how the transformation tools categorize the various types of data file formats. Although most users see data as it appears on the integration platform user interface, the integration platform sees the data as it is stored in the data file.
Below are the most popular data formats read by the integration platform:
Structured Formats
Semi-structured Formats
Unstructured Formats
Delimited ASCII
The list of applications in each category is not a complete listing of all the supported formats. Delimited ASCII is so varied that it deserves a category by itself. A few formats appear in two categories because of possible variations in the particular data file you are working with.
Structured Formats
Structured formats are data files in which both the data and the file structure are stored in the same file. (There may be additional memo files, such as in dBASE or xBASE.) Complete metadata should exist that determine file structure. When the integration platform reads these data files, the data is automatically "parsed" into fields and records. It is not necessary to know the schema of the file.
Structured file formats are the easiest type to view and transform. Some of the applications in this category are dBASE, DataFlex, Excel, Goldmine, Lotus 1-2-3, Quattro Pro, SAS, SPSS, and XDB.
Semi-structured Formats
Semi-structured formats are data files in which the data and some of the file structure is stored in the data file. Some metadata exists to define the files. For instance, there is often an index file that accompanies the data file—its file extension is often .idx.
When these data files are read, the data is usually parsed into records, but not into fields. When you look at these files in a source or target browser, you see clearly defined rows (records), but the columns (fields) are not defined. You must either specify an external dictionary file from which the integration platform reads the field structure, or manually define the fields in the Data Parser.
The data is often a mixture of text, packed, and binary data. It is usually recommended that you have a schema from which to work, or at least be familiar with packed and binary data types when defining the fields. However, this is not mandatory, since the Data Parser assists you visually, and allows "trial and error" efforts.
An application in this category is ASCII (Fixed).
Unstructured Formats
Unstructured formats are data files in which only the data is stored. No metadata defines the structure of the files. There are no readable field and no record separators; therefore, you must either specify an external dictionary file from which the integration platform reads the record and field structure, or manually define the fields in the Data Parser.
Delimited ASCII
Delimited ASCII data could easily fall into the category with other structured data file formats because it is indeed structured data. However, because delimited ASCII data can include so many variations, we have elected to give it a category of its own.
The rules that apply to all valid delimited ASCII data files are as follows:
Each character of the data must be one of the first 128 ANSI characters. In other words, the data must be readable – it cannot contain any binary or packed data.
Delimited ASCII data files do not contain the Null character.
Each record within the data file must contain the same number of fields. In other words, if one record contains 8 fields, every record must contain 8 fields.
There must be an identifiable record separator and field separator.
If you are not familiar with the terms "record," "field," or "Null" character, see the Glossary.
Delimited ASCII data files may (or may not) contain a header at the beginning of the file. If a header is present, it may contain information that is totally irrelevant to the transformation, or it may contain field names which are relevant. If the header is irrelevant, set the Starting Offset in Source Properties to the byte length of the header so that the integration platform for skip the header. If the header contains field names, set the Header in Source Properties to True to instruct the integration platform to use that that information to name the fields.
Record separators, field separators, and field delimiters may include any one or any combination of the entire 256 ANSI character set (excluding the null character). The standard for most delimited ASCII files on Microsoft Windows operating systems is to use a comma as the field separator, a carriage-return and a line feed as the record separator, and quotation marks as delimiters. If the separators and delimiters are non-standard in your file, open Source Properties and set them to the appropriate values.
The integration platform is designed to handle many of the possible variations of delimited ASCII data. The Hex Browser can help you determine variations in the files you are working with.
HDFS Connectivity
This section contains information on reading from and writing to an Hadoop Distributed File System (HDFS). Once the following steps are taken, DataConnect can read and write to HDFS just like any other file system.
1. Configure your HDFS instance to be mountable. This is done by enabling the NFS gateway. Follows the basic steps in https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html
2. Mount the HDFS instance onto the Linux system where you plan to run DataConnect. Logged in as root, execute the mount command with the follwing parameters:
mount -t nfs -o vers=3,proto=tcp,nolock <server>:/ <mount point>
3. For example:
mount -t nfs -o vers=3,proto=tcp,nolock 192.168.1.50:/ /mnt/hdfs1
Where 192.168.1.50 is the IP address of the Name Node (HDFS instance), and /mnt/hdfs1 is the location on the client machine where you’d like to mount the file system.
4. Install, configure, and launch DataConnect Studio IDE on the Linux system where HDFS was mounted. You will now be able to create datasets that can connect to the HDFS mounted file locations.
Note:
Both the HDFS system and the system where DataConnect is installed should have a user group with the same name containing a shared list of users. The users should have full permissions.
If you have multiple DataConnect worker machines, each worker will need to have the HDFS instance mounted.
Consider using macros for mountable file locations.
Due to limitations in either HDFS or the NFS gateway for overwriting existing files, the target connector's output mode should be set to Append. If you need to delete the contents of the file before each transformation, you can use the FileDelete script function.