File Connections
The following file connections are available:
Physical Data Connections
The most basic connection is from a dataset to a physical file. There are two processes for connecting to physical files:
Raw Sequential Connection Process
This process can be used to open, read, and parse data from any fixed record length sequential file, such as COBOL legacy data. This includes the ability to read ASCII or EBCDIC and text or binary data of virtually any type or style (for example, COBOL packed, reverse byte-order, old floating point formats, and blobs). You can define rules governing a particular flat file and its structure and then extract clean data records for transformation.
Because of its binary reading capability, the integration platform can extract data from unknown raw file formats. Most commercial applications store data records using fixed-length techniques. When this is the case, the integration platform can be used as a brute force extractor to open the file, navigate to the beginning of fixed data, and rip records out according to rules that you create.
For related information, see Binary.
Physical File Format Connection Process
This process is used to physically read and write in the native internal storage format of the file. For example, you can use this process with xBASE files. The integration platform can use this process when there is published information describing in detail the internal storage structure of the file.
The advantage of this process is that it provides fast access, high performance, and code that can be used across platforms. However, because it requires developers to keep up with changes to file formats, this process is not often used. It is, however, an option for certain file formats, particularly text formats and those that are old or obsolete.
Intermediate Import and Export Format Connections
When the integration platform cannot connect directly to source or target data, it uses intermediate import and export formats for data transformation.
Although all programs, applications, and software packages store data in native internal formats, most offer some primitive level of import and export support (for example, .sdf, .dbf, .wk1) for reading and writing external data. The integration platform can read and write most dialects of intermediate file (fixed ASCII, delimited ASCII, sequential, Lotus 123, Excel, and xBASE), modifying and mapping data between unequal sources and targets to perform transformations.
Much of this work involves modifying the structure of fields (for example, adding, deleting, rearranging, and changing field sizes) and modifying the data itself (for example, parsing names, formatting dates, and expanding codes). The integration platform provides a visual interface for performing this work quickly.
The following topic provides details about how the integration platform works with intermediate import and export file formats.
Importing Through Intermediate Data
When the integration platform cannot write directly to the native format of a target application, you can use its ability to import through intermediate data if the application has a minimum facility for importing external data files. Using such import capabilities enables you to insert data into complex internal native file systems and to perform indexing, transaction control, audit trail updates, and other advanced tasks. However, to import the data, target applications often require that it first be organized in particular ways.
The integration platform provides a utility that cleans and formats data for batch loading into SQL databases such as Oracle, SQL Server, and Sybase. Although the integration platform can transform data directly into SQL targets (with the Native API or ODBC), SQL inserts of individual records into a database table is slow and therefore not practical for large amounts of data.
Most SQL databases have an off-line batch loading facility (for example, BCP for Sybase and SQL Server, SQL Loader for Oracle). These loader utilities have been optimized for the mass insertion of large amounts of data directly into database tables but demand that incoming data be formatted in specific ways. You can use the integration platform to map, manipulate, and transform the foreign source data into a format suitable for feeding into batch loader tools.
Exporting Through Intermediate Data
When the integration platform cannot read data directly from the native format of a source application, you can use its ability to export data if the target application has a minimal facility for exporting external data files. However, exported data often needs to be transformed into formats that downstream applications can use. You can use the integration platform to transform the data as needed for downstream use.
XML Support
You can read and write all dialects of XML, whether or not you have DTDs or schemas to define the file structure. Simply use XML as your source or target connector and optionally specify any DTD or schema files in the Source or Target Schema box.
The following list does not include every XML flavor that the integration platform supports. You may be familiar with an industry standard that is not listed here. If the markup language uses the basic elements of standard XML markup, it is probably compatible with the XML metadata connector.
Actian DataConnect supports XML, XML Schema, and DTD Editors from the Eclipse Web Tools Platform. For information about the XML Editors, see the following link:
To find standards and specification information on any of the markup languages listed below, use your favorite Web search engine.
• ACORD XML – Association for Cooperative Operations Research and Development
• Adex – Document type for newspaper classified ads
• ADML – Astronomical Dataset Markup Language
• AdXML – Advertising for XML
• AIML – Astronomical Instrument Markup Language
• AnthroGlobe Schema – Social Sciences metadata format
• AppML – Application Markup Language
• BoleroXML – Bolero.net's cross-industry XML standard
• BSML – Bioinformatic Sequence Markup Language
• CDF – Channel Definition Format
• CIDS – Component Information Dictionary Standard
• CIDX – Chemical Industry Data Exchange
• CML – Chemical Markup Language
• cXML – Commerce XML
• CoopML – Cooperation Markup Language
• CWMI – Common Warehouse Metamodel Interchange
• DAML+OIL – DARPA Markup Language
• DITA – Darwin Information Typing Architecture
• DocBook XML – DTD for books, computer documentation
• ebXML – Electronic Business Markup Language
• Eda XML – Electronic Design Automation XML
• esohXML – Environmental, Safety and Occupational Health XML
• FpML – Financial products Markup Language
• GDML – Geometry Description Mark-up Language
• GEML – Gene Expression Markup Language
• HumanML – Human Markup Language
• JSML – Jspeech Markup Language
• LMML – Learning Material Markup Language
• LOGML – Log Markup Language
• MathML – Mathematical Markup Language
• MCF – Meta Content Framework
• MoDL – Molecular Dynamics Language
• MusicXML – Music XML
• NetBeans XML Project – Open source XML Editor for NetBeans Integrated Development Environment
• NewsML – NetFederation Showcase
• NITF – News Industry Text Format
• OMF – Weather Observation Definition Format
• OAGI – Open Applications Group
• OpenOffice.org XML Project – Sun Microsystem's XMLfile format used in the StarOffice suite
• OSD – Open Software Description Format
• PetroXML – Oil and Gas Field Operations DTD
• P3P – Platform for Privacy Preferences
• PMML – Predictive Model Markup Language
• QEL – Quotation Exchange Language
• rezML – XML DTD and style sheets for Resume and Job Listing structures
• SMBXML – Small and Medium Business XML
• SML – Spacecraft Markup Language
• TranXML – XML Transportation & Logistics
• UBL – Universal Business Language
• UGML – Unification Grammar Markup Language
• VCML – Value Chain Markup Language
• VIML – Virtual Instruments Markup Language
• VocML – Vocabulary Markup Language
• WSCI – Web Service Choreography Interface
• X3D – Extensible 3D
• XBEL – XML Bookmark Exchange Language
• XBRL – eXtensible Business Reporting Language
• XFRML – Extensible Financial Reporting Markup Language
• XGMML – Extensible Graph Markup and Modeling Language
• XLF – Extensible Logfile Format
• XML/EDI Group – XML and Electronic Data Interchange (EDI) ebusiness initiative
• XVRL – Extensible Value Resolution Language
XML Schema
XML Schema is a file that defines the content used in an XML document.
XML Schema provides data typing, which enables defining data by type (character, integer). Schema reuse, or schema inheritance, enables tags referenced in one schema to be used in other schemas. XML Schema name spaces enable multiple schemas to be combined into one. Global attributes within XML Schema assign properties to all elements. XML Schema also allows associating Java classes for additional data processing. Authoring information adds improved documentation for schema designers.
Unlike DTD files, XML schemas are written in XML syntax. Although more verbose than DTD files, XML Schemas can be created with any XML tools. Examples of XML schemas can be found at www.xml.org.
In XML Schema, there is a basic difference between complex types, which allow elements in their content and may carry attributes, and simple types, which cannot have element content and cannot carry attributes. There is also a major distinction between definitions that create new types (both simple and complex), and declarations that enable the appearance in document instances of elements or attributes with specific names and types (both simple and complex).
The following simple types are currently supported for XML Schema connections:
• Binary
• Boolean
• byte
• century
• date
• decimal
• double
• ENTITIES
• ENTITY
• float
• ID
• IDREF
• IDREFS
• int
• integer
• language
• long
• month
• Name
• NCName
• NegativeInteger
• NMTOKEN
• NMTOKENS
• nonNegativeInteger
• nonPositiveInteger
• NOTATION
• PositiveInteger
• QName
• RecurringDate
• recurringDay
• recurringDuration
• short
• string
• time
• timeDuration
• timeInstant
• timePeriod
• unsignedByte
• unsignedInt
• unsignedLong
• unsignedShort
• urlReference
• year
Within XML Schema, new simple types are defined by derivation from existing simple types (built-ins and derived) through a technique called restriction. A new type must have a name different from the existing type, and the new type may constrain the legal range of values obtained from the existing type by applying one or more facets.
XML Schema simple types are supported to the extent that we recognize them and will use a type with that name. But the integration platform has no way to handle many of the simple types, such as timePeriod and century, so they will use that type in name only within your transformations.
XML Schema (MS XDR Biztalk)
BizTalk uses XML to describe schemas using a notation convention called XML Data, subset reduced. The abbreviated term, XDR, stands for the schema format that most recent Microsoft XML schema parsers understand. The advantage of using XML to describe a schema, such as XDR conventions dictate, is that users can use their XML skills to describe the schema.
Other advantages that XDR brings over other forms of contract description that can be found in use (such as DTD). Schemas allow you to describe the name of the tags and the order of the tags. You can also specify that the content of one element is a string of characters, and that another element is actually a number.
To set the XDR Schema for source or target files
1. In the map window, click the down arrow to the right of the Source or Target Structured Schema box and select XDR (XML Data Reduced) as the External Type.
2. Navigate to the schema file you want to use.
3. After you have connected to the file, two radio button choices appear: Replace Existing Layout(s), Append to Existing Layout(s). Click OK to save your selections.
Element and Attribute
XML documents consist of elements and attributes. If the field is an attribute, the DefaultExpression contains the attribute type (attribute field name). However, the Default Expression of element field is empty.
Attribute of an Element
An element type may contain attributes. In this case, the attribute field is named as element field name (element type)_attribute field name (attribute type). Because one element field may have the same name as the other has in the same record, a naming conflict is avoided.
XML Schema (W3C XSDL)
This XML Schema is the current standard for the World Wide Web Consortium (W3C). XSDL stands for XML Schema Definition Language. For information on W3C standards, visit the Web site.
eDoc Connections
Electronic Document (eDoc) connectors are designed to enable mapping to and from different electronic document exchange messages based on standard protocols and structures. Most eDoc connectors have schemas that you can download from Actian ESD, or leverage XML schema (that is, DTD or XSD), which you can download from the relevant standard websites. Examples of these connectors include, but not limited to, EDI x12, HIPAA, SWIFT, and HL7.
Data File Formats
Explained here is how the transformation tools categorize the various types of data file formats. Although most users see data as it appears on the integration platform user interface, the integration platform sees the data as it is stored in the data file.
Below are the most popular data formats read by the integration platform:
• Structured Formats
• Semi-structured Formats
• Unstructured Formats
• Delimited ASCII
The list of applications in each category is not a complete listing of all the supported formats. Delimited ASCII is so varied that it deserves a category by itself. A few formats appear in two categories because of possible variations in the particular data file you are working with.
Structured Formats
Structured formats are data files in which both the data and the file structure are stored in the same file. (There may be additional memo files, such as in dBASE or xBASE.) Complete metadata should exist that determine file structure. When the integration platform reads these data files, the data is automatically "parsed" into fields and records. It is not necessary to know the schema of the file.
Structured file formats are the easiest type to view and transform. Some of the applications in this category are dBASE, Excel, Quattro Pro, SAS, and XDB.
Semi-structured Formats
Semi-structured formats are data files in which the data and some of the file structure is stored in the data file. Some metadata exists to define the files. For instance, there is often an index file that accompanies the data file—its file extension is often .idx.
When these data files are read, the data is usually parsed into records, but not into fields. When you look at these files in a source or target browser, you see clearly defined rows (records), but the columns (fields) are not defined. You must either specify an external dictionary file from which the integration platform reads the field structure, or manually define the fields in the Data Parser.
The data is often a mixture of text, packed, and binary data. It is usually recommended that you have a schema from which to work, or at least be familiar with packed and binary data types when defining the fields. However, this is not mandatory, since the Data Parser assists you visually, and allows "trial and error" efforts.
An application in this category is ASCII (Fixed).
Unstructured Formats
Unstructured formats are data files in which only the data is stored. No metadata defines the structure of the files. There are no readable field and no record separators; therefore, you must either specify an external dictionary file from which the integration platform reads the record and field structure, or manually define the fields in the Data Parser. For more information, see
Using Extract Editor.
Delimited ASCII
Delimited ASCII data could easily fall into the category with other structured data file formats because it is indeed structured data. However, because delimited ASCII data can include so many variations, we have elected to give it a category of its own.
The rules that apply to all valid delimited ASCII data files are as follows:
• Each character of the data must be one of the first 128 ANSI characters. In other words, the data must be readable – it cannot contain any binary or packed data.
• Delimited ASCII data files do not contain the Null character.
• Each record within the data file must contain the same number of fields. In other words, if one record contains 8 fields, every record must contain 8 fields.
• There must be an identifiable record separator and field separator.
If you are not familiar with the terms "record," "field," or "Null" character, see the Glossary.
Delimited ASCII data files may (or may not) contain a header at the beginning of the file. If a header is present, it may contain information that is totally irrelevant to the transformation, or it may contain field names which are relevant. If the header is irrelevant, set the Starting Offset in Source Properties to the byte length of the header so that the integration platform for skip the header. If the header contains field names, set the Header in Source Properties to True to instruct the integration platform to use that that information to name the fields.
Record separators, field separators, and field delimiters may include any one or any combination of the entire 256 ANSI character set (excluding the null character). The standard for most delimited ASCII files on Microsoft Windows operating systems is to use a comma as the field separator, a carriage-return and a line feed as the record separator, and quotation marks as delimiters. If the separators and delimiters are non-standard in your file, open Source Properties and set them to the appropriate values.
The integration platform is designed to handle many of the possible variations of delimited ASCII data. The Hex Browser can help you determine variations in the files you are working with.
HDFS Connectivity
This topic contains information on reading from and writing to an Hadoop Distributed File System (HDFS). Once the following steps are taken, Actian DataConnect can read and write to HDFS just like any other file system.
2. Mount the HDFS instance onto the Linux system where you plan to run Actian DataConnect. Logged in as root, execute the mount command with the follwing parameters:
mount -t nfs -o vers=3,proto=tcp,nolock <server>:/ <mount point>
3. For example:
mount -t nfs -o vers=3,proto=tcp,nolock 192.168.1.50:/ /mnt/hdfs1
Where 192.168.1.50 is the IP address of the Name Node (HDFS instance), and /mnt/hdfs1 is the location on the client machine where you’d like to mount the file system.
4. Install, configure, and launch the Actian DataConnect Studio IDE on the Linux system where HDFS was mounted. You will now be able to create datasets that can connect to the HDFS mounted file locations.
Note:
• Both the HDFS system and the system where Actian DataConnect is installed should have a user group with the same name containing a shared list of users. The users should have full permissions.
• If you have multiple Actian DataConnect worker machines, each worker will need to have the HDFS instance mounted.
• Consider using macros for mountable file locations.
• Due to limitations in either HDFS or the NFS gateway for overwriting existing files, the target connector's output mode should be set to Append. If you need to delete the contents of the file before each transformation, you can use the FileDelete script function.