Read I/O Operators
The DataFlow operator library includes several pre-built Input/Output operators. This section covers the Read operators and provides details on how to use them. For more information, refer to the following topics:
ReadAvro Operator
The
ReadAvro operator reads a data file previously written using the
Apache Avro serialization format. The Avro format is a commonly used binary format that offers data compression and the ability to be parsed in parallel. Metadata about the data, such as its schema and compression format, are serialized into the file, making it available to readers.
The operator will translate the Avro schema into appropriate DataFlow types when possible—some schemas are not supported for reading, as described later.
As DataFlow operates on records, it is generally expected that the source data will have a RECORD schema type. If this is not the case, the operator treats the schema as if it were a record with a single field named "value."
The output record type will have fields with the same names and in the same order as the source schema. Output fields are assigned a type based on the schema of the source field with the same name. In general, Avro schema types are assigned DataFlow types according to the following table.
For types not listed previously, the schema type may or may not be mapped to a DataFlow type. If attempting to read a source with schema types that cannot be mapped to DataFlow types, the operator will produce an error. The conditions under which other schema types are supported are as follows:
• Source fields with ARRAY or MAP schema types are never supported.
• Source fields with a RECORD schema type are supported only when reading Avro data written using the
WriteAvro operator; fields with DataFlow types that do not have analogues in Avro are written as nested records. Source fields using these same schemas will be mapped back into the original DataFlow type.
• Source fields with a UNION schema type is only supported if it is a union of exactly two schema types where one type is NULL. In this case, the type is determined using the non-NULL schema type of the union.
For information about creating files containing data in Avro format using DataFlow, see
WriteAvro Operator.
When reading Avro files written by DataFlow, there may be additional metadata information about the data embedded within the files. If the reader has been configured to use this metadata, then it can obtain information about the ordering and partitioning of the data when it was written, which can eliminate the need to re-sort or partition the data.
Code Examples
Because Avro files are self-contained with respect to metadata, it is generally not necessary to provide any information other than the location of the data. Following is an example use of the operator in Java.
Using the ReadAvro operator in Java
ReadAvro reader = graph.add(new ReadAvro("data/ratings.avro"));
The following example demonstrates using the delimited text reader within RushScript.
var data = dr.readAvro({source:'data/ratings.avro'});
Properties
The
ReadAvro operator supports the following properties:
Ports
The
ReadAvro operator provides a single output port.
ReadORC Operator
The
ReadORC operator reads a data file that is written earlier using the
Apache Optimized Row Columnar (ORC) File format. The ORC format is supported by Apache Hive.
The ORC format is columnar-based but can also be autonomous. The columns in an ORC file separate the stripes or sections of the file. An internal index is used to track a section of the data within each column. This organization allows readers to efficiently omit the columns that are not required. Also, each column can apply a different compression method depending on the data type. Metadata about the ORC data, such as the schema and compression format are serialized into the file and are made available to the readers.
The operator translates the ORC file schema into appropriate DataFlow types when it is possible. A few ORC data types are not supported for reading. The columns with unsupported data types are omitted. The output record type has fields with the same names and in the same order as the source schema. The type in the output fields are assigned based on the schema of the source field with the same name.
In general, DataFlow types are assigned to the ORC schema types as shown in the following table.
Several ORC types are not supported by DataFlow. If these types are found in an ORC file, they will be ignored. The reader logs a message for all columns that are omitted having the unsupported data types. The following ORC data types are not supported:
• LIST
• MAP
• STRUCT
• UNION
Column Pruning
We recommend you limit the columns read to only the ones required for downstream processing, since ORC format is columnar. Use the selectedFields property to specify the fields to read. For more information, see
Properties. The ORC columns not included in the list are not omitted. This optimization provides a performance boost especially for files containing a large number of columns.
Note: Before running the workflow, ensure that the client configuration and the jar files are added to the classpath. For more information, see
Integrating DataFlow with Hadoop.
You must enable the datarush-hadoop-apache3 module to enable reading ORC files from S3A and ABFS locations.
Code Examples
It is required to provide the location of the data to the operator because the ORC files are self-contained about the metadata. We recommend you limit the columns read to only the ones required for downstream processing. Because ORC format is columnar, reducing the columns read might enhance performance. Use the selectFields property to specify the columns to read from a given ORC data set.
The following example provides reading the ORC file data using Java.
Using ReadORC in Java
ReadORC reader = graph.add(new ReadORC("data/ratings.orc"));
Using ReadORC in RushScript
var data = dr.readORC({source:'data/ratings.orc'});
Properties
The
ReadORC operator supports the following properties.
Ports
The
ReadORC operator provides a single output port.
ReadMDF Operator
This topic describes the DataFlow MDF read operator. For information about its KNIME node, see
MDF Reader.
The ReadMDF operator reads a data file previously written using the
ASAM MDF format. The MDF format is supported and maintained by ASAM.
MDF or Measurment Data Format is a binary file format used to store recorded and calculated data that is frequently used in post-measurement processing, off-line evaluation, or long-term storage.
It offers efficient and high performance storage of large amounts of measurement data. The file format allows the storage of the raw measurement data along with associated metadata and corresponding conversion formulas so that the raw data can still be interpreted correctly and utilized through post-processing.
The operator will translate the MDF schema into appropriate DataFlow types whenever possible, although because of the frequent usage of unsigned types in MDF data, sometimes the type used by DataFlow must be wider than the original type specified in the metadata to prevent loss of scale or precision.
Since DataFlow operates on the concept of homogenous data records within a given flow, a ReadMDF operator is only able to extract one record type from the file at a time, although multiple ReadMDF operators can read multiple records types concurrently from the same file.
The output record type will have fields with the names and types determined by the metadata provided in the file. The ordering of the fields will also correspond to the declaration order within the metadata, with the exception that the master channel will always be the first field even if it is not defined first.
Currently the operator only supports primitive types, which have an analog within DataFlow; therefore the extraction of MIME types in various media formats is not currently supported.
Code Examples
Since MDF files are self-contained with respect to metadata, it is generally not necessary to provide any information other than the location of the data and the data group containing the specified record that should be extracted. Following is an example use of the operator in Java.
Using the ReadMDF operator in Java
ReadMDF reader = graph.add(new ReadMDF("data/output.mf4"));
reader.setDataChannel(1);
reader.setRecordId(1);
The following example demonstrates using the MDF reader within RushScript:
var data = dr.readMDF({source:'data/output.mf4', dataChannel:1, recordId:1});
Properties
The ReadMDF operator supports the following properties:
Addtional properties are shared with delimited text.
Ports
The ReadMDF operator provides a single output port.
ReadParquet Operator
The
ReadParquet operator reads data that is written earlier using
Apache Parquet format. The Parquet format is supported by Apache Hive.
Parquet is a columnar file format used to store the tabular form of data. Parquet supports compression and encoding the schemes effectively and allows specifying the compression schemes at each column level. It supports:
• Source projects such as Apache Hadoop (MapReduce), Apache Hive, Impala, and so on, as it presents the data in columnar format
• Compression codecs such as SNIPPY, GZIP, and LZO. The design allows integration with future types.
The ReadParquet <include link> operator uses Hive libraries through the shim layer and requires a Hadoop module configuration to be enabled, even if the workflow does not run on the cluster or access HDFS.
DataFlow automatically determines the equivalent data types from Parquet. The result is the output type of the reader. However, Parquet and DataFlow support different data types and not all data in Parquet format can be read. If it attempts to read data that cannot be represented in DataFlow, an error is returned.
The primitive Parquet types are mapped to DataFlow as shown in the following table.
Note: Before running the workflow, ensure that the client configuration and the jar files are added to the classpath. For more information, see
Integrating DataFlow with Hadoop.
You must enable the datarush-hadoop-apache3 module to enable reading parquet files from S3A locations.
Code Examples
The following example provides reading the Parquet file data using Java.
Using ReadParquet in Java
// Path to read entire Hive table
ReadParquet reader = new ReadParquet("hdfs://10.100.10.41:8020//apps/hive/warehouse/CityParquet");
// Path to read specific partition of Hive table from HDFS
// ReadParquet reader = new ReadParquet("hdfs://10.100.10.41:8020//apps/hive/warehouse/CityParquet/000000_0");
// Path to read parquet file from Local file system
// ReadParquet reader = new ReadParquet("C:/Parquet/Cities.parquet");
graph.add(reader);
Using ReadParquet in RushScript
var data = dr.readParquet({source:"hdfs://10.100.10.41:8020//apps/hive/warehouse/CityParquet/000000_0"});
Properties
The
ReadParquet operator supports the following properties.
Ports
The
ReadParquet operator provides a single output port.
ReadFromJDBC Operator
The
ReadFromJDBC operator accesses relational database systems using a supplied JDBC driver. The JDBC driver must be in the class path of the DataFlow application. Each database provides a JDBC driver implementation that DataFlow can use to access data in the database. Reference the specific database to be accessed for driver-specific information.
The ReadFromJDBC operator can be used to read all of the columns from a specific table or to execute a provided query. The query provided can be a complex, multitable query. Follow the syntax guidelines of the database being queried.
The results of the query will be made available to the output port of the operator. The operator transforms the database column types to supported DataFlow scalar types. Some database-specific data types may not map well and will either be ignored or mapped to Java Object types.
The results of a database query executed through JDBC are returned through a ResultSet object. The ResultSet is used to iterate through the resultant rows and to access column data. The JDBC ResultSet class does not support multithreaded access. Given that, the default behavior of the ReadFromJDBC operator is to execute in nonparallel mode when provided a nonparameterized query.
To execute queries in parallel (and distributed), the ReadFromJDBC operator supports the use of parameterized queries. JDBC supports adding parameters to a query using the "?" character. Following is an example of a parameterized query. Note the use of the "?" character in the "where" clause.
Example of query with parameters
select * from lineitem where l_shipmode = ?
When used as the data query for the ReadFromJDBC operator, a parameterized query can be executed in parallel. A set of parameters must be supplied to the parallel workers executing the parameterized queries. The parameters can be supplied in one of the following ways:
• Through the optional input port the ReadFromJDBC operator.
• Obtained by a parameter query supplied as a property to the operator ("parameterQuery"). The query is executed and the results are used as parameters to the parameterized query.
• An array of values is passed as a property to the ReadFromJDBC operator ("parameters").
Here is an example of a parameter query:
Query to gather parameters
select distinct l_shipmmode from lineitem
Note that the parameter query is selecting a distinct set of values from the lineitem table. The values will be substituted for the "?" in the parameterized query.
The parameters are handled the same whether they are provided directly as objects, read from the input port, or queried through the parameter query. For each set (row) of parameters, the following occurs:
• The parameters are substituted within the parameterized data query. From our example, one of the parameter values is "RAIL". When substituted within the example data query, the resultant query becomes select * from lineitem where l_shipmode = RAIL.
• The query with the substituted parameters is executed against the database.
• The results of the query are streamed to the output of the operator.
When used with a parameterized query and provided query parameters, the ReadFromJDBC operator operates in parallel by creating multiple workers. The query parameters are distributed to the workers in round-robin fashion. The workers execute the data query after applying parameter substitution as described above.
The order of the parameter value is important. The order must match the order of the parameters (the "?") in the data query. This is true for parameter values from the optional input port, provided as objects or from the parameter query. The ReadFromJDBC operator does not have the context to determine which parameter values match which parameters. The ordering of the parameter values is left to the user.
When using a parameterized query, the number of parameter values provided must match the number of parameters in the query. If there is a mismatch in sizes, an exception will be raised and the operator will fail.
To obtain the best performance, the number of sets of query parameters should be greater than the configured parallelism. In our example parameter query, only 7 values are returned. In this case, having parallelism set to anything greater than 7 will be wasteful. Additional streams of execution will have no data to process.
Writing to database tables can be accomplished with the
WriteToJDBC Operator operator.
Code Examples
The following example demonstrates using the
ReadFromJDBC operator to access data using the provided SQL statement. Setting the fetch size and the SQL warning limit properties is optional. Default settings will be used if they are not set explicitly.
Either a table name or a SQL statement to execute can be specified. Using a table name is equivalent to using the statement select * from tableName. In this example, a table name is specified.
Using the ReadFromJDBC Operator in Java
ReadFromJDBC reader = graph.add(new ReadFromJDBC());
reader.setDriverName("com.mysql.jdbc.Driver");
reader.setUrl("jdbc:mysql://dbserver:3306/test");
reader.setUser("test");
reader.setPassword("test");
reader.setSqlWarningLimit(20);
reader.setTableName("tpchorders");
Using the ReadFromJDBC Operator in RushScript
var data = dr.readFromJDBC({driverName:'com.mysql.jdbc.Driver', url:'jdbc:mysql://dbserver:3306/test', user:'test', password:'test', sqlWarningLimit:20, tableName:'tpchorders'});
The following example uses a SQL statement directly. Using the SQL statement allows selection of only the desired fields. A complex statement can also be used to join tables together and have the results presented as a single output data set.
Specifying a SQL statement
ReadFromJDBC reader = graph.add(new ReadFromJDBC());
reader.setDriverName("com.mysql.jdbc.Driver");
reader.setUrl("jdbc:mysql://dbserver:3306/test");
reader.setUser("test");
reader.setPassword("test");
reader.setDataQuery("select o_orderkey, o_orderdate, o_totalprice from totalorders");
This example demonstrates using a parameterized query in Java:
Using a parameterized query in Java
ReadFromJDBC reader = graph.add(new ReadFromJDBC());
reader.setDriverName("com.mysql.jdbc.Driver");
reader.setUrl("jdbc:mysql://dbserver:3306/test");
reader.setUser("test");
reader.setPassword("test");
reader.setSqlWarningLimit(20);
reader.setDataQuery("select * from lineitem where l_shipmode = ?");
reader.setParameterQuery("select distinct l_shipmode from lineitem");
This example demonstrates using a parameterized query in RushScript:
Using a parameterized query in RushScript
var data = dr.readFromJDBC({
driverName:'com.mysql.jdbc.Driver',
url:'jdbc:mysql://dbserver:3306/test',
user:'test',
password:'test',
sqlWarningLimit:20,
dataQuery:'select * from lineitem where l_shipmode = ?',
parameterQuery:'select distinct l_shipmode from lineitem'});
The driver name and URL format are specific to each JDBC driver. See the documentation of the specific database being used for more information on these values.
Properties
The
ReadFromJDBC operator supports the following properties:
Ports
The
ReadFromJDBC operator supports one optional input port. This port is used to provide parameters to a parameterized data query.
The
ReadFromJDBC operator provides one output port:
ReadDelimitedText Operator
The
ReadDelimitedText operator reads a text file of delimited records as record tokens. Records are identified by the presence of a non-empty, user-defined record separator sequence between each individual record. Output records contain the same fields as the input text. The reader can also filter and/or reorder the fields of the output as necessary.
Delimited text supports up to three distinct user-defined sequences within a record, used to identify field boundaries:
• a field separator: found between individual fields. By default, this is the comma character (,).
• a field start delimiter: marking the beginning of a field value. By default, this is the double quote character (").
• a field end delimiter: marking the end of a field value. By default, this is the double quote character (").
The field separator cannot be empty. The start and end delimiters can be the same value. They can also both (but not individually) be empty, signifying the absence of field delimiters. It is not expected that all fields start and end with a delimiter, though if one starts with a delimiter it must end with one. Fields containing significant characters, such as whitespace and the record and field separators, must be delimited to avoid parsing errors. Should a delimited field need to contain the end delimiter, it is escaped from its normal interpretation by duplicating it. For instance, the value "ab""c" represents a delimited field value of ab"c.
The reader supports incomplete specification of the separators and delimiters. By default, it will attempt to automatically discover these values based on analysis of a sample of the file. We strongly suggest that this discovery ability not be relied upon if these values are already known, as it cannot be guaranteed to produce desirable results in all cases.
The reader requires a schema to provide parsing and type information for the fields. The schema, in conjunction with any specified field filter, defines the output type of the reader. This can be manually constructed through the API provided, although this metadata is often persisted externally. The
StructuredSchemaReader class provides support for reading in Pervasive Data Integrator structured schema descriptors (.schema files) for use with readers. Schemas can also be generated from
Record Token Types by using the
TextRecord.convert methods.
Because delimited text has explicit field markers, it is also possible to perform automated discovery of the schema based on the contents of the file; the reader provides a pluggable discovery mechanism to support this functionality. Custom mechanisms must implement the
TextRecordDiscoverer interface. Two implementations are provided in the operator library:
• A mechanism using pattern matching for determining field type. Values for a field are compared to the patterns; any patterns which do not match a field are discarded as possibilities. If multiple possibilities exist and the conflict is between numeric types (for example, integers and doubles), the wider of the two is chosen. Otherwise, conflicts are resolved by treating the field as a string. This is the default mechanism used by the operator.
The set of patterns used can be extended by providing additional patterns when setting the schemaDiscovery property. Alternatively, the
TextRecord.extendDefault method can be used to to create a new discoverer using the supplied patterns in addition to the defaults. If the default patterns should not be included, create a
PatternBasedDiscovery object directly, specifying only the desired patterns.
• A mechanism that treats all fields as “raw” strings—that is, without white space trimming and not treating the empty string as NULL. Use the
TextRecord.TEXT_FIELD_DISCOVER constant to reference this mechanism.
Both built-in schema discoverers will generate a schema having as many fields as the longest analyzed row. Both use the header row, if present, to name the schema’s fields. Repetitions of the same name will be resolved by adding a suffix to avoid collision; any missing names will be generated as field<n>, where <n> is the field’s index in the schema.
Typically, the output of the reader includes all records in the file, both those with and without parsing errors. Fields that cannot be parsed are null-valued in the resulting record. If desired, the reader can be configured to filter failed records from the output.
Delimited text data may or may not have a header row. The header row is delimited as usual but contains the names of the fields in the data portion of the record. The reader must be told whether a header row exists. If it does, the parser will skip the header row; otherwise the first row is treated as a record and will appear in the output. If a header row does exist and any of the field names are blank, a field name will be generated. Generated field names take the form “fieldN” where N is the zero-based position of the field.
Delimited text files can be parsed in parallel under “optimistic” assumptions: namely, that parse splits do not occur in the middle of a delimited field value and somewhere before an escaped record separator. This is assumed by default but can be disabled, with an accompanying reduction of scalability and performance.
When reading delimited text files there may be metadata information about the data embedded within the files. If the reader has been configured to use this metadata, it can obtain information about the ordering and partitioning of the data when it was written, which can eliminate the need to re-sort or partition the data.
Delimited text can be written
WriteDelimitedText Operator.
Code Examples
The first code example shows a simple usage of the reader. The path to the local file name is given as a parameter to the constructor. This could have also been set using the setSource() method. The field separator and header properties are set. Then a record type is built and used as the input schema. Note that the record type must be converted to an acceptable schema before being used by the reader. Also note that the record separator is not specified. It will be determined by the auto discovery mechanism of the reader.
Using the ReadDelimitedText operator example in Java
ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/ratings.txt"));
reader.setFieldSeparator("::");
reader.setHeader(true);
RecordTokenType ratingsType = record(INT("userID"), INT("movieID"), INT("rating"), STRING("timestamp"));
reader.setSchema(TextRecord.convert(ratingsType));
Using the ReadDelimitedText operator usage in RushScript
var ratingsSchema = dr.schema().INT('userID').INT('movieID').INT('rating').STRING('timestamp');
var data = dr.readDelimitedText({source:'data/ratings.txt', fieldSeparator:'::', header:true, schema:ratingsSchema});
The snippet of data below is from the ratings.txt file and can be read using the code example above.
userID::movieID::rating::timestamp
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368
This next example reads from a file in a Hadoop Distributed File System (HDFS). The hdfs URL scheme identifies the file as being contained within an HDFS file system. The authority section of the URL specifies the specific HDFS file system. The rest of the path indicates the file path within the HDFS instance. A schema is built for this data since it contains a date field. The format or pattern of the date field must be specified since it is non-standard.
Reading from HDFS with a date type
TextRecord schema =
SchemaBuilder.define(
SchemaBuilder.STRING("accountNumber"),
SchemaBuilder.STRING("clientName"),
SchemaBuilder.STRING("companyName"),
SchemaBuilder.STRING("streetAddress"),
SchemaBuilder.STRING("city"),
SchemaBuilder.STRING("state"),
SchemaBuilder.STRING("zip"),
SchemaBuilder.STRING("emailAddress"),
SchemaBuilder.DATE("birthDate", "MM/dd/yyyy"), // specify pattern for parsing the date
SchemaBuilder.STRING("accountCodes"),
SchemaBuilder.DOUBLE("standardPayment"),
SchemaBuilder.DOUBLE("payment"),
SchemaBuilder.DOUBLE("balance")
);
// Create a delimited text reader for the accounts data
ReadDelimitedText reader = graph.add(new ReadDelimitedText("hdfs://saturn.englab.local:9000/user/jfalgout/data/Accounts.txt"));
reader.setFieldSeparator(",");
reader.setHeader(true);
reader.setSchema(schema);
Following is a snippet of the data that can be read and parsed with the previous code example. Note that each field is surrounded with a double quote as the field delimiter. Also note the format of the "birthDate" field. It is a non-standard (not ISO) format. The schema used to parse the data specifies the pattern used to parse the date field.
"accountNumber","clientName","companyName","streetAddress","city","state","zip","emailAddress","birthDate","accountCodes","standardPayment","payment","balance"
"01-000667","George P Schell","Market Place Products","334 Hilltop Dr","Mentor","OH","44060-1930","warmst864@aol.com","02/28/1971","XA","101.00","100.00","15.89"
"01-002423","Marc S Brittan","Madson & Huth Communication Co","5653 S Blackstone Avenue, #3E","Chicago","IL","60637-4596","mapper@tcent.net","06/30/1975","BA","144.00","144.00","449.92"
"01-006063","Stephanie A Jernigan","La Salle Clinic","77565 Lorain","Akron","OH","44325-4002","dram@akron.net","11/02/1941","EB|CB","126.00","126.00","262.98"
"01-010474","Ernie Esser","Town & Country Electric Inc.","56 Pricewater","Waltham","MA","2453","hazel@bentley.net","12/15/1962","JA|RB","127.00","127.00","271.75"
"01-010852","Robert A Jacoby","Saturn of Baton Rouge","4001 Lafayette","Baton Rouge","LA","70803-4918","din33@norl.com","12/22/1985","ED|EA|RB|KA","142.00","150.00","423.01"
"01-011625","James C Felli","Bemiss Corp.","23A Carolina Park Circle","Spartanburg","SC","29303-9398","cadair@gw.com","02/21/1940","SB","151.00","155.00","515.41"
In the previous example, the schema could also be discovered, extending the default type patterns to recognize the date formats. This can be done in a fairly straightforward fashion:
Custom schema discovery
// Instead of constructing schema and calling reader.setSchema(schema)
TextDataType usDate= TextTypes.FORMATTED_DATE(new SimpleDateFormat("MM/dd/yyyy"));
List<TypePattern> patterns= Arrays.asList(new TypePattern("\\d{1,2}/\\d{1,2}/\\d+", usDate));
// Simple extension of default pattern-based discovery
reader.setSchemaDiscovery(patterns);
// Complete replacement of schema discoverer
// More interesting when using custom discovery implementation
TextRecordDiscoverer discoverer= TextRecord.extendDefault(patterns);
reader.setSchemaDiscovery(discoverer);
Properties
The
ReadDelimitedText operator supports the following properties:
Name | Type | Description |
|---|
analysisDepth | int | The number of characters to read for performing schema discovery and structural analysis. |
autoDiscoverNewline | String | Determines if the record separator should be auto-discovered. Defaul: enabled. |
charset | Charset | The character set used by the data source. Default: ISO-8859-1. |
charsetName | String | The character set used by the data source by name. |
decodeBuffer | int | The size of the buffer, in bytes, used to decode character data. By default, this will be automatically derived using the character set and read buffer size. |
discoveryNullIndicator | String | The text value used to represent null values by default in discovered schemas. By default, this is the empty string. |
discoveryStringHandling | StringConversion | The default behavior for processing string-valued types in discovered schemas. |
encoding | | Properties that control character set encoding. |
errorAction | CodingErrorAction | The error action determines how to handle errors encoding the input data into the configured character set. The default action is to replace the faulty data with a replacement character. |
extraFieldAction | | How to handle fields found when parsing the record, but not declared in the schema. |
fieldDelimiter | String | Delimiter used to denote the boundaries of a data field. |
fieldEndDelimiter | String | Ending delimiter used to denote the boundaries of a data field. |
fieldErrorAction | | How to handle fields that cannot be parsed. |
fieldLengthThreshold | int | The maximum length allowed for a field value before it is considered an error. |
fieldSeparator | String | Delimiter used to define the boundary between data fields. |
fieldStartDelimiter | String | Starting delimiter used to denote the boundaries of a data field. |
header | String | Whether to expect a header row in the source. The header row contains field names. |
includeSourceInfo | boolean | Determines whether output records will include additional fields detailing origin information for the record. If true, records will have three additional fields: • sourcePath – the path of the file from which the record originates. If this is not known, it will be NULL. • splitOffset – the offset of the starting byte of the containing split in the source data. • recordOffset – the offset of the first character of the record text from the start of the containing split. If these names would collide with those defined in the source schema, they will be renamed to avoid collision. These fields are added as the first three of the output and are not affected by the selectedFields property. |
lineComment | String | The character sequence indicating a line comment. Lines beginning with this sequence are ignored. |
maxRowLength | int | The limit, in characters, for the first row. Zero indicates no maximum. |
missingFieldAction | | How to handle fields declared in the schema, but not found when parsing the record. If the configured action does not discard the record, the missing fields will be null-valued in the output. |
parseErrorAction | | How to handle all parsing errors. |
parseOptions | | The parsing options used by the reader. |
pessimisticSplitting | boolean | Configures whether pessimistic file splitting must be used. By default, this is disabled. Pessimistic splitting defines one file split per file (assumes the input files are not splittable). |
readBuffer | int | The size of the I/O buffer, in bytes, to use for reads. Default: 64K. |
readOnClient | boolean | Determines whether reads are performed by the client or in the cluster. By default, reads are performed in the cluster if executed in a distributed context. |
recordSeparator | String | Value to use as a record separator. |
recordWarningThreshold | int | The maximum number of records which can have parse warnings before failing. |
replacement | String | Replacement string to use when encoding error policy is replacement. Default: '?' |
selectedFields | List<String> | The list of input fields to include in the output. Use this to limit the fields written to the output. |
schema | | The record schema expected in the delimited text source. This property is mutually exclusive with schemaDiscovery; setting one causes the other to be ignored. By default, this property is unset. |
schemaDiscovery | | The schema discovery mechanism to use. This property is mutually exclusive with schema; setting one causes the other to be ignored. By default, a pattern-based mechanism is used. Supplying a list of pattern/type pairs uses the default discoverer extended with the supplied patterns. |
source | ByteSource, Path, or String | Source of the input data to parse as delimited text. |
splitOptions | | The configuration used in determining how to break the source into splits. |
useMetadata | boolean | Whether the reader should use any discovered metadata about the ordering and distribution. Default: false. |
Ports
The
ReadDelimitedText operator provides a single output port.
ReadFixedText Operator
Fixed text data contains fields that are not delimited as CSV files are. A schema defines each field and its type, offset, and length within a row of data. Data is parsed from each input row according to the defined position of each field. Field types can be specified along with patterns for parsing the data. Patterns are especially useful for date and timestamp field types.
The
ReadFixedText operator reads a text file of fixed-width records as record tokens. Records are identified by the presence of a non-empty, user-defined record separator sequence between each individual record or by the total length of the record if an empty or zero-length record separator is provided. Output records contain the same fields as the input file. The parser can also filter or reorder the fields of the output, as requested.
The reader requires a
FixedWidthTextRecord object to provide field position as well as parsing and type information for fields. The schema, in conjunction with any specified field filter, defines the output type of the parser. These can be manually constructed through the API provided, although this metadata is often persisted externally.
StructuredSchemaReader provides support for reading in Pervasive Data Integrator structured schema descriptors (.schema files) for use with readers.
Typically, the output of the parsing includes all records in the file, both those with and without parsing errors. Fields that cannot be parsed are null-valued in the resulting record. If desired, the reader can be configured to filter failed records from the output.
Since record boundaries occur at known positions, fixed text files can be parsed in parallel.
Fixed-width text data can be written
WriteFixedText Operator.
Code Examples
The following example builds a schema that is used by the
ReadFixedText operator to read a fixed text format file.
Using the ReadFixedText operator in Java
// Create fixed text reader
ReadFixedText reader = graph.add(new ReadFixedText("data/AccountsFixed.txt"));
// Build the schema. Fields must be added in order of appearance in records.
// The field size must be exact as it determines the position of the field for parsing.
FixedWidthTextRecord schema = new FixedWidthTextRecord(new TextConversionDefaults(StringConversion.NULLABLE_TRIMMED));
schema.defineField("accountNumber", new PaddedTextType(TextTypes.STRING, 9, ' ', Alignment.LEFT));
schema.defineField("name", new PaddedTextType(TextTypes.STRING, 21, ' ', Alignment.LEFT));
schema.defineField("companyName", new PaddedTextType(TextTypes.STRING, 31, ' ', Alignment.LEFT));
schema.defineField("address", new PaddedTextType(TextTypes.STRING, 35, ' ', Alignment.LEFT));
schema.defineField("city", new PaddedTextType(TextTypes.STRING, 16, ' ', Alignment.LEFT));
schema.defineField("state", new PaddedTextType(TextTypes.STRING, 2, ' ', Alignment.LEFT));
schema.defineField("zip", new PaddedTextType(TextTypes.STRING, 10, ' ', Alignment.LEFT));
schema.defineField("emailAddress", new PaddedTextType(TextTypes.STRING, 25, ' ', Alignment.LEFT));
schema.defineField("birthDate", new PaddedTextType(TextTypes.FORMATTED_DATE(new SimpleDateFormat("MM/dd/yyyy")), 10, ' ', Alignment.LEFT));
schema.defineField("accountCodes", new PaddedTextType(TextTypes.STRING, 11, ' ', Alignment.LEFT));
schema.defineField("standardPayment", new PaddedTextType(TextTypes.JAVA_DOUBLE, 6, ' ', Alignment.LEFT));
schema.defineField("payment", new PaddedTextType(TextTypes.JAVA_DOUBLE, 7, ' ', Alignment.LEFT));
schema.defineField("balance", new PaddedTextType(TextTypes.JAVA_DOUBLE, 6, ' ', Alignment.LEFT));
// Set the schema of the reader.
reader.setSchema(schema);
An example of data that can be read with the above code fragment follows. Because of the wide nature of the data, the records will most likely appear across multiple lines of the display.
01-000667George P Schell Market Place Products 334 Hilltop Dr Mentor OH44060-1930warmst864@aol.com 02/28/1971XA 101.00100.00 15.89
01-002423Marc S Brittan Madson & Huth Communication Co 5653 S Blackstone Avenue, #3E Chicago IL60637-4596mapper@tcent.net 06/30/1975BA 144.00144.00 449.92
01-006063Stephanie A Jernigan La Salle Clinic 77565 Lorain Akron OH44325-4002dram@akron.net 11/02/1941EB|CB 126.00126.00 262.98
01-010474Ernie Esser Town & Country Electric Inc. 56 Pricewater Waltham MA2453 hazel@bentley.net 12/15/1962JA|RB 127.00127.00 271.75
01-010852Robert A Jacoby Saturn of Baton Rouge 4001 Lafayette Baton Rouge LA70803-4918din33@norl.com 12/22/1985ED|EA|RB|KA142.00150.00 423.01
01-011625James C Felli Bemiss Corp. 23A Carolina Park Circle Spartanburg SC29303-9398cadair@gw.com 02/21/1940SB 151.00155.00 515.41
01-018448Alan W Neebe Georgia State Credit Union PO Box 159 Demorest GA30535-1177delores@truett.com 01/31/1960MA|ED|SB 113.00120.00 131.89
01-018595Alexander Gose Office Support Services 436 Green Mountain Circle New Paltz NY12561-0023dams@matrix.net 06/19/1940EC 147.00147.00 477.09
The following example demonstrates using the ReadFixedText operator in RushScript. The schema is created in RushScript and passed to the operator.
Using the ReadFixedText operator in RushScript
// Build the schema. Fields must be added in order of appearance in records.
// The field size must be exact as it determines the position of the field for parsing.
var accountsFixedSchema = dr.schema({type:'FIXED'})
.nullable(true)
.trimmed(true)
.padChar(' ')
.alignment('LEFT')
.STRING("accountNumber", {size:9})
.STRING("clientName", {size:21})
.STRING("companyName", {size:31})
.STRING("streetAddress", {size:35})
.STRING("city", {size:16})
.STRING("state", {size:2})
.STRING("zip", {size:10})
.STRING("emailAddress", {size:25})
.DATE("birthDate", {pattern:'MM/dd/yyyy', size:10})
.STRING("accountCodes", {size:11})
.DOUBLE("standardPayment", {pattern:'0.00', size:6})
.DOUBLE("payment", {pattern:'0.00', size:7})
.DOUBLE("balance", {pattern:'0.00', size:6});
// Read the data
var data = dr.readFixedText({source:'/path/to/file.txt', schema:accountsFixedSchema});
Properties
The
ReadFixedText operator supports the following properties:
Ports
The
ReadFixedText operator provides a single output port.
ReadSource Operator
The
ReadSource operator reads a defined data source as a stream of records. The data source provides a sequence of bytes in some format that can be parsed into records that are assumed to be identical in logical structure. The mapping between physical and logical structure is encapsulated in a format descriptor, which must be provided.
This operator is low level, providing a generalized model for reading files in a distributed fashion. Typically, the
ReadSource operator is not directly used in a graph, instead being indirectly used though a composite operator such as one derived from
AbstractReader, providing a more appropriate interface to the end user.
Parallelized reads are implemented by breaking input files into independently parsed pieces, a process called splitting. Splits are then distributed to available partitions and parsed. When run on a distributed cluster, the reader makes an attempt to assign splits to machines where the I/O will be local, but non-local assignment may occur in order to provide work for all partitions. Distributed execution also makes an assumption that the specified data source is accessible from any machine. If this is not the case, the read operator must be made non-parallel by invoking the disableParallelism() method on the operator instance.
Not all formats support splitting; this generally requires a way of unambiguously identifying record boundaries. Formats will indicate whether they can support splitting. If not, each input file will be treated as a single split. Even with a non-splittable format, this means reading multiple files can be parallelized. Some formats can partially support splitting, but in a “optimistic” fashion; under most circumstances splits can be handled, but in some edge cases splitting leads to parse failures. For these cases, the reader supports a “pessimistic” mode that can be used to assume a format is non-splittable, regardless of what it reports.
The reader makes a best-effort attempt to validate the data source before execution but cannot always guarantee correctness, depending on the nature of the data source. This is done to try to prevent misconfigured graphs from executing, such as when the reader may not execute until a late phase where a failure may result in a significant amount of work being lost.
Tip... This is a low-level operator that typically is not directly used. It can be used with a custom data format. A custom data format may be needed to support a format not provided by the DataFlow library.
Code Example
This example code fragment demonstrates how to set up a reader for a generic file type.
Using the ReadSource operator
ReadSource reader = new ReadSource();
reader.setSource(new BasicByteSource("filesource"));
reader.setFormat(new DelimitedTextFormat(TextRecord.convert(record(INT("intfield"), STRING("stringfield"))),
new FieldDelimiterSettings,
new CharsetEncoding()));
ParsingOptions options = new ParsingOptions();
options.setSelectedFields("stringfield");
reader.setParseOptions(options);
Properties
The
ReadSource operator supports the following properties:
Ports
The
ReadSource operator provides a single output port:
ReadLog Operator
Many applications and systems produce log data that is loosely structured. Generally, there is a specific format used to write the log data; however, this format is not always unambiguously reversible to a typical parser. Also, different fields might use different field separators and delimiters.
About the only generalities that can be made about all log formats is that the records always contain an ID field, usually a timestamp, and a message field consisting of the information that produced the log event. In these cases the log data may not be able to be read by a regular delimited or fixed text reader.
The
ReadLog operator reads a text file or alternative source consisting of log events from a particular application or system. The type of application or system producing the log records must be specified in advance through a property setting. The currently supported log types are enumerated by
SupportedLogType. Configuring the operator requires the user to either provide one of these enumerations or their own implementation of a particular
LogFormat. It should be noted that these settings are mutually exclusive.
In addition to specifying the log type the format pattern may be set. This is a String that provides information about a log format when customization of the format is allowed. It is specific to the type of log being read and therefore may provide more customization based on the logs being read. Additionally the newline character used by the log files may be specified if a nondefault newline character is being used. By default this is determined automatically by examining the first few lines in the source.
The record flow generated by this operator is determined by the log type being read and the log format pattern provided during composition of the operator unless otherwise noted.
Supported Log Types
The
ReadLog operator supports a selection of common log formats. These are enumerated by
SupportedLogType. Custom log formats can be added by implementing the
LogFormat interface. A custom format would be instantiated and provided to the
ReadLog operator through the logFormat property. Certain log types can also be manually instantiated and provided to the
ReadLog operator when log-specific settings need to be changed, such as log4j’s logging levels.
The various supported log types are listed below.
Generic Log Data
The generic type can be used when the log data can be parsed using a regular expression but there is no dedicated format for the log. The schema is automatically generated by counting the number of groupings in the regular expression provided. The schema can also be set manually by creating a custom instance of the log format.
The generic format takes a valid Java regular expression string. The grouping of the regular expression defines the fields the individual records will be split into.
Default : "(.*)"
Example : "(\\d\\d.\\d+) (\\w+) (\\w+)"
Common Log Format
The CLF type can be used when reading a web server log in common log format. NCSA common log format is specified at
http://www.w3.org/Daemon/User/Config/Logging.html#common-logfile-format.Since CLF is well defined, it does not allow a format pattern to be specified.
Combined Log Format
The Combined type can be used when reading a web server log in combined log format. NCSA combined log format is specified at
http://publib.boulder.ibm.com/tividd/td/ITWSA/ITWSA_info45/en_US/HTML/guide/c-logs.html#ncsa.
Combined takes a true or false string, which determines if the optional cookie field is included in the log.
Extended Log Format
The ELF type can be used when reading a web server log in extended log format. Extended log format is specified at
http://www.w3.org/TR/WD-logfile.html.
ELF will accept a string in the same form as a Fields directive as specified in the official format. If format discovery is enabled, it will scan the file for any directives and apply them appropriately.
Example : "#Fields: date time cs-method cs-uri"
GlassFish Logs
The GlassFish format supports reading logs produced by GlassFish servers. The GlassFish server log format is specified at
http://docs.oracle.com/cd/E18930_01/html/821-2416/abluk.html.
The format pattern supported by the GlassFish format consists of a string that specifies the date format used in the timestamps of the log. Any string supported by Java’s DateFormat class is acceptable.
Default : "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
Example : "dd-MM-yyyy HH:mm:ss"
Log4j Logs
The log4j format supports reading logs produced by the Apache log4j library for Java. More information about the library can be found at
http://logging.apache.org/log4j/1.2/.
The log4j format will accept a string in the same form as the conversion pattern that specifies the logging. More information about log4j conversion patterns can be found at
http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/EnhancedPatternLayout.html.
Default : "%r [%t] %-5p %c %x - %m%n"
Example : "%d{ISO8601} %p %c: %m%n"
Syslog Logs
The syslog format supports reading logs produced by syslogd and other BSD-compliant syslog producers. The BSD syslog format is specified by
RFC-3164.
The format pattern supported by the syslog format consists of a string that includes the current four-digit year and the signed four-digit offset from UTC separated with a single space.
Default : current year and timezone
Example : "2012 -0600"
Code Examples
This example code fragment demonstrates how to set up a reader for a log4j log file.
Using the ReadLog operator
ReadLog reader = graph.add(new ReadLog(data/log4jdata.log));
reader.setLogType(SupportedLogType.LOG4J);
reader.setLogPattern("%d{ISO8601} %p %c: %m%n");
reader.setNewline("\n");
Using the ReadLog operator in RushScript
var data = dr.readLog({source:'data/log4jdata.log', logType:'LOG4J', logPattern:'%d{ISO8601} %p %c: %m%n', newLine:'\n');
Properties
The
ReadLog operator supports the following properties:
Ports
The
ReadLog operator provides a single output port.
ReadARFF Operator
This topic describes the DataFlow ARFF read operator. For information on its KNIME node, see
ARFF Reader.
Sparse data is useful for data sets that contain a large number of fields where most of the fields do not have data values. This is mostly the case with numeric data, but can also be applied to enumerated data types. A common example is a data set that contains a row-per-website user with a field-per-website page. Each field contains a count of the number of times a user has visited the specific page. Most users will visit only a fraction of the overall pages on the website. Using a sparse data representation will allow the data set to be much smaller in size than a fully populated, dense data set.
DataFlow supports sparse data using the Attribute-Relation File Format (ARFF). The
ReadARFF operator is used to read sparse data stored in ARFF. Files using ARFF can be in either sparse or dense mode. This reader detects the mode and reads the data accordingly. ARFF files contain schema information. The schema is parsed and used to determine how to parse data records.
ARFF can be parsed in parallel under “optimistic” assumptions: namely, that parse splits do not occur in the middle of a delimited field value and somewhere before an escaped record separator. This is assumed by default, but can be disabled with an accompanying reduction of scalability and performance.
ARFF data is used by DataFlow to represent sparse data. But it can also be used to store dense data in a CSV style. The ARFF mode determines which format to use: sparse or dense. The reader automatically discovers the mode.
The ARFF metadata also contains two other data values: the relation name and comments. The relation name is specified as one of the metadata headers. Comments are lines that start with the "%" character. Comments are returned as a list of String values.
Data can be written in ARFF
WriteARFF Operator.
Code Examples
Since ARFF includes metadata that contains field names and types, the schema for ARFF files does not have to be specified. The metadata can be accessed using the discoverMetadata() method on the reader after the data source has been configured. The metadata can be used to access the relation name, comments, ARFF mode, and data schema. The schema contains the field names and types along with patterns for parsing and formatting field values.
Using the ReadARFF operator in Java
// Create ARFF reader
ReadARFF reader = graph.add(new ReadARFF("data/weather.arff"));
// Get metadata for the configured data source
Analysis metadata = reader.discoverMetadata(FileClient.basicClient());
ARFFMode mode = metadata.getMode();
String relationName = metadata.getRelationName();
List<String> comments = metadata.getComments();
TextRecord schema = metadata.getSchema();
// Dump out metadata values
System.out.println("mode = " + mode);
System.out.println("relationName = " + relationName);
System.out.println("comments = " + comments);
System.out.println("schema = " + schema.getFieldNames());
Following is a snippet of output from running an application with the previous code fragment.
mode = DENSE
relationName = weather
comments = []
schema = [outlook, temperature, humidity, windy, play]
Using the ReadARFF operator in RushScript
var data = dr.readARFF(source:'data/weather.arff');
The weather.arff file’s contents:
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
Properties
The
ReadARFF operator supports the following properties:
Ports
The
ReadARFF operator provides one output port:
ReadStagingDataset Operator
Staging data sets are used within DataFlow for writing intermediate data into a fast and efficient binary format. They are also very convenient since metadata is stored in a header section of the data set. When reading a data set, the metadata is used to determine the fields and types of data contained within the data set.
Text files can be used for intermediate data access also but incur the overhead of formatting and parsing. Text files also can introduce data errors due to having to convert numeric data into textual formats. Staging data sets do not require the round trip to and from text and so bypass these issues.
The DataFlow framework uses staging data sets in the following situations:
• In between application phases where one phase feeds data to the next
• Staging data to disk during merge sort operations
• Staging data to disk due to queue overflow in deadlock situations
There are several use cases for directly using staging data sets. A few examples are:
• A particular task that needs to be accomplished must be broken into several DataFlow applications. Data from each application must be passed to the next. Staging data sets can be used to capture the intermediate data, allowing for quick and easy access.
• Data within a database is being used to build a predictive model. This development will require many iterations of algorithm testing with the data. For this use case, the most efficient route is to first read the data from the source database and write it to any staging data sets. Then use the staged data as input into the predictive analytic development. The one-time read from the database is overhead. However, given that the source data will be read many times, the savings are substantial.
As with other file-based operators, the
ReadStagingDataset operator accepts several data sources:
• A Path reference
• A String reference containing:
– The path to a single data set
– A path to a directory containing many data sets with the same format and schema
– A path with wildcard characters within the file name that will be resolved to the matching data set files
Staging data sets can be written
WriteStagingDataset Operator.
Code Examples
This code example constructs a new
ReadStagingDataset operator by passing in the name of the file to read as a constructor parameter. Properties for setting the I/O buffer size and the selected fields are then set. The selected fields must contain field names that exist in the data set. Use the selectedFields property to constrain the fields read from the data set and provided to the output port of the operator.
Using the ReadStagingDataset operator in Java
ReadStagingDataset reader = graph.add(new ReadStagingDataset("results/ratings-stage"));
reader.setReadBuffer(128 * 1024);
reader.setSelectedFields(Arrays.asList(new String[] {"userID", "rating"}));
Using the ReadStagingDataset operator in RushScript
var data = dr.readStagingDataset({source:'results/ratings-stage', readBuffer:131072});
The operator also supports a static method for reading the metadata of a staging data set. The following code example depicts how to read and use the metadata.
Using staging dataset metadata
ReadStagingDataset reader = graph.add(new ReadStagingDataset("results/ratings-stage"));
reader.setReadBuffer(128 * 1024);
reader.setSelectedFields(Arrays.asList(new String[] {"userID", "rating"}));
DatasetMetadata metadata = reader.discoverMetadata();
long rowCount = metadata.getRowCount();
int blockSize = metadata.getBlockSize();
TokenType datasetType = metadata.getSchema();
DatasetStorageFormat format = metadata.getStorageFormat();
System.out.println("rowCount = " + rowCount);
System.out.println("blockSize = " + blockSize);
System.out.println("datasetType = " + datasetType);
System.out.println("format = " + format);
The output of the metadata information from the previous application is shown following. This metadata can be used within an application as needed.
rowCount = 1000
blockSize = 64
datasetType = {"type":"record","representation":"DENSE_BASE_NULL","fields":[{"userID":{"type":"int"}},{"movieID":{"type":"int"}},{"rating":{"type":"int"}},{"timestamp":{"type":"string"}}]}
format = COMPACT_ROW
Properties
The
ReadStagingDataset operator supports the following properties:
Ports
The
ReadStagingDataset operator provides a single output port:
ParseTextFields Operator
The
ParseTextFields operator is similar to the
ReadDelimitedText and other text-based read operators. However, instead of reading from a source of text, it takes a flow of records consisting of text fields as input. These input strings are then parsed into other value types as specified by the schema provided to the operator. Additionally, it emits a flow of records which failed parsing, allowing remediation to be performed on invalid records.
The parsed output will have the type specified by the schema. Output fields will contain the result of processing the input field of the same name according to the type information in the provided schema. Referenced input fields must either be string valued, in which case they are parsed according to the schema, or of a type which is assignable to the output field's type, in which case they are copied as-is.
If a field is present in the schema, but not in the input, the output field is NULL. If an input value is NULL, the resulting output field is NULL. Input fields without a matching field in the schema are dropped.
The rejected output has the same type as the input. Field values will match those of the failed input record.
ParseTextFields only performs semantic validation of the input; the input must have been already broken into field values. This can be accomplished by using
ReadDelimitedText with a schema consisting of all string fields or by custom operators. It should also be noted that a special schema discoverer
TextRecord.TEXT_FIELD_DISCOVER is provided for when the schema of the source field may not be known.
Code Examples
Following is an example usage of the parser. In this case, notice that the input field acctNumber is not defined in the schema, so is absent from the outputs. Conversely, acctCodes is present in the schema, but not in the input; it is present in the output, but will be NULL for every record. Additionally note that the input fields that are in the schema appear in a different order.
Using ParseTextFields in Java
// Define a "raw" schema, identifying fields but not parsing content
RecordTokenType rawType=record(STRING("acctNumber"),STRING("company"),
STRING("startDate"),STRING("balance"));
TextRecord rawSchema= TextRecord.convert(rawType, StringConversion.RAW);
// Read "raw" fields
ReadDelimitedText reader= graph.add(new ReadDelimitedText(...));
reader.setSchema(rawSchema);
// Define the parsing schema
TextRecord schema =
SchemaBuilder.define(
SchemaBuilder.STRING("company"),
SchemaBuilder.STRING("acctCode"),
SchemaBuilder.DOUBLE("balance"),
SchemaBuilder.DATE("startDate", "MM/dd/yyyy") // specify pattern for parsing the date
);
// Parse the text records
ParseTextFields parser= graph.add(new ParseTextFields());
parser.setSchema(schema);
graph.connect(reader.getOutput(), parser.getInput());
Using ParseTextFields in RushScript
var rawschema= dr.schema()
.nullable(false).trimmed(false) // Want strings in raw form
.STRING("acctNumber")
.STRING("company")
.STRING("startDate")
.STRING("balance");
var rawdata= dr.readDelimitedText({source:..., schema:rawschema});
var schema = dr.schema()
.STRING('company')
.STRING('acctCodes')
.DOUBLE('balance')
.DATE('startDate','MM/dd/yyyy');
var parsed= dr.parseTextFields(rawdata, {schema:movieschema});
To better illustrate the behavior of the previous code, consider the following source data (all fields are string values).
This yields the data below on the output port. Here values are parsed as the appropriate type for the field. As the balance field is NULL in the source, it is NULL in the output.
Meanwhile, the rejects port will produce the following data, as the startDate field has the wrong format (all fields are string values).
Properties
The
ParseTextFields operator supports the following properties.
Ports
The
ParseTextFields operator provides two output ports.
ReadJSON Operator
This topic describes the DataFlow JSON read operator. For information about its KNIME node, see
JSON Reader.
The ReadJSON operator reads a JSON file containing key-value pairs or array of objects as record tokens. It supports JSON Lines format as described at
http://jsonlines.org/. The formatted text in JSON Lines contains a single JSON record per line. Each record is separated by a newline character.
In JSON, all field keys must start and end with a delimiter. A "(double quote) is typically used as the field delimiter. However, you may enable the allowSingleQuotes property to avoid parsing errors when single quotes are used. The ReadJSON operator uses the Jackson JSON parsing library to parse fields.
The reader may optionally specify RecordTextSchema to provide parsing and type information for fields. The schema, in conjunction with any specified field filter, defines the output type of the reader. This can be manually constructed using the provided DataFlow API.
The
StructuredSchemaReader class provides support for reading the DataConnect structured schema descriptors (.schema files) to use with readers. Also, automated schema discovery can be performed based on the contents of the file because JSON text has explicit field markers. The reader provides a pluggable discovery mechanism to support this function. By default, the schema is automatically discovered with the initial assumption that all the fields are strings. The discovered fields are named using the available key fields.
Normally, the reader output file includes all the parsed records with and without parsing errors. The fields that cannot be parsed have null values in the output. If required, the reader can be configured to filter the failed records from the output.
JSON text does not contain a header row since the keys in a JSON record define the fields in the output. JSON text files can be parsed in parallel with the "optimistic" assumption that the data is well formatted as per the JSON Lines standard.
Code Examples
The first code example shows a simple usage of the reader. The path to the local file name is provided as a parameter to the constructor. This can also be set using the setSource() method. A record type is built and used as the input schema. The record type must be converted to an acceptable schema before it is used by the reader.
Using the ReadJSON operator in Java
The following code shows how to use the ReadJSON operator in Java:
ReadJSON reader = graph.add(new ReadJSON("data/iris.jsonl"));
RecordTokenType schema = record(DOUBLE("sepallength"), DOUBLE("sepalwidth"), DOUBLE("petallength"), DOUBLE("petalwidth"), STRING("class"));
reader.setSchema(TextRecord.convert(schema));
Using ReadJSON operator in RushScript
The following code shows how to use the ReadJSON operator in RushScript:
var irisSchema = dr.schema().DOUBLE("sepallength"), DOUBLE("sepalwidth"), DOUBLE("petallength"), DOUBLE("petalwidth"), STRING("class");
var data = dr.readjson({source:'data/iris.jsonl ', schema: irisSchema});
The following data snippet is from the iris.jsonl file and can be read using the preceding code examples (either Java or RushScript).
{"sepallength":5.1,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"Iris-setosa"}
{"sepallength":4.9,"sepalwidth":3,"petallength":1.4,"petalwidth":0.2,"class":"Iris-setosa"}
{"sepallength":4.7,"sepalwidth":3.2,"petallength":1.3,"petalwidth":0.2,"class":"Iris-setosa"}
{"sepallength":7,"sepalwidth":3.2,"petallength":4.7,"petalwidth":1.4,"class":"Iris-versicolor"}
{"sepallength":6.4,"sepalwidth":3.2,"petallength":4.5,"petalwidth":1.5,"class":"Iris-versicolor"}
{"sepallength":6.9,"sepalwidth":3.1,"petallength":4.9,"petalwidth":1.5,"class":"Iris-versicolor"}
{"sepallength":6.3,"sepalwidth":3.3,"petallength":6,"petalwidth":2.5,"class":"Iris-virginica"}
{"sepallength":5.8,"sepalwidth":2.7,"petallength":5.1,"petalwidth":1.9,"class":"Iris-virginica"}
{"sepallength":7.1,"sepalwidth":3,"petallength":5.9,"petalwidth":2.1,"class":"Iris-virginica"}
{"sepallength":6.3,"sepalwidth":2.9,"petallength":5.6,"petalwidth":1.8,"class":"Iris-virginica"}
Properties
The
ReadJSON operator supports the following properties.
Name | Type | Description |
|---|
allowComments | Boolean | Determines whether the parser will allow using Java or C++ style comments (both '/'+'*' and '//' types) within parsed content or not. Default is False. |
allowUnquotedFieldNames | Boolean | Determines whether the parser will allow using unquoted field names. Default is False. |
allowSingleQuotes | Boolean | Determines whether the parser will allow using single quotes (apostrophe, character '\'') for quoting strings. Default is False. |
allowUnquotedControlChars | Boolean | Determines whether the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed). Default is False. |
allowBackslashEscapingAny | Boolean | Determines whether the parser will allow quoting any character using backslash quoting mechanism. If it is not enabled, only characters that are explicitly listed by JSON specification can be escaped. Default is False. |
allowNumericLeadingZeros | Boolean | Determines whether the parser will allow numbers to start with additional leading zeros. If the leading zeros are allowed for numbers in source, then this field must be set to True. Default is False. |
allowNonNumericNumbers | Boolean | Determines whether the parser is allowed to recognize "Not a Number" (NaN) token as legal floating point values. Default is False. |
analysisDepth | Int | Indicates the number of characters to read for performing schema discovery and structural analysis. Default is False. |
charset | Charset | Indicates the character set used by the data source. Default is UTF-8. |
charsetName | String | Indicates the character set used by the data source based on the name. |
decodeBuffer | Int | Indicates the buffer size (in bytes) used to decode character data. By default, this is automatically derived using the character set and read buffer size. |
discoveryNullIndicator | Int | Indicates the text value used to represent null values by default in discovered schemas. By default, this is the empty string. |
discoveryStringHandling | StringConversion | Indicates the default behavior for processing string-valued types in discovered schemas. |
encoding | | Indicates the properties that control character set encoding. |
errorAction | CodingErrorAction | Determines the action to be performed for errors encoding the input data into the configured character set. The default action replaces the faulty data with a replacement character. |
extraFieldAction | | Determines the action to be performed for fields that are found when parsing the record, but not declared in the schema. |
fieldErrorAction | | Determines the action to be performed for fields that cannot be parsed. |
fieldLengthThreshold | Int | Indicates the maximum length allowed for a field value before it is considered an error. |
includeSourceInfo | Boolean | Determines whether output records will include additional fields that provides origin information for the record. If true, records will have three additional fields: • sourcePath - Path of the file from which the record originates. If this is not known, the value is Null. • SplitOffset - Offset of the starting byte of the containing split in the source data. • recordOffset - Offset of the first character of the record text from the start of the containing split. If these names collide with the names defined in the source schema, they will be renamed to avoid collision. These fields are added as the first three of the output and are not affected by the selectedFields property. |
missingFieldAction | | Indicates how to handle fields declared in the schema, but not found when parsing the record. If the configured action does not discard the record, then the missing fields will be null-valued in the output. |
multilineFormat | | Determines whether the file should be parsed as a multiline JSON file, which allows each JSON record to span multiple lines. Otherwise the data must be in JSON lines format. Default is False. |
parseErrorAction | | Indicates the action to be performed for all parsing errors. |
parseOptions | | Indicates the parsing options used by the reader. |
pessimisticSplitting | Boolean | Configures whether pessimistic file splitting must be used. By default, this is disabled. Pessimistic splitting defines one file split per file (assumes the input files cannot be split). |
readBuffer | Int | Indicates the size of the I/O buffer (in bytes) to use for reads. Default is 64K. |
readOnClient | Boolean | Determines whether reads are performed by the client or in the cluster. By default, reads are performed in the cluster if executed in a distributed context. |
recordWarningThreshold | Int | Indicates the maximum number of records that can have parse warnings before it fails. |
replacement | String | Indicates the replacement string to be used when encoding error policy is replacement. Default is '?'. |
selectedFields | List<String> | Indicates the list of input fields to be included in the output. This sets a limit for the fields written to the output. |
schema | | Indicates the record schema expected in the delimited text source. This property is mutually exclusive with the schemaDiscovery property. If either of the two property is set, then the other property is ignored. By default, this property is not set. |
schemaDiscovery | | Indicates the schema discovery mechanism to be used. This property is mutually exclusive with the schema property. If either of the two property is set, then the other property is ignored. By default, a pattern-based mechanism is used. Providing a list of pattern/type pairs uses the default discoverer extended with the supplied patterns. |
source | | Indicates the source of the input data to be parsed as delimited text. |
splitOptions | | Indicates the configuration used to determine how to break the source into splits. |
Ports
The
ReadJSON operator provides a single output port.