DF 8.2 | Read I/O Operators

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > Read I/O Operators

Was this helpful?

Read I/O Operators

The DataFlow operator library includes several pre-built Input/Output operators. This section covers the Read operators and provides details on how to use them. For more information, refer to the following topics:

• ReadAvro Operator

• ReadORC Operator

• ReadMDF Operator

• ReadParquet Operator

• ReadFromJDBC Operator

• ReadDelimitedText Operator

• ReadFixedText Operator

• ReadSource Operator

• ReadLog Operator

• ReadARFF Operator

• ReadStagingDataset Operator

• ParseTextFields Operator

• ReadJSON Operator

ReadAvro Operator

The ReadAvro operator reads a data file previously written using the Apache Avro serialization format. The Avro format is a commonly used binary format that offers data compression and the ability to be parsed in parallel. Metadata about the data, such as its schema and compression format, are serialized into the file, making it available to readers.

The operator will translate the Avro schema into appropriate DataFlow types when possible—some schemas are not supported for reading, as described later.

As DataFlow operates on records, it is generally expected that the source data will have a RECORD schema type. If this is not the case, the operator treats the schema as if it were a record with a single field named "value."

The output record type will have fields with the same names and in the same order as the source schema. Output fields are assigned a type based on the schema of the source field with the same name. In general, Avro schema types are assigned DataFlow types according to the following table.

Avro Schema Type	DataFlow Type
BOOLEAN	BOOLEAN
BYTES	BINARY
FIXED	BINARY
DOUBLE	DOUBLE
FLOAT	FLOAT
INT	INT
LONG	LONG
STRING	STRING
ENUM	STRING with domain using the declared set of symbols

For types not listed previously, the schema type may or may not be mapped to a DataFlow type. If attempting to read a source with schema types that cannot be mapped to DataFlow types, the operator will produce an error. The conditions under which other schema types are supported are as follows:

• Source fields with ARRAY or MAP schema types are never supported.

• Source fields with a RECORD schema type are supported only when reading Avro data written using the WriteAvro operator; fields with DataFlow types that do not have analogues in Avro are written as nested records. Source fields using these same schemas will be mapped back into the original DataFlow type.

• Source fields with a UNION schema type is only supported if it is a union of exactly two schema types where one type is NULL. In this case, the type is determined using the non-NULL schema type of the union.

For information about creating files containing data in Avro format using DataFlow, see WriteAvro Operator.

When reading Avro files written by DataFlow, there may be additional metadata information about the data embedded within the files. If the reader has been configured to use this metadata, then it can obtain information about the ordering and partitioning of the data when it was written, which can eliminate the need to re-sort or partition the data.

Code Examples

Because Avro files are self-contained with respect to metadata, it is generally not necessary to provide any information other than the location of the data. Following is an example use of the operator in Java.

Using the ReadAvro operator in Java

ReadAvro reader = graph.add(new ReadAvro("data/ratings.avro"));

The following example demonstrates using the delimited text reader within RushScript.

var data = dr.readAvro({source:'data/ratings.avro'});

Properties

The ReadAvro operator supports the following properties:

Name	Type	Description
extraFieldAction	ParseErrorAction	How to handle fields found when parsing the record, but not declared in the schema.
fieldErrorAction	ParseErrorAction	How to handle fields that cannot be parsed.
includeSourceInfo	boolean	Determines whether output records will include additional fields detailing origin information for the record. If true, records will have three additional fields: • sourcePath – the path of the file from which the record originates. If this is not known, it will be NULL. • splitOffset – the offset of the starting byte of the containing split in the source data. • recordOffset – the offset of the first character of the record text from the start of the containing split. If these names would collide with those defined in the source schema, they will be renamed so as to avoid collision. These fields are added as the first three of the output and are not affected by the selectedFields property.
missingFieldAction	ParseErrorAction	How to handle fields declared in the schema, but not found when parsing the record. If the configured action does not discard the record, the missing fields will be null-valued in the output.
parseErrorAction	ParseErrorAction	How to handle all parsing errors.
parseOptions	ParsingOptions	The parsing options used by the reader.
pessimisticSplitting	boolean	Configures whether pessimistic file-splitting must be used. By default, this is disabled. Pessimistic splitting defines one file split per file (that is, it assumes the input files are not splittable).
readBuffer	int	The size of the I/O buffer, in bytes, to use for reads. Default: 64K.
readOnClient	boolean	Determines whether reads are performed by the client or in the cluster. By default, reads are performed in the cluster if executed in a distributed context.
recordWarningThreshold	int	The maximum number of records that can have parse warnings before failing.
selectedFields	List<String>	The list of input fields to include in the output. Use this to limit the fields written to the output.
source	ByteSource, Path, or String	Source of the input data to parse as delimited text.
splitOptions	SplitOptions	The configuration used in determining how to break the source into splits.
useMetadata	boolean	Whether the reader should use any discovered metadata about the ordering and distribution. Defaults to false.

Ports

The ReadAvro operator provides a single output port.

Name	Type	Get method	Description
output	RecordPort	getOutput()	Provides the record data to read and parse from the provided input data files (sources).

ReadORC Operator

The ReadORC operator reads a data file that is written earlier using the Apache Optimized Row Columnar (ORC) File format. The ORC format is supported by Apache Hive.

The ORC format is columnar-based but can also be autonomous. The columns in an ORC file separate the stripes or sections of the file. An internal index is used to track a section of the data within each column. This organization allows readers to efficiently omit the columns that are not required. Also, each column can apply a different compression method depending on the data type. Metadata about the ORC data, such as the schema and compression format are serialized into the file and are made available to the readers.

The operator translates the ORC file schema into appropriate DataFlow types when it is possible. A few ORC data types are not supported for reading. The columns with unsupported data types are omitted. The output record type has fields with the same names and in the same order as the source schema. The type in the output fields are assigned based on the schema of the source field with the same name.

In general, DataFlow types are assigned to the ORC schema types as shown in the following table.

DataFlow Type	ORC Schema Type
BOOLEAN	BOOLEAN
LONG	BIGINT
BINARY	BINARY
STRING	CHAR
DATE	DATE
NUMERIC	DECIMAL
DOUBLE	DOUBLE
FLOAT	FLOAT
INT	INT
INT	SMALLINT
STRING	STRING
INT	TINYINT
TIMESTAMP	TIMESTAMP
STRING	VARCHAR

Several ORC types are not supported by DataFlow. If these types are found in an ORC file, they will be ignored. The reader logs a message for all columns that are omitted having the unsupported data types. The following ORC data types are not supported:

• LIST

• MAP

• STRUCT

• UNION

Column Pruning

We recommend you limit the columns read to only the ones required for downstream processing, since ORC format is columnar. Use the selectedFields property to specify the fields to read. For more information, see Properties. The ORC columns not included in the list are not omitted. This optimization provides a performance boost especially for files containing a large number of columns.

Note: Before running the workflow, ensure that the client configuration and the jar files are added to the classpath. For more information, see Integrating DataFlow with Hadoop.

You must enable the datarush-hadoop-apache3 module to enable reading ORC files from S3A and ABFS locations.

Code Examples

It is required to provide the location of the data to the operator because the ORC files are self-contained about the metadata. We recommend you limit the columns read to only the ones required for downstream processing. Because ORC format is columnar, reducing the columns read might enhance performance. Use the selectFields property to specify the columns to read from a given ORC data set.

The following example provides reading the ORC file data using Java.

Using ReadORC in Java

ReadORC reader = graph.add(new ReadORC("data/ratings.orc"));

Using ReadORC in RushScript

var data = dr.readORC({source:'data/ratings.orc'});

Properties

The ReadORC operator supports the following properties.

Name	Type	Description
includeSourceInfo	boolean	Determines if the output records will include the additional fields having the origin information for the record. If true, records will have three additional fields: • sourcePath – path of the file that originates the record. If the path is unknown, then it is NULL. • splitOffset – offset of the starting byte in the source data containing the split. • recordOffset – offset of the first character in the record text from the start of the containing split. If these names are the same as those defined in the source schema, they are renamed. These fields are added as the first three of the output and not impacted by the selectedFields property.
pessimisticSplitting	boolean	Defines whether pessimistic file splitting must be used. By default, this is disabled. Pessimistic splitting defines one file split per file. This means that the input files cannot be split.
readBuffer	int	Defines the size of the I/O buffer (in bytes) that is used for reads. Default: 64K.
readOnClient	boolean	Determines whether reads are performed by the client or within the cluster. If executed in a distributed context, the reads are performed in the cluster by default.
selectedFields	List<String>	Defines the list of input fields to be included in the output. This is used to limit the fields available on the output port of the reader. We recommend you use this property to limit the columns read to only those required for downstream processing.
source	ByteSource, Path, or String	Source of the input data to parse as delimited text.
splitOptions	SplitOptions	Defines the configuration used to determine breaking the source into splits.
useMetadata	boolean	Determines whether the reader should use any discovered metadata on ordering and distribution. Default: false.

Ports

The ReadORC operator provides a single output port.

Name	Type	Get method	Description
output	RecordPort	getOutput()	Provides the record data read from the provided input data files (sources).

ReadMDF Operator

This topic describes the DataFlow MDF read operator. For information about its KNIME node, see MDF Reader.

The ReadMDF operator reads a data file previously written using the ASAM MDF format. The MDF format is supported and maintained by ASAM.

MDF or Measurment Data Format is a binary file format used to store recorded and calculated data that is frequently used in post-measurement processing, off-line evaluation, or long-term storage.

It offers efficient and high performance storage of large amounts of measurement data. The file format allows the storage of the raw measurement data along with associated metadata and corresponding conversion formulas so that the raw data can still be interpreted correctly and utilized through post-processing.

The operator will translate the MDF schema into appropriate DataFlow types whenever possible, although because of the frequent usage of unsigned types in MDF data, sometimes the type used by DataFlow must be wider than the original type specified in the metadata to prevent loss of scale or precision.

Since DataFlow operates on the concept of homogenous data records within a given flow, a ReadMDF operator is only able to extract one record type from the file at a time, although multiple ReadMDF operators can read multiple records types concurrently from the same file.

The output record type will have fields with the names and types determined by the metadata provided in the file. The ordering of the fields will also correspond to the declaration order within the metadata, with the exception that the master channel will always be the first field even if it is not defined first.

Currently the operator only supports primitive types, which have an analog within DataFlow; therefore the extraction of MIME types in various media formats is not currently supported.

Code Examples

Since MDF files are self-contained with respect to metadata, it is generally not necessary to provide any information other than the location of the data and the data group containing the specified record that should be extracted. Following is an example use of the operator in Java.

Using the ReadMDF operator in Java

ReadMDF reader = graph.add(new ReadMDF("data/output.mf4"));
reader.setDataChannel(1);
reader.setRecordId(1);

The following example demonstrates using the MDF reader within RushScript:

var data = dr.readMDF({source:'data/output.mf4', dataChannel:1, recordId:1});

Properties

The ReadMDF operator supports the following properties:

Name	Type	Description
source	ByteSource, Path, or String	Source of the input data to parse as delimited text
dataChannel	int	The data group containing the record to read
recordId	long	The record ID to read within the specified data group. If the record ID is unspecified, it will attempt to read the first channel group.
convertRaw	boolean	Specifies whether the raw values should have any included conversion rules applied
version	MDFVersion	Sets the version of the MDF file. Currently, MDF version 4 is supported.
runMode	ExtractionMode	Sets the mode that will decide what will be extracted from the MDF file: • DATA - Extracts the data from the specified record. • ATTACHMENT - Extracts the raw binary data of any attachments. • METADATA - Extracts the metadata from the file.

Addtional properties are shared with delimited text.

Ports

The ReadMDF operator provides a single output port.

Name	Type	Get method	Description
output	RecordPort	getOutput()	Provides the record data read from the provided input data files (sources).

ReadParquet Operator

The ReadParquet operator reads data that is written earlier using Apache Parquet format. The Parquet format is supported by Apache Hive.

Parquet is a columnar file format used to store the tabular form of data. Parquet supports compression and encoding the schemes effectively and allows specifying the compression schemes at each column level. It supports:

• Source projects such as Apache Hadoop (MapReduce), Apache Hive, Impala, and so on, as it presents the data in columnar format

• Compression codecs such as SNIPPY, GZIP, and LZO. The design allows integration with future types.

The ReadParquet <include link> operator uses Hive libraries through the shim layer and requires a Hadoop module configuration to be enabled, even if the workflow does not run on the cluster or access HDFS.

DataFlow automatically determines the equivalent data types from Parquet. The result is the output type of the reader. However, Parquet and DataFlow support different data types and not all data in Parquet format can be read. If it attempts to read data that cannot be represented in DataFlow, an error is returned.

The primitive Parquet types are mapped to DataFlow as shown in the following table.

DataFlow Type	Parquet Type
BOOLEAN	BOOLEAN
DOUBLE	DOUBLE
FLOAT	FLOAT
INT32	INT
INT64	LONG
BINARY	STRING

Code Examples

The following example provides reading the Parquet file data using Java.

Using ReadParquet in Java

// Path to read entire Hive table
ReadParquet reader = new ReadParquet("hdfs://10.100.10.41:8020//apps/hive/warehouse/CityParquet");

// Path to read specific partition of Hive table from HDFS
// ReadParquet reader = new ReadParquet("hdfs://10.100.10.41:8020//apps/hive/warehouse/CityParquet/000000_0");

// Path to read parquet file from Local file system
// ReadParquet reader = new ReadParquet("C:/Parquet/Cities.parquet");

graph.add(reader);

Using ReadParquet in RushScript

var data = dr.readParquet({source:"hdfs://10.100.10.41:8020//apps/hive/warehouse/CityParquet/000000_0"});

Properties

The ReadParquet operator supports the following properties.

Name	Type	Description
includeSourceInfo	boolean	Determines if the output records will include the additional fields having the origin information for the record. If true, records will have three additional fields: • sourcePath – path of the file that originates the record. If the path is unknown, then it is NULL. • splitOffset – offset of the starting byte in the source data containing the split. • recordOffset – offset of the first character in the record text from the start of the containing split. If these names are the same as those defined in the source schema, then they are renamed. These fields are added as the first three of the output and not impacted by the selectedFields property.
pessimisticSplitting	boolean	Defines whether pessimistic file splitting must be used. By default, this is disabled. Pessimistic splitting defines one file split per file. This means that the input files cannot be split.
readBuffer	int	Defines the size of the I/O buffer (in bytes) that is used for reads. Default: 64K.
readOnClient	boolean	Determines whether reads are performed by the client or within the cluster. If executed in a distributed context, the reads are performed in the cluster by default.
selectedFields	List<String>	Defines the list of input fields to be included in the output. This is used to limit the fields available on the output port of the reader. We recommend you use this property to limit the columns read to only those required for downstream processing.
source	ByteSource, Path, or String	Source of the input data to parse as delimited text.
splitOptions	SplitOptions	Defines the configuration used to determine breaking the source into splits.
useMetadata	boolean	Determines whether the reader should use any discovered metadata on ordering and distribution. Default: false.

Ports

The ReadParquet operator provides a single output port.

Name	Type	Get method	Description
output	RecordPort	getOutput()	Provides the record data read from the provided input data files (sources).

ReadFromJDBC Operator

The ReadFromJDBC operator accesses relational database systems using a supplied JDBC driver. The JDBC driver must be in the class path of the DataFlow application. Each database provides a JDBC driver implementation that DataFlow can use to access data in the database. Reference the specific database to be accessed for driver-specific information.

The ReadFromJDBC operator can be used to read all of the columns from a specific table or to execute a provided query. The query provided can be a complex, multitable query. Follow the syntax guidelines of the database being queried.

The results of the query will be made available to the output port of the operator. The operator transforms the database column types to supported DataFlow scalar types. Some database-specific data types may not map well and will either be ignored or mapped to Java Object types.

The results of a database query executed through JDBC are returned through a ResultSet object. The ResultSet is used to iterate through the resultant rows and to access column data. The JDBC ResultSet class does not support multithreaded access. Given that, the default behavior of the ReadFromJDBC operator is to execute in nonparallel mode when provided a nonparameterized query.

To execute queries in parallel (and distributed), the ReadFromJDBC operator supports the use of parameterized queries. JDBC supports adding parameters to a query using the "?" character. Following is an example of a parameterized query. Note the use of the "?" character in the "where" clause.

Example of query with parameters

select * from lineitem where l_shipmode = ?

When used as the data query for the ReadFromJDBC operator, a parameterized query can be executed in parallel. A set of parameters must be supplied to the parallel workers executing the parameterized queries. The parameters can be supplied in one of the following ways:

• Through the optional input port the ReadFromJDBC operator.

• Obtained by a parameter query supplied as a property to the operator ("parameterQuery"). The query is executed and the results are used as parameters to the parameterized query.

• An array of values is passed as a property to the ReadFromJDBC operator ("parameters").

Here is an example of a parameter query:

Query to gather parameters

select distinct l_shipmmode from lineitem

Note that the parameter query is selecting a distinct set of values from the lineitem table. The values will be substituted for the "?" in the parameterized query.

The parameters are handled the same whether they are provided directly as objects, read from the input port, or queried through the parameter query. For each set (row) of parameters, the following occurs:

• The parameters are substituted within the parameterized data query. From our example, one of the parameter values is "RAIL". When substituted within the example data query, the resultant query becomes select * from lineitem where l_shipmode = RAIL.

• The query with the substituted parameters is executed against the database.

• The results of the query are streamed to the output of the operator.

When used with a parameterized query and provided query parameters, the ReadFromJDBC operator operates in parallel by creating multiple workers. The query parameters are distributed to the workers in round-robin fashion. The workers execute the data query after applying parameter substitution as described above.

The order of the parameter value is important. The order must match the order of the parameters (the "?") in the data query. This is true for parameter values from the optional input port, provided as objects or from the parameter query. The ReadFromJDBC operator does not have the context to determine which parameter values match which parameters. The ordering of the parameter values is left to the user.

When using a parameterized query, the number of parameter values provided must match the number of parameters in the query. If there is a mismatch in sizes, an exception will be raised and the operator will fail.

To obtain the best performance, the number of sets of query parameters should be greater than the configured parallelism. In our example parameter query, only 7 values are returned. In this case, having parallelism set to anything greater than 7 will be wasteful. Additional streams of execution will have no data to process.

Writing to database tables can be accomplished with the WriteToJDBC Operator operator.

Code Examples

The following example demonstrates using the ReadFromJDBC operator to access data using the provided SQL statement. Setting the fetch size and the SQL warning limit properties is optional. Default settings will be used if they are not set explicitly.

Either a table name or a SQL statement to execute can be specified. Using a table name is equivalent to using the statement select * from tableName. In this example, a table name is specified.

Using the ReadFromJDBC Operator in Java

ReadFromJDBC reader = graph.add(new ReadFromJDBC());
reader.setDriverName("com.mysql.jdbc.Driver");
reader.setUrl("jdbc:mysql://dbserver:3306/test");
reader.setUser("test");
reader.setPassword("test");
reader.setSqlWarningLimit(20);
reader.setTableName("tpchorders");

Using the ReadFromJDBC Operator in RushScript

var data = dr.readFromJDBC({driverName:'com.mysql.jdbc.Driver', url:'jdbc:mysql://dbserver:3306/test', user:'test', password:'test', sqlWarningLimit:20, tableName:'tpchorders'});

The following example uses a SQL statement directly. Using the SQL statement allows selection of only the desired fields. A complex statement can also be used to join tables together and have the results presented as a single output data set.

Specifying a SQL statement

ReadFromJDBC reader = graph.add(new ReadFromJDBC());
reader.setDriverName("com.mysql.jdbc.Driver");
reader.setUrl("jdbc:mysql://dbserver:3306/test");
reader.setUser("test");
reader.setPassword("test");
reader.setDataQuery("select o_orderkey, o_orderdate, o_totalprice from totalorders");

This example demonstrates using a parameterized query in Java:

Using a parameterized query in Java

This example demonstrates using a parameterized query in RushScript:

Using a parameterized query in RushScript

var data = dr.readFromJDBC({
driverName:'com.mysql.jdbc.Driver',
url:'jdbc:mysql://dbserver:3306/test',
user:'test',
password:'test',
sqlWarningLimit:20,
dataQuery:'select * from lineitem where l_shipmode = ?',
parameterQuery:'select distinct l_shipmode from lineitem'});

The driver name and URL format are specific to each JDBC driver. See the documentation of the specific database being used for more information on these values.

Properties

The ReadFromJDBC operator supports the following properties:

Name	Type	Description
connectionFactory	ConnectionFactory	The connection factory to use for acquiring connections. A default implementation is provided. A specialized factory can be provided as needed.
dataQuery	String	The SQL select statement to execute to gather data from the source database. The query must be a select statement. The query can be parameterized. If the query is parameterized, the parameters can be provided through the optional input port of the operator or through the parameters or parameterQuery properties. To read all data from a single table, the tableName property can be provided.
discoverOutputAtRuntime	boolean	Specifies whether the output type should be automatically configured at run time. Generally, this setting should not be used since it can potentially decrease overall performance. However, in some situations it may be necessary to use this option instead of configureOutputType() when the metadata for the table is not correctly being discovered. Generally, this should only be used when calling stored procedures that dynamically generate the output type during execution. In this case the correct output type cannot be discovered without executing the full query during composition. Note: Only use this option if absolutely necessary. It may negatively affect performance.
driverName	String	The class name of the JDBC driver to use, for example: "sun.jdbc.odbc.JdbcOdbcDriver".
errorAction	ErrorAction	The action to take if a SQL error occurs while reading data from the database.
fetchSize	int	The fetch size to set on the JDBC driver. The fetch size specifies the number of rows to fetch from the remote database at a time. The database response to a query may include many rows of data. Responses are usually buffered from the server to the client. The fetch size specifies how many rows should be buffered at a time. A larger size will consume more client memory but reduce the overall communication latency with the server.
hostNames	List<String> or String[]	(Optional) A list of database server host names. Some database vendors support multiple, distributed database servers. Providing a list of database server hosts allows the operator to distribute the workload across multiple database servers. If the database being accessed does not support this feature, then this property should not be used.
outputType	RecordTokenType	Sets the expected output type. The types of the output fields must be compatible with those of the underlying JDBC ResultSet. Consider using configureOutputType() to auto-discover the output type.
parameterQuery	String	The query used to gather parameters applied to a parameterized data query.
parameters	Object[][]	An array of arrays of Objects. Each row represents one set of query parameters.
password	String	The password for connecting to the database.
selectStatement	String	(Deprecated) Use the dataQuery property instead.
sqlWarningLimit	int	The maximum number of warnings regarding SQL errors that an operator can emit. This property is only used if the errorAction property is set to ErrorAction.WARN.
tableName	String	The name of the database table to access. Mutually exclusive with the dataQuery property. A simple query is generated to access data from the table. The query will be executed in nonparallel mode.
url	String	The URL specifying the instance of the database to connect with. Reference the documentation for the database to connect with for formatting information.
user	String	The user name for connecting to the database.

Ports

The ReadFromJDBC operator supports one optional input port. This port is used to provide parameters to a parameterized data query.

Name	Type	Get method	Description
input	RecordPort	getInput()	Optionally provides the set of parameters to apply to the parameterized data query.

The ReadFromJDBC operator provides one output port:

Name	Type	Get method	Description
output	RecordPort	getOutput()	Provides the record data resulting from executing the provided SQL query or selecting all data from the given table.

ReadDelimitedText Operator

The ReadDelimitedText operator reads a text file of delimited records as record tokens. Records are identified by the presence of a non-empty, user-defined record separator sequence between each individual record. Output records contain the same fields as the input text. The reader can also filter and/or reorder the fields of the output as necessary.

Delimited text supports up to three distinct user-defined sequences within a record, used to identify field boundaries:

• a field separator: found between individual fields. By default, this is the comma character (,).

• a field start delimiter: marking the beginning of a field value. By default, this is the double quote character (").

• a field end delimiter: marking the end of a field value. By default, this is the double quote character (").

The field separator cannot be empty. The start and end delimiters can be the same value. They can also both (but not individually) be empty, signifying the absence of field delimiters. It is not expected that all fields start and end with a delimiter, though if one starts with a delimiter it must end with one. Fields containing significant characters, such as whitespace and the record and field separators, must be delimited to avoid parsing errors. Should a delimited field need to contain the end delimiter, it is escaped from its normal interpretation by duplicating it. For instance, the value "ab""c" represents a delimited field value of ab"c.

The reader supports incomplete specification of the separators and delimiters. By default, it will attempt to automatically discover these values based on analysis of a sample of the file. We strongly suggest that this discovery ability not be relied upon if these values are already known, as it cannot be guaranteed to produce desirable results in all cases.

The reader requires a schema to provide parsing and type information for the fields. The schema, in conjunction with any specified field filter, defines the output type of the reader. This can be manually constructed through the API provided, although this metadata is often persisted externally. The StructuredSchemaReader class provides support for reading in Pervasive Data Integrator structured schema descriptors (.schema files) for use with readers. Schemas can also be generated from Record Token Types by using the TextRecord.convert methods.

Because delimited text has explicit field markers, it is also possible to perform automated discovery of the schema based on the contents of the file; the reader provides a pluggable discovery mechanism to support this functionality. Custom mechanisms must implement the TextRecordDiscoverer interface. Two implementations are provided in the operator library:

• A mechanism using pattern matching for determining field type. Values for a field are compared to the patterns; any patterns which do not match a field are discarded as possibilities. If multiple possibilities exist and the conflict is between numeric types (for example, integers and doubles), the wider of the two is chosen. Otherwise, conflicts are resolved by treating the field as a string. This is the default mechanism used by the operator.

The set of patterns used can be extended by providing additional patterns when setting the schemaDiscovery property. Alternatively, the TextRecord.extendDefault method can be used to to create a new discoverer using the supplied patterns in addition to the defaults. If the default patterns should not be included, create a PatternBasedDiscovery object directly, specifying only the desired patterns.

• A mechanism that treats all fields as “raw” strings—that is, without white space trimming and not treating the empty string as NULL. Use the TextRecord.TEXT_FIELD_DISCOVER constant to reference this mechanism.

Both built-in schema discoverers will generate a schema having as many fields as the longest analyzed row. Both use the header row, if present, to name the schema’s fields. Repetitions of the same name will be resolved by adding a suffix to avoid collision; any missing names will be generated as field<n>, where <n> is the field’s index in the schema.

Typically, the output of the reader includes all records in the file, both those with and without parsing errors. Fields that cannot be parsed are null-valued in the resulting record. If desired, the reader can be configured to filter failed records from the output.

Delimited text data may or may not have a header row. The header row is delimited as usual but contains the names of the fields in the data portion of the record. The reader must be told whether a header row exists. If it does, the parser will skip the header row; otherwise the first row is treated as a record and will appear in the output. If a header row does exist and any of the field names are blank, a field name will be generated. Generated field names take the form “fieldN” where N is the zero-based position of the field.

Delimited text files can be parsed in parallel under “optimistic” assumptions: namely, that parse splits do not occur in the middle of a delimited field value and somewhere before an escaped record separator. This is assumed by default but can be disabled, with an accompanying reduction of scalability and performance.

When reading delimited text files there may be metadata information about the data embedded within the files. If the reader has been configured to use this metadata, it can obtain information about the ordering and partitioning of the data when it was written, which can eliminate the need to re-sort or partition the data.

Delimited text can be written WriteDelimitedText Operator.

Code Examples

The first code example shows a simple usage of the reader. The path to the local file name is given as a parameter to the constructor. This could have also been set using the setSource() method. The field separator and header properties are set. Then a record type is built and used as the input schema. Note that the record type must be converted to an acceptable schema before being used by the reader. Also note that the record separator is not specified. It will be determined by the auto discovery mechanism of the reader.

Using the ReadDelimitedText operator example in Java

ReadDelimitedText reader = graph.add(new ReadDelimitedText("data/ratings.txt"));
reader.setFieldSeparator("::");
reader.setHeader(true);
RecordTokenType ratingsType = record(INT("userID"), INT("movieID"), INT("rating"), STRING("timestamp"));
reader.setSchema(TextRecord.convert(ratingsType));

Using the ReadDelimitedText operator usage in RushScript

var ratingsSchema = dr.schema().INT('userID').INT('movieID').INT('rating').STRING('timestamp');

var data = dr.readDelimitedText({source:'data/ratings.txt', fieldSeparator:'::', header:true, schema:ratingsSchema});

The snippet of data below is from the ratings.txt file and can be read using the code example above.

userID::movieID::rating::timestamp
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368

This next example reads from a file in a Hadoop Distributed File System (HDFS). The hdfs URL scheme identifies the file as being contained within an HDFS file system. The authority section of the URL specifies the specific HDFS file system. The rest of the path indicates the file path within the HDFS instance. A schema is built for this data since it contains a date field. The format or pattern of the date field must be specified since it is non-standard.

Reading from HDFS with a date type

TextRecord schema =
    SchemaBuilder.define(
            SchemaBuilder.STRING("accountNumber"),
            SchemaBuilder.STRING("clientName"),
            SchemaBuilder.STRING("companyName"),
            SchemaBuilder.STRING("streetAddress"),
            SchemaBuilder.STRING("city"),
            SchemaBuilder.STRING("state"),
            SchemaBuilder.STRING("zip"),
            SchemaBuilder.STRING("emailAddress"),
            SchemaBuilder.DATE("birthDate", "MM/dd/yyyy"),  // specify pattern for parsing the date
            SchemaBuilder.STRING("accountCodes"),
            SchemaBuilder.DOUBLE("standardPayment"),
            SchemaBuilder.DOUBLE("payment"),
            SchemaBuilder.DOUBLE("balance")
            );

// Create a delimited text reader for the accounts data
ReadDelimitedText reader = graph.add(new ReadDelimitedText("hdfs://saturn.englab.local:9000/user/jfalgout/data/Accounts.txt"));
reader.setFieldSeparator(",");
reader.setHeader(true);
reader.setSchema(schema);

Following is a snippet of the data that can be read and parsed with the previous code example. Note that each field is surrounded with a double quote as the field delimiter. Also note the format of the "birthDate" field. It is a non-standard (not ISO) format. The schema used to parse the data specifies the pattern used to parse the date field.

"accountNumber","clientName","companyName","streetAddress","city","state","zip","emailAddress","birthDate","accountCodes","standardPayment","payment","balance"
"01-000667","George P Schell","Market Place Products","334 Hilltop Dr","Mentor","OH","44060-1930","warmst864@aol.com","02/28/1971","XA","101.00","100.00","15.89"
"01-002423","Marc S Brittan","Madson & Huth Communication Co","5653 S Blackstone Avenue, #3E","Chicago","IL","60637-4596","mapper@tcent.net","06/30/1975","BA","144.00","144.00","449.92"
"01-006063","Stephanie A Jernigan","La Salle Clinic","77565 Lorain","Akron","OH","44325-4002","dram@akron.net","11/02/1941","EB|CB","126.00","126.00","262.98"
"01-010474","Ernie Esser","Town & Country Electric Inc.","56 Pricewater","Waltham","MA","2453","hazel@bentley.net","12/15/1962","JA|RB","127.00","127.00","271.75"
"01-010852","Robert A Jacoby","Saturn of Baton Rouge","4001 Lafayette","Baton Rouge","LA","70803-4918","din33@norl.com","12/22/1985","ED|EA|RB|KA","142.00","150.00","423.01"
"01-011625","James C Felli","Bemiss Corp.","23A Carolina Park Circle","Spartanburg","SC","29303-9398","cadair@gw.com","02/21/1940","SB","151.00","155.00","515.41"

In the previous example, the schema could also be discovered, extending the default type patterns to recognize the date formats. This can be done in a fairly straightforward fashion:

Custom schema discovery

// Instead of constructing schema and calling reader.setSchema(schema)

TextDataType usDate= TextTypes.FORMATTED_DATE(new SimpleDateFormat("MM/dd/yyyy"));
List<TypePattern> patterns= Arrays.asList(new TypePattern("\\d{1,2}/\\d{1,2}/\\d+", usDate));

// Simple extension of default pattern-based discovery
reader.setSchemaDiscovery(patterns);

// Complete replacement of schema discoverer
// More interesting when using custom discovery implementation
TextRecordDiscoverer discoverer= TextRecord.extendDefault(patterns);
reader.setSchemaDiscovery(discoverer);

Properties

The ReadDelimitedText operator supports the following properties:

Name	Type	Description
analysisDepth	int	The number of characters to read for performing schema discovery and structural analysis.
autoDiscoverNewline	String	Determines if the record separator should be auto-discovered. Defaul: enabled.
charset	Charset	The character set used by the data source. Default: ISO-8859-1.
charsetName	String	The character set used by the data source by name.
decodeBuffer	int	The size of the buffer, in bytes, used to decode character data. By default, this will be automatically derived using the character set and read buffer size.
discoveryNullIndicator	String	The text value used to represent null values by default in discovered schemas. By default, this is the empty string.
discoveryStringHandling	StringConversion	The default behavior for processing string-valued types in discovered schemas.
encoding	CharsetEncoding	Properties that control character set encoding.
errorAction	CodingErrorAction	The error action determines how to handle errors encoding the input data into the configured character set. The default action is to replace the faulty data with a replacement character.
extraFieldAction	ParseErrorAction	How to handle fields found when parsing the record, but not declared in the schema.
fieldDelimiter	String	Delimiter used to denote the boundaries of a data field.
fieldEndDelimiter	String	Ending delimiter used to denote the boundaries of a data field.
fieldErrorAction	ParseErrorAction	How to handle fields that cannot be parsed.
fieldLengthThreshold	int	The maximum length allowed for a field value before it is considered an error.
fieldSeparator	String	Delimiter used to define the boundary between data fields.
fieldStartDelimiter	String	Starting delimiter used to denote the boundaries of a data field.
header	String	Whether to expect a header row in the source. The header row contains field names.
includeSourceInfo	boolean	Determines whether output records will include additional fields detailing origin information for the record. If true, records will have three additional fields: • sourcePath – the path of the file from which the record originates. If this is not known, it will be NULL. • splitOffset – the offset of the starting byte of the containing split in the source data. • recordOffset – the offset of the first character of the record text from the start of the containing split. If these names would collide with those defined in the source schema, they will be renamed to avoid collision. These fields are added as the first three of the output and are not affected by the selectedFields property.
lineComment	String	The character sequence indicating a line comment. Lines beginning with this sequence are ignored.
maxRowLength	int	The limit, in characters, for the first row. Zero indicates no maximum.
missingFieldAction	ParseErrorAction	How to handle fields declared in the schema, but not found when parsing the record. If the configured action does not discard the record, the missing fields will be null-valued in the output.
parseErrorAction	ParseErrorAction	How to handle all parsing errors.
parseOptions	ParsingOptions	The parsing options used by the reader.
pessimisticSplitting	boolean	Configures whether pessimistic file splitting must be used. By default, this is disabled. Pessimistic splitting defines one file split per file (assumes the input files are not splittable).
readBuffer	int	The size of the I/O buffer, in bytes, to use for reads. Default: 64K.
readOnClient	boolean	Determines whether reads are performed by the client or in the cluster. By default, reads are performed in the cluster if executed in a distributed context.
recordSeparator	String	Value to use as a record separator.
recordWarningThreshold	int	The maximum number of records which can have parse warnings before failing.
replacement	String	Replacement string to use when encoding error policy is replacement. Default: '?'
selectedFields	List<String>	The list of input fields to include in the output. Use this to limit the fields written to the output.
schema	TextRecord	The record schema expected in the delimited text source. This property is mutually exclusive with schemaDiscovery; setting one causes the other to be ignored. By default, this property is unset.
schemaDiscovery	TextRecordDiscoverer, List<TypePattern>	The schema discovery mechanism to use. This property is mutually exclusive with schema; setting one causes the other to be ignored. By default, a pattern-based mechanism is used. Supplying a list of pattern/type pairs uses the default discoverer extended with the supplied patterns.
source	ByteSource, Path, or String	Source of the input data to parse as delimited text.
splitOptions	SplitOptions	The configuration used in determining how to break the source into splits.
useMetadata	boolean	Whether the reader should use any discovered metadata about the ordering and distribution. Default: false.

Ports

The ReadDelimitedText operator provides a single output port.

Name	Type	Get method	Description
output	RecordPort	getOutput()	Provides the record data to read and parse from the provided input data files (sources).

ReadFixedText Operator

Fixed text data contains fields that are not delimited as CSV files are. A schema defines each field and its type, offset, and length within a row of data. Data is parsed from each input row according to the defined position of each field. Field types can be specified along with patterns for parsing the data. Patterns are especially useful for date and timestamp field types.

The ReadFixedText operator reads a text file of fixed-width records as record tokens. Records are identified by the presence of a non-empty, user-defined record separator sequence between each individual record or by the total length of the record if an empty or zero-length record separator is provided. Output records contain the same fields as the input file. The parser can also filter or reorder the fields of the output, as requested.

The reader requires a FixedWidthTextRecord object to provide field position as well as parsing and type information for fields. The schema, in conjunction with any specified field filter, defines the output type of the parser. These can be manually constructed through the API provided, although this metadata is often persisted externally. StructuredSchemaReader provides support for reading in Pervasive Data Integrator structured schema descriptors (.schema files) for use with readers.

Typically, the output of the parsing includes all records in the file, both those with and without parsing errors. Fields that cannot be parsed are null-valued in the resulting record. If desired, the reader can be configured to filter failed records from the output.

Since record boundaries occur at known positions, fixed text files can be parsed in parallel.

Fixed-width text data can be written WriteFixedText Operator.

Code Examples

The following example builds a schema that is used by the ReadFixedText operator to read a fixed text format file.

Using the ReadFixedText operator in Java

// Create fixed text reader
ReadFixedText reader = graph.add(new ReadFixedText("data/AccountsFixed.txt"));

// Build the schema. Fields must be added in order of appearance in records.
// The field size must be exact as it determines the position of the field for parsing.
FixedWidthTextRecord schema = new FixedWidthTextRecord(new TextConversionDefaults(StringConversion.NULLABLE_TRIMMED));
schema.defineField("accountNumber", new PaddedTextType(TextTypes.STRING, 9, ' ', Alignment.LEFT));
schema.defineField("name", new PaddedTextType(TextTypes.STRING, 21, ' ', Alignment.LEFT));
schema.defineField("companyName", new PaddedTextType(TextTypes.STRING, 31, ' ', Alignment.LEFT));
schema.defineField("address", new PaddedTextType(TextTypes.STRING, 35, ' ', Alignment.LEFT));
schema.defineField("city", new PaddedTextType(TextTypes.STRING, 16, ' ', Alignment.LEFT));
schema.defineField("state", new PaddedTextType(TextTypes.STRING, 2, ' ', Alignment.LEFT));
schema.defineField("zip", new PaddedTextType(TextTypes.STRING, 10, ' ', Alignment.LEFT));
schema.defineField("emailAddress", new PaddedTextType(TextTypes.STRING, 25, ' ', Alignment.LEFT));
schema.defineField("birthDate", new PaddedTextType(TextTypes.FORMATTED_DATE(new SimpleDateFormat("MM/dd/yyyy")), 10, ' ', Alignment.LEFT));
schema.defineField("accountCodes", new PaddedTextType(TextTypes.STRING, 11, ' ', Alignment.LEFT));
schema.defineField("standardPayment", new PaddedTextType(TextTypes.JAVA_DOUBLE, 6, ' ', Alignment.LEFT));
schema.defineField("payment", new PaddedTextType(TextTypes.JAVA_DOUBLE, 7, ' ', Alignment.LEFT));
schema.defineField("balance", new PaddedTextType(TextTypes.JAVA_DOUBLE, 6, ' ', Alignment.LEFT));

// Set the schema of the reader.
reader.setSchema(schema);

An example of data that can be read with the above code fragment follows. Because of the wide nature of the data, the records will most likely appear across multiple lines of the display.

01-000667George P Schell      Market Place Products          334 Hilltop Dr                     Mentor          OH44060-1930warmst864@aol.com        02/28/1971XA         101.00100.00 15.89
01-002423Marc S Brittan       Madson & Huth Communication Co 5653 S Blackstone Avenue, #3E      Chicago         IL60637-4596mapper@tcent.net         06/30/1975BA         144.00144.00 449.92
01-006063Stephanie A Jernigan La Salle Clinic                77565 Lorain                       Akron           OH44325-4002dram@akron.net           11/02/1941EB|CB      126.00126.00 262.98
01-010474Ernie Esser          Town & Country Electric Inc.   56 Pricewater                      Waltham         MA2453      hazel@bentley.net        12/15/1962JA|RB      127.00127.00 271.75
01-010852Robert A Jacoby      Saturn of Baton Rouge          4001 Lafayette                     Baton Rouge     LA70803-4918din33@norl.com           12/22/1985ED|EA|RB|KA142.00150.00 423.01
01-011625James C Felli        Bemiss Corp.                   23A Carolina Park Circle           Spartanburg     SC29303-9398cadair@gw.com            02/21/1940SB         151.00155.00 515.41
01-018448Alan W Neebe         Georgia State Credit Union     PO Box 159                         Demorest        GA30535-1177delores@truett.com       01/31/1960MA|ED|SB   113.00120.00 131.89
01-018595Alexander Gose       Office Support Services        436 Green Mountain Circle          New Paltz       NY12561-0023dams@matrix.net          06/19/1940EC         147.00147.00 477.09

The following example demonstrates using the ReadFixedText operator in RushScript. The schema is created in RushScript and passed to the operator.

Using the ReadFixedText operator in RushScript

// Build the schema. Fields must be added in order of appearance in records.
// The field size must be exact as it determines the position of the field for parsing.

var accountsFixedSchema = dr.schema({type:'FIXED'})
    .nullable(true)
   .trimmed(true)
    .padChar(' ')
    .alignment('LEFT')
    .STRING("accountNumber", {size:9})
   .STRING("clientName", {size:21})
   .STRING("companyName", {size:31})
   .STRING("streetAddress", {size:35})
   .STRING("city", {size:16})
   .STRING("state", {size:2})
   .STRING("zip", {size:10})
   .STRING("emailAddress", {size:25})
   .DATE("birthDate", {pattern:'MM/dd/yyyy', size:10})
   .STRING("accountCodes", {size:11})
   .DOUBLE("standardPayment", {pattern:'0.00', size:6})
   .DOUBLE("payment", {pattern:'0.00', size:7})
   .DOUBLE("balance", {pattern:'0.00', size:6});

// Read the data
var data = dr.readFixedText({source:'/path/to/file.txt', schema:accountsFixedSchema});

Properties

The ReadFixedText operator supports the following properties:

Name	Type	Description
charset	Charset	The character set used by the data source. Default: ISO-8859-1.
charsetName	String	The character set used by the data source by name.
decodeBuffer	int	The size of the buffer, in bytes, used to decode character data. By default, this will be automatically derived using the character set and read buffer size.
encoding	CharsetEncoding	Properties that control character set encoding.
errorAction	CodingErrorAction	The error action determines how to handle errors encoding the input data into the configured character set. The default action is to replace the faulty data with a replacement character.
extraFieldAction	ParseErrorAction	How to handle fields found when parsing the record but not declared in the schema.
fieldErrorAction	ParseErrorAction	How to handle fields that cannot be parsed.
fieldLengthThreshold	int	The maximum length allowed for a field value before it is considered an error.
includeSourceInfo	boolean	Determines whether output records will include additional fields detailing origin information for the record. If true, records will have three additional fields: • sourcePath – the path of the file from which the record originates. If this is not known, it will be NULL. • splitOffset – the offset of the starting byte of the containing split in the source data. • recordOffset – the offset of the first character of the record text from the start of the containing split. If these names would collide with those defined in the source schema, they will be renamed to avoid collision. These fields are added as the first three of the output and are not affected by the selectedFields property.
lineComment	String	The character sequence indicating a line comment. Lines beginning with this sequence are ignored.
missingFieldAction	ParseErrorAction	How to handle fields declared in the schema, but not found when parsing the record. If the configured action does not discard the record, the missing fields will be null-valued in the output.
parseErrorAction	ParseErrorAction	How to handle all parsing errors.
parseOptions	ParsingOptions	The parsing options used by the reader.
pessimisticSplitting	boolean	Configures whether pessimistic file-splitting must be used. Default: disabled. Pessimistic splitting defines one file split per file (assumes the input files are not splittable).
readBuffer	int	The size of the I/O buffer, in bytes, to use for reads. The default size is 64K.
readOnClient	boolean	Determines whether reads are performed by the client or in the cluster. By default, reads are performed in the cluster if executed in a distributed context.
recordSeparator	String	Value to use as a record separator.
recordWarningThreshold	int	The maximum number of records that can have parse warnings before failing.
replacement	String	Replacement string to use when encoding error policy is replacement. Default: '?'
selectedFields	List<String>	The list of input fields to include in the output. Use this to limit the fields written to the output.
schema	TextRecord	The record schema expected in the delimited text source.
source	ByteSource, Path, or String	Source of the input data to parse as delimited text.
splitOptions	SplitOptions	The configuration used in determining how to break the source into splits.

Ports

The ReadFixedText operator provides a single output port.

Name	Type	Get method	Description
output	RecordPort	getOutput()	Provides the record data read and parsed from the provided input data files (sources).

ReadSource Operator

The ReadSource operator reads a defined data source as a stream of records. The data source provides a sequence of bytes in some format that can be parsed into records that are assumed to be identical in logical structure. The mapping between physical and logical structure is encapsulated in a format descriptor, which must be provided.

This operator is low level, providing a generalized model for reading files in a distributed fashion. Typically, the ReadSource operator is not directly used in a graph, instead being indirectly used though a composite operator such as one derived from AbstractReader, providing a more appropriate interface to the end user.

Parallelized reads are implemented by breaking input files into independently parsed pieces, a process called splitting. Splits are then distributed to available partitions and parsed. When run on a distributed cluster, the reader makes an attempt to assign splits to machines where the I/O will be local, but non-local assignment may occur in order to provide work for all partitions. Distributed execution also makes an assumption that the specified data source is accessible from any machine. If this is not the case, the read operator must be made non-parallel by invoking the disableParallelism() method on the operator instance.

Not all formats support splitting; this generally requires a way of unambiguously identifying record boundaries. Formats will indicate whether they can support splitting. If not, each input file will be treated as a single split. Even with a non-splittable format, this means reading multiple files can be parallelized. Some formats can partially support splitting, but in a “optimistic” fashion; under most circumstances splits can be handled, but in some edge cases splitting leads to parse failures. For these cases, the reader supports a “pessimistic” mode that can be used to assume a format is non-splittable, regardless of what it reports.

The reader makes a best-effort attempt to validate the data source before execution but cannot always guarantee correctness, depending on the nature of the data source. This is done to try to prevent misconfigured graphs from executing, such as when the reader may not execute until a late phase where a failure may result in a significant amount of work being lost.

Tip... This is a low-level operator that typically is not directly used. It can be used with a custom data format. A custom data format may be needed to support a format not provided by the DataFlow library.

Code Example

This example code fragment demonstrates how to set up a reader for a generic file type.

Using the ReadSource operator

ReadSource reader = new ReadSource();
reader.setSource(new BasicByteSource("filesource"));
reader.setFormat(new DelimitedTextFormat(TextRecord.convert(record(INT("intfield"), STRING("stringfield"))),
new FieldDelimiterSettings,
new CharsetEncoding()));
ParsingOptions options = new ParsingOptions();
options.setSelectedFields("stringfield");
reader.setParseOptions(options);

Properties

The ReadSource operator supports the following properties:

Name	Type	Description
format	DataFormat	The data format for the configured source.
includeSourceInfo	boolean	Determines whether output records will include additional fields detailing origin information for the record. If true, records will have three additional fields: • sourcePath – the path of the file from which the record originates. If this is not known, it will be NULL. • splitOffset – the offset of the starting byte of the containing split in the source data. • recordOffset – the offset of the first character (or byte for binary formats) of the record from the start of the containing split. If these names would collide with those defined in the source schema, they will be renamed to avoid collision. These fields are added as the first three of the output and are not affected by the selectedFields property.
parseOptions	ParsingOptions	The parsing options used by the reader.
pessimisticSplitting	boolean	Configures whether pessimistic file splitting must be used. Default: disabled. Pessimistic splitting defines one file split per file (assumes the input files are not splittable).
readOnClient	boolean	Determines whether reads are performed by the client or in the cluster. By default, reads are performed in the cluster if executed in a distributed context.
source	ByteSource, Path, or String	Source of the input data to parse as delimited text.
splitOptions	SplitOptions	The configuration used in determining how to break the source into splits.

Ports

The ReadSource operator provides a single output port:

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Data from the defined source parsed into records by the given data format.

ReadLog Operator

Many applications and systems produce log data that is loosely structured. Generally, there is a specific format used to write the log data; however, this format is not always unambiguously reversible to a typical parser. Also, different fields might use different field separators and delimiters.

About the only generalities that can be made about all log formats is that the records always contain an ID field, usually a timestamp, and a message field consisting of the information that produced the log event. In these cases the log data may not be able to be read by a regular delimited or fixed text reader.

The ReadLog operator reads a text file or alternative source consisting of log events from a particular application or system. The type of application or system producing the log records must be specified in advance through a property setting. The currently supported log types are enumerated by SupportedLogType. Configuring the operator requires the user to either provide one of these enumerations or their own implementation of a particular LogFormat. It should be noted that these settings are mutually exclusive.

In addition to specifying the log type the format pattern may be set. This is a String that provides information about a log format when customization of the format is allowed. It is specific to the type of log being read and therefore may provide more customization based on the logs being read. Additionally the newline character used by the log files may be specified if a nondefault newline character is being used. By default this is determined automatically by examining the first few lines in the source.

The record flow generated by this operator is determined by the log type being read and the log format pattern provided during composition of the operator unless otherwise noted.

Supported Log Types

The ReadLog operator supports a selection of common log formats. These are enumerated by SupportedLogType. Custom log formats can be added by implementing the LogFormat interface. A custom format would be instantiated and provided to the ReadLog operator through the logFormat property. Certain log types can also be manually instantiated and provided to the ReadLog operator when log-specific settings need to be changed, such as log4j’s logging levels.

The various supported log types are listed below.

Generic Log Data

The generic type can be used when the log data can be parsed using a regular expression but there is no dedicated format for the log. The schema is automatically generated by counting the number of groupings in the regular expression provided. The schema can also be set manually by creating a custom instance of the log format.

The generic format takes a valid Java regular expression string. The grouping of the regular expression defines the fields the individual records will be split into.

Default : "(.*)"

Example : "(\\d\\d.\\d+) (\\w+) (\\w+)"

Common Log Format

The CLF type can be used when reading a web server log in common log format. NCSA common log format is specified at http://www.w3.org/Daemon/User/Config/Logging.html#common-logfile-format.

Since CLF is well defined, it does not allow a format pattern to be specified.

Combined Log Format

The Combined type can be used when reading a web server log in combined log format. NCSA combined log format is specified at http://publib.boulder.ibm.com/tividd/td/ITWSA/ITWSA_info45/en_US/HTML/guide/c-logs.html#ncsa.

Combined takes a true or false string, which determines if the optional cookie field is included in the log.

Extended Log Format

The ELF type can be used when reading a web server log in extended log format. Extended log format is specified at http://www.w3.org/TR/WD-logfile.html.

ELF will accept a string in the same form as a Fields directive as specified in the official format. If format discovery is enabled, it will scan the file for any directives and apply them appropriately.

Example : "#Fields: date time cs-method cs-uri"

GlassFish Logs

The GlassFish format supports reading logs produced by GlassFish servers. The GlassFish server log format is specified at http://docs.oracle.com/cd/E18930_01/html/821-2416/abluk.html.

The format pattern supported by the GlassFish format consists of a string that specifies the date format used in the timestamps of the log. Any string supported by Java’s DateFormat class is acceptable.

Default : "yyyy-MM-dd'T'HH:mm:ss.SSSZ"

Example : "dd-MM-yyyy HH:mm:ss"

Log4j Logs

The log4j format supports reading logs produced by the Apache log4j library for Java. More information about the library can be found at http://logging.apache.org/log4j/1.2/.

The log4j format will accept a string in the same form as the conversion pattern that specifies the logging. More information about log4j conversion patterns can be found at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/EnhancedPatternLayout.html.

Default : "%r [%t] %-5p %c %x - %m%n"

Example : "%d{ISO8601} %p %c: %m%n"

Syslog Logs

The syslog format supports reading logs produced by syslogd and other BSD-compliant syslog producers. The BSD syslog format is specified by RFC-3164.

The format pattern supported by the syslog format consists of a string that includes the current four-digit year and the signed four-digit offset from UTC separated with a single space.

Default : current year and timezone

Example : "2012 -0600"

Code Examples

This example code fragment demonstrates how to set up a reader for a log4j log file.

Using the ReadLog operator

ReadLog reader = graph.add(new ReadLog(data/log4jdata.log));
reader.setLogType(SupportedLogType.LOG4J);
reader.setLogPattern("%d{ISO8601} %p %c: %m%n");
reader.setNewline("\n");

Using the ReadLog operator in RushScript

var data = dr.readLog({source:'data/log4jdata.log', logType:'LOG4J', logPattern:'%d{ISO8601} %p %c: %m%n', newLine:'\n');

Properties

The ReadLog operator supports the following properties:

Name	Type	Description
autoDiscoverFormat	boolean	Determines if additional format information should be acquired by scanning the source. Only certain formats can acquire additional metadata through this process. Default: enabled.
autoDiscoverNewline	boolean	Determines if the newline characters should be auto-discovered. Default: enabled.
charset	Charset	The character set used by the data source. Default: ISO-8859-1.
charsetName	String	The character set used by the data source by name.
decodeBuffer	int	The size of the buffer, in bytes, used to decode character data. By default, this will be automatically derived using the character set and read buffer size.
encoding	CharsetEncoding	Properties that control character set encoding.
errorAction	CodingErrorAction	The error action determines how to handle errors encoding the input data into the configured character set. The default action is to replace the faulty data with a replacement character.
extraFieldAction	ParseErrorAction	How to handle fields found when parsing the record, but not declared in the schema.
fieldErrorAction	ParseErrorAction	How to handle fields which cannot be parsed.
fieldLengthThreshold	int	The maximum length allowed for a field value before it is considered an error.
includeSourceInfo	boolean	Determines whether output records will include additional fields detailing origin information for the record. If true, records will have three additional fields: • sourcePath – the path of the file from which the record originates. If this is not known, it will be NULL. • splitOffset – the offset of the starting byte of the containing split in the source data. • recordOffset – the offset of the first character of the record text from the start of the containing split. If these names would collide with those defined in the source schema, they will be renamed so as to avoid collision. These fields are added as the first three of the output and are not affected by the selectedFields property.
logFormat	LogFormat	The type of log this operator will read. This can be used when support for custom log formats is needed or additional settings are required.
logPattern	String	The format pattern to use when reading the logs. This is specific to the type of log being read.
logType	SupportedLogType	The type of log this operator will read. This property is mutually exclusive with logFormat.
missingFieldAction	ParseErrorAction	How to handle fields declared in the schema but not found when parsing the record. If the configured action does not discard the record, the missing fields will be null-valued in the output.
newline	String	The newline character used in the logs. Setting this will disable the autoDiscoverNewline property.
parseErrorAction	ParseErrorAction	How to handle all parsing errors.
parseOptions	ParsingOptions	The parsing options used by the reader.
pessimisticSplitting	boolean	Configures whether pessimistic file splitting must be used. By default, this is disabled. Pessimistic splitting defines one split per file (assumes the input files are not splittable).
readBuffer	int	The size of the I/O buffer, in bytes, to use for reads. Default: 64K.
readOnClient	boolean	Determines whether reads are performed locally or distributed in the cluster. By default, reads are performed in the cluster if executed in a distributed context.
recordWarningThreshold	int	The maximum number of records that can have parse warnings before failing.
replacement	String	Replacement string to use when encoding error policy is replacement. Default: '?'
selectedFields	List<String>	The list of input fields to include in the output. Use this to limit the fields written to the output.
source	ByteSource, Path, or String	Source of the input data to parse.
splitOptions	SplitOptions	The configuration used in determining how to break the source into splits.

Ports

The ReadLog operator provides a single output port.

Name	Type	Get method	Description
output	RecordPort	getOutput()	Provides the record data read and parsed from the provided input log files (sources).

ReadARFF Operator

This topic describes the DataFlow ARFF read operator. For information on its KNIME node, see ARFF Reader.

Sparse data is useful for data sets that contain a large number of fields where most of the fields do not have data values. This is mostly the case with numeric data, but can also be applied to enumerated data types. A common example is a data set that contains a row-per-website user with a field-per-website page. Each field contains a count of the number of times a user has visited the specific page. Most users will visit only a fraction of the overall pages on the website. Using a sparse data representation will allow the data set to be much smaller in size than a fully populated, dense data set.

DataFlow supports sparse data using the Attribute-Relation File Format (ARFF). The ReadARFF operator is used to read sparse data stored in ARFF. Files using ARFF can be in either sparse or dense mode. This reader detects the mode and reads the data accordingly. ARFF files contain schema information. The schema is parsed and used to determine how to parse data records.

ARFF can be parsed in parallel under “optimistic” assumptions: namely, that parse splits do not occur in the middle of a delimited field value and somewhere before an escaped record separator. This is assumed by default, but can be disabled with an accompanying reduction of scalability and performance.

ARFF data is used by DataFlow to represent sparse data. But it can also be used to store dense data in a CSV style. The ARFF mode determines which format to use: sparse or dense. The reader automatically discovers the mode.

The ARFF metadata also contains two other data values: the relation name and comments. The relation name is specified as one of the metadata headers. Comments are lines that start with the "%" character. Comments are returned as a list of String values.

Data can be written in ARFF WriteARFF Operator.

Code Examples

Since ARFF includes metadata that contains field names and types, the schema for ARFF files does not have to be specified. The metadata can be accessed using the discoverMetadata() method on the reader after the data source has been configured. The metadata can be used to access the relation name, comments, ARFF mode, and data schema. The schema contains the field names and types along with patterns for parsing and formatting field values.

Using the ReadARFF operator in Java

// Create ARFF reader
ReadARFF reader = graph.add(new ReadARFF("data/weather.arff"));

// Get metadata for the configured data source
Analysis metadata = reader.discoverMetadata(FileClient.basicClient());
ARFFMode mode = metadata.getMode();
String relationName = metadata.getRelationName();
List<String> comments = metadata.getComments();
TextRecord schema = metadata.getSchema();

// Dump out metadata values
System.out.println("mode = " + mode);
System.out.println("relationName = " + relationName);
System.out.println("comments = " + comments);
System.out.println("schema = " + schema.getFieldNames());

Following is a snippet of output from running an application with the previous code fragment.

mode = DENSE
relationName = weather
comments = []
schema = [outlook, temperature, humidity, windy, play]

Using the ReadARFF operator in RushScript

var data = dr.readARFF(source:'data/weather.arff');

The weather.arff file’s contents:

@relation weather

@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no

Properties

The ReadARFF operator supports the following properties:

Name	Type	Description
charset	Charset	The character set used by the data source. Default: ISO-8859-1.
charsetName	String	The character set used by the data source by name.
decodeBuffer	int	The size of the buffer, in bytes, used to decode character data. By default, this will be automatically derived using the character set and read buffer size.
encoding	CharsetEncoding	Properties that control character set encoding.
errorAction	CodingErrorAction	The error action determines how to handle errors encoding the input data into the configured character set. The default action is to replace the faulty data with a replacement character.
fieldDelimiter	char	The ARFF format supports only a single quote or a double quote as the field delimiter. Default: single quote (').
includeSourceInfo	boolean	Determines whether output records will include additional fields detailing origin information for the record. If true, records will have three additional fields: • sourcePath – the path of the file from which the record originates. If this is not known, it will be NULL. • splitOffset – the offset of the starting byte of the containing split in the source data. • recordOffset – the offset of the first character of the record text from the start of the containing split. If these names would collide with those defined in the source schema, they will be renamed to avoid collision. These fields are added as the first three of the output and are not affected by the selectedFields property.
readBuffer	int	The size of the I/O buffer, in bytes, to use for reads. Default: 64K.
replacement	String	Replacement string to use when encoding error policy is replacement. Default: '?'
selectedFields	List<String>	The list of input fields to include in the output. Use this to limit the fields written to the output.
source	ByteSource, Path, or String	Source of the input data to parse as delimited text.
splitOptions	SplitOptions	The configuration used in determining how to break the source into splits.

Ports

The ReadARFF operator provides one output port:

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Output port containing the data read from the ARFF source.

ReadStagingDataset Operator

Staging data sets are used within DataFlow for writing intermediate data into a fast and efficient binary format. They are also very convenient since metadata is stored in a header section of the data set. When reading a data set, the metadata is used to determine the fields and types of data contained within the data set.

Text files can be used for intermediate data access also but incur the overhead of formatting and parsing. Text files also can introduce data errors due to having to convert numeric data into textual formats. Staging data sets do not require the round trip to and from text and so bypass these issues.

The DataFlow framework uses staging data sets in the following situations:

• In between application phases where one phase feeds data to the next

• Staging data to disk during merge sort operations

• Staging data to disk due to queue overflow in deadlock situations

There are several use cases for directly using staging data sets. A few examples are:

• A particular task that needs to be accomplished must be broken into several DataFlow applications. Data from each application must be passed to the next. Staging data sets can be used to capture the intermediate data, allowing for quick and easy access.

• Data within a database is being used to build a predictive model. This development will require many iterations of algorithm testing with the data. For this use case, the most efficient route is to first read the data from the source database and write it to any staging data sets. Then use the staged data as input into the predictive analytic development. The one-time read from the database is overhead. However, given that the source data will be read many times, the savings are substantial.

As with other file-based operators, the ReadStagingDataset operator accepts several data sources:

• A ByteSource reference

• A Path reference

• A String reference containing:

– The path to a single data set

– A path to a directory containing many data sets with the same format and schema

– A path with wildcard characters within the file name that will be resolved to the matching data set files

Staging data sets can be written WriteStagingDataset Operator.

Code Examples

This code example constructs a new ReadStagingDataset operator by passing in the name of the file to read as a constructor parameter. Properties for setting the I/O buffer size and the selected fields are then set. The selected fields must contain field names that exist in the data set. Use the selectedFields property to constrain the fields read from the data set and provided to the output port of the operator.

Using the ReadStagingDataset operator in Java

ReadStagingDataset reader = graph.add(new ReadStagingDataset("results/ratings-stage"));
reader.setReadBuffer(128 * 1024);
reader.setSelectedFields(Arrays.asList(new String[] {"userID", "rating"}));

Using the ReadStagingDataset operator in RushScript

var data = dr.readStagingDataset({source:'results/ratings-stage', readBuffer:131072});

The operator also supports a static method for reading the metadata of a staging data set. The following code example depicts how to read and use the metadata.

Using staging dataset metadata

ReadStagingDataset reader = graph.add(new ReadStagingDataset("results/ratings-stage"));
reader.setReadBuffer(128 * 1024);
reader.setSelectedFields(Arrays.asList(new String[] {"userID", "rating"}));

DatasetMetadata metadata = reader.discoverMetadata();
long rowCount = metadata.getRowCount();
int blockSize = metadata.getBlockSize();
TokenType datasetType = metadata.getSchema();
DatasetStorageFormat format = metadata.getStorageFormat();

System.out.println("rowCount = " + rowCount);
System.out.println("blockSize = " + blockSize);
System.out.println("datasetType = " + datasetType);
System.out.println("format = " + format);

The output of the metadata information from the previous application is shown following. This metadata can be used within an application as needed.

rowCount = 1000
blockSize = 64
datasetType = {"type":"record","representation":"DENSE_BASE_NULL","fields":[{"userID":{"type":"int"}},{"movieID":{"type":"int"}},{"rating":{"type":"int"}},{"timestamp":{"type":"string"}}]}
format = COMPACT_ROW

Properties

The ReadStagingDataset operator supports the following properties:

Name	Type	Description
includeSourceInfo	boolean	Determines whether output records will include additional fields detailing origin information for the record. If true, records will have three additional fields: • sourcePath – the path of the file from which the record originates. If this is not known, it will be NULL. • splitOffset – the offset of the starting byte of the containing split in the source data. • recordOffset – a unique identifier for the record which preserves the relative ordering of records, approximating the byte offset of the encoded record in the source split. If these names would collide with those defined in the source schema, they will be renamed to avoid collision. These fields are added as the first three of the output and are not affected by the selectedFields property.
pessimisticSplitting	boolean	Configures whether pessimistic file splitting must be used. Default: disabled. Pessimistic splitting defines one file split per file (assumes the input files are not splittable).
readBuffer	int	The size of the I/O buffer, in bytes, to use for reads. Default: 64K.
readOnClient	boolean	Determines whether reads are performed by the client or in the cluster. By default, reads are performed in the cluster, if executed in a distributed context.
selectedFields	List<String>	The list of input fields to include in the output. Use this to limit the fields written to the output.
source	ByteSource, Path, or String	Source of the input data to parse as delimited text.

Ports

The ReadStagingDataset operator provides a single output port:

Name	Type	Get method	Description
output	RecordPort	getOutput()	Provides the record data read and parsed from the provided input data files (sources).

ParseTextFields Operator

The ParseTextFields operator is similar to the ReadDelimitedText and other text-based read operators. However, instead of reading from a source of text, it takes a flow of records consisting of text fields as input. These input strings are then parsed into other value types as specified by the schema provided to the operator. Additionally, it emits a flow of records which failed parsing, allowing remediation to be performed on invalid records.

The parsed output will have the type specified by the schema. Output fields will contain the result of processing the input field of the same name according to the type information in the provided schema. Referenced input fields must either be string valued, in which case they are parsed according to the schema, or of a type which is assignable to the output field's type, in which case they are copied as-is.

If a field is present in the schema, but not in the input, the output field is NULL. If an input value is NULL, the resulting output field is NULL. Input fields without a matching field in the schema are dropped.

The rejected output has the same type as the input. Field values will match those of the failed input record.

ParseTextFields only performs semantic validation of the input; the input must have been already broken into field values. This can be accomplished by using ReadDelimitedText with a schema consisting of all string fields or by custom operators. It should also be noted that a special schema discoverer TextRecord.TEXT_FIELD_DISCOVER is provided for when the schema of the source field may not be known.

Code Examples

Following is an example usage of the parser. In this case, notice that the input field acctNumber is not defined in the schema, so is absent from the outputs. Conversely, acctCodes is present in the schema, but not in the input; it is present in the output, but will be NULL for every record. Additionally note that the input fields that are in the schema appear in a different order.

Using ParseTextFields in Java

// Define a "raw" schema, identifying fields but not parsing content
RecordTokenType rawType=record(STRING("acctNumber"),STRING("company"),
                               STRING("startDate"),STRING("balance"));
TextRecord rawSchema= TextRecord.convert(rawType, StringConversion.RAW);

// Read "raw" fields
ReadDelimitedText reader= graph.add(new ReadDelimitedText(...));
reader.setSchema(rawSchema);

// Define the parsing schema
TextRecord schema =
  SchemaBuilder.define(
            SchemaBuilder.STRING("company"),
            SchemaBuilder.STRING("acctCode"),
            SchemaBuilder.DOUBLE("balance"),
            SchemaBuilder.DATE("startDate", "MM/dd/yyyy") // specify pattern for parsing the date
            );

// Parse the text records
ParseTextFields parser= graph.add(new ParseTextFields());
parser.setSchema(schema);
graph.connect(reader.getOutput(), parser.getInput());

Using ParseTextFields in RushScript

var rawschema= dr.schema()
    .nullable(false).trimmed(false)  // Want strings in raw form
    .STRING("acctNumber")
    .STRING("company")
    .STRING("startDate")
    .STRING("balance");

var rawdata= dr.readDelimitedText({source:..., schema:rawschema});

var schema = dr.schema()
    .STRING('company')
    .STRING('acctCodes')
    .DOUBLE('balance')
    .DATE('startDate','MM/dd/yyyy');

var parsed= dr.parseTextFields(rawdata, {schema:movieschema});

To better illustrate the behavior of the previous code, consider the following source data (all fields are string values).

acctNumber	company	startDate	balance
"1"	"Acme Corporation"	"9/17/1949"	"1500.00"
"2"	"Spacely Sprockets"	"2062-01-01"	"1000.00"
"3"	"Duff Beer"	"12/17/1989"	NULL

This yields the data below on the output port. Here values are parsed as the appropriate type for the field. As the balance field is NULL in the source, it is NULL in the output.

company	acctCodes	balance	startDate
"Acme Corporation"	NULL	1500.00	1949-09-17
"Duff Beer"	NULL	NULL	1989-12-17

Meanwhile, the rejects port will produce the following data, as the startDate field has the wrong format (all fields are string values).

acctNumber	company	startDate	balance
"2"	"Spacely Sprockets"	"2062-01-01"	"1000.00"

Properties

The ParseTextFields operator supports the following properties.

Name	Type	Description
schema	TextRecord	The record schema expected for the input and required for the output. Schema fields will be matched to input fields by name.

Ports

The ParseTextFields operator provides two output ports.

Name	Type	Get method	Description
output	RecordPort	getOutput()	Provides the successfully parsed input records.
rejects	RecordPort	getRejects()	Provides the input records which failed parsing.

ReadJSON Operator

This topic describes the DataFlow JSON read operator. For information about its KNIME node, see JSON Reader.

The ReadJSON operator reads a JSON file containing key-value pairs or array of objects as record tokens. It supports JSON Lines format as described at http://jsonlines.org/. The formatted text in JSON Lines contains a single JSON record per line. Each record is separated by a newline character.

In JSON, all field keys must start and end with a delimiter. A "(double quote) is typically used as the field delimiter. However, you may enable the allowSingleQuotes property to avoid parsing errors when single quotes are used. The ReadJSON operator uses the Jackson JSON parsing library to parse fields.

The reader may optionally specify RecordTextSchema to provide parsing and type information for fields. The schema, in conjunction with any specified field filter, defines the output type of the reader. This can be manually constructed using the provided DataFlow API.

The StructuredSchemaReader class provides support for reading the DataConnect structured schema descriptors (.schema files) to use with readers. Also, automated schema discovery can be performed based on the contents of the file because JSON text has explicit field markers. The reader provides a pluggable discovery mechanism to support this function. By default, the schema is automatically discovered with the initial assumption that all the fields are strings. The discovered fields are named using the available key fields.

Normally, the reader output file includes all the parsed records with and without parsing errors. The fields that cannot be parsed have null values in the output. If required, the reader can be configured to filter the failed records from the output.

JSON text does not contain a header row since the keys in a JSON record define the fields in the output. JSON text files can be parsed in parallel with the "optimistic" assumption that the data is well formatted as per the JSON Lines standard.

Code Examples

The first code example shows a simple usage of the reader. The path to the local file name is provided as a parameter to the constructor. This can also be set using the setSource() method. A record type is built and used as the input schema. The record type must be converted to an acceptable schema before it is used by the reader.

Using the ReadJSON operator in Java

The following code shows how to use the ReadJSON operator in Java:

ReadJSON reader = graph.add(new ReadJSON("data/iris.jsonl"));

RecordTokenType schema = record(DOUBLE("sepallength"), DOUBLE("sepalwidth"), DOUBLE("petallength"), DOUBLE("petalwidth"), STRING("class"));

reader.setSchema(TextRecord.convert(schema));

Using ReadJSON operator in RushScript

The following code shows how to use the ReadJSON operator in RushScript:

var irisSchema = dr.schema().DOUBLE("sepallength"), DOUBLE("sepalwidth"), DOUBLE("petallength"), DOUBLE("petalwidth"), STRING("class");

var data = dr.readjson({source:'data/iris.jsonl ', schema: irisSchema});

The following data snippet is from the iris.jsonl file and can be read using the preceding code examples (either Java or RushScript).

{"sepallength":5.1,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"Iris-setosa"}

{"sepallength":4.9,"sepalwidth":3,"petallength":1.4,"petalwidth":0.2,"class":"Iris-setosa"}

{"sepallength":4.7,"sepalwidth":3.2,"petallength":1.3,"petalwidth":0.2,"class":"Iris-setosa"}

{"sepallength":7,"sepalwidth":3.2,"petallength":4.7,"petalwidth":1.4,"class":"Iris-versicolor"}

{"sepallength":6.4,"sepalwidth":3.2,"petallength":4.5,"petalwidth":1.5,"class":"Iris-versicolor"}

{"sepallength":6.9,"sepalwidth":3.1,"petallength":4.9,"petalwidth":1.5,"class":"Iris-versicolor"}

{"sepallength":6.3,"sepalwidth":3.3,"petallength":6,"petalwidth":2.5,"class":"Iris-virginica"}

{"sepallength":5.8,"sepalwidth":2.7,"petallength":5.1,"petalwidth":1.9,"class":"Iris-virginica"}

{"sepallength":7.1,"sepalwidth":3,"petallength":5.9,"petalwidth":2.1,"class":"Iris-virginica"}

{"sepallength":6.3,"sepalwidth":2.9,"petallength":5.6,"petalwidth":1.8,"class":"Iris-virginica"}

Properties

The ReadJSON operator supports the following properties.

Name	Type	Description
allowComments	Boolean	Determines whether the parser will allow using Java or C++ style comments (both '/'+'*' and '//' types) within parsed content or not. Default is False.
allowUnquotedFieldNames	Boolean	Determines whether the parser will allow using unquoted field names. Default is False.
allowSingleQuotes	Boolean	Determines whether the parser will allow using single quotes (apostrophe, character '\'') for quoting strings. Default is False.
allowUnquotedControlChars	Boolean	Determines whether the parser will allow JSON strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed). Default is False.
allowBackslashEscapingAny	Boolean	Determines whether the parser will allow quoting any character using backslash quoting mechanism. If it is not enabled, only characters that are explicitly listed by JSON specification can be escaped. Default is False.
allowNumericLeadingZeros	Boolean	Determines whether the parser will allow numbers to start with additional leading zeros. If the leading zeros are allowed for numbers in source, then this field must be set to True. Default is False.
allowNonNumericNumbers	Boolean	Determines whether the parser is allowed to recognize "Not a Number" (NaN) token as legal floating point values. Default is False.
analysisDepth	Int	Indicates the number of characters to read for performing schema discovery and structural analysis. Default is False.
charset	Charset	Indicates the character set used by the data source. Default is UTF-8.
charsetName	String	Indicates the character set used by the data source based on the name.
decodeBuffer	Int	Indicates the buffer size (in bytes) used to decode character data. By default, this is automatically derived using the character set and read buffer size.
discoveryNullIndicator	Int	Indicates the text value used to represent null values by default in discovered schemas. By default, this is the empty string.
discoveryStringHandling	StringConversion	Indicates the default behavior for processing string-valued types in discovered schemas.
encoding	CharsetEncoding	Indicates the properties that control character set encoding.
errorAction	CodingErrorAction	Determines the action to be performed for errors encoding the input data into the configured character set. The default action replaces the faulty data with a replacement character.
extraFieldAction	ParseErrorAction	Determines the action to be performed for fields that are found when parsing the record, but not declared in the schema.
fieldErrorAction	ParseErrorAction	Determines the action to be performed for fields that cannot be parsed.
fieldLengthThreshold	Int	Indicates the maximum length allowed for a field value before it is considered an error.
includeSourceInfo	Boolean	Determines whether output records will include additional fields that provides origin information for the record. If true, records will have three additional fields: • sourcePath - Path of the file from which the record originates. If this is not known, the value is Null. • SplitOffset - Offset of the starting byte of the containing split in the source data. • recordOffset - Offset of the first character of the record text from the start of the containing split. If these names collide with the names defined in the source schema, they will be renamed to avoid collision. These fields are added as the first three of the output and are not affected by the selectedFields property.
missingFieldAction	ParseErrorAction	Indicates how to handle fields declared in the schema, but not found when parsing the record. If the configured action does not discard the record, then the missing fields will be null-valued in the output.
multilineFormat		Determines whether the file should be parsed as a multiline JSON file, which allows each JSON record to span multiple lines. Otherwise the data must be in JSON lines format. Default is False.
parseErrorAction	ParseErrorAction	Indicates the action to be performed for all parsing errors.
parseOptions	ParsingOptions	Indicates the parsing options used by the reader.
pessimisticSplitting	Boolean	Configures whether pessimistic file splitting must be used. By default, this is disabled. Pessimistic splitting defines one file split per file (assumes the input files cannot be split).
readBuffer	Int	Indicates the size of the I/O buffer (in bytes) to use for reads. Default is 64K.
readOnClient	Boolean	Determines whether reads are performed by the client or in the cluster. By default, reads are performed in the cluster if executed in a distributed context.
recordWarningThreshold	Int	Indicates the maximum number of records that can have parse warnings before it fails.
replacement	String	Indicates the replacement string to be used when encoding error policy is replacement. Default is '?'.
selectedFields	List<String>	Indicates the list of input fields to be included in the output. This sets a limit for the fields written to the output.
schema	TextRecord	Indicates the record schema expected in the delimited text source. This property is mutually exclusive with the schemaDiscovery property. If either of the two property is set, then the other property is ignored. By default, this property is not set.
schemaDiscovery	TextRecordDiscoverer, List< TypePattern>	Indicates the schema discovery mechanism to be used. This property is mutually exclusive with the schema property. If either of the two property is set, then the other property is ignored. By default, a pattern-based mechanism is used. Providing a list of pattern/type pairs uses the default discoverer extended with the supplied patterns.
source	ByteSource, Path, or String	Indicates the source of the input data to be parsed as delimited text.
splitOptions	SplitOptions	Indicates the configuration used to determine how to break the source into splits.

Ports

The ReadJSON operator provides a single output port.

Name	Type	Get method	Description
output	RecordPort	getOutput()	Provides the record data to read and parse from the provided input data files (sources).

Last modified date: 03/10/2025