Schema Variables

Building DataFlow Applications : Building DataFlow Applications Using RushScript : RushScript Variables : Schema Variables

Share this page

Schema Variables

Schema variables are created using the dr.schema() function. Schema variables are used mainly by I/O operators such as Using the ReadDelimitedText Operator to Read Delimited Text, Using the ReadFixedText Operator to Read Fixed-width Text, Using the WriteDelimitedText Operator to Write Delimited Text, and Using the WriteFixedText Operator to Write Fixed-width Text.

After a schema variable is created, it can be used to define a schema or load an existing schema from a schema file. After a schema is defined, it can also be used to persist the schema definition to a local file for reuse later. Schema files can also be created within the RushAnalyzer environment and loaded using a schema variable.

Schema variables can also be used in places where a RecordTokenType is required. The scripting environment will do the conversion for you automatically.

The dr.schema() method takes an optional map parameter that allows specifying schema level properties. Since two types of schemas are supported, variable-length text and fixed-width text, the schema type can be optionally specified using the map parameter. If the type is not specified, a variable-length schema is created by default. To create a fixed-width schema, the type FIXED must be provided.

The following table lists the properties that can be optionally set via the dr.schema() function. When creating a fixed-width schema, the type property is required.

Property Name	Type	Description
type	String	The type of schema to create. By default, a variable length schema is created (useful for delimited text). To create a fixed-width text schema that can be used with the read/write fixed text operators, set to FIXED.
nullable	boolean	Indicates whether empty field values are treated as null values.
nullIndicator	String	Sets the text that represents a null value within a field. This value is used for all fields by default. Each field definition can override the null indicator for the field.
padChar	char	The character used to pad-fill field values as needed. This is only applicable for fixed-width schemas. The default value is a space character.
alignment	String	The alignment to use when formatting fixed text. This is only applicable for fixed-width schemas. The valid values are CENTER, LEFT, and RIGHT. Default: LEFT.

The following table lists the functions provided by schema variables. See Tokens and Types for more information about DataFlow data types. For types supporting a pattern specification, if a pattern is defined, the pattern will be used for date parsing and formatting. Each of the functions returns a schema variable allowing the functions to be chained for convenience.

Function Name	Input Parameters	Description
alignment	• String	The alignment to use when formatting fixed-width text. This is only applicable for fixed-width schemas. The valid values are CENTER, LEFT, and RIGHT. Default: LEFT.
BOOLEAN	• Field name • Properties (optional)	Creates a new field of type BOOLEAN.
DATE	• Field name • Properties (optional)	Creates a new field of type DATE.
DOUBLE	• Field name • Properties (optional)	Creates a new field of type DOUBLE.
ENUM (deprecated-use STRING)	• Field name • Enumerated values	Creates a new field of type ENUM.
ENUM (deprecated-use STRING)	• Field name • Properties • Enumerated values	Creates a new field of type ENUM. The provided values are used to build the enumerated type.
FLOAT	• Field name • Properties (optional)	Creates a new field of type FLOAT.
INT	• Field name • Properties (optional)	Creates a new field of type INT.
load	Local pathname	Loads a predefined schema from a local file.
LONG	• Field name • Properties (optional)	Creates a new field of type LONG.
nullable	boolean	Indicates whether or not empty field values are treated as null values.
nullIndicator	String value	Sets the text that represents a null value within a field. This value is used for all fields by default. Each field definition can override the null indicator for the field.
NUMERIC	• Field name • Properties (optional)	Creates a new field of type NUMERIC.
padChar	• char	The character used to pad fill field values as needed. This is only applicable for fixed-width schemas. Default: space character.
store	Local pathname	Stores the current schema definition into a local file. The schema file can be used later within scripting or within the RushAnalyzer environment.
STRING	• Field name • Properties (optional) • Allowed values (optional)	Creates a new field of type STRING. If allowed values are provided, the values are added to the domain of the field.
TIMEOFDAY	• Field name • Properties (optional)	Creates a new field of type TIME.
TIMESTAMP	• Field name • Properties (optional)	Creates a new field of type TIMESTAMP.
trimmed	boolean	Enable or disable string trimming if input field values.

The functions listed in the table above that create schema fields optionally accept a properties parameter that is specified as a JavaScript map. The following table lists the properties that are supported and on which data types they are applicable.

Property Name	Type	Supported on Types	Description
alignment	String	all types (for fixed-width text)	The alignment to use when formatting fixed-width text. This is only applicable for fixed-width schemas. The valid values are CENTER, LEFT, and RIGHT. Default: LEFT.
falseValue	String	boolean	The text that represents boolean false.
nullIndicator	String	all types	Text value that represents a null value within a field. Common settings are: NA, ?, <null> for example. Default: empty string.
nullable	boolean	all types	A boolean value that indicates if the field may contain null values. Default: true.
padChar	char	all types (for fixed-width text)	Specifies the character to use for padding when formatting fields for fixed-width text. Default: space character.
pattern	String	date, double, float, int, long, numeric, time of day, timestamp	Specifies how to parse and format field values. The pattern is type specific, that is, a pattern for a date fields is different from a pattern for floating point values. See the documentation immediately following this table for more information about patterns.
size	int	all types (for fixed-width text)	The fixed size of a field within a schema. This property is only supported for fixed-width schemas. There is no default value. When building fixed-width schemas, the size parameter must be provided on all fields.
trimmed	boolean	String	A boolean value that indicates whether or not to trim white space from field values. The default value is false.
trueValue	String	boolean	The text that represents boolean true.

Patterns for parsing and formatting numeric field types are supported. The following table specifies the special characters that are provided for building patterns.

Symbol	Meaning
0	Digit, zero-filled
#	Digit, zero values are absent
.	Decimal separator
,	Group separator
E	In scientific notation, separates the mantissa and the exponent. Does not need to be quoted.
;	Separates positive and negative subpatterns
%	Multiplies by 100 and show as percentage
'	Used to quote special characters

Here is an example of the numeric pattern ##,##0.00 and how numbers using it are formatted:

• 0 will format as 0.00

• 10323.353 will format as 10,323.35

• 443.1 will format as 443.10

Patterns for parsing and formatting date and timestamp types are also supported. The following table covers the special characters for these patterns.

Symbol	Meaning
G	Era designator
y	Year
M	Month in year
w	Week in year
D	Day in year
d	Day in month
e	Day of week (as number)
E	Day in week (as text)
a	am/pm marker
H	Hour in day (0-23)
k	Hour in day (1-24)
K	Hour in am/pm (0-11)
h	Hour in am/pm (1-12)
m	Minute in hour
s	Second in minute
S	Millisecond
z	Time zone (General)
Z	Time zone (RFC 822)

Within a date/timestamp pattern, repeating symbols can provide different patterns. For example, using MM for the month will generate a two-digit (zero-filled, one-based) representation of the month (such as 01 for January). However, using the pattern MMM will produce a three-character abbreviation of the month (such as Jan for January).

An example date pattern is MM/dd/yyyy. Applying this pattern results in:

• September 14, 1993, formats as 09/14/1993

• January 3, 2013, formats as 01/03/2013

The following JavaScript code fragment demonstrates creating a new schema variable, creating fields, and storing the schema definition for later use. Note the use of a pattern for certain fields. This is especially useful for date and timestamp typed fields. It can also be useful with numeric fields, especially in conjunction with an operator that formats data for writing.

Note that the syntax for setting properties in the field definition uses the JavaScript map notation. The properties are surrounded by curly braces. To set a property, provide its name, then a colon, followed by the property value. Use a comma to separate properties when more than one is being specified.

Creating and persisting a variable length schema

// Create a new variable length schema and add fields to it
var accountschema = dr.schema()
   .nullable(true)
   .trimmed(false)
   .STRING('accountNumber')
   .STRING('clientName')
   .STRING('companyName')
   .STRING('streetAddress')
   .STRING('city')
   .STRING('state')
   .STRING('zip')
   .STRING('emailAddress')
   .DATE('birthDate', {pattern:'MM/dd/yyyy'})
   .STRING('accountCodes')
   .DOUBLE('standardPayment', {pattern:'0.00'})
   .DOUBLE('payment', {pattern:'0.00'})
   .DOUBLE('balance', {pattern:'0.00'});

// Store the schema in a local file
accountSchema.store('/user/me/accounts.schema');

A null indicator can be specified for the schema. This indicator is applied to every field definition by default. If a field uses an alternate null indicator, it can be specified when the field is defined by setting the "nullIndicator" property. The following example demonstrates specifying a default null indicator for the schema and an alternate null indicator for several fields.

Specifying alternate null indicators per field

// Create a new schema variable and add fields to it
var accountschema = dr.schema()
   .nullable(true)
   .nullIndicator('<null>')
   .trimmed(false)
   .STRING('accountNumber', {nullIndicator:'?'})
   .STRING('clientName')
   .STRING('companyName')
   .STRING('streetAddress')
   .STRING('city')
   .STRING('state')
   .STRING('zip')
   .STRING('emailAddress')
   .DATE('birthDate', {pattern:'MM/dd/yyyy', nullIndicator:'NA'})
   .STRING('accountCodes', {nullIndicator:'N/A'})
   .DOUBLE('standardPayment', {pattern:'0.00'})
   .DOUBLE('payment', {pattern:'0.00'})
   .DOUBLE('balance', {pattern:'0.00'});

To load a schema variable from a persisted schema file, first create a new schema variable and then invoke the load() method. The following code fragment demonstrates this use case. Once loaded, the schema variable can be passed into an operator for usage.

Loading a schema variable from a local file

// Load the schema definition from a local file
var accountSchema = dr.schema().load('/user/me/accounts.schema');

// Use the schema variable in a read operation
var accountData = dr.readDelimitedText({source:'/user/me/accounts.txt', schema:accountSchema});

The following example demonstrates creating a fixed-width schema. Note that the type of the schema must be specified as FIXED. Also, each field specification must provide the size property. The alignment and padChar properties are optional. If not specified, they will take on their default values.

Create a fixed-width schema

// Create a fixed-width schema. Note that the type parameter must be specified when
// creating the schema. Also, the size parameter is required on each field.
// Alignment is optional for reading, but should be specified for writing.

var accountsFixedSchema = dr.schema({type:'FIXED'})
   .nullable(true)
   .trimmed(true)
   .STRING("accountNumber", {size:9, alignment:'RIGHT'})
   .STRING("clientName", {size:21, alignment:'LEFT'})
   .STRING("companyName", {size:31, alignment:'LEFT'})
   .STRING("streetAddress", {size:35, alignment:'LEFT'})
   .STRING("city", {size:16, alignment:'LEFT'})
   .STRING("state", {size:2, alignment:'LEFT'})
   .STRING("zip", {size:10, alignment:'LEFT'})
   .STRING("emailAddress", {size:25, alignment:'LEFT'})
   .DATE("birthDate", {pattern:'MM/dd/yyyy', size:10, alignment:'LEFT'})
   .STRING("accountCodes", {size:11, alignment:'RIGHT'})
   .DOUBLE("standardPayment", {pattern:'0.00', size:6, alignment:'RIGHT', padChar:'0'})
   .DOUBLE("payment", {pattern:'0.00', size:7, alignment:'RIGHT', padChar:'0'})
   .DOUBLE("balance", {pattern:'0.00', size:6, alignment:'RIGHT', padChar:'0'});

The following example demonstrates loading a fixed-width schema from a local file and its use by the ReadFixedText operator. Note that the type of the schema must be specified when creating the schema for loading.

Loading a fixed-width schema

// Create the schema as a fixed-width type and then load from a local file
var accountSchema = dr.schema({type:'FIXED'}).load('/user/me/accounts-fixed.schema');

// Use the schema variable in a read operation
var accountData = dr.readFixedText({source:'/user/me/accounts.txt', schema:accountSchema});

Using Enumerated Types

The RushScript environment preloads many DataFlow enumerated types into the JavaScript environment for direct usage in scripts. This is a convenience that prevents the user from having to import the enumerated types into JavaScript directly.

The RushScript environment will attempt to automatically convert textual representations of enumerated values into their corresponding type. For example, this allows scripts to use 'OVERWRITE' instead of WriteMode.OVERWRITE for operator properties that require a WriteMode enumerated type.

The preloaded enumerated types are listed below:

Enumerated Type	Values	Description
ARFFMode	• SPARSE • DENSE	Using the WriteARFF Operator to Write Sparse Data to specify the output format to use when writing data.
DatasetStorageFormat	• COMPACT_ROW • COLUMNAR	Using the WriteStagingDataset Operator to Write Staging Data Sets to determine the format used to store data.
DetailLevel	• SINGLE_PASS_ONLY • MULTI_PASS	Using the SummaryStatistics Operator to Calculate Data Statistics to define the detail level desired.
DistanceMeasure	• EUCLIDEAN • COSINE_SIMILARITY	Using the KMeans Operator to Compute K-Means to specify the distance measure.
JoinMode	• INNER • FULL_OUTER • LEFT_OUTER • RIGHT_OUTER	Using the Join Operator to Do Standard Relational Joins to specify the type of join to perform.
NormalizationMethod	• none • logit	Used within PMML to define the normalization function applied to modeling applications.
NormalizeMethod	• MINMAX • ZSCORE	Using the NormalizeValues Operator to Normalize Values to specify the type of normalization to apply.
OutputMode	• APPEND • OVERWRITEROWS • OVERWRITETABLE • UPDATE • DELETE	Using the WriteToJDBC Operator to Write to Databases to specify the mode of writing to the target table.
ParseErrorAction	• ERROR • WARN_AND_DISCARD • WARN • DISCARD • IGNORE	Describes the possible actions for handling record parsing errors. Used by file reader operators such as Using the ReadDelimitedText Operator to Read Delimited Text.
RankMode	• STANDARD • DENSE • ORDINAL	Using the Rank Operator to Rank Data to specify the type of ranking to use.
SampleMode	• BY_PERCENT • BY_SIZE	Using the SampleRandomRows Operator to Sample Data to specify the type of sampling to apply.
StringConversion	• RAW • NULLABLE_RAW • TRIMMED • NULLABLE_TRIMMED	Enumerates the possible conversions for string-valued text types.
UnreadableSourceAction	• IGNORE • WARN • FAIL	Specifies the behavior for handling data sources which are unreadable.
WriteMode	• CREATE_NEW • OVERWRITE • APPEND	Used by file writer operators such as Using the WriteDelimitedText Operator to Write Delimited Text, Using the WriteARFF Operator to Write Sparse Data, and others to specify how to handle creating the output files for writing.

The following code fragment demonstrates using the WriteMode enumerated type with the WriteDelimitedText operator.

Example using enumerated types

// Write a delimited text file using overwrite for the write mode
dr.writeDelimitedText(data, {target:'file.txt', mode:WriteMode.OVERWRITE});

This code fragment below accomplishes the same thing using a String value for the enumerated type. The RushScript framework will convert the String value into the correct enumerated value.

Using a string value for an enumerated type

// Write a delimited text file using overwrite for the write mode
dr.writeDelimitedText(data, {target:'file.txt', mode:'OVERWRITE'});

Using Helper Classes

The RushScript environment preloads several DataFlow classes for convenience. These classes provide static methods that can be invoked using the associated class name. The following table lists the DataFlow classes that are preloaded. A few examples are provided to demonstrate usage of the classes within RushScript.

Class Name	Description
Aggregation	Provides methods for creating the aggregations to perform Using the Group Operator to Compute Aggregations.
Arithmetic	Provides methods for creating functions that perform arithmetic operations.
Conditionals	Provides methods for creating conditional functions.
ConstantReference	Provides methods for creating functions that provide constant values.
Conversions	Provides methods for creating functions that provide data type conversions.
DateTime	Provides methods for creating functions that process date and timestamp data types.
DateTimeValue	Provides values that are needed by the DateTime functions.
FieldDerivation	Provides methods for creating field derivation specifications by Using the DeriveFields Operator to Compute New Fields.
FieldReference	Provides methods for creating functions that access fields.
Formatting	Provides methods for creating functions that provide data formatting functionality.
Predicates	Provides methods for creating functions that express predicate conditions.
ReplaceSpecification	Provides methods for creating missing value handling specifications by Using the ReplaceMissingValues Operator to Replace Missing Values.
ScalarType	Defines the scalar types available within the DataFlow framework.
SortKey	Provides methods for creating SortKey objects used with Using the Sort Operator to Sort Data Sets.
Statistics	Provides methods for creating functions that provide compute common statistics.
Strings	Provides methods for creating functions that provide common String utilities.
Tolerance	Provides methods for creating floating point tolerance types used with various assertion operators such as Using the AssertEqual Operator to Assert Data Equality.

The following code fragment demonstrates using the Aggregation class to create the aggregations to perform using the Group operator.

RushScript example using the Aggregation class

// Define the aggregations wanted
var aggs = [
Aggregation.count("sepal length").as("count-sl"),
    Aggregation.min("sepal length").as("min-sl"),
    Aggregation.max("sepal length").as("max-sl"),
    Aggregation.avg("sepal length").as("avg-sl"),
    Aggregation.stddev("sepal length").as("stddev-sl"),
    Aggregation.geoAvg("petal length").as("geoavg-pl"),
    Aggregation.skewness("petal length").as("skewness-pl")
];

// Group by the given keys using the defined aggregations
var results = dr.group(data, {keys:groupKeys, aggregations:aggs});

The Group operator allows setting the aggregations using a more SQL-like syntax. The following code fragment accomplishes the same aggregation using a simpler syntax.

Simpler syntax for Group

// Define the aggregations using a SQL-like syntax
var aggs = 'count("sepal length") as count_sl, min("sepal length") as min_sl, max("sepal length") as max_sl, avg("sepal length") as avg_sl, stddev("sepal length") as stddev_sl, geoavg("petal length") as geoavg_pl, skewness("petal length") as skewness_pl';

// Group by the given keys using the defined aggregations
var results = dr.group(data, {keys:groupKeys, aggregations:aggs});

The following code fragment builds several field derivations and applies them to the data within a data set. This example demonstrates building functions that get applied with the DerivedFields operator.

Creating and using functions

// Create the wanted field derivations using Arithmetic functions.
// Add a field values to a constant.
var d1 = FieldDerivation.derive('df1', Arithmetic.add('sepal length', 17.23));

// Add values of two fields.
var d2 = FieldDerivation.derive('df2', Arithmetic.add('sepal length', 'sepal width'));

// Multiply field values by a constant
var d3 = FieldDerivation.derive('df3', Arithmetic.mult('petal length', 425.32));

// Multiply values of two fields.
var d4 = FieldDerivation.derive('df4', Arithmetic.mult('petal length', 'petal width'));

// Chain functions creating: ('sepal length' + 'sepal width') / ('petal length' + 'petal width')
var d5 = FieldDerivation.derive('df5', Arithmetic.div(Arithmetic.add('sepal length', 'sepal width'), Arithmetic.add('petal length', 'petal width')));

// Use derive fields to apply the defined derivations
var results = dr.deriveFields(data, {derivedFields:[d1, d2, d3, d4, d5]});