Building DataFlow Applications : Building DataFlow Applications Using RushScript : RushScript Variables : Schema Variables
 
Share this page                  
Schema Variables
Schema variables are created using the dr.schema() function. Schema variables are used mainly by I/O operators such as Using the ReadDelimitedText Operator to Read Delimited Text, Using the ReadFixedText Operator to Read Fixed-width Text, Using the WriteDelimitedText Operator to Write Delimited Text, and Using the WriteFixedText Operator to Write Fixed-width Text.
After a schema variable is created, it can be used to define a schema or load an existing schema from a schema file. After a schema is defined, it can also be used to persist the schema definition to a local file for reuse later. Schema files can also be created within the RushAnalyzer environment and loaded using a schema variable.
Schema variables can also be used in places where a RecordTokenType is required. The scripting environment will do the conversion for you automatically.
The dr.schema() method takes an optional map parameter that allows specifying schema level properties. Since two types of schemas are supported, variable-length text and fixed-width text, the schema type can be optionally specified using the map parameter. If the type is not specified, a variable-length schema is created by default. To create a fixed-width schema, the type FIXED must be provided.
The following table lists the properties that can be optionally set via the dr.schema() function. When creating a fixed-width schema, the type property is required.
Property Name
Type
Description
type
String
The type of schema to create. By default, a variable length schema is created (useful for delimited text). To create a fixed-width text schema that can be used with the read/write fixed text operators, set to FIXED.
nullable
boolean
Indicates whether empty field values are treated as null values.
nullIndicator
String
Sets the text that represents a null value within a field. This value is used for all fields by default. Each field definition can override the null indicator for the field.
padChar
char
The character used to pad-fill field values as needed. This is only applicable for fixed-width schemas. The default value is a space character.
alignment
String
The alignment to use when formatting fixed text. This is only applicable for fixed-width schemas. The valid values are CENTER, LEFT, and RIGHT. Default: LEFT.
The following table lists the functions provided by schema variables. See Tokens and Types for more information about DataFlow data types. For types supporting a pattern specification, if a pattern is defined, the pattern will be used for date parsing and formatting. Each of the functions returns a schema variable allowing the functions to be chained for convenience.
Function Name
Input Parameters
Description
alignment
String
The alignment to use when formatting fixed-width text. This is only applicable for fixed-width schemas. The valid values are CENTER, LEFT, and RIGHT. Default: LEFT.
BOOLEAN
Field name
Properties (optional)
Creates a new field of type BOOLEAN.
DATE
Field name
Properties (optional)
Creates a new field of type DATE.
DOUBLE
Field name
Properties (optional)
Creates a new field of type DOUBLE.
ENUM (deprecated-use STRING)
Field name
Enumerated values
Creates a new field of type ENUM.
ENUM (deprecated-use STRING)
Field name
Properties
Enumerated values
Creates a new field of type ENUM. The provided values are used to build the enumerated type.
FLOAT
Field name
Properties (optional)
Creates a new field of type FLOAT.
INT
Field name
Properties (optional)
Creates a new field of type INT.
load
Local pathname
Loads a predefined schema from a local file.
LONG
Field name
Properties (optional)
Creates a new field of type LONG.
nullable
boolean
Indicates whether or not empty field values are treated as null values.
nullIndicator
String value
Sets the text that represents a null value within a field. This value is used for all fields by default. Each field definition can override the null indicator for the field.
NUMERIC
Field name
Properties (optional)
Creates a new field of type NUMERIC.
padChar
char
The character used to pad fill field values as needed. This is only applicable for fixed-width schemas. Default: space character.
store
Local pathname
Stores the current schema definition into a local file. The schema file can be used later within scripting or within the RushAnalyzer environment.
STRING
Field name
Properties (optional)
Allowed values (optional)
Creates a new field of type STRING. If allowed values are provided, the values are added to the domain of the field.
TIMEOFDAY
Field name
Properties (optional)
Creates a new field of type TIME.
TIMESTAMP
Field name
Properties (optional)
Creates a new field of type TIMESTAMP.
trimmed
boolean
Enable or disable string trimming if input field values.
The functions listed in the table above that create schema fields optionally accept a properties parameter that is specified as a JavaScript map. The following table lists the properties that are supported and on which data types they are applicable.
Property Name
Type
Supported on Types
Description
alignment
String
all types (for fixed-width text)
The alignment to use when formatting fixed-width text. This is only applicable for fixed-width schemas. The valid values are CENTER, LEFT, and RIGHT. Default: LEFT.
falseValue
String
boolean
The text that represents boolean false.
nullIndicator
String
all types
Text value that represents a null value within a field. Common settings are: NA, ?, <null> for example. Default: empty string.
nullable
boolean
all types
A boolean value that indicates if the field may contain null values. Default: true.
padChar
char
all types (for fixed-width text)
Specifies the character to use for padding when formatting fields for fixed-width text. Default: space character.
pattern
String
date, double, float, int, long, numeric, time of day, timestamp
Specifies how to parse and format field values. The pattern is type specific, that is, a pattern for a date fields is different from a pattern for floating point values. See the documentation immediately following this table for more information about patterns.
size
int
all types (for fixed-width text)
The fixed size of a field within a schema. This property is only supported for fixed-width schemas. There is no default value. When building fixed-width schemas, the size parameter must be provided on all fields.
trimmed
boolean
String
A boolean value that indicates whether or not to trim white space from field values. The default value is false.
trueValue
String
boolean
The text that represents boolean true.
Patterns for parsing and formatting numeric field types are supported. The following table specifies the special characters that are provided for building patterns.
Symbol
Meaning
0
Digit, zero-filled
#
Digit, zero values are absent
.
Decimal separator
,
Group separator
E
In scientific notation, separates the mantissa and the exponent. Does not need to be quoted.
;
Separates positive and negative subpatterns
%
Multiplies by 100 and show as percentage
'
Used to quote special characters
Here is an example of the numeric pattern ##,##0.00 and how numbers using it are formatted:
0 will format as 0.00
10323.353 will format as 10,323.35
443.1 will format as 443.10
Patterns for parsing and formatting date and timestamp types are also supported. The following table covers the special characters for these patterns.
Symbol
Meaning
G
Era designator
y
Year
M
Month in year
w
Week in year
D
Day in year
d
Day in month
e
Day of week (as number)
E
Day in week (as text)
a
am/pm marker
H
Hour in day (0-23)
k
Hour in day (1-24)
K
Hour in am/pm (0-11)
h
Hour in am/pm (1-12)
m
Minute in hour
s
Second in minute
S
Millisecond
z
Time zone (General)
Z
Time zone (RFC 822)
Within a date/timestamp pattern, repeating symbols can provide different patterns. For example, using MM for the month will generate a two-digit (zero-filled, one-based) representation of the month (such as 01 for January). However, using the pattern MMM will produce a three-character abbreviation of the month (such as Jan for January).
An example date pattern is MM/dd/yyyy. Applying this pattern results in:
September 14, 1993, formats as 09/14/1993
January 3, 2013, formats as 01/03/2013
The following JavaScript code fragment demonstrates creating a new schema variable, creating fields, and storing the schema definition for later use. Note the use of a pattern for certain fields. This is especially useful for date and timestamp typed fields. It can also be useful with numeric fields, especially in conjunction with an operator that formats data for writing.
Note that the syntax for setting properties in the field definition uses the JavaScript map notation. The properties are surrounded by curly braces. To set a property, provide its name, then a colon, followed by the property value. Use a comma to separate properties when more than one is being specified.
Creating and persisting a variable length schema
// Create a new variable length schema and add fields to it
var accountschema = dr.schema()
   .nullable(true)
   .trimmed(false)
   .STRING('accountNumber')
   .STRING('clientName')
   .STRING('companyName')
   .STRING('streetAddress')
   .STRING('city')
   .STRING('state')
   .STRING('zip')
   .STRING('emailAddress')
   .DATE('birthDate', {pattern:'MM/dd/yyyy'})
   .STRING('accountCodes')
   .DOUBLE('standardPayment', {pattern:'0.00'})
   .DOUBLE('payment', {pattern:'0.00'})
   .DOUBLE('balance', {pattern:'0.00'});

// Store the schema in a local file
accountSchema.store('/user/me/accounts.schema');
A null indicator can be specified for the schema. This indicator is applied to every field definition by default. If a field uses an alternate null indicator, it can be specified when the field is defined by setting the "nullIndicator" property. The following example demonstrates specifying a default null indicator for the schema and an alternate null indicator for several fields.
Specifying alternate null indicators per field
// Create a new schema variable and add fields to it
var accountschema = dr.schema()
   .nullable(true)
   .nullIndicator('<null>')
   .trimmed(false)
   .STRING('accountNumber', {nullIndicator:'?'})
   .STRING('clientName')
   .STRING('companyName')
   .STRING('streetAddress')
   .STRING('city')
   .STRING('state')
   .STRING('zip')
   .STRING('emailAddress')
   .DATE('birthDate', {pattern:'MM/dd/yyyy', nullIndicator:'NA'})
   .STRING('accountCodes', {nullIndicator:'N/A'})
   .DOUBLE('standardPayment', {pattern:'0.00'})
   .DOUBLE('payment', {pattern:'0.00'})
   .DOUBLE('balance', {pattern:'0.00'});
To load a schema variable from a persisted schema file, first create a new schema variable and then invoke the load() method. The following code fragment demonstrates this use case. Once loaded, the schema variable can be passed into an operator for usage.
Loading a schema variable from a local file
// Load the schema definition from a local file
var accountSchema = dr.schema().load('/user/me/accounts.schema');

// Use the schema variable in a read operation
var accountData = dr.readDelimitedText({source:'/user/me/accounts.txt', schema:accountSchema});
The following example demonstrates creating a fixed-width schema. Note that the type of the schema must be specified as FIXED. Also, each field specification must provide the size property. The alignment and padChar properties are optional. If not specified, they will take on their default values.
Create a fixed-width schema
// Create a fixed-width schema. Note that the type parameter must be specified when
// creating the schema. Also, the size parameter is required on each field.
// Alignment is optional for reading, but should be specified for writing.

var accountsFixedSchema = dr.schema({type:'FIXED'})
   .nullable(true)
   .trimmed(true)
   .STRING("accountNumber", {size:9, alignment:'RIGHT'})
   .STRING("clientName", {size:21, alignment:'LEFT'})
   .STRING("companyName", {size:31, alignment:'LEFT'})
   .STRING("streetAddress", {size:35, alignment:'LEFT'})
   .STRING("city", {size:16, alignment:'LEFT'})
   .STRING("state", {size:2, alignment:'LEFT'})
   .STRING("zip", {size:10, alignment:'LEFT'})
   .STRING("emailAddress", {size:25, alignment:'LEFT'})
   .DATE("birthDate", {pattern:'MM/dd/yyyy', size:10, alignment:'LEFT'})
   .STRING("accountCodes", {size:11, alignment:'RIGHT'})
   .DOUBLE("standardPayment", {pattern:'0.00', size:6, alignment:'RIGHT', padChar:'0'})
   .DOUBLE("payment", {pattern:'0.00', size:7, alignment:'RIGHT', padChar:'0'})
   .DOUBLE("balance", {pattern:'0.00', size:6, alignment:'RIGHT', padChar:'0'});
The following example demonstrates loading a fixed-width schema from a local file and its use by the ReadFixedText operator. Note that the type of the schema must be specified when creating the schema for loading.
Loading a fixed-width schema
// Create the schema as a fixed-width type and then load from a local file
var accountSchema = dr.schema({type:'FIXED'}).load('/user/me/accounts-fixed.schema');

// Use the schema variable in a read operation
var accountData = dr.readFixedText({source:'/user/me/accounts.txt', schema:accountSchema});
Using Enumerated Types
The RushScript environment preloads many DataFlow enumerated types into the JavaScript environment for direct usage in scripts. This is a convenience that prevents the user from having to import the enumerated types into JavaScript directly.
The RushScript environment will attempt to automatically convert textual representations of enumerated values into their corresponding type. For example, this allows scripts to use 'OVERWRITE' instead of WriteMode.OVERWRITE for operator properties that require a WriteMode enumerated type.
The preloaded enumerated types are listed below:
Enumerated Type
Values
Description
ARFFMode
SPARSE
DENSE
Using the WriteARFF Operator to Write Sparse Data to specify the output format to use when writing data.
DatasetStorageFormat
COMPACT_ROW
COLUMNAR
Using the WriteStagingDataset Operator to Write Staging Data Sets to determine the format used to store data.
DetailLevel
SINGLE_PASS_ONLY
MULTI_PASS
DistanceMeasure
EUCLIDEAN
COSINE_SIMILARITY
Using the KMeans Operator to Compute K-Means to specify the distance measure.
JoinMode
INNER
FULL_OUTER
LEFT_OUTER
RIGHT_OUTER
Using the Join Operator to Do Standard Relational Joins to specify the type of join to perform.
NormalizationMethod
none
logit
Used within PMML to define the normalization function applied to modeling applications.
NormalizeMethod
MINMAX
ZSCORE
Using the NormalizeValues Operator to Normalize Values to specify the type of normalization to apply.
OutputMode
APPEND
OVERWRITEROWS
OVERWRITETABLE
UPDATE
DELETE
Using the WriteToJDBC Operator to Write to Databases to specify the mode of writing to the target table.
ParseErrorAction
ERROR
WARN_AND_DISCARD
WARN
DISCARD
IGNORE
Describes the possible actions for handling record parsing errors. Used by file reader operators such as Using the ReadDelimitedText Operator to Read Delimited Text.
RankMode
STANDARD
DENSE
ORDINAL
Using the Rank Operator to Rank Data to specify the type of ranking to use.
SampleMode
BY_PERCENT
BY_SIZE
Using the SampleRandomRows Operator to Sample Data to specify the type of sampling to apply.
StringConversion
RAW
NULLABLE_RAW
TRIMMED
NULLABLE_TRIMMED
Enumerates the possible conversions for string-valued text types.
UnreadableSourceAction
IGNORE
WARN
FAIL
Specifies the behavior for handling data sources which are unreadable.
WriteMode
CREATE_NEW
OVERWRITE
APPEND
Used by file writer operators such as Using the WriteDelimitedText Operator to Write Delimited Text, Using the WriteARFF Operator to Write Sparse Data, and others to specify how to handle creating the output files for writing.
The following code fragment demonstrates using the WriteMode enumerated type with the WriteDelimitedText operator.
Example using enumerated types
// Write a delimited text file using overwrite for the write mode
dr.writeDelimitedText(data, {target:'file.txt', mode:WriteMode.OVERWRITE});
This code fragment below accomplishes the same thing using a String value for the enumerated type. The RushScript framework will convert the String value into the correct enumerated value.
Using a string value for an enumerated type
// Write a delimited text file using overwrite for the write mode
dr.writeDelimitedText(data, {target:'file.txt', mode:'OVERWRITE'});
Using Helper Classes
The RushScript environment preloads several DataFlow classes for convenience. These classes provide static methods that can be invoked using the associated class name. The following table lists the DataFlow classes that are preloaded. A few examples are provided to demonstrate usage of the classes within RushScript.
Class Name
Description
Aggregation
Provides methods for creating the aggregations to perform Using the Group Operator to Compute Aggregations.
Arithmetic
Provides methods for creating functions that perform arithmetic operations.
Conditionals
Provides methods for creating conditional functions.
ConstantReference
Provides methods for creating functions that provide constant values.
Conversions
Provides methods for creating functions that provide data type conversions.
DateTime
Provides methods for creating functions that process date and timestamp data types.
DateTimeValue
Provides values that are needed by the DateTime functions.
FieldDerivation
Provides methods for creating field derivation specifications by Using the DeriveFields Operator to Compute New Fields.
FieldReference
Provides methods for creating functions that access fields.
Formatting
Provides methods for creating functions that provide data formatting functionality.
Predicates
Provides methods for creating functions that express predicate conditions.
ReplaceSpecification
Provides methods for creating missing value handling specifications by Using the ReplaceMissingValues Operator to Replace Missing Values.
ScalarType
Defines the scalar types available within the DataFlow framework.
SortKey
Provides methods for creating SortKey objects used with Using the Sort Operator to Sort Data Sets.
Statistics
Provides methods for creating functions that provide compute common statistics.
Strings
Provides methods for creating functions that provide common String utilities.
Tolerance
Provides methods for creating floating point tolerance types used with various assertion operators such as Using the AssertEqual Operator to Assert Data Equality.
The following code fragment demonstrates using the Aggregation class to create the aggregations to perform using the Group operator.
RushScript example using the Aggregation class
// Define the aggregations wanted
var aggs = [
Aggregation.count("sepal length").as("count-sl"), 
    Aggregation.min("sepal length").as("min-sl"),
    Aggregation.max("sepal length").as("max-sl"),
    Aggregation.avg("sepal length").as("avg-sl"),
    Aggregation.stddev("sepal length").as("stddev-sl"),
    Aggregation.geoAvg("petal length").as("geoavg-pl"),
    Aggregation.skewness("petal length").as("skewness-pl")
];

// Group by the given keys using the defined aggregations
var results = dr.group(data, {keys:groupKeys, aggregations:aggs});
The Group operator allows setting the aggregations using a more SQL-like syntax. The following code fragment accomplishes the same aggregation using a simpler syntax.
Simpler syntax for Group
// Define the aggregations using a SQL-like syntax
var aggs = 'count("sepal length") as count_sl, min("sepal length") as min_sl, max("sepal length") as max_sl, avg("sepal length") as avg_sl, stddev("sepal length") as stddev_sl, geoavg("petal length") as geoavg_pl, skewness("petal length") as skewness_pl';

// Group by the given keys using the defined aggregations
var results = dr.group(data, {keys:groupKeys, aggregations:aggs});
The following code fragment builds several field derivations and applies them to the data within a data set. This example demonstrates building functions that get applied with the DerivedFields operator.
Creating and using functions
// Create the wanted field derivations using Arithmetic functions.
// Add a field values to a constant.
var d1 = FieldDerivation.derive('df1', Arithmetic.add('sepal length', 17.23));

// Add values of two fields.
var d2 = FieldDerivation.derive('df2', Arithmetic.add('sepal length', 'sepal width'));

// Multiply field values by a constant
var d3 = FieldDerivation.derive('df3', Arithmetic.mult('petal length', 425.32));

// Multiply values of two fields.
var d4 = FieldDerivation.derive('df4', Arithmetic.mult('petal length', 'petal width'));

// Chain functions creating: ('sepal length' + 'sepal width') / ('petal length' + 'petal width')
var d5 = FieldDerivation.derive('df5', Arithmetic.div(Arithmetic.add('sepal length', 'sepal width'), Arithmetic.add('petal length', 'petal width')));

// Use derive fields to apply the defined derivations
var results = dr.deriveFields(data, {derivedFields:[d1, d2, d3, d4, d5]});