Manipulating Records
DataFlow Operators for Manipulating Records
The DataFlow operator library includes several operators for manipulating the fields contained within record ports. These operators are covered in this section.
Covered Field Operations
Using the DeriveFields Operator to Compute New Fields
The
DeriveFields operator is used to derive one or more new fields from existing fields in the input data. New output fields are derived by applying predefined functions to fields within the input record data. One new output field is generated per function applied to the input data. The result is an output record flow that contains the input data plus the function results.
It is possible to overwrite existing fields with new derived values. It is also possible to omit input fields in the result, effectively applying a complete transformation to the record.
Applying multiple functions to an input record flow within a single DataFlow process is usually more efficient than applying each function in its own process. This is due to many factors, but mainly preventing processor cache thrashing, saving data copies, and lowering thread context switching.
Multiple derivations can be applied to the input data flow. Each derivation specifies two things: the name of the output field and the function to apply. The names of the input fields to the function are provided as part of the function specification.
Alternatively, a function can be passed as the input to another function. This type of function chaining lets you specify complex operations to perform on the input data. The output of each function in the chain is passed as input to the next function.
When applying multiple derivations in a single operator, the resulting fields from preceding functions can be used in subsequent functions. For example, if your input type contains the fields a, b, and c, you could apply the following derivations in a single operator:
a + b as d,
c * d as e
Even though d is not in the input type, it can be referenced in the second function since it is defined in a preceding function. However, the following would not be valid:
c * d as e,
a + b as d
since d has not yet been defined when the first function is evaluated.
A list of all available functions is in
Available Functions.
Using Expressions to Derive Fields
Field derivations can also be specified by using the expression language. Individual FieldDerivations can be created by passing to them a scalar-valued function expression. These FieldDerivations can then be supplied to a
DeriveFields operator as normal.
Additionally, any number of field derivations can be specified by passing a single function derivation expression directly to the
DeriveFields operator. The syntax of a field derivation expression is:
<expression 1> as <field name 1>[, <expression 2> as <field name 2>, ...]
For example, you could pass the following expression directly to DeriveFields, assuming your input had a birthdate field:
dateTimeValue(birthdate, "MONTH") as month, dateTimeValue(birthdate, "DAY") as day, dateTimeValue(birthdate, "YEAR") as year
As with field names used elsewhere within expressions, the output field name can be surrounded by back-ticks if it contains non-alphanumeric characters, such as in the expression:
avg(Exam1, Exam2, Exam3) as ‘Exam Average‘
For more information about syntax and available functions, see
Expression Language.
Code Example
The following example derives a new field named "discounted_price" from the input fields "l_extendedprice" and "l_discount". An expression is used to express how to calculate the new output field. The expression is set as the "derivedFields" property of the operator.
Using the DeriveFields operator in Java
// Calculate the discounted price
DeriveFields deriveFields = graph.add(new DeriveFields());
deriveFields.setDerivedFields("l_extendedprice * (1 - l_discount) as discounted_price");
or
deriveFields.setDerivedFields(
new SimpleFieldDerivation(
"fieldy", new FieldReference("fieldx")),
new SimpleFieldDerivation("discounted_price",
mult("l_extendedprice", sub(1, "l_discount"))) );
Using the DeriveFields operator in RushScript
// Calculate the discounted price
var results = dr.deriveFields(data, {derivedFields:'l_extendedprice * (1 - l_discount) as discounted_price'});
or
dr.deriveFields(data, {derivedFields: 'fieldx as fieldy, ' +
'l_extendedprice * (1 - l_discount) as discounted_price'});
The following code fragment demonstrates chaining functions together. In the example, a field derivation is created that converts an input value into a double and then multiplies the result by the value of Pi. The resultant value is stored in the defined output field.
Using the API directly to chain functions
// Create a field derivation converting an input field into a double and multiplying it by pi
FieldDerivation fd = FieldDerivation.derive("outputField", Arithmetic.mult(Conversions.asDouble("inputField"), 3.14159));
DeriveFields deriveFields = graph.add(new DeriveFields(fd));
See the JavaDoc reference documentation for each function to determine the types of expressions and values that can be passed into the function.
Properties
The
DeriveFields operator has the following properties.
Ports
The
DeriveFields operator provides a single input port.
The
DeriveFields operator provides a single output port.
Using the DiscoverEnums Operator to Discover Enumerated Data Types
Enumerated data types contain categorical data. This data type contains a limited, pre-defined set of values. For example, a field containing gender types can be defined as enumerated. The field contains two possible values: male and female.
The
DiscoverEnums operator performs discovery of allowed values for a specified set of string fields, converting the specified fields into enumerated values. This operator works in two passes:
• The first pass performs a distributed discovery of allowed values.
• The second pass performs the conversions into the enumerated type.
Tip... Discovering enumerated types requires an additional pass over the input data. If the possible values of the enumerated type are already known, it is more efficient to define the enumerated type when the data is initially sourced.
Code Example
The following example converts a field named "class" into an enumerated type. This is useful when all of the possible values of the enumerated type are not known beforehand. Some operators require that data fields be enumerated types.
Using the DiscoverEnums operator in Java
// Discover the "class" enumerated type
DiscoverEnums discoverEnums = graph.add(new DiscoverEnums());
discoverEnums.setIncludedFields(Arrays.asList("class"));
Using the DiscoverEnums operator in RushScript
var results = dr.discoverEnums(data, {includedFields:'class'});
Properties
The
DiscoverEnums operator support the following properties.
Ports
The
DiscoverEnums operator provides a single input port.
The
DiscoverEnums operator provides a single output port.
Using the MergeFields Operator to Merge Fields
The
MergeFields operator merges two streams of data with an equivalent number of rows into one. The output type contains all the fields from the left input followed by those from the right input. If any names exist in both the left and right input, the collision is resolved by renaming the fields from the right.
This operator is nonparallel because the equivalence of input sizes cannot be guaranteed in a parallel context; using this operator will force a gather of the input to a single machine. Even then, you must take care to ensure the ordering of input records is consistent with the desired result. Generally, this operator only makes sense when used inside a
DeferredCompositeOperator where these guarantees can be made.
Code Example
Using the MergeFields operator in Java
// Create the merge fields operator
MergeFields merge = graph.add(new MergeFields());
// Connect the source ports to the inputs of merge fields
graph.connect(leftSource, merge.getLeftInput());
graph.connect(rightSource, merge.getRightInput());
// Connect the output of merge fields to an operator downstream
graph.connect(merge.getOutput(), target);
Using the MergeFields operator in RushScript
// Merge two data sources
var mergedData = dr.mergeFields(leftData, rightData);
Properties
The
MergeFields operator has no properties.
Ports
The
MergeFields operator provides the following input ports.
The
MergeFields operator provides a single output port.
Using the RemoveFields Operator to Remove Fields
The
RemoveFields operator removes a subset of fields from the input records. Only those fields not named in the filter are copied to the output; if a field specified in the filter does not exist in the input, it is ignored. The relative order of fields within the input data is unchanged.
Tip... Removing extraneous fields can provide a performance boost. Always try to keep flows to the minimum set of fields possible. This promotes maximum efficiency of the overall application.
Code Example
The following example uses the
RemoveFields operator to remove fields "field1" and "field4" from an input flow. Downstream consumers of the
RemoveFields operator access the trimmed flow without the extraneous fields.
Using the RemoveFields operator in Java
// Create the remove fields operator
RemoveFields removeFields = graph.add(new RemoveFields());
removeFields.setFieldNames(Arrays.asList("field1", "field4"));
Using the RemoveFields operator in RushScript
var results = dr.removeFields(data, {fieldNames:['field1','field2']});
Properties
The
RemoveFields operator provides the following properties.
Ports
The
RemoveFields operator provides a single input port.
The
RemoveFields operator provides a single output port.
Using the RetainFields Operator to Retain Fields
The
RetainFields operator preserves a subset of fields from the input records. Only those fields named in the filter are copied to the output; if a field specified in the filter does not exist in the input, it is ignored. The relative order of fields within the input data is unchanged. To require the existence of certain fields, use the
SelectFields operator instead.
Tip... Removing extraneous fields can provide a performance boost. Always try to keep flows to the minimum set of fields possible. This promotes maximum efficiency of the overall application.
Code Example
The following code example uses the
RetainFields operator to retain only "field1" and "field4" from the source data. Downstream operators from
RetainFields will only have access to "field1" and "field4" data.
Using the RetainFields operator in Java
// Create the retain fields operator
RetainFields retainer = graph.add(new RetainFields());
retainer.setFieldNames(Arrays.asList("field1", "field4"));
// Connect the source port to the retain fields operator
graph.connect(source, retainer.getInput());
// Connect retain fields output to a downstream operator.
// This data only contains "field1" and "field4".
graph.connect(removeFields.getOutput(), target);
Using the RetainFields operator in JavaScript
// Retain only fields field1 and field4
var results = dr.retainFields(data, {fieldNames:['field1', 'field4']});
Properties
The
RetainFields operator provides one property.
Ports
The
RetainFields operator provides a single input port.
The
RetainFields operator provides a single output port.
Using the SelectFields Operator to Select Fields
The
SelectFields operator preserves a subset of fields from the input records. Although similar to the
RetainFields operator, there are two significant behavioral differences:
• The fields are reordered according to their order in the filter.
• All fields specified in the filter must exist in the input data or an error will be issued.
Note: Removing extraneous fields can provide a performance boost. Always try to keep flows to the minimum set of fields possible. This promotes maximum efficiency of the overall application.
Code Example
The following code example uses the
SelectFields operator to select only "field4" and "field1" from the source data. Downstream operators from
SelectFields will have access only to "field4" and "field1" data. The order of the fields within the flow is changed to "field4", "field1". This feature is useful when fields are wanted in a specific order (for example, when writing to a file).
Using the SelectFields operator in Java
// Create the select fields operator
SelectFields select = graph.add(new SelectFields());
select.setFieldNames(Arrays.asList("field4", "field1"));
// Connect the source port to the select fields operator
graph.connect(source, select.getInput());
// Connect select fields output to a downstream operator.
// This data only contains "field1" and "field4".
graph.connect(select.getOutput(), target);
Using the SelectFields operator in RushScript
// Select fields: field4, field1 in that order
var results = dr.selectFields(data, {fieldNames:['field4', 'field1']});
Properties
The
SelectFields operator provides one property.
Ports
The
SelectFields operator provides a single input port.
The
SelectFields operator provides a single output port.
Using the RemapFields Operator to Rename Fields
The
RemapFields operator rearranges and renames fields in a record. Two options are provided for handling source fields not specifically referenced in the field renaming:
• Unmapped fields are removed from the result. This achieves the same effect as combining the renaming operation with the
SelectFields operator.
• Unmapped fields are kept in the result, with the names of the fields remaining the same as in the source. For example, if the source schema is [A,B,C] and the mapping is {B -> Z}, the resulting schema will be [A,Z,C].
Code Example
The following code example creates a
RemapFields operator and configures it using a convenience method on the
FieldRemapping class. The input field named "source1" will be renamed to "target1". All other fields will be left unchanged.
Using the RemapFields operator in Java
// Create the remap fields operator
RemapFields remap = graph.add(new RemapFields());
// Use a FieldRemappings convenience method to specify that the
// given field should be renamed and all others dropped
remap.setFieldRemapping(FieldRemappings.reorder(FieldRemappings.map("source1","target1")));
// Connect the remap operator to a data source
graph.connect(source, remap.getInput());
Using the RemapFields operator in RushScript
// Define and apply the remapping
var remapping = FieldRemappings.reorder(FieldRemappings.map("source1", "target1"));
var dataRemap = dr.remapFields(data, {fieldRemapping:remapping});
Properties
The
RemapFields operator provides one property.
Ports
The
RemapFields operator provides a single input port.
The
RemapFields operator provides a single output port.
Using the SplitField Operator to Split Fields
The
SplitField operator splits a string field into multiple fields based on a specified pattern.
The SplitField operator has three properties:
• splitField—The name of the field to split. The field must be of type String.
• splitPattern—A regular expression describing the delimiter used for splitting.
• resultMapping—A map of integers to strings, detailed below.
The contents of the split field will be split using the defined split pattern, resulting in an array of substrings. The key of the result mapping corresponds to an index within this array, and the associated value defines the output field in which to place the substring.
For example, if you have a record with a field named "time" containing times in the format 18:30:00, you could use the following
SplitField operator to split the time into "hour", "minute", and "second" fields.
HashMap<Integer,String> map = new HashMap<Integer,String>();
map.put(0,"hour");
map.put(1,"minute");
map.put(2,"second");
SplitField splitter = new SplitField("time",":",map);
Code Example
This example uses the
SplitField operator to split a String field representing a customer's email address into the local and domain part.
Using the SplitField operator in Java
// Split an email address field into it's components
SplitField split = graph.add(new SplitField());
split.setSplitField("emailAddress");
split.setSplitPattern("@");
HashMap<Integer,String> map = new HashMap<Integer,String>();
map.put(0, "emailUser");
map.put(1, "emailDomain");
split.setResultMapping(map);
Using the SplitField operator in RushScript
// Split an email address field into its components
var results = dr.splitField(data, {splitField:'emailAddress', splitPattern:'@', resultMapping:{0:'emailUser', 1:'emailDomain'}});
Properties
The
SplitField operator provides the following properties.
Ports
The
SplitField operator provides a single input port.
The
SplitField operator provides a single output port.
Using the RowsToColumns Operator to Convert Rows to Columns (Pivot)
The
RowsToColumns operator pivots data from a narrow representation (rows) into a wider representation (columns).
The data is first segmented into groups using a defined set of group keys. The ordering of the group keys is important as it defines how the data is partitioned and ordered for the pivot operation. A pivot key field contains the distinct values that will be used as the pivot point. This field must be a string. A column is created in the output for each distinct value of the pivot key. An aggregation is defined which is performed on each data grouping defined by the group keys and the pivot key. The result of the aggregation for each unique value of the pivot key appears in the appropriate output column.
The following table provides an example set of data that we want to pivot by Region. There are only four regions: North, South, East and West. For each item, we want to show the total sales per region. Items can show up multiple times in a region since the data is also organized by store.
To accomplish this pivot, the ItemID will be used as the group key. The Region will be used as the pivot key. And the Sales column will be the pivot value, aggregating by summing the values. The pivot key values are: "North", "South", "East", and "West". The result of the pivot is shown in the following table.
Note that the sales total for the West region for item 1 is empty. Scanning the input data shows that no sales were present in the West region for item 1. Item 1 did have two sales values for the North region. Those values (1000 and 800) are summed and the total (1800) appears in the North region column for item 1. Columns with values indicated as '?' imply null or non-existing value.
The key concepts to understand in using the RowsToColumns operator are:
• The use of a set of columns to segment the data into groups for pivoting. This set of column is called the group keys. The example used the ItemID as the group key.
• A categorical valued column whose distinct values are used as columns in the output. This is the pivot key. The example used the Region as the pivot key.
• A column that can be aggregated for each data grouping and within each pivot key. These are the pivot values. The example used the Sales column as the pivot value.
Code Examples
The following code examples demonstrate using the
RowsToColumns operator to pivot the data provided in the previous example. Note that the pivot values must be specified explicitly. Also note that the DataFlow
Expression Language can be used to specify the aggregations to calculate on the pivot values.
Using the RowsToColumns operator in Java
RowsToColumns pivot = app.add(new RowsToColumns());
pivot.setGroupKeys("ItemID");
pivot.setPivotKey("Region");
pivot.setPivotKeyValues("North", "South", "East", "West");
pivot.setAggregations("sum(Sales)");
pivot.setPivotColumnPattern("{0}_{1}");
Multiple aggregations can be used on the pivot values. This will result in multiple sets of results for each pivot key value. Also, a limited set of pivot key values can be provided, resulting in fewer output columns. This can be useful if only a subset of the pivot values are needed in the output. Both of these capabilities are demonstrated in this code example.
Using multiple aggregations and limited pivot values in Java
RowsToColumns pivot = app.add(new RowsToColumns());
pivot.setGroupKeys("ItemID");
pivot.setPivotKey("Region");
// Only want the North and South regions sales values pivoted
pivot.setPivotKeyValues("North", "South");
// Generate a sum and average of sales per region
pivot.setAggregations("sum(Sales), avg(Sales)");
Using the RowsToColumns operator in RushScript
var regions = ['North', 'South', 'East', 'West'];
var results = dr.rowsToColumns(data, {groupKeys:'ItemID', pivotKey:'Region', pivotKeyValues:regions, aggregations:'sum(Sales)', pivotColumnPattern:"{0} {1}"});
The following RushScript example demonstrates using limited pivot key values and multiple aggregations.
Using multiple aggregations and limited pivot values in RushScript
var regions = ['North', 'South'];
var results = dr.rowsToColumns(data, {groupKeys:'ItemID', pivotKey:'Region', pivotKeyValues:regions, aggregations:'sum(Sales), avg(Sales)'});
Properties
The
RowsToColumns operator provides the following properties.
Ports
The
RowsToColumns operator provides a single input port.
Using the ColumnsToRows Operator to Convert Columns to Rows (Unpivot)
The
ColumnsToRows operator transposes values from row columns into multiple rows. This operator is used primarily to unpivot wide records into a form with repeating row key values possibly as an attempt to move data to a more normal form.
In the following example the multiple phone numbers and addresses for a given record are unpivoted to a more normalized contact, type, address and number format.
Example source:
Unpivoted output:
Users define one or move pivot families to describe the output field and the source fields or constant values that will be mapped to these fields in the output. Target field datatypes are determined by finding the widest common scalar data type of the family’s unpivot elements. An exception can occur if no common type is possible. The number of elements defined in a keyFamily, or valueFamily must be equal.
Users may optionally define groupKeyFields that will be the fixed or repeating value portion of the multiple rows. If this property is unset, the group key fields will be determined by the remainder of the source fields not specified as pivot elements.
Note: If, during the columns to rows process, all mapped values to the pivot columns are found to be null, that row will not be produced in the output. In the example above, if no pivotKeyFamily were defined, the Sally White "home" record would not have appeared in the output because both elements are null. Only because a pivotKeyFamily was defined does this record appear in the output.
Code Example
ColumnsToRows operator in Java
// Create the columns to rows operator
ColumnsToRows c2r= graph.add(new ColumnsToRows());
// Define an pivotKeyFamily of constant values
c2r.definePivotKeyFamily("type", "home", "business");
/*
*NOTE: the order in which the pivotValueFamilies are defined imply the order of those fields in the output.
*If a pivotKeyFamily is defined it will appear after the fixed portion of the output fields but before the pivot family fields.
*/
//Define an pivotValueFamily "address" containing source input fields "home address" and "business address"
c2r.definePivotValueFamily("address", "Home Address", "Business Address"));
//Define another pivotValueFamily "number" containing source input fields "home phone" and "business phone"
c2r.definePivotValueFamily("number", "Home Phone", "Business Phone"));
// Optionally specify the key (fixed) fields
c2r.setGroupKeyFields("Contact");
// Connect the source port to the inputs of ColumnsToRows
graph.connect(sourceFlow, c2r.getInput());
// Connect the output of ColumnsToRows to an operator downstream
graph.connect(c2r.getOutput(), target);
ColumnsToRows operator in RushScript
var flow = dr.columnsToRows(sourceFlow, {groupKeyFields:["Contact"], keyMap:{"type":["home", "business"]}, valueMap:{"address":["Home Address", "Business Address"], "number":["Home Phone", "Business Phone"]}, valueOrder:["address","number"] });
Properties
The
ColumnsToRows operator supports the following properties.
Ports
The
ColumnsToRows operator provides a single input port.
The
ColumnsToRows operator provides a single output port.