Building DataFlow Applications Using RushScript : Use Case Scenario for RushScript : Joining Data
 
Share this page                  
Joining Data
The following script reads two data sources, joins the data together, and then summarizes the data values producing a stats model. A few noteworthy items:
The dr.schema() function is used to create schemas used by the readers. A schema can also be loaded from a local file using the load() function of the schema.
The dr.makeJoinKeys() function is used to create the join keys since the names of the key fields for the left and right hand sides differ.
The join mode is set to LEFT_OUTER to perform a left outer join. Note the use of the join mode as a string type. It could also have been specified as JoinMode.LEFT_OUTER.
Summary statistics are generated from the joined data. The statistics model is output in PMML and is persisted to a local file.
RushScript Example: Joining Data
// Define rating schema
var ratingschema = dr.schema()
    .nullable(true)
    .trimmed(true)
    .INT("r_userID")
    .INT("r_movieID")
    .DOUBLE("r_rating").INT("r_timestamp");

// Define movie schema
var movieschema = dr.schema()
    .nullable(true)
    .trimmed(true)
    .INT('m_movieID')
    .STRING('m_movieName')
    .STRING('m_genre');

// Read ratings
var ratings = dr.readDelimitedText({source:'input/ratings.txt', schema:ratingschema, fieldSeparator:"::", header:true});

// Read movies
var movies = dr.readDelimitedText({source:'input/movies.txt', schema:movieschema, fieldSeparator:"::", header:true});

// Create keys for join
var keys = dr.makeJoinKeys(['r_movieID'], ['m_movieID']);

// Use left outer join in case any movie definitions are missing
var results = dr.join(ratings, movies, {joinKeys:keys, joinMode:'LEFT_OUTER', mergeLeftAndRightKeys:true});

// Run summary statistics on the combined data
var model = dr.summaryStatistics(results, {detailLevel:DetailLevel.MULTI_PASS});

// Store the stats model in PMML form
dr.writePMML(model, {targetPathName:'output/summary-pmml.xml'});