Was this helpful?
Association Rule Mining Operators
Association Rule Mining (ARM) is an unsupervised learning technique for finding relationships of items contained within transactions. ARM is typically applied to sales transactions where transactions contain line items. Each line item provides a transaction identifier and an item identifier. ARM uses this relationship (transaction to item) to find relationships between sets of items. The two main results of ARM are frequent item sets and association rules.
Frequent item sets capture sets of items that are considered frequent along with some statistics about the sets. Item and item sets are considered frequent if they meet a minimum support threshold. The minimum support is usually quantified as a percentage of transactions within a data set. For example, setting the minimum support to 5% implies that an item is only frequent if it appears in 5% of the transactions within a data set. If a data set contains 5 million transactions (unique transactions, not just records), then an item must appear in at least 250,000 transactions to be considered frequent.
Item sets can have a cardinality of 1 (meaning one item per set) up to a maximum value, usually denoted as k. Frequent item sets are useful in finding sets of items that appear together in transactions and may be opportunities for up-sell and cross-sell activities.
Association rules are another result of ARM. Association rules provide information about frequent item sets and how they interact with each other. From association rules, it can be determined how the presence of one set of items within a transaction influences the presence of another set of items. The rules consist of:
Antecedent
Specifies the frequent item set driving the rule (antecedent implies consequent)
Consequent
Specifies the frequent items included in a transaction due to the antecedent
Confidence
Specifies the confidence of a rule, defined as:
Confidence(Antecedent => Consequent) = Support(Antecedent U Consequent) / Support(Antecedent)
A confidence of 25% implies that the rule is correct for 25% of the transactions containing the antecedent items. The higher the confidence value, the higher the strength of the rule.
Lift
Specifies the lift of a rule, defined as:
Lift(Antecedent => Consequent) = Support(Antecedent U Consequent) / (Support(Antecedent) * Support(Consequent))
Lift provides a ratio of the support of the item sets to that expected if the item sets were independent of each other. A higher lift value indicates the two item sets are dependent on each other, implying the rule is valid. In general, the confidence measurement alone cannot be used when considering the validity of a rule. A rule can have high confidence but a low lift, implying the rule may not be useful.
Support
Specifies the ratio of the number of transactions where the antecedent and consequent are present to the total number of transactions.
DataFlow provides operators for calculating the frequent items and the association rules. If only the frequent items are wanted, use the FrequentItems operator. For a combination of the frequent items and association rules, use the FPGrowth operator. ConvertARMModel Operator can be used to convert an association PMML model into other formats. For more information, refer to the following topics:
FrequentItems Operator
The FrequentItems operator computes the frequency of items within the given transactions. Two fields are needed in the input flow: one to identify transactions and another to identify items.
A minimum support value must also be specified as a percentage of the transactions an item must participate in to be considered frequent. Setting this value’s threshold lower will typically return more data. An optional label field may also be specified that will be used for display purposes instead of the transaction identifiers.
The main output of this operator is the set of frequent items. Two fields are contained in the output: the item field from the input and the frequency count of the items that are above the minimum support value.
The operator also outputs a PMML port that contains an association model. The model will be partially filled in with frequent items and transaction statistics.
Code Example
In this example we are using a movie ratings data set to determine which movies have been rated by at least 40% of the users.
Using the FrequentItems operator in Java
// Create frequent items operator with minimum support set to 40%
FrequentItems freqItems = graph.add(new FrequentItems("userID", "movieID", 0.4D));
The following example demonstrates using the operator in RushScript. Note that the operator has two output ports: the frequent items and a PMML model. Access the returned data set container variable with the port names to access a specific port.
Using the FrequentItems operator in RushScript
var freqItem = dr.frequentItems(data, {txnFieldName:'userID', itemFieldName:'movieID', minSupport:0.4});
Properties
The FrequentItems operator provides the following properties.
Name
Type
Description
txnFieldName
String
The name of the input field containing the transaction identifier.
itemFieldName
String
The name of the input field containing the item name.
labelFieldName
String
The name of the input field containing the optional labels.
minSupport
double
The minimum support percentage. Must be a value between 0 and 1.
Ports
The FrequentItems operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The data on which the frequency is calculated.
The FrequentItems operator provides the following output ports.
Name
Type
Get Method
Description
output
getOutput()
Output data with item field and frequency count field.
model
getModel()
Outputs partially filled association model built from the data.
FPGrowth Operator
The FPGrowth operator implements the FP-growth algorithm, creating a PMML model that contains generated item sets and association rules. The FP-growth algorithm is optimized to require only two passes over the input data. Other ARM algorithms such as apriori require many passes over the data.
The input data is required to have two fields: a transaction identifier and an item identifier. The transaction identifier discriminates transactions. The item identifier indicates items within transactions.
Transactions are assumed to be in line item order (one input record per item in a transaction). Transactions records are also assumed to be contiguous. An optional label field may also be specified that will be used for display purposes instead of the transaction identifiers.
The operator has two outputs: the frequent item sets and a PMML model. The frequent item sets are output in pivoted form with a set identifier and one item per row. The output model is a PMML-based association model. The model contains frequent items, frequent item sets, and association rules.
Code Example
In this example, the BPM sample data is read and used as input into the FPGrowth operator. Downstream from the FPGrowth operator, the PMML model is persisted to a local file. The PMML model contains the frequent items, frequent item sets, association rules, and various metrics about the model.
Using the FPGrowth operator in Java
// Create the FP-Growth operator.
FPGrowth fpgrowth = graph.add(new FPGrowth());
fpgrowth.setTxnFieldName("txnID");             // field containing transaction ID
fpgrowth.setItemFieldName("itemID");           // field containing item ID
fpgrowth.setMinSupport(0.02);                  // min support for frequent items
fpgrowth.setMinConfidence(0.2);                // min confidence for rules
fpgrowth.setK(2);                              // highest cardinality of item sets wanted
fpgrowth.setAnnotationText("This is an ARM model built using DataFlow");       
Using the FPGrowth operator in RushScript
var model = dr.fpgrowth(data, {
    txnFieldName:'txnID',
    itemFieldName:'itemID',
    minSupport:0.02,
    minConfidence:0.2,
    k:2,
    annotationText:'This is an ARM model built using DataFlow'
    });
Properties
The FPGrowth operator provides the following properties.
Name
Type
Description
annotationText
String
The text provided will be added as an annotation to the output PMML model. This can be used to tag a model for later reference. This is opaque text added to the annotation element within the model.
k
int
The largest item-set cardinality wanted. The default value is zero, which implies that all cardinality item-sets should be discovered.
txnFieldName
String
The transaction identifier field name property.
itemFieldName
String
The item field name property.
labelFieldName
String
The optional label field name property.
minSupport
double
The minimum support property. This property is a floating point number between 0 and 1 (exclusive) that defines the percentage of transactions an item must participate in to be considered frequent.
minConfidence
double
The minimum confidence threshold. The minimum confidence value must be between 0.0 and 1.0 (exclusive).
Ports
The FPGrowth operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The input data. Must have at least two fields.
The FPGrowth operator provides two output ports.
Name
Type
Get Method
Description
itemSets
getItemSets()
The discovered frequent item sets.
model
getOutput()
Outputs the generated PMML-based association model.
ConvertARMModel Operator
The ConvertARMModel operator converts a PMML model created by FPGrowth Operator into other formats. PMML is a standard format for data mining models developed by the DMG organization.
The operator currently supports the output format: GEXF. This format is a standard that is accepted by many graph visualization tools. GEXF was originally developed for the open source Gephi tool. The format is also supported by other visualization tools.
When converting into the GEXF format, the frequent item sets of the PMML model are used. Each frequent item is added as a node in the graph. Attributes are added to each node indicating the support metric for the item.
Each relationship between items is added as an edge. Edges are weighted by how many times items are found in frequent item sets together. After the GEXF file is created, the Gephi tool can be launched to visualize the results. The Gephi tool is not included as part of DataFlow. You can download it from http://gephi.org/.
A path name is required to specify where to persist the converted model. When executed, the operator applies the specified conversion of the input PMML model into the specified format. The results are written to the specified file. If the file exists already, it will be overwritten.
Code Examples
Using the ConvertARMModel operator in Java
// Create converter and set properties.
ConvertARMModel converter = graph.add(new ConvertARMModel());
converter.setConversionType(ConversionType.GEXF);
converter.setTarget("/path/to/output/results.gexf"));

// Connect FP-Growth learner model output with converter input
graph.connect(learner.getOutput(), converter.getInput());
Using the ConvertARMModel operator in RushScript
dr.convertARMModel(model, {conversionType:'GEXF', target:'/path/to/output/results.gexf'});
Properties
The ConvertARMModel operator has the following properties.
Name
Type
Description
conversionType
The type of conversion to apply. This property is set using an enumerated type. Default: GEXF.
target
String
Pathname of the target file to create with the results of the conversion.
Ports
The ConvertARMModel operator provides a single input port.
Name
Type
Get Method
Description
input
getInput()
The input PMML model to convert. The model must be an association model built by FPGrowth Operator or other ARM operator.
The ConvertARMModel operator has no output ports.
Last modified date: 03/10/2025