Association Rule Mining Operators
Association Rule Mining (ARM) is an unsupervised learning technique for finding relationships of items contained within transactions. ARM is typically applied to sales transactions where transactions contain line items. Each line item provides a transaction identifier and an item identifier. ARM uses this relationship (transaction to item) to find relationships between sets of items. The two main results of ARM are frequent item sets and association rules.
Frequent item sets capture sets of items that are considered frequent along with some statistics about the sets. Item and item sets are considered frequent if they meet a minimum support threshold. The minimum support is usually quantified as a percentage of transactions within a data set. For example, setting the minimum support to 5% implies that an item is only frequent if it appears in 5% of the transactions within a data set. If a data set contains 5 million transactions (unique transactions, not just records), then an item must appear in at least 250,000 transactions to be considered frequent.
Item sets can have a cardinality of 1 (meaning one item per set) up to a maximum value, usually denoted as k. Frequent item sets are useful in finding sets of items that appear together in transactions and may be opportunities for up-sell and cross-sell activities.
Association rules are another result of ARM. Association rules provide information about frequent item sets and how they interact with each other. From association rules, it can be determined how the presence of one set of items within a transaction influences the presence of another set of items. The rules consist of:
Antecedent
Specifies the frequent item set driving the rule (antecedent implies consequent)
Consequent
Specifies the frequent items included in a transaction due to the antecedent
Confidence
Specifies the confidence of a rule, defined as:
Confidence(Antecedent => Consequent) = Support(Antecedent U Consequent) / Support(Antecedent)
A confidence of 25% implies that the rule is correct for 25% of the transactions containing the antecedent items. The higher the confidence value, the higher the strength of the rule.
Lift
Specifies the lift of a rule, defined as:
Lift(Antecedent => Consequent) = Support(Antecedent U Consequent) / (Support(Antecedent) * Support(Consequent))
Lift provides a ratio of the support of the item sets to that expected if the item sets were independent of each other. A higher lift value indicates the two item sets are dependent on each other, implying the rule is valid. In general, the confidence measurement alone cannot be used when considering the validity of a rule. A rule can have high confidence but a low lift, implying the rule may not be useful.
Support
Specifies the ratio of the number of transactions where the antecedent and consequent are present to the total number of transactions.
DataFlow provides operators for calculating the frequent items and the association rules. If only the frequent items are wanted, use the
FrequentItems operator. For a combination of the frequent items and association rules, use the
FPGrowth operator.
ConvertARMModel Operator can be used to convert an association PMML model into other formats. For more information, refer to the following topics:
FrequentItems Operator
The
FrequentItems operator computes the frequency of items within the given transactions. Two fields are needed in the input flow: one to identify transactions and another to identify items.
A minimum support value must also be specified as a percentage of the transactions an item must participate in to be considered frequent. Setting this value’s threshold lower will typically return more data. An optional label field may also be specified that will be used for display purposes instead of the transaction identifiers.
The main output of this operator is the set of frequent items. Two fields are contained in the output: the item field from the input and the frequency count of the items that are above the minimum support value.
The operator also outputs a PMML port that contains an association model. The model will be partially filled in with frequent items and transaction statistics.
Code Example
In this example we are using a movie ratings data set to determine which movies have been rated by at least 40% of the users.
Using the FrequentItems operator in Java
// Create frequent items operator with minimum support set to 40%
FrequentItems freqItems = graph.add(new FrequentItems("userID", "movieID", 0.4D));
The following example demonstrates using the operator in RushScript. Note that the operator has two output ports: the frequent items and a PMML model. Access the returned data set container variable with the port names to access a specific port.
Using the FrequentItems operator in RushScript
var freqItem = dr.frequentItems(data, {txnFieldName:'userID', itemFieldName:'movieID', minSupport:0.4});
Properties
The
FrequentItems operator provides the following properties.
Ports
The
FrequentItems operator provides a single input port.
The
FrequentItems operator provides the following output ports.
FPGrowth Operator
The
FPGrowth operator implements the FP-growth algorithm, creating a PMML model that contains generated item sets and association rules. The FP-growth algorithm is optimized to require only two passes over the input data. Other ARM algorithms such as apriori require many passes over the data.
The input data is required to have two fields: a transaction identifier and an item identifier. The transaction identifier discriminates transactions. The item identifier indicates items within transactions.
Transactions are assumed to be in line item order (one input record per item in a transaction). Transactions records are also assumed to be contiguous. An optional label field may also be specified that will be used for display purposes instead of the transaction identifiers.
The operator has two outputs: the frequent item sets and a PMML model. The frequent item sets are output in pivoted form with a set identifier and one item per row. The output model is a PMML-based association model. The model contains frequent items, frequent item sets, and association rules.
Code Example
In this example, the BPM sample data is read and used as input into the
FPGrowth operator. Downstream from the
FPGrowth operator, the PMML model is persisted to a local file. The PMML model contains the frequent items, frequent item sets, association rules, and various metrics about the model.
Using the FPGrowth operator in Java
// Create the FP-Growth operator.
FPGrowth fpgrowth = graph.add(new FPGrowth());
fpgrowth.setTxnFieldName("txnID"); // field containing transaction ID
fpgrowth.setItemFieldName("itemID"); // field containing item ID
fpgrowth.setMinSupport(0.02); // min support for frequent items
fpgrowth.setMinConfidence(0.2); // min confidence for rules
fpgrowth.setK(2); // highest cardinality of item sets wanted
fpgrowth.setAnnotationText("This is an ARM model built using DataFlow");
Using the FPGrowth operator in RushScript
var model = dr.fpgrowth(data, {
txnFieldName:'txnID',
itemFieldName:'itemID',
minSupport:0.02,
minConfidence:0.2,
k:2,
annotationText:'This is an ARM model built using DataFlow'
});
Properties
The
FPGrowth operator provides the following properties.
Ports
The
FPGrowth operator provides a single input port.
The
FPGrowth operator provides two output ports.
ConvertARMModel Operator
The
ConvertARMModel operator converts a PMML model created by
FPGrowth Operator into other formats. PMML is a standard format for data mining models developed by the
DMG organization.
The operator currently supports the output format:
GEXF. This format is a standard that is accepted by many graph visualization tools. GEXF was originally developed for the open source
Gephi tool. The format is also supported by other visualization tools.
When converting into the GEXF format, the frequent item sets of the PMML model are used. Each frequent item is added as a node in the graph. Attributes are added to each node indicating the support metric for the item.
Each relationship between items is added as an edge. Edges are weighted by how many times items are found in frequent item sets together. After the GEXF file is created, the Gephi tool can be launched to visualize the results. The Gephi tool is not included as part of DataFlow. You can download it from
http://gephi.org/.
A path name is required to specify where to persist the converted model. When executed, the operator applies the specified conversion of the input PMML model into the specified format. The results are written to the specified file. If the file exists already, it will be overwritten.
Code Examples
Using the ConvertARMModel operator in Java
// Create converter and set properties.
ConvertARMModel converter = graph.add(new ConvertARMModel());
converter.setConversionType(ConversionType.GEXF);
converter.setTarget("/path/to/output/results.gexf"));
// Connect FP-Growth learner model output with converter input
graph.connect(learner.getOutput(), converter.getInput());
Using the ConvertARMModel operator in RushScript
dr.convertARMModel(model, {conversionType:'GEXF', target:'/path/to/output/results.gexf'});
Properties
The
ConvertARMModel operator has the following properties.
Ports
The
ConvertARMModel operator provides a single input port.
The
ConvertARMModel operator has no output ports.