DF 8.2 | Association Rule Mining Operators

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > Association Rule Mining Operators

Was this helpful?

Association Rule Mining Operators

Association Rule Mining (ARM) is an unsupervised learning technique for finding relationships of items contained within transactions. ARM is typically applied to sales transactions where transactions contain line items. Each line item provides a transaction identifier and an item identifier. ARM uses this relationship (transaction to item) to find relationships between sets of items. The two main results of ARM are frequent item sets and association rules.

Frequent item sets capture sets of items that are considered frequent along with some statistics about the sets. Item and item sets are considered frequent if they meet a minimum support threshold. The minimum support is usually quantified as a percentage of transactions within a data set. For example, setting the minimum support to 5% implies that an item is only frequent if it appears in 5% of the transactions within a data set. If a data set contains 5 million transactions (unique transactions, not just records), then an item must appear in at least 250,000 transactions to be considered frequent.

Item sets can have a cardinality of 1 (meaning one item per set) up to a maximum value, usually denoted as k. Frequent item sets are useful in finding sets of items that appear together in transactions and may be opportunities for up-sell and cross-sell activities.

Association rules are another result of ARM. Association rules provide information about frequent item sets and how they interact with each other. From association rules, it can be determined how the presence of one set of items within a transaction influences the presence of another set of items. The rules consist of:

Antecedent

Specifies the frequent item set driving the rule (antecedent implies consequent)

Consequent

Specifies the frequent items included in a transaction due to the antecedent

Confidence

Specifies the confidence of a rule, defined as:

Confidence(Antecedent => Consequent) = Support(Antecedent U Consequent) / Support(Antecedent)

A confidence of 25% implies that the rule is correct for 25% of the transactions containing the antecedent items. The higher the confidence value, the higher the strength of the rule.

Lift

Specifies the lift of a rule, defined as:

Lift(Antecedent => Consequent) = Support(Antecedent U Consequent) / (Support(Antecedent) * Support(Consequent))

Lift provides a ratio of the support of the item sets to that expected if the item sets were independent of each other. A higher lift value indicates the two item sets are dependent on each other, implying the rule is valid. In general, the confidence measurement alone cannot be used when considering the validity of a rule. A rule can have high confidence but a low lift, implying the rule may not be useful.

Support

Specifies the ratio of the number of transactions where the antecedent and consequent are present to the total number of transactions.

DataFlow provides operators for calculating the frequent items and the association rules. If only the frequent items are wanted, use the FrequentItems operator. For a combination of the frequent items and association rules, use the FPGrowth operator. ConvertARMModel Operator can be used to convert an association PMML model into other formats. For more information, refer to the following topics:

• FrequentItems Operator

• FPGrowth Operator

• ConvertARMModel Operator

FrequentItems Operator

The FrequentItems operator computes the frequency of items within the given transactions. Two fields are needed in the input flow: one to identify transactions and another to identify items.

A minimum support value must also be specified as a percentage of the transactions an item must participate in to be considered frequent. Setting this value’s threshold lower will typically return more data. An optional label field may also be specified that will be used for display purposes instead of the transaction identifiers.

The main output of this operator is the set of frequent items. Two fields are contained in the output: the item field from the input and the frequency count of the items that are above the minimum support value.

The operator also outputs a PMML port that contains an association model. The model will be partially filled in with frequent items and transaction statistics.

Code Example

In this example we are using a movie ratings data set to determine which movies have been rated by at least 40% of the users.

Using the FrequentItems operator in Java

// Create frequent items operator with minimum support set to 40%
FrequentItems freqItems = graph.add(new FrequentItems("userID", "movieID", 0.4D));

The following example demonstrates using the operator in RushScript. Note that the operator has two output ports: the frequent items and a PMML model. Access the returned data set container variable with the port names to access a specific port.

Using the FrequentItems operator in RushScript

var freqItem = dr.frequentItems(data, {txnFieldName:'userID', itemFieldName:'movieID', minSupport:0.4});

Properties

The FrequentItems operator provides the following properties.

Name	Type	Description
txnFieldName	String	The name of the input field containing the transaction identifier.
itemFieldName	String	The name of the input field containing the item name.
labelFieldName	String	The name of the input field containing the optional labels.
minSupport	double	The minimum support percentage. Must be a value between 0 and 1.

Ports

The FrequentItems operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The data on which the frequency is calculated.

The FrequentItems operator provides the following output ports.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Output data with item field and frequency count field.
model	PMMLPort	getModel()	Outputs partially filled association model built from the data.

FPGrowth Operator

The FPGrowth operator implements the FP-growth algorithm, creating a PMML model that contains generated item sets and association rules. The FP-growth algorithm is optimized to require only two passes over the input data. Other ARM algorithms such as apriori require many passes over the data.

The input data is required to have two fields: a transaction identifier and an item identifier. The transaction identifier discriminates transactions. The item identifier indicates items within transactions.

Transactions are assumed to be in line item order (one input record per item in a transaction). Transactions records are also assumed to be contiguous. An optional label field may also be specified that will be used for display purposes instead of the transaction identifiers.

The operator has two outputs: the frequent item sets and a PMML model. The frequent item sets are output in pivoted form with a set identifier and one item per row. The output model is a PMML-based association model. The model contains frequent items, frequent item sets, and association rules.

Code Example

In this example, the BPM sample data is read and used as input into the FPGrowth operator. Downstream from the FPGrowth operator, the PMML model is persisted to a local file. The PMML model contains the frequent items, frequent item sets, association rules, and various metrics about the model.

Using the FPGrowth operator in Java

// Create the FP-Growth operator.
FPGrowth fpgrowth = graph.add(new FPGrowth());
fpgrowth.setTxnFieldName("txnID");             // field containing transaction ID
fpgrowth.setItemFieldName("itemID");           // field containing item ID
fpgrowth.setMinSupport(0.02);                  // min support for frequent items
fpgrowth.setMinConfidence(0.2);                // min confidence for rules
fpgrowth.setK(2);                              // highest cardinality of item sets wanted
fpgrowth.setAnnotationText("This is an ARM model built using DataFlow");

Using the FPGrowth operator in RushScript

var model = dr.fpgrowth(data, {
    txnFieldName:'txnID',
    itemFieldName:'itemID',
    minSupport:0.02,
    minConfidence:0.2,
    k:2,
    annotationText:'This is an ARM model built using DataFlow'
    });

Properties

The FPGrowth operator provides the following properties.

Name	Type	Description
annotationText	String	The text provided will be added as an annotation to the output PMML model. This can be used to tag a model for later reference. This is opaque text added to the annotation element within the model.
k	int	The largest item-set cardinality wanted. The default value is zero, which implies that all cardinality item-sets should be discovered.
txnFieldName	String	The transaction identifier field name property.
itemFieldName	String	The item field name property.
labelFieldName	String	The optional label field name property.
minSupport	double	The minimum support property. This property is a floating point number between 0 and 1 (exclusive) that defines the percentage of transactions an item must participate in to be considered frequent.
minConfidence	double	The minimum confidence threshold. The minimum confidence value must be between 0.0 and 1.0 (exclusive).

Ports

The FPGrowth operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data. Must have at least two fields.

The FPGrowth operator provides two output ports.

Name	Type	Get Method	Description
itemSets	RecordPort	getItemSets()	The discovered frequent item sets.
model	PMMLPort	getOutput()	Outputs the generated PMML-based association model.

ConvertARMModel Operator

The ConvertARMModel operator converts a PMML model created by FPGrowth Operator into other formats. PMML is a standard format for data mining models developed by the DMG organization.

The operator currently supports the output format: GEXF. This format is a standard that is accepted by many graph visualization tools. GEXF was originally developed for the open source Gephi tool. The format is also supported by other visualization tools.

When converting into the GEXF format, the frequent item sets of the PMML model are used. Each frequent item is added as a node in the graph. Attributes are added to each node indicating the support metric for the item.

Each relationship between items is added as an edge. Edges are weighted by how many times items are found in frequent item sets together. After the GEXF file is created, the Gephi tool can be launched to visualize the results. The Gephi tool is not included as part of DataFlow. You can download it from http://gephi.org/.

A path name is required to specify where to persist the converted model. When executed, the operator applies the specified conversion of the input PMML model into the specified format. The results are written to the specified file. If the file exists already, it will be overwritten.

Code Examples

Using the ConvertARMModel operator in Java

// Create converter and set properties.
ConvertARMModel converter = graph.add(new ConvertARMModel());
converter.setConversionType(ConversionType.GEXF);
converter.setTarget("/path/to/output/results.gexf"));

// Connect FP-Growth learner model output with converter input
graph.connect(learner.getOutput(), converter.getInput());

Using the ConvertARMModel operator in RushScript

dr.convertARMModel(model, {conversionType:'GEXF', target:'/path/to/output/results.gexf'});

Properties

The ConvertARMModel operator has the following properties.

Name	Type	Description
conversionType	ConversionType	The type of conversion to apply. This property is set using an enumerated type. Default: GEXF.
target	String	Pathname of the target file to create with the results of the conversion.

Ports

The ConvertARMModel operator provides a single input port.

Name	Type	Get Method	Description
input	PMMLPort	getInput()	The input PMML model to convert. The model must be an association model built by FPGrowth Operator or other ARM operator.

The ConvertARMModel operator has no output ports.

Last modified date: 03/10/2025