DF 8.2 | Naive Bayes Operators

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > Naive Bayes Operators

Was this helpful?

Naive Bayes Operators

The Naive Bayes algorithm uses Bayes’ theorem to find the probability of an event occurring given the probability of another event that has already occurred.

The Naive Bayes learner produces a model that can be used with the Naive Bayes predictor as a simple probabilistic classifier based on applying Bayes’ theorem with strong, or naive, independence assumptions.

One of the advantages of a naive Bayes classifier is that it only requires a relatively small amount of training data to estimate the parameters necessary for classification.

DataFlow provides operators to produce and use naive Bayes classifiers. The learner is used to determine the classification rules for a particular data set while the predictor can apply these rules to a data set. For more information, refer to the following topics:

• NaiveBayesLearner Operator

• NaiveBayesPredictor Operator

NaiveBayesLearner Operator

The NaiveBayesLearner operator is responsible for building a Naive Bayes PMML model from input data. The base algorithm used is specified at http://www.dmg.org/v4-0-1/NaiveBayes.html, with the following differences:

• Provides the ability to predict based on numerical data. For numerical data, we compute probability based on the assumption of a Gaussian distribution.

• We use Laplace smoothing in place of the "threshold" parameter.

• We provide an option to count missing values. If selected, missing values are treated like any other single distinct value. Probability is calculated in terms of the ratio of missing to non-missing.

• Calculation is performed in terms of log-likelihood rather than likelihood to avoid underflow on data with a large number of fields.

Code Example

This example uses Naive Bayes to create a predictive classification model based on the Iris data set. It uses the field "class" within the iris data as the target column. This example produces a PMML model that is persisted. This PMML model can then be used with the NaiveBayesPredictor operator to predict target values.

Using the NaiveBayesLearner operator in Java

// Run Naive Bayes using "class" as the target column.
// All other columns are used as learning columns by default.
NaiveBayesLearner nbLearner = graph.add(new NaiveBayesLearner());
nbLearner.setTargetColumn("class");

Using the NaiveBayesLearner operator in RushScript

// Run Naive Bayes using "class" as the target column.
// All other columns are used as learning columns by default.
var model = dr.naiveBayesLearner(data, {targetColumn:'class'});

Properties

The NaiveBayesLearner operator provides the following properties.

Name	Type	Description
learningColumns	List<String>	The list of columns to be used to predict the output value. Default of empty list means "everything but targetColumn".
targetColumn	String	The name of the column to be predicted. Must be a column of type string.

Ports

The NaiveBayesLearner operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data. String fields are assumed to be categorical. Double fields are assumed to be numerical. All other fields are ignored.

The NaiveBayesLearner operator provides a single output port.

Name	Type	Get Method	Description
model	PMMLPort	getModel()	The Naive Bayes PMML model.

NaiveBayesPredictor Operator

The NaiveBayesPredictor operator applies a previously built Naive Bayes model to the input data. The base algorithm used is specified at http://www.dmg.org/v4-0-1/NaiveBayes.html, with the following differences:

• Provides the ability to predict based on numerical data. For numerical data, we compute probability based on the assumption of a Gaussian distribution.

• We use Laplace smoothing in place of the "threshold" parameter.

• Calculation is performed in terms of log-likelihood rather than likelihood.

Code Example

Using the NaiveBayesPredictor operator in Java

// Create the Naive Bayes predictor operator and add it to a graph
NaiveBayesPredictor predictor = graph.add(new NaiveBayesPredictor());
predictor.setAppendProbabilities(false);

// Connect the predictor to an input data and model source
graph.connect(dataSource.getOuptut(), predictor.getInput());
graph.connect(modelSource.getOutput(), predictor.getModel());

// The output of the predictor is available for downstream operators to use

Using the NaiveBayesPredictor operator in RushScript

// Apply a naive Bayes model to the given data
var classifiedData = dr.naiveBayesPredictor(model, data, {appendProbabilities:false});

Properties

The NaiveBayesPredictor operator provides the following properties.

Name	Type	Description
appendProbabilities	boolean	Whether to include probabilities in the prediction. Default: true.
laplaceCorrector	double	The Laplace corrector to be used. The Laplace corrector is a way to handle "zero" counts in the training data. Otherwise a value that was never observed in the training data results in zero probability. The default of 0.0 means no correction. The "threshold" value specified in the PMML model will always be ignored in favor of the Laplace corrector specified on a NaiveBayesPredictor.
ignoreMissingValues	boolean	Whether to ignore missing values. If set to true, missing values are ignored for the purposes of prediction; otherwise missing values are considered when calculating probability distribution. Default: true.
probabilityPrefix	String	The field name prefix to use for probabilities. Default: "probability_"
winnerField	String	The name of the winner field to output. Default: "winner"

Ports

The NaiveBayesPredictor operator provides the following input ports.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The input data to which the Naive Bayes model is applied.
model	PMMLPort	getModel()	Naive Bayes model in PMML to apply.

The NaiveBayesPredictor operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Results of applying the model to the input data.

Last modified date: 03/10/2025