DF 8.2 | Regression Analysis

Building DataFlow Applications > Building DataFlow Applications > Building DataFlow Applications in Java > DataFlow Operator Library > Regression Analysis

Was this helpful?

Regression Analysis

Regression analysis includes many techniques for modelling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables.

Regression analysis is applied when you want to understand how the typical value of the dependent variable changes as the independent variables are varied. It is therefore useful in finding the relationship between variables. This information can then be used to predict the expected values of a variable given the current values of the other variables.

DataFlow provides several operators that can be used for regression analysis. For more information, refer to the following topics:

• LinearRegressionLearner Operator

• RegressionPredictor Operator

• LogisticRegressionLearner Operator

• LogisticRegressionPredictor Operator

LinearRegressionLearner Operator

The LinearRegressionLearner operator performs a multivariate linear regression on the given training data. The output is a PMML model describing the resultant regression model. The model consists of the y-intercept and coefficients for each of the given independent variables.

A dependent variable must be specified. This is a field in the input that is the target of the linear regression model. One or more independent variables are also required from the input data.

This operator supports numeric as well as categorical data as input. The linear regression is performed using an Ordinary Least Squares (OLS) fit. Dummy Coding is used to handle categorical variables.

This approach requires for each of the categorical variables one value from its domain to be chosen that serves as reference for all other values in that domain during the computation of the model. Specifying reference values using operator's API is optional. If for a certain categorical variable no reference value is specified by the user, it will be randomly chosen.

The output is an estimate of coefficients for the model:

Y = a + (b1*x1 + ... + bn*xn) + (0*w1 ref + c1,1*w1,1 + ... + c1,k 1 *w1,k1 + ... + 0*w m ref + c m,1 *w m,1 + ... + c m,km *w m,km )

where

• a is the constant term (aka the intercept)

• n is the number of numeric input variables

• bi , 0 < i ≤ n, is the coefficient for numerical input variable xi

• m is the number of categorical input variables

• wiref , 0 < i ≤ m, is the reference value of the categorical variable wi

• ki , 0 < i ≤ m, is the domain size of the categorical variable wi

• ci,j, 0 < i ≤ m, 0 < j ≤ ki , is the coefficient for the jth nonreference value wi,j of the ith categorical input variable wi

The following assumptions are made about the nature of input data:

• Independent variables must be linearly independent from each other.

• Dependent variable must be noncategorical (that is, continuous and not discrete).

• All variables loosely follow the normal distribution.

Code Example

This example uses linear regression to create a predictive model based on a simple data set. It uses the field "y" as the dependent variable and the fields "x1" and "x2" as the independent variables. This example produces a PMML model that is persisted. This model can then be used with the RegressionPredictor operator to predict data values.

Using the LinearRegressionLearner operator in Java

// Run Linear Regression with y as the dependent variable field
// and x1 and x2 as the independent variable fields. x2 is a
// categorical variable field.
LinearRegressionLearner lrl = graph.add(new LinearRegressionLearner()); lrl.setDependentVariable("y"); lrl.setIndependentVariables("x1", "x2");
// Passing in reference values for categorical variable fields
// is optional. If for a certain categorical variable field no
// reference value is passed in, a value from variable field's
// domain is randomly chosen.
Map<String, String> varToRef = new HashMap<String, String>();
varToRef.put("x2", "blue");
lrl.setReferenceValues(varToRef);

Using the LinearRegressionLearner operator in RushScript

// Run Linear Regression with y as the dependent variable field
// and x1 and x2 as the independent variable fields. x2 is a
// categorical variable field.

// Passing in reference values for categorical variable fields
// is optional. If for a certain categorical variable field no
// reference value is passed in, a value from variable field's
// domain is randomly chosen.
var results = dr.linearRegressionLearner(data, {dependentVariable:'y',
independentVariables:['x1', 'x2'], referenceValues:{'x2':'blue'}});

Properties

The LinearRegressionLearner operator provides the following properties.

Name	Type	Description
dependentVariable	String	The field name of the dependent variable to use in the calculations.
independentVariables	String...	A list of fields to use as independent variables in the calculations.
referenceValues	Map<String,String>	A mapping from categorical variable field names to reference values.
singularityThreshold	Double	A threshold value used to determine effective singularity in LU decomposition. Default: Double.MIN_VALUE

Ports

The LinearRegressionLearner operator provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The training data used to compute the linear regression.

The LinearRegressionLearner operator provides a single output port.

Name	Type	Get Method	Description
output	PMMLPort	getOutput()	The model describing the resulting regression model.

RegressionPredictor Operator

The RegressionPredictor operator applies a previously built regression model to the input data. The model defines the independent variables used to create the model. The input data must contain the same dependent and independent variable fields as the training data that was used to build the model. The predicted value is added to the output dataflow as an additional field. All input fields are also transferred to the output.

Code Example

Using the RegressionPredictor operator in Java

// Create the regression predictor operator and add it to a graph
RegressionPredictor predictor = graph.add(new RegressionPredictor());

// Connect the predictor to an input data and model source
graph.connect(dataSource.getOutput(), predictor.getInput());
graph.connect(modelSource.getOutput(), predictor.getModel());

// The output of the predictor is available for downstream operators to use

Using the RegressionPredictor in RushScript

// Apply a regression model to the given data
var results = dr.regressionPredictor(model, data);

Properties

The RegressionPredictor operator provides the following properties.

Name	Type	Description
predictedFieldSuffix	String	The suffix to be added to the target field name. The suffix is set to "(predicted)" by default.

Ports

The RegressionPredictor operator provides the following input ports.

Name	Type	Get Method	Description
input	RecordPort	getInput()	Input data to which the regression model is applied.
model	PMMLPort	getModel()	Regression model in PMML to apply.

The RegressionPredictor operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Results of applying the regression model to the input data. Contains a new field with the predicted value.

LogisticRegressionLearner Operator

The LogisticRegressionLearner operator performs a stochastic gradient descent, a probabilistic algorithm for very large data sets, on the given training data. The output is a PMML model describing the resultant classification model.

A single dependent variable must be specified. This is a field in the input data that is the target of the logistic regression model. One or more independent variables are also required from the input data.

This operator supports numerical and categorical data as input.

Take the following considerations into account when composing the operator:

• The dependent variable must be categorical.

• Very low learning rates can cause the algorithm to take a long time to converge. It is often better to set the learning rate too high and let the algorithm lower it automatically.

• The learning rate adjustment takes an iteration to perform, so if it is set high, make sure to adjust the maxIterations property appropriately.

• The smaller the training data set, the higher the maxIterations property should be set.

• The algorithm used by the operator is much more accurate when the training data set is relatively large.

• Input data that includes null or infinite values can produce nonsensical results.

Code Example

This example uses logistic regression to create a predictive model based on the Iris data set. It uses the "class" field as the dependent variable and the remaining fields as the independent variables. This operator produces a PMML model that may be persisted. The resultant model can be used with the RegressionClassifier operator to predict data values.

Using the LogisticRegressionLearner operator in Java

// Run Logistic Regression with "class" as the target field
// and the sepal and petal length and width as independent variables
LogisticRegressionLearner learner = graph.add(new LogisticRegressionLearner());
learner.setLearningColumns(Arrays.asList("sepal length","sepal width","petal length","petal width"));
learner.setTargetColumn("class");

Using the LogisticRegressionLearner operator in RushScript

// Run Logistic Regression with "class" as the target field
// and the sepal and petal length and width as independent variables
var model = dr.logisticRegressionLearner(data, {targetColumn:'class', learningColumns:['sepal length', 'sepal width', 'petal length', 'petal width']});

Properties

The LogisticRegressionLearner operator provides the following properties.

Name	Type	Description
learningColumns	List<String>	A list of fields to use as independent variables in the calculations.
targetColumn	String	The field name of the dependent variable.
ridge	double	The regularization constant, also called lambda. The regularization constant penalizes very large coefficients and is sometimes necessary for convergence. Must be greater or equal to 0.0 and less than 1.0. It should generally be small.
learningRate	double	The learning rate that should be used at the start of the computation. This is a maximum value; the algorithm may reduce the rate if it is likely to result in divergence. The rate must be positive and should be less than 1.0. If set much higher, additional iterations will likely need to be used to adjust the rate to a more reasonable value, although this is not always the case.
maxIterations	int	The maximum number of iterations attempted before generating a model. Note that more iterations can produce a more accurate model at the cost of much greater run time.
tolerance	double	The strictness of the convergence criteria as a fraction of the total length of the coefficient vector. Note that a threshold much higher than the learning rate or above 1.0 can result in premature convergence detection.
seed	long	The seed for the random number generator used by the algorithm. The main use of the seed is to randomly reorder the input. Note that even with the same seed, results may vary based on engine settings such as the number of partitions.
maxDistinctNominalValues	int	The maximum number of distinct nominal values to allow. Attributes with more than this number of distinct values will be filtered from the model. Default: 1000.

Ports

The LogisticRegressionLearner provides a single input port.

Name	Type	Get Method	Description
input	RecordPort	getInput()	The training data used to compute the logistic regression.

The LogisticRegressionLearner operator provides a single output port.

Name	Type	Get Method	Description
output	PMMLPort	getModelOutput()	The model describing the resulting classification model.

LogisticRegressionPredictor Operator

The LogisticRegressionPredictor operator implements a maximum-likelihood classifier using regression models to predict the relative likelihoods of the various categories. The operator is used to apply a previously built classification model to the input data.

The model defines the independent variables used to create the model. The input data must contain the same dependent and independent variable fields as the training data that was used to build the model.

The predicted categories are added to the output dataflow as an additional field. All input fields are also transferred to the output. The output also includes a confidence score for each category of the dependent variable.

Note: The LogisticRegressionPredictor currently only supports models produced by the LogisticRegressionLearner operator.

Code Example

Using the RegressionClassifier operator in Java

// Create the regression predictor operator and add it to a graph
LogisticRegressionPredictor classifier = graph.add(new LogisticRegressionPredictor());

// Connect the predictor to an input data and model source
graph.connect(dataSource.getOutput(), classifier.getInput());
graph.connect(modelSource.getOutput(), classifier.getModel());

// The output of the classifier is available for downstream operators to use

Using the RegressionClassifier operator in RushScript

// Classify the input data using the given model
var classifiedData = dr.logisticRegressionPredictor(model, data);

Properties

The LogisticRegressionPredictor operator provides the following properties.

Name	Type	Description
winnerField	String	The name of the winner field to output. Default: "winner"

Ports

The LogisticRegressionPredictor operator provides the following input ports.

Name	Type	Get Method	Description
input	RecordPort	getInput()	Input data to which the classification model is applied.
model	PMMLPort	getModel()	Classification model in PMML to apply.

The LogisticRegressionPredictor operator provides a single output port.

Name	Type	Get Method	Description
output	RecordPort	getOutput()	Results of applying the classification model to the input data. Contains a new field with the winner category and a field with the probability score for each category of the input dependent variable.

Last modified date: 03/10/2025