DataFlow Preferences in KNIME

Installing and Configuring DataFlow > Installing and Configuring DataFlow > Installing DataFlow on KNIME > Configuring DataFlow in KNIME > DataFlow Preferences in KNIME

Was this helpful?

In the KNIME window that is opened by the File > Preferences command, you can select a combination of Actian settings to configure DataFlow behavior:

• Actian Profiles

• Actian Vector Connections

• Managing Actian Vector Connections

• Actian PSQL Connections

• Cluster Plugin Preferences

• HBase Clusters

• Managing JDBC Connections

• Remote File Systems

Actian Profiles

You can use the preferences dialog to set the workflow run-time settings that are specific to DataFlow. To open the Preferences dialog, select File > Preferences from the KNIME menu. The Preferences dialog displays multiple profiles for which you can configure the settings. You can also set a default profile. The DataFlow executor uses the specified profile in the Job Selection Management tab of the workflow Configure dialog.

The following is an example of the Preferences dialog:

/download/attachments/20480646/ActianProfilepage.jpg?version=1&modificationDate=1426229421854&api=v2

Profiles

The Profiles section provides a list of available profiles. Select a profile. The settings are displayed in the Details section.

Adding a Profile

Click Add to add a new profile. In the New profile dialog, enter a name for the profile. Following is an example of the dialog:

/download/attachments/20480646/addProfile.jpg?version=1&modificationDate=1426229780821&api=v2

Removing a Profile

Click Remove to remove the selected profile.

Details

The Details section displays the properties based on the selected profile. Select a property to edit the value.

The following table provides all the properties and their descriptions.

Property	Default Value	Description
Parallelism	0	Specifies the parallelism setting.
Minimum parallelism	0	Specifies the minimum value of parallelism required to execute a job. This property is used when executing a job on a busy cluster. If the required value of parallelism is not reached with the current resources, then the minimum parallelism setting is used to determine the minimum value that is acceptable. To ensure that the accurate number of partitions of parallelism are allocated, set the same value for both minimum parallelism and parallelism levels.
Execute in cluster	false	Specifies whether the workflow must be executed on a cluster or locally. Set the value to true to execute on a cluster or false to execute it locally.
Cluster URL	dr://localhost:1099	Specifies the URL, such as dr://hostname:1099, of the cluster, which is used to execute the workflow. When integrating with Apache YARN, use yarn as the scheme of the cluster URL. For example: yarn://hostname:1099.
Scheduler queue	default	Specifies the name of scheduler queue. This is used when scheduling the jobs. The scheduler queue name is valid only when a cluster is used for job execution. Currently, scheduler queue names are supported only if YARN is used to execute a job.
User extension paths (comma delimited)	not applicable	Specifies the list of archived files used by the workflow that runs on a YARN cluster. This is required when you need additional files such as customer components, jar files, and scripts for execution, and these files must be distributed across all the nodes used to run the DataFlow job.
Use socks proxy	false	Specifies whether to use the socks proxy to contact the cluster. This property sets the socket provider with the specified host and port. Set the value to true to use the socks proxy.
Socks proxy host	localhost	Specifies the socks proxy host name.
Socks proxy port	1080	Specifies the socks proxy port number.
Maximum number of retries	0	Specifies the maximum number of retries for network communication. The default value (0) does not allow retries.
Writeahead	2	Specifies the number of unread batches that a port can publish before blocking.
Spooling threshold	0	Specifies the threshold for the queue. If the queue exceeds the threshold limit, it is routed to the disk until it shrinks. This saves memory. The default value (0) indicates that the system should set a valid threshold based on the available resources.
Batch size	1024	Specifies the batch size that the operator ports should use to publish the data tokens that are written. Data are published for readers until a full batch is ready or the end of data is reached.
Collect engine statistics	true	Specifies whether to collect the engine statistics. Set the value to true for the workflow to collect the engine statistics.
Autosize writeahead (based on number of readers)	true	Specifies whether to auto-size the write ahead based on the number of readers. Set the value to true to auto-size the write ahead.
Subgraph history size	10	Specifies the maximum number of subgraphs to store in memory while tracking the execution history. By default, the last 10 subgraphs for any given operator are stored.
Storage management path	not applicable	Specifies the path to store the temporary and intermediate files during graph execution.
Visualization sampling percentage	0.0	Specifies the percentage of total data in the workflow that is sampled for use with monitoring plugins.
Visualization sampling seed	37	Specifies the seed that should be used when determining the data for sampling. Setting the sampling percentage to 100 has no effect.
Sort buffer	10M	Specifies the size of the buffer used to sort the data in memory.
Sort IO Buffer	0	Specifies the buffer size (in bytes) used for all I/O operations initiated by the Sort operator. The value is in kilobytes (k), megabytes (m), or gigabytes (g). The default size (0) indicates that the framework should set the buffer size at run time based on the current resource availability.
Sort Max Merge	0	Specifies the maximum number of intermediate sort segments to merge concurrently. Adjust this setting to control the resources required to merge sorted segments and create the final sorted result.
Dump file path	not applicable	Sets the dumpFilePath engine configuration to the specified value.

Note: For some of the properties, KNIME may have different default values.

Managing Actian Profiles

The Preferences dialog lets you select the default profile, add, or remove a profile.

Default Profile

If the workflow does not have a profile, then it uses the default profile selected. This profile is used when running DataFlow nodes in default KNIME job manager.

To add a profile

1. Click Add in the Preferences dialog.

2. In the New profile dialog, enter a name for the profile.

To remove a profile

In the Preferences dialog, select the profile that you want to remove and click Remove.

Actian Vector Connections

You must create the Actian Vector connections to use in the Load Actian Vector and Load Actian Vector On Hadoop nodes in KNIME.

To maintain the connections

1. In KNIME window, go to File > Preferences > Actian > Actian Vector Connections.

The Preferences page is displayed and provides a list of the connections configured currently.

2. Provide each connection a unique name for identification. The list is available by its connection name.

The following is a sample of the Preferences dialog.

Managing Actian Vector Connections

The Actian Vector Connections page lets you add, edit, or remove a connection.

Adding a Vector Connection

To add a Vector connection

1. Click Add on the Actian Vector Connections dialog.

The Vector Connection Information dialog opens.

2. Enter all of the following fields (mandatory):

Connection name

Unique name for this connection. This field is for labeling purposes only and is not used to create a connection to Actian Vector.

Host

Host name of the Actian Vector server.

Instance

Name of the Actian Vector instance. VW is the default instance name.

Database

Name of the database.

User

Database user account used for connections. If you are using Vector on Hadoop, this account may be different from the SSH account used by vwload, which is set in the configuration of the node itself.

Password

Database user account password.

Verify password

Reenter the password.

3. Select Advanced Properties to add any additional JDBC properties that are required for the connection. You also may disable using the default port in which case the Instance property should use the new port value instead of the logical instance.

4. Click Test Connection to verify the connection.

If the test succeeds, click OK to save the connection. Click Cancel to exit without saving.

Note: The dialog displays error messages if any of the settings are invalid or if a connection cannot be made. Update the settings as indicated by the error message.

The following is an example of the dialog. The required information has been entered and the connection tested successfully.

/download/attachments/20480650/image2013-4-29%2013%3A58%3A0.png?version=1&modificationDate=1367261881000&api=v2

Connecting to Actian Avalanche

When connecting to an Avalanche instance, ensure that you have added the property encryption with the value on to the Advanced Properties. In addition, the default port must be unchecked when you set the custom port used by the database.

Editing a Vector Connection

1. Click Edit in the Actian Vector Connections page to reset an existing connection.

2. On the Vector Connection Information dialog, enter the required changes and click OK to save.

Removing a Vector Connection

To delete an existing connection, select it on the Connections page and click Remove.

The connection is deleted.

Actian PSQL Connections

Actian Express 2.0 can use data from a PSQL database. For this, you must create an Actian PSQL connection for use by the Actian DataFlow IO nodes Database Reader and Database Writer.

For more information about using AAP for PSQL, including the configuration procedure and AAP for PSQL release notes, see Actian Analytics for PSQL at the PSQL website.

Connection settings are managed as KNIME preferences. The Actian PSQL Connections preference page provides a list of the currently configured connections. Each connection must have a unique name in the list.

To access these settings, in the KNIME window select File > Preferences. In the Preferences dialog, expand Actian and select Actian PSQL Connections to view the list of PSQL connections currently configured.

/download/attachments/20480647/PSQLConnectionpage.JPG?version=1&modificationDate=1423219411175&api=v2

Managing Actian PSQL Database Connections

The Actian PSQL Connections page lets you add, edit, or remove a connection.

Adding a PSQL Connection

To create a new connection

1. Click Add on the Actian PSQL Connections page.

The Actian PSQL Connection Information dialog opens:

/download/attachments/20480647/AddPSQLConnectPage.PNG?version=1&modificationDate=1423219536768&api=v2

2. Enter the following settings (mandatory):

Connection name

Name for this connection. Each connection must have a unique name. The name only identifies a group of settings in this dialog and is not used to create a PSQL connection.

Server

PSQL server name.

Port

Port number used by the PSQL server. Default: 1583.

Database

Name of the database.

User

Database user account used for connections, if needed.

Password

Database user account password, if needed.

Verify password

Reenter the password.

3. Click Test Connection to verify the connection.

If the connection test is successful, click OK to save the settings. Click Cancel to close the dialog without adding the connection.

Note: The dialog displays error messages if any of the settings are not valid or if a connection cannot be made. Update the settings and test the connection again.

Editing a PSQL Connection

Click Edit on the Actian PSQL Connections dialog to reset an existing connection.

On the Actian PSQL Connection Information dialog, enter the required changes and click OK to save.

Removing a PSQL Connection

To delete an existing connection, select it on the Actian PSQL Connections dialog and click Remove.

A confirmation message is not displayed and the connection is deleted.

Cluster Plugin Preferences

The Actian Cluster Plugin Preferences page lets you select which plugins to use when running in a cluster.

You can select the plugins in KNIME preferences that can be used when running a cluster.

To select a preference

1. Go to File > Preferences.

2. In the Preferences dialog, expand Actian and click Cluster Plugin.

The following options are available:

Core DataFlow

Includes only the basic DataFlow plugins.

Core DataFlow + Core KNIME

Includes the basic DataFlow and KNIME plugins.

All Plugins

Includes all plugins available to KNIME.

Core DataFlow + Core KNIME + Custom

Includes all the core plugins and any additional plugins specified by the Custom Plugins dialog.

3. Click Select All to select all or Clear All to clear all the selected plugins.

4. Click Discover Fragments or Remove Fragments to further customize the jar files from installed plugins which are included in the class path.

5. Click OK to save the selected plugins.

The following image is an example of the dialog.

HBase Clusters

You can use the HBase Clusters preference page in KNIME to configure the HBase clusters that are accessed by the HBase Reader and Writer nodes.

To configure, go to File > Preferences. In the Preferences dialog, expand Actian and click HBase Clusters.

The following image is a sample of the HBase Clusters dialog.

The HBase Clusters section provides the list of clusters that are already configured. Each cluster has a name used for identification within KNIME. When a cluster is selected, its properties are displayed in the Properties section.

Adding an HBase Cluster

To add a cluster

1. Click Add in the HBASE Clusters page.

2. In the Add Cluster dialog, enter a name for the cluster and click OK.

3. Select the cluster and add the properties in the Properties section of the dialog.

For more information about the list of properties and managing the properties, see HBase Cluster Properties.

Renaming an HBase Cluster

To rename a cluster

1. Click Rename in the HBase Clusters page.

2. On the Rename Cluster dialog, enter the new name and click OK.

After renaming the cluster, you must reconfigure the nodes using that cluster with the new cluster name.

For more information about the list of properties and managing the properties, see HBase Cluster Properties.

Removing an HBase Cluster

Select the cluster and click Remove on the HBase Clusters page.

In the confirmation message that is displayed, click Yes to remove the cluster.

Importing or Exporting an HBase Cluster

You can import the cluster from a file or URL using Import/Export in the HBase Clusters page.

Import from File...

Imports a cluster from a file. You will be prompted to select a file from the local file system. This file should be in the same format as Hadoop XML configuration files.

Import from URL...

Imports a cluster from a URL. You will be prompted to enter a URL. The file returned by this URL should be in the same format as Hadoop XML configuration files.

Export to File...

Exports the currently selected cluster to a file. You will be prompted to select a file location on the local file system. The file will be in the same format as Hadoop XML configuration files.

Testing a Connection

To test the connection of the selected cluster, click Test Connection. The results are reported.

HBase Cluster Properties

This section lists the properties for the selected HBase cluster. The minimum properties needed for connecting to an HBase cluster are:

fs.default.name

The HDFS URL (such as hdfs://headnode:8020)

hbase.rootdir

The root directory where HBase data is stored (such as hdfs://headnode:8020/hbase)

hbase.zookeeper.quroum

The nodes running ZooKeeper

hbase.zookeeper.property.clientPort

The ZooKeeper client port.

Any additional Hadoop or HBase property can be added here as well. Some properties that may occasionally need to be defined include:

hive.metastore.uris

The Hive Metastore URI, if using HCatalog (such as thrift://headnode:9083)

zookeeper.znode.parent

The parent znode in ZooKeeper used by HBase. This should be defined if the default value of /hbase is not being used by your cluster.

Note: For a few Hortonworks clusters, you may need to define this property with a value of /hbase-unsecure.

You may add, edit, or remove properties.

Adding a Property

To add a new property, click Add in the Properties section of the HBase Clusters dialog.

In the displayed dialog, enter a property name and value, and click OK to save.

Editing a Property

To edit a property, select the property and click Edit in the Properties section of the HBase Clusters dialog.

In the displayed dialog, enter a new value for the property and click OK to save.

Removing a Property

To remove a property, select the property and click Remove in the Properties section of the HBase Clusters dialog.

In the confirmation dialog, click OK to remove the property.

Managing JDBC Connections

This JDBC Connections page lets you manage JDBC database connections. After connections are created and tested, they can be used by the Database Reader and Database Writer nodes within RushAnalytics.

To configure JDBC connections, go to File > Preferences. In the Preferences dialog, expand Actian and click JDBC Connections.

The following image is a sample of the JDBC Connections dialog.

/download/attachments/20480651/image2013-5-3%2015%3A42%3A32.png?version=1&modificationDate=1367613751000&api=v2

The JDBC Connections page provides a listing of the currently configured connections. Each connection should be provided a unique name for identification. The listing is by connection name.

Adding a JDBC Connection

To add a connection

1. Click Add to create a new connection.

The JDBC Connection Information dialog opens for setting the necessary information for a Vector database connection.

2. Enter the following settings (mandatory) for a Vector database connection:

Connection name

The unique name for this connection. This field is for bookkeeping purposes only and is not used to create a connection to Vector.

Select Driver

The name of the JDBC driver to use. Select an already configured database driver. See Database Drivers for more information about installing a database driver.

URL

The connection URL for the database instance. Each driver defines the format of the connection URL. When a driver is selected, a template URL is filled in.

User

The database user account used for connections.

Password

The database user account password.

Verify password

Re-type in the password to verify the contents.

Server hosts

Some databases supported distributed operations where multiple database hosts can be used. Supply the host names separated by commas. This is optional.

3. Click Test Connection to test the connection.

If the connection test is successful, click OK to save the connection configuration. Click Cancel to close the dialog without adding the connection.

Note: The dialog displays error messages if any of the settings are invalid or if a connection cannot be made to the database. Update the provided values as indicated by the error message and test again.

The following screenshot shows the dialog used to create or edit a JDBC connection. The required information has been input and the connection tested successfully.

/download/attachments/20480651/image2013-4-29%2013%3A45%3A11.png?version=1&modificationDate=1367261112000&api=v2

Editing a JDBC Connection

Click Edit in the JDBC Connections page to reset an existing connection.

Enter the required changes and click OK to save.

Removing a JDBC Connection

To delete an existing connection, select it in the JDBC Connections page and click Remove.

The selected connection is deleted.

Module Preferences

The Module Preferences page lets you select the version of the various modules available.

To select, go to File > Preferences. In the Preferences dialog, expand Actian and click Module Preferences.

The following image is a sample of the Module Preferences dialog.

Click OK after selecting the version of the module.

Remote File Systems

The Remote File Systems preference page lets you define known remote file systems for browsing.

To access the page, go to File > Preferences. In the Preferences dialog, expand Actian and click Remote File Systems, which provides the currently configured remote file systems.

The following image is a sample of the Remote File Systems dialog.

Known remote file systems

Lists all the currently configured remote file systems that are available for browsing.

Adding a Remote File System

To add a remote file system

1. Click Add in the Remote File System page.

2. On the displayed dialog, enter a path identifying a remote file system.

Note: A local file path must not be entered.

3. Click OK to save the remote file system.

Removing a Remote File System

To delete an existing remote file system, select it and click Remove.

The selected file system is deleted.

Importing a Remote File System

To import a list of known remote file systems, click Import. Browse to select the file to import.

Exporting a Remote File System

To export the current known remote file systems, click Export. Browse to the file location to export.

Last modified date: 01/03/2025