Vector 6.2 | Python UDFs

SQL Language Guide > SQL Language Guide > Elements of SQL Statements > Scalar User-defined Functions (UDFs) > Python UDFs

Was this helpful?

Python UDFs

Python UDFs must be configured before use. For more information, see the chapter “Configuring User-defined Functions” in the System Administrator Guide.

NumPy Requirement

When using unsecured UDFs, NumPy must be installed in the Python installation. If using the default Python 3.6 for Vector, use the following command:

pip3 install numpy

When using secured UDFs, NumPy is pre-installed in the container and does not require a separate install.

Using Imports

Python imports are located using PYTHONPATH. The /opt/Actian/vectorwise-udf/import directory is automatically added to PYTHONPATH when using secure UDFs.

NumPy UDF

The speed of numerical data processing is greatly improved using NumPy arrays versus using conventional Python data types. In addition, many machine learning libraries use NumPy array processing allowing database data to be directly provided to these libraries without conversion to or from Python objects.

Although NumPy UDFs are written in Python (you must select Python as the language), there are several important differences between NumPy and Python UDFs.

NumPy Import

The NumPy Python import will be imported automatically as “np” so your UDF is not required to add a NumPy import statement.

UDF names must be prefixed with numpy__ (two underscore characters).

When you prefix a Python UDF name with numpy__, such as “numpy__myudf”, argument processing is changed to take advantage of NumPy’s fast array-based APIs and UDFs receive database data a vector at a time rather than a tuple at a time.

Best with Numeric Data

NumPy’s key strength is processing vectors of numeric data. Although other data types such as strings and dates are supported in NumPy UDFs, many of these types revert to the underlying Python data type rather than a NumPy array.

To realize the benefit of NumPy array processing, data must be accessed with NumPy APIs. Converting to Python objects will negate any performance benefit and may even be slower.

UDFs will receive and return vectors rather than a tuple at a time.

NumPy UDF arguments and return value differ from basic Python UDFs. When writing NumPy UDFs, you must handle arguments and the return value as follows:

1. Every non-constant argument is either a NumPy or Python array of the same size.

2. Constant argument values will be an array of size 1. Your code should assume any argument can be a constant value and check the size.

3. A variable called vector_size is automatically provided to the UDF code that contains the current vector size. Note that the vector size can range from 1 to the currently set vector size. (Default size is 1024.)

4. The UDF must return a single NumPy or Python array that is exactly the size of the provided vector_size variable.

Below is a simple NumPy UDF that adds two arguments and returns the result. It does not allow constants (single value arrays) as an example of the vector_size variable. However, the UDF will function properly if those checks are removed, with the behavior of adding the constant value to all array items.

create or replace function numpy__add(a integer, b integer) return (integer) as language python source='

if len(a) < vector_size:

raise Exception("Argument 1 is a constant")

if len(b) < vector_size:

raise Exception("Argument 2 is a constant")

return np.add(a,b)

',commutative=true;\g

Nulls Not Supported

NumPy UDFs do not support nullskip=’ignore’ and do not support NULL argument or return values.

Data Types Supported by NumPy

Vector data types map to NumPy types as shown in the following table.

Supported NumPy types must be returned as a NumPy array. However, the UDF can also return a plain Python array with the expected Python data type. For example, UDFs that are defined to return a timestamp are expected to return a datetime Python type or a type that Python can convert to a datetime.

Data types not supported by NumPy will send Python objects.

Vector Type	NumPy Output Type
CHAR	NPY_UNICODE*
VARCHAR	NPY_UNICODE*
NCHAR	NPY_UNICODE*
NVARCHAR	NPY_UNICODE*
INTEGER1	NPY_BYTE
INTEGER2	NPY_SHORT
INTEGER4	NPY_INT
INTEGER8	NPY_LONGLONG
DECIMAL	Python Decimal
FLOAT	NPY_FLOAT
FLOAT4	NPY_FLOAT
ANSIDATE	NPY_DATETIME64
TIME WITHOUT TIMEZONE	NPY_DATETIME64
TIME WITH TIMEZONE	Python DateTime with TZ set
TIME WITH LOCAL TIMEZONE	NPY_DATETIME64
TIMSTAMP WITHOUT TIMEZONE	NPY_DATETIME64
TIMESTAMP WITH TIMEZONE	Python DateTime with TZ set
TIMESTAMP WITH LOCAL TIMEZONE	NPY_DATETIME64
INTERVAL YEAR TO MONTH	NPY_INT
INTERVAL DAY TO SECOND	Python type (varies)
MONEY	NPY_DOUBLE
IP4	NPY_INT
IP6	Python Long
UUID	Python Long
BOOLEAN	NPY_BOOL
*CHAR and VCHAR up to 1024 are NYP_UNICODE in a NumPy array and those over are Python Py_UNICODE strings in a Python array. This is based on the UDF argument size defined on creation so the UDF knows which type it will get.

Last modified date: 06/28/2024