Was this helpful?
Python UDFs
Python UDFs must be configured before use. For more information, see the chapter “Configuring User-defined Functions” in the System Administrator Guide.
NumPy Requirement
When using unsecured UDFs, NumPy must be installed in the Python installation. If using the default Python 3.6 for Vector, use the following command:
pip3 install numpy
When using secured UDFs, NumPy is pre-installed in the container and does not require a separate install.
Using Imports
Python imports are located using PYTHONPATH. The /opt/Actian/vectorwise-udf/import directory is automatically added to PYTHONPATH when using secure UDFs.
NumPy UDF
The speed of numerical data processing is greatly improved using NumPy arrays versus using conventional Python data types. In addition, many machine learning libraries use NumPy array processing allowing database data to be directly provided to these libraries without conversion to or from Python objects.
Although NumPy UDFs are written in Python (you must select Python as the language), there are several important differences between NumPy and Python UDFs.
NumPy Import
The NumPy Python import will be imported automatically as “np” so your UDF is not required to add a NumPy import statement.
UDF names must be prefixed with numpy__ (two underscore characters).
When you prefix a Python UDF name with numpy__, such as “numpy__myudf”, argument processing is changed to take advantage of NumPy’s fast array-based APIs and UDFs receive database data a vector at a time rather than a tuple at a time.
Best with Numeric Data
NumPy’s key strength is processing vectors of numeric data. Although other data types such as strings and dates are supported in NumPy UDFs, many of these types revert to the underlying Python data type rather than a NumPy array.
To realize the benefit of NumPy array processing, data must be accessed with NumPy APIs. Converting to Python objects will negate any performance benefit and may even be slower.
UDFs will receive and return vectors rather than a tuple at a time.
NumPy UDF arguments and return value differ from basic Python UDFs. When writing NumPy UDFs, you must handle arguments and the return value as follows:
1. Every non-constant argument is either a NumPy or Python array of the same size.
2. Constant argument values will be an array of size 1. Your code should assume any argument can be a constant value and check the size.
3. A variable called vector_size is automatically provided to the UDF code that contains the current vector size. Note that the vector size can range from 1 to the currently set vector size. (Default size is 1024.)
4. The UDF must return a single NumPy or Python array that is exactly the size of the provided vector_size variable.
Below is a simple NumPy UDF that adds two arguments and returns the result. It does not allow constants (single value arrays) as an example of the vector_size variable. However, the UDF will function properly if those checks are removed, with the behavior of adding the constant value to all array items.
create or replace function numpy__add(a integer, b integer) return (integer) as language python source='
if len(a) < vector_size:
          raise Exception("Argument 1 is a constant")
if len(b) < vector_size:
          raise Exception("Argument 2 is a constant")
return np.add(a,b)
',commutative=true;\g
Nulls Not Supported
NumPy UDFs do not support nullskip=’ignore’ and do not support NULL argument or return values.
Data Types Supported by NumPy
Vector data types map to NumPy types as shown in the following table.
Supported NumPy types must be returned as a NumPy array. However, the UDF can also return a plain Python array with the expected Python data type. For example, UDFs that are defined to return a timestamp are expected to return a datetime Python type or a type that Python can convert to a datetime.
Data types not supported by NumPy will send Python objects.
Vector Type
NumPy Output Type
CHAR
NPY_UNICODE*
VARCHAR
NPY_UNICODE*
NCHAR
NPY_UNICODE*
NVARCHAR
NPY_UNICODE*
INTEGER1
NPY_BYTE
INTEGER2
NPY_SHORT
INTEGER4
NPY_INT
INTEGER8
NPY_LONGLONG
DECIMAL
Python Decimal
FLOAT
NPY_FLOAT
FLOAT4
NPY_FLOAT
ANSIDATE
NPY_DATETIME64
TIME WITHOUT TIMEZONE
NPY_DATETIME64
TIME WITH TIMEZONE
Python DateTime with TZ set
TIME WITH LOCAL TIMEZONE
NPY_DATETIME64
TIMSTAMP WITHOUT TIMEZONE
NPY_DATETIME64
TIMESTAMP WITH TIMEZONE
Python DateTime with TZ set
TIMESTAMP WITH LOCAL TIMEZONE
NPY_DATETIME64
INTERVAL YEAR TO MONTH
NPY_INT
INTERVAL DAY TO SECOND
Python type (varies)
MONEY
NPY_DOUBLE
IP4
NPY_INT
IP6
Python Long
UUID
Python Long
BOOLEAN
NPY_BOOL
*CHAR and VCHAR up to 1024 are NYP_UNICODE in a NumPy array and those over are Python Py_UNICODE strings in a Python array. This is based on the UDF argument size defined on creation so the UDF knows which type it will get.
Last modified date: 06/28/2024