Mastering Python UDFs in TDengine: From Beginner to Pro

Chait Diwadkar
Chait Diwadkar
/
Share on LinkedIn

With the release of TDengine 3.0.4.0, we unveiled an exciting new feature: the ability to create User-Defined Functions (UDFs) using Python. This innovative capability delivers unparalleled flexibility for database operations while making customization more accessible—even for those new to programming. With Python UDFs, users can easily tailor and manage their databases, seamlessly integrating these custom functions into SQL queries as if they were built-in. In this article, we’ll explore how to leverage this feature effectively and help you take the first steps toward creating your own database customizations.

Getting Started with UDFs

Creating a UDF

Here’s how to create a Python UDF function in TDengine:

       CREATE  [OR REPLACE] [AGGREGATE] FUNCTION function_name
                               as library_path  OUTPUTTYPE output_type  [BUFSIZE buffer_size] [LANGUAGE 'C‘ | 'Python']

Options Explained:

  1. CREATE [OR REPLACE]: Use CREATE for new functions and add OR REPLACE to update existing ones.
  2. AGGREGATE: Optional. Indicates an aggregate function. Without it, the function defaults to a scalar function.
  3. function_name: The name of the UDF, which can be up to 64 bytes long. Names exceeding this limit will be truncated.
  4. OUTPUTTYPE: Specifies the output data type of the UDF. Supported types include:
Serial NumberSupported Data Type
1TIMESTAMP
2INT
3INT UNSIGNED
4BIGINT
5BIGINT UNSIGNED
6FLOAT
7DOUBLE
8BINARY
9SMALLINT
10SMALLINT UNSIGNED
11TINYINT
12TINYINT UNSIGNED
13BOOL
14NCHAR
  1. BUFSIZE: Sets the memory buffer size for aggregate functions (up to 256 KB). This buffer is allocated for the lifetime of the aggregation process, making it useful as a global variable.
  2. LANGUAGE: Specifies the programming language. Currently, TDengine supports Python and C.

Deleting a UDF

To remove a UDF, use the following command:

DROP FUNCTION function_name;

Viewing UDFs

To view all created UDFs:

SHOW FUNCTIONS;

For detailed information:

SELECT * FROM information_schema.ins_functions\G;

Setting Up the Environment

Before diving into Python UDF development, ensure your environment is ready:

  1. CMake: Minimum version 3.0.2.
  2. GCC: Required for compiling the shared library (.so file). Minimum version: 7.5.
  3. Python: Version 3.7 or higher.

Install the UDF plugin with the following command:

python3 -m pip install taospyudf
ldconfig

Your development environment is now ready!

Writing UDFs

UDFs in TDengine fall into two categories: scalar and aggregate functions. Before diving into the details of the functions, let’s first take a look at the process of how custom functions are invoked:

How UDFs Work

TDengine processes data in blocks for efficiency. When a UDF is invoked, the input data is a block, and you can access any cell in the block using the data(row, col) method. This approach minimizes the overhead of calls between the C framework and Python, boosting performance.

  • Scalar Functions: Must return the same number of rows as the input.
  • Aggregate Functions: Aggregate data and return a single row.

Creating a Scalar Function

Let’s create a Python UDF equivalent of the CONCAT string concatenation function:

  1. Write the Function

The function must include the following methods:

  • init: Called once during the initialization of the UDF module. Use this to perform any necessary setup tasks.
  • destroy: Called once when the UDF module exits. Use this for cleanup tasks.
  • process: The main function that processes each incoming data block. Use the shape() method to retrieve the number of rows and columns in the data block.
    • nrows: Returns the number of rows in the data block.
    • ncols: Returns the number of columns, which corresponds to the number of parameters passed to the concat() function.

Return Values:

  • A scalar function must return a list. If the return value is not a list, an error will occur.
  • The number of elements in the returned list must match the number of rows (nrows) in the data block. Mismatched row counts will result in an error.
  1. Create the function
create function py_concat as 'https://eujqw4hwudm.exactdn.com/home/py_concat.py' outputtype varchar(256) language 'Python';
  1. Execute the function Use the UDF like any built-in function:
select sf_concat(factory_name,room_name), concat(factory_name,room_name) from devices;

Creating an Aggregate Function

Aggregate functions perform computations on data and output a single aggregated result. Let’s create a Python UDF for counting rows:

  1. Write the Function

Implementation Principles:

  • During the start initialization callback, the value 0 is stored in the buffer (buf) as the initial value for counting.
  • In the reduce function, as data blocks are processed, any non-None values are added to the cumulative count. The result is stored back into the buffer (buf), which is passed as a parameter during subsequent calls to reduce. This allows the buffer to be reused across iterations.
  • Finally, in the finish function, the buffer (buf) is passed as a parameter. The value stored in the buffer is retrieved and returned as the final result, representing the total count.

Return Value:

  • The return type of the function must match the OUTPUTTYPE specified when the UDF function was created. If the return type is incorrect, an error will be triggered.
  • Returning a None object is allowed, but it must be handled appropriately.
  1. Create the UDF
create aggregate function af_count as ''https://eujqw4hwudm.exactdn.com/home/af_count.py'' outputtype bigint bufsize 4096 language 'Python';
  1. Execute the UDF
select af_count(col1) from devices;

Data Type Mappings

Python interacts with C through specific data type mappings. Key points include:

TDengine Data TypeMapped to Python Object
TINYINT / TINYINT UNSIGNED / SMALLINT / SMALLINT UNSIGNED / INT / INT UNSIGNED / BIGINT / BIGINT UNSIGNEDint
FLOAT / DOUBLEfloat
BOOLbool
BINARY / NCHAR / VARCHARbytes
TIMESTAMPint
JSONNot supported
  1. Binary, NCHAR, and VARCHAR map to Python’s bytes object. Use appropriate decoding:
    1. For binary: bytes.decode('utf-8')
    2. For nchar: bytes.decode('utf_32_le')
  2. When returning a str object:
    1. For binary: str.encode('utf-8')
    2. For nchar: str.encode('utf_32_le')

Development Tips

Updating Function Code

The .py file path specified during UDF creation is only used at that moment. The file’s content is stored in the mnode, making the function accessible across the cluster. After creation, the .py file is no longer needed.

To update the UDF code, you must update mnode with the new content. Use the OR REPLACE option to overwrite the existing function and apply the updated code for future calls.

CREATE [OR REPLACE] FUNCTION function_name AS library_path OUTPUTTYPE output_type [LANGUAGE 'C' | 'Python'];

Logging

Python UDFs in TDengine do not support direct debugging of Python code but do support logging. You can use commonly used libraries like logging within UDF functions. It is recommended to output logs to a file for review. Note that messages printed using the print function will not be visible, so avoid using it for output. For example:

Exception Handling

Raise exceptions to terminate queries if needed. Exceptions are logged in taosudfpy.log.

Checking UDF Framework Logs

If the UDF framework returns an error, check the log files in the TDengine log directory for details:

  1. taospyudf.log Logs Python UDF activity, including function loading, execution errors, and exceptions. Use this log when debugging Python UDFs.
  2. udfdlog.0 Logs framework-level issues for all UDF languages (C, Python, etc.). Check this log for broader framework errors, though it’s rarely needed.

Common Errors and Solutions

Serial NumberError CodeError DescriptionExplanation
10x80002901udf is stoppingThe UDF framework is shutting down and cannot provide external services.
20x80002902udf pipe read errorA data read error occurred during communication between the taosd and udfd processes.
30x80002903udf pipe connect errorA connection error occurred during communication between the taosd and udfd processes.
40x80002904udf no pipeFailed to create a pipe during communication between the taosd and udfd processes.
50x80002905udf load failureFailed to load the UDF module.
60x80002906udf invalid stateThe UDF framework is in an invalid state (e.g., during initialization) and cannot provide external services.
70x80002907udf invalid function inputInvalid parameters were provided to the UDF function.
80x80002908udf no function handleInternal error: no function handle was found.
90x80002909udf invalid bufsizeThe BUFSIZE specified for an aggregate function was invalid or exceeded the limit.
100x8000290Audf invalid output typeThe actual data type does not match the OUTPUTTYPE specified when creating the UDF function.
110x8000290Budf program language not supportedThe specified programming language is not supported.
120x8000290Cudf function execution failureThe UDF function encountered an error during execution. Check logs for more details.
130x8000290Dudf raise exceptionThe UDF function raised an exception or terminated explicitly, causing the query to stop.

Errors 10 and 12 are the most common issues when developing Python UDFs. For detailed causes, refer to the taospyudf.log file.

Examples and Resources

TDengine’s open-source repository provides several Python UDF test examples for reference:

Steps to Run Test Cases:

  1. Ensure Python3 is correctly installed.
  2. Clone the TDengine OSS code.
  3. Navigate to the TDengine/tests/system-test/ directory.
  4. Run the test case using:
python3 test.py -f others/udfpy_main.py

Notes:

1. Common Issues with UDF Scalar Functions

  • Row Count Mismatch: The number of rows returned by a scalar function must match the number of input rows.
    • If the logic is complex, it’s easy to miss rows during processing. Be cautious to avoid such errors.

2. OUTPUTTYPE Mismatch

  • Output Type Consistency: The return type of the UDF must match the OUTPUTTYPE specified during its creation.
    • Mismatched types will result in an error.
  • Scalar Functions: The return value of the process function must be a list, and all elements in the list must match the data type specified by OUTPUTTYPE.
  • Aggregate Functions: The final return value in the finish function must match the OUTPUTTYPE as a Python data type.

3. Error: Unable to Load libtaospyudf.so

  • Check Installation: Ensure the taospyudf plugin was installed properly.
    • The library is typically installed in /usr/local/lib/libtaospyudf.so or Python’s plugin directory.
    • Use the command below to locate the file:
find / -name 'libtaospyudf.so'
  • Missing File: If the file is missing, the plugin installation may have failed. Reinstall the plugin to resolve the issue.
  • File Exists but Still Fails:
    • Use ldd to check the library’s dependencies:
ldd /usr/local/lib/libtaospyudf.so
  • Missing Python dependencies are a common issue. For example, an improperly installed Python environment can cause errors.
  • Solution: Reinstall Python (e.g., Python 3.9) correctly to resolve the dependency issue.

Conclusion

The Python UDF feature in TDengine is an exciting step forward, offering unmatched customization for your database. This is just the beginning—we’re continuously improving and expanding its capabilities. We invite you to explore this feature, unleash your creativity, and share your feedback to make it even better!

  • Chait Diwadkar

    Chait Diwadkar is Director of Solution Engineering at TDengine. Prior to joining TDengine he was in the biotechnology industry in technical marketing, professional services, and product management roles and supported customers in pharma, medical devices and diagnostics on analytical chemistry, and genetic platforms.