BigQuery: User-Defined Functions to supercharge your SQL
As we all know Google is continuously improving and expanding the Google Cloud Platform and BigQuery has truly benefited from this. Taking a simple scalable data warehouse into what is now a comprehensive and dynamic system. As you may guess, I’m a massive fan of BigQuery.
BigQuery offers a huge amount of functionality including DDL and DML, however, sometimes, there simply isn’t the function to help you complete what you are doing. Often in the past, I would end up accessing BigQuery via Python or SAS and applying the missing function.
However User Defined Functions cover some of the current gaps, and since they are now persistent (I’ll explain this further later) you can share and re-use these with your colleagues.
So what is a UDF?
A UDF simply is a function you create that applies a logic not available within the standard SQL function to your data. Generally, you will be using Javascript, SQL, simple logic or base Python (albeit in a rather hacky approach).
An Example
Until recently, BigQuery did not have a function to apply the proper format. On a quick Stack Overflow search, there were some really great examples of UDFs using SQL to generate this function.
Below shows the SQL logic being generated into a temporary function which is then applied to your string. Whilst this function is now defunct since recently BigQuery has introduced INITCAP, I wanted to show you the effect of the UDF compared to the INITCAP function.
So how do you generate your own SQL UDFs?
Using Bigquery’s comprehensive guides, you can see the general structure is very similar to some DDL statements, however, it can take a little getting used to.
Steps to create your first UDF
Creation line to generate a persistent function or a temp. Remember to consider your DDL at this point to ensure you don’t accidentally generate an error.
CREATE [OR REPLACE] [TEMPORARY | TEMP] FUNCTION [IF NOT EXISTS]
The name of your function, if you are creating a temp function you only need the function name. I generally recommend using a literal value that defined what you are planning to achieve where possible and ensure you maintain your database guidelines like using snake case and core naming conventions.
[[`project_name`.]dataset_name.]function_name
You will then need to add what parameter you expect to return plus what data type it returns. One note on this, you can have multiple params returned as long as they are comma-delimited. I’ve added examples here since I personally took a few attempts to work out what to return.
([named_parameter[, ...]])
[RETURNS data_type]
#Example
(userAge INT64)
#Multi value example
(userAge INT64, gender STRING)
Here’s where the fun starts — adding in your SQL expression. This can be a Scalar query or a simple script. Just think how much time you could save on joining back onto your data to apply logic. For any SAS users, this is very similar to a macro.
AS (sql_expression/logic/js)#Google's example
CREATE TEMP FUNCTION
addFourAndDivideAny(x ANY TYPE,
y ANY TYPE) AS ( (x + 4) / y );
SELECT
addFourAndDivideAny(3 /*x*/ ,3 /*y*/ ) AS integer_output
So let's see this in action:
Imagine a scenario where you have a dataset where you need to flatten multiple fields in your GA4 or Firebase dataset. Normally you would need to write out the standard script of:
(SELECT VALUE.INT_VALUE FROM UNNEST(EVENT_PARAMS) WHERE KEY = 'CCC')
However, that is super painful to repeat multiple times and likely to produce errors, so using a UDF, scripting and the fab SQL execute statement, I have generated the code below, which will automatically unnest your Firebase or GA4 data.
Be careful to select the specific date frame your reviewing and select specific events you wish to review. You can even remove the SQL execute and simply copy the query into another window or a tool such as Python.
So how do you save a UDF to your project?
I mentioned earlier that recently Google introduced persistent UDFs, eg ones which can be saved in your database and this is relatively simple to do. You simply need a destination dataset and the name of your UDF. This will then save in your dataset and include a function flag to make them easy to spot. You can then see the function values when you click on them and apply them to any code using the dataset.function_name.
CREATE FUNCTION UDFS.ga4_firebase
In fact, a great UDF community has formed an open-sourced UDF project which you can actually apply to your own work, including many statistical techniques such as P Values and much more simply by using the following syntax. Be sure to check them out:
SELECT bqutil.fn.udf_name(variable)
I hope you have enjoyed this brief intro to UDFs and have plenty of fun trying them out in your code. My next blog post will be focused on creating JS UDFs as well as accessing these when stored on the cloud (which open up a world of statistical opportunities).