Google BigQuery automation and Google Analytics 4 daily files
As you may already know, Firebase and Google Analytics 4 both have a great feature in that they provide the daily event data in BigQuery tables. If you have selected the streaming option you will see an intraday table for today, and stacked daily event tables. The intraday table for yesterday is normally moved to a daily event table around 4–7 am (UTC).
Often if you wish to build automated secondary tables or reports from these tables, you will use the _table_suffix functionality as mentioned in my previous blog post: Getting started with GA4 BigQuery.
Recently I saw a thread around delays and missing with the daily and intraday tables being populated. This rarely happens, however when it does, if you haven’t scripted effectively, you may need to manually back run missing data, which we all know can be a painful and time consuming experiance.
Why not simply use the _table_suffix= current_date()-1?
When you use this functionality and your table is appending yesterday’s data, if the file is missing, your code will simply look at the next table when it runs tomorrow. Resulting in the dreaded message, what happened on this date?
Along comes SQL Scripting the rescue
I am a great fan of scripting SQL, mainly as the more time I free up from processing reporting and tables, the more time I have to do the fun stuff like analysing and building cool models. And thankfully over the past two years, BigQuery and GCP have been heavily investing in making sure scripting functionality is available to support this.
In this case, I will be sharing functionality for a daily update to a table that used the Google analytics 4 tables. This functionality will work with any stacked tables with a date suffix, and I will be writing further blogs on enhancing this functionality further.
Step 1: Declare your variables
Declare statements must sit at the top of your code and serve to generate local variables within your code. These can be populated with default values, simple outcomes, or even use SQL to generate dynamic declare values.
The declare statement required a name, a data type alongside the optional default outcome.
DECLARE date_value DATE DEFAULT CURRENT_DATE();
In this process, we will be generating five statements, with no default outcome. These will be populated in subsequent steps, either by hardcoding or SQL statements
# checks for the ga4 table
DECLARE ga4_table_updated date;
#checks if the table has been updated and contains data within it
DECLARE secondary_table INT64;
#selects the date the updates start from
DECLARE secondary_table_update date;
# selects table name
DECLARE dest_table_name string;
# selects table name
DECLARE source_table_name string;
# selects date_var in your destination table
DECLARE date_var string;
Step 2: Add your source and destination tables
Simply add your destination table and your source event files, this code will only accommodate one destination, and requires your Google Analytics dataset.
/******************************************************/
#MANUAL UPDATES
/******************************************************/
SET
dest_table_name = 'testing.exist_tests' ;
SET
source_table_name = 'mydatasource.analytics_222222222' ;
SET
date_var = 'event_date';
Step3: Generate declared variables:
The next table requires a destination table which contains a date, all variables in this part of the code are automated and no manual change is needed. This is because we are writing a string variable which is then executed. The variable is then input into the declare statement.
Because the second statement, utilises the secondary_table_update variable, we use a SET statement. In order to support automation, the second part of the query will check if the table has already been updated today.
/******************************************************/
#CHECK THE DESTINATION TABLE MAX DATE
/******************************************************/EXECUTE IMMEDIATE (
'SELECT MAX('|| date_var ||' ) FROM '|| dest_table_name
) INTO secondary_table_update;/******************************************************/
#CREATE VARIABLE 1 OR 0 TO RUN OR NOT DEPENDANT ON IF TABLE HAS ALREADY BEEN UPDATED
/******************************************************/
SET
secondary_table =
(SELECT
(CASE WHEN secondary_table_update = (DATE_SUB(DATE(current_date()), INTERVAL 1 day)) THEN 1 ELSE 0 END)) ;
Step 4: If the variables are generated then check the last date the source file was created;
There are many ways to check if a table exists, including the nifty exist function. In this case we are using DML to review the contents of the analytics dataset to get the latest table. Additionally we parse the table name to remove non numeric data and generate a date. Once again as we are using the table name generated at the beginning of the code, we use execute.
We also only run this statement if the destination table hasn’t been updated today.
/******************************************************/
#IF THE TABLE HAS NOT BEEN UPDATED TODAY IT WILL RUN THE REMAINDER OF THE CODE;
/******************************************************/
IF secondary_table =1 THEN/******************************************************/
#THIS STEP CHECKS THE LATEST FILE AVAILABLE
/******************************************************/
EXECUTE IMMEDIATE (
'SELECT MAX(PARSE_DATE("%Y%m%d",'|| "REGEXP_REPLACE(table_name,r'[^0-9.]', ''))) AS date_value FROM `"|| source_table_name
||"`.INFORMATION_SCHEMA.TABLES WHERE table_name LIKE 'events_2%'" )
INTO ga4_table_updated;
Step 5: Add your code
The second IF statement will only run if your Google Analytics tables have updated since the destination table has updated. One thing you may want to consider is adding even more scripting into this part of the code so you don’t have to change the source or the destination table.
/******************************************************/
# ONLY WILL RUN THIS PART IF THE GA4 TABLE IS GREATER THEN THE TABLE UPDATES
/******************************************************/
IF ga4_table_updated > secondary_table_update THEN/******************************************************/
#ADD QUERY HERE (MUST HAVE A DATE VARIABLE)
/******************************************************/CREATE OR REPLACE TABLE
testing.exist_tests AS
SELECT
DISTINCT DATE(TIMESTAMP_MICROS(event_timestamp)) AS event_date,
stream_id
FROM
`mydatasource.analytics_222222222.events_*`
where _table_suffix BETWEEN FORMAT_DATE('%Y%m%d', DATE_ADD(secondary_table_update, INTERVAL 1 DAY))
AND FORMAT_DATE('%Y%m%d',ga4_table_updated) ;/******************************************************/
#ENDS THE IF;
/******************************************************/
END IF ;
END IF;
Finally — in the BigQuery UI the output you will only see the pieces of code where the script fit the eligibility criteria.
There are many other ways of running this process, including pub/sub and having a datawarehouse update reporting table, however I generally find this an effective solution with limited coding. For the full code please see my github gist.