BigQuery – Getting started on the BigQuery Google Analytics 4 and Firebase datasets

Lace Chantelle Rogers
5 min readAug 24, 2021

--

BigQuery — I’m a little in love with this resource

Working with apps and marketing data, you can imagine I spend a huge amount of time working with the GA4 (also known as the Firebase) data streams. This has helped me build up an itinerary of tips and tricks when handling dates, unnesting data plus useful syntax which I use daily to help build analysis, reporting and models (all the fun stuff).

So firstly let me introduce you to the Google Analytics 4 eCommerce public dataset which was released last week:

SELECT
DISTINCT
event_name
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
ORDER BY
1

This data contains 92 tables of daily event-based GA4 data — meaning you can really get your teeth into building data models, even if your app or web GA4 container is not launched yet. I highly recommend also familiarising yourself with the GA4 schema, which is a powerful tool throughout your usage of GA4 streamed datasets.

The dataset is located in the US so make sure you adjust your region in the BigQuery interface or you will see the following error.

Error from selecting EU as the processing location

Getting that table efficiently

You may notice firstly that in the query above I’ve used a wildcard on the dataset name, this is because all GA4, GA360 and Firebase tables are suffixed with ‘YYYYMMDD’. These are not partitioned tables, but they are stacked and therefore a wildcard can be used. This also enables automation of your queries instead of hardcoding each date (which no one wants to do).

The drop-down for any GA4 stream
How the tables appear when using the interface

However, with the benefits of the wildcard comes a risk, accidentally running queries across your entire history of tables, which for companies with huge datasets could generate significant cost over time. To mediate this you need to incorporate _table_suffix into your where statement.

Here the current date is formatted to the date required and being used with the suffix, but you can also use declare functions and other methods to populate the suffix. I also highly recommend placing the _table_suffix at the top of the where statement as it can reduce processing cost and time.

SELECT
DISTINCT event_date
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE(‘%Y%m%d’, DATE_SUB( CURRENT_DATE(), INTERVAL 2 day))
AND FORMAT_DATE(‘%Y%m%d’, DATE_SUB( CURRENT_DATE(), INTERVAL 1 day))
ORDER BY
1

If you are joining the stacked table onto another which you require the _table_suffix, place it in the on clause, however, if you are joining onto the stacked table then use the where statement. Always check the query processing bar in BigQuery or use dry runs to assess the bytes processed.

BigQuery built-in processing prediction

Date, dates and more dates

So now you have mastered the schema and the table wildcards, you will notice there are quite a few date fields. It is super important to use the right ones and use them consistently across your reporting. So what do these dates really mean and what tricks are there to use them?

event_date: This is quite literally the date the record was added to the table, it’s in a string format which means if you want this in a date format you need to parse it.

SELECT
PARSE_DATE(“%Y%m%d”,“20200101”) AS date

event_timestamp: This is the time in microseconds — which is different from GA360 and UA data which is in seconds. So you will need to use timestamp_micros instead of timestamp_seconds to generate a timestamp. Something to note is the event_timestamp is when the event was triggered to the server. This means on apps where customers can utilise features offline the device will cache the events until it is connected to the internet. In addition, if devices have a poor internet connection they may cache data. This data is stored for 72 hours or up to 10mb and is why you will often see the event_timestamp generate dates which vary from the event_date.

In addition to the above issues, if a user has adjusted their device time, the timestamp can appear wildly out …. I think we all know certain games which we used to try to get extra levels by changing device time … well this will show in your event_timestamp.

SELECT
event_timestamp,
TIMESTAMP_MICROS(event_timestamp) AS timestamp, #generate timestamp
event_server_timestamp_offset #offset time if any in seconds
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
ORDER BY
1

previous_event_timestamp is also in micro-second format and a super powerful tool to establish when the event last occurred (only reports on the same event type). This will be null if there are no previous occurrences of this event.

device.time_zone_offset_seconds: Often if an app or website is available worldwide, you may want to compare behaviour on local time. This is easily done by using timestamp_add. You must use _add function since the offset will have negative values which are used for the calculation.

Interestingly you can actually see how commuter times impact trends across the world and see how commuting time varies and correlates across the world using this method. It’s also super useful if you want to time sales, promos and messaging based on analysis. Have fun!

SELECT 
TIMESTAMP_MICROS(event_timestamp) AS timestamp,
TIMESTAMP_ADD(TIMESTAMP_MICROS(event_timestamp), INTERVAL device.time_zone_offset_seconds second) AS local_time
FROM
`firebase-public-project.analytics_153293282.events_*`
WHERE
device.time_zone_offset_seconds IS NOT NULL
ORDER BY
1

Finally user or not to user

You may have noticed a slight change in naming conventions from GA360 if you are used to this dataset, and this is that of client_id being noticeably missing. Well no fear, user_pseudo_id is to the rescue. This is effectively the instance id that allows you to track a device journey. However, as with cookies, you will find that when a user deletes an app or clears their data — the user_pseudo_id is no longer tracked and a new one is generated.

In comes the user_id, if your developers have set this up correctly you can track a user by their ID across devices and services. However, one note is that the user_id can be associated with multiple user_pseudo_id, and the user_id won’t match the other device user_ids until the user signs in. When used correctly this variable is powerful and will help you design user journies and support reporting of user-level KPIs.

You must also ensure your user_id abides with Googles privacy rules — eg do not put their email address in!!!

SELECT
DISTINCT coalesce(user_id,
user_pseudo_id) AS id,
user_id,
user_pseudo_id
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
ORDER BY
1

The GA4 and Firebase datasets are incredibly powerful tools for interrogating your data, however as you can see it’s important to familiarise yourself with the schema and data. Once you leverage it, the world is your oyster to analysing your app and web data.

--

--

Lace Chantelle Rogers

https://www.bigquery.co.uk/ leading data and analytics . GCP, BigQuery and GA4 expert specialising in data engineering, data science and Brighton cloud events.