GCP Data warehousing: It’s all in the BigQuery DDL

It’s always a challenge making BigQuery DataWarehouses more efficient. So today I want to share with you the joys of DDL and how you can reduce the time it takes to produce tables from your queries. In the past 6 months BigQuery’s DDL and DML functionality has exploded with plenty of new features and massive time savers, which mean less time spent trying to keep your BigQuery warehouse tidy.

So firstly- what even is DDL and why would you even use it?

DDL stands for Data Definition Language and is primarily used to create and modify tables. It is a universal term for many types of SQL, but in BigQuery it is used for:

  • Creating tables and views
  • Modifying tables and views
  • Deleting tables and views
  • Creating User-Defined Functions
  • A little more about UDFs in a later article (as they’re great)

It’s used in the console and within scripting to help frame and design your table structure. Within the BigQuery console, it saves lots of time and means you can automate code with things like clustering when it’s not currently available on the Scheduling interface.

So you have written a code that you wish to run daily. You could use the BigQuery scheduling tools, however, you also want to add clustering and descriptions to the variable names.

The code

The above is a simple query from the public datasets and available here

Firstly you should ask yourself: do you want to insert, create, or replace your table?

Insert into … if you want to append data into the table, just ensure you are not accidentally adding duplicates into the table

Create table … If you want to create a new table, however, this will only work as a one-off

Create or replace table … You’ll overwrite the existing table or create if it doesn’t exist, so remember to retain the data you wish to carry over

Create table if not exists… You’ll create a table if it doesn’t already exist, but again be sure to retain data you wish to carry over

All of these are have great uses but also should be used with caution (see our last article Link if things do go wrong)

Here is a sample for you to test out using DDL table creation queries in your own project

For more examples take a look at our repository for this post, github

So you’ve decided to create or replace your table, so how do you enhance your queries further?

There are so many options when it comes to DDL these include

  • Partitions
  • Clustering
  • Adding descriptions to columns and tables
  • Data retention policy

In this example, we create a table that has a lot of features, we will discuss features such as partitioning and how they can improve your table structure and query in future articles.

The code

This will use “create table”, this table will generate clusters and partitions which will help you and your colleagues subsequently use this table efficiently. Make sure you use the create and replace should your table already exist, however if you have not previously added partitioning or clustering you should append historical data and generate a new table.

The result

What this looks like on the resulting table. Please note that if you overwrite a table you will lose the descriptions.

You will also see more information on the Details tab- this includes the options to add descriptions, labels and data retention. You can manually edit these values, but including them in the code means you and automate your process if you wish this process to run regularly.

I hope you find this article useful, and DDL helps you automate some of your daily tasks to support your day to day work.

If you want to discuss more about DDL or the other great features and resources available via GCP and BigQuery, drop me a message.

Lead data specialist, and programming obsessive. Specialising in data team development. Love statistical languages and focusing on the GCP platform on apps.