BigQuery: Importing your data with SQL from Google cloud and drive storage

Lace Chantelle Rogers
3 min readMar 29, 2022

How often have you been asked to upload a google sheet or CSV into BigQuery, especially for a sheet which updates regularly? Well, the great news is that there is inbuilt functionality within BigQuery DDL (Data Definition Language) to enable you to do this without upskilling on other Google Cloud Platform tools or manually using the User Interface.

Well there is a simple and elegant solution in BigQuery DDL, using the function Create External Table. This function allows you to connect to drive and storage and generate an external table.

An external table effectively is a view to the source csv or sheet. These tables are backed by storage external to BigQuery, for example Google Cloud Storage or Google Drive. They also reflect changes made in the local storage. So if you have a lookup reference doc you’re likely to be update, this may be the perfect solution for you.

One thing to note is the External tables do not have a preview option and someone querying the table externally from BigQuery such as via Python or SAS will need access to the source.

BigQuery supports the following external data sources:

  • Bigtable
  • Cloud Storage
  • Drive

Basic code

The code itself requires a URI, and the options available very much depend on the format of the document. For example, if you use Google Sheets, you can specify the range (which is super handy if you only want specific columns to be read through).

This code means the table will continue to update once connected, and reflect changes made in the source document.

# ***************************************************************
# From drive
# ***************************************************************
CREATE OR REPLACE EXTERNAL TABLE
project.testing.external_table
# ***************************************************************
# See the link for the options, however I have added the most common ones
# *****************************************************************
OPTIONS( format = 'CSV',
uris = ['https://docs.google.com/spreadsheets/d/xxxxxxxxxxxxxxxxxxxxxxxx'],
skip_leading_rows=1,
ignore_unknown_values = TRUE,
description= 'extracted from drive',
max_bad_records = 5
);

Since this only generates an external table, which means other tools will need access to the source, or oauth2 access, it is well worth adding an addition to this code and scheduling it to generate a Native Table.

An native table is data stored within BigQuery and will not reflect changes to the source until refreshed. This means that anyone with the right permissions will be able to utilise this table without OAUTH2 errors. It also allows you to rename, add descriptions and re-format data to the structure you wish the table to be in.

The code below demonstrates changing column names and types, adding description to the columns, alongside the table. Plus adding clustering and table labels (more of this in a later blog post).

CREATE OR REPLACE TABLE
`project.testing.native_table`
#column names and descriptions
(criteria_id INT64 OPTIONS( description='Coded value' ),
name STRING OPTIONS( description='City name' ),
rec STRING OPTIONS( description='Grouping short name in upper case, 4 characters' ),
standardised_rec STRING OPTIONS( description='Grouping short name' ))
CLUSTER BY
rec,
name
#table options
OPTIONS ( description="Generate native table from external table",
labels=[("digital",
"lookup")]) AS
#SQL
SELECT
CAST (CID AS INT64) AS criteria_id,
name,
rec,
STD_REC AS standardised_rec
FROM
PROJECT.testing.external_table;

If you then schedule these two codes, I suggest adding a delete table step, if it exists, to ensure you keep your dataset tidy and prevent confusion if other team members are using the table.

# *************************************************************
# drop external table #**************************************************************
DROP TABLE IF EXISTS PROJECT.testing.external_table;

As mentioned at the beginning of the post, there are many ways of handling data from cloud storage, bigtable and the drive so it’s always worth considering the following:

  • Will the data be updated at anypoint
  • How extensive is the wrangling required and is your data structured
  • Who and how will this table be used
  • Is it a source supported by external tables

Enjoy BigQuerying as always and be sure to message me with suggestions for other posts.

--

--

Lace Chantelle Rogers

https://www.bigquery.co.uk/ leading data and analytics . GCP, BigQuery and GA4 expert specialising in data engineering, data science and Brighton cloud events.