BigQuery: A greener way to warehouse
Nowadays it’s incredibly easy to upload, store query and edit data, however, how often do we consider the carbon (and other) footprints in doing so. I attended an eye-opening conference hosted by Silicon Brighton which triggered how can I make my warehouses more sustainable.
I am not an expert in green policy however, I am able to contribute and support the journey toward greener warehouses, with engineering skillsets and so can you in a few simple steps.
Google Cloud Platform as always had already considered this and I will take you through a few steps which you can do without huge upheaval to start to tame your warehouse carbon footprint.
Google has recently released The new Carbon Footprint tool which is associated with your billing account. Simply follow the link and if you have access to a billing account you can see gross calculations on your carbon footprint. This is in preview mode at the moment, so I’m sure they will be adding more features as time goes by. However it identifies the emissions from a service level, which helps you establish where to focus your initial efforts.
Steps you can take today for a greener data warehouse
Before getting started I highly recommend reading this article which shows some of the approaches GCP is taking to help move toward cloud sustainability.
When you get started on your data warehouse green plan, I highly recommend spending time mapping out your warehouse from source to reporting. This will enable you to understand what you’ll need or have in place already and help identify opportunities to create a greener warehouse. I personally use tools like Visio, but there are also amazing tools out there to help with this challenge including the good old whiteboard.
Simple steps to greener
Storage Region
What region is your data to be stored in, and is it viable to store your datasets in a low-carbon data centre? GCP provides handy icons to show data centres which have low emissions and are increasing the sites all the time. This said remember to consider where your other data is stored and whether you have legal restrictions on where your data is stored or processed.
One important note is if you choose to migrate your current datasets to one of these locations, use the data transfer service tool provided by Google and this should be done across any datasets you may want to join in the future.
Do you need this data?
It’s super easy nowadays to simply upload all data for all sources into your cloud-based warehouse, but have you considered if it will be used and if it is even required anymore?
If you need to retain your data but you’ll be very unlikely to use it, utilise GCP cloud storage functionality with either coldline or achive storage. This also carries the benefit of being cheaper than BigQuery storage and will make your active warehouse easier to manage.
If it’s data which will need transforming, use cloud storage as a data lake and set up automated logic to move this into cold line storage once transformed and streamed into BigQuery.
Don’t be afraid to delete data (this sends a shiver down any data engineer's spine). If you know you’ll never need this data and it is not retained for legal reasons, consider deletion. This is even more true for transformed data where you may be storing the raw data in your cloud storage.
A question to always think about when storing, transforming or managing data is, just because you can, should you?
Are you duplicating your data?
How often do you end up with the same source in multiple lakes and warehouses across your company? Well, the good news is, GCP provides both OMNI and data transfer tools, which enable you to query data from other providers such as AWS, and pull data from lakes such as the S3 storage buckets on AWS. This means you do not need to duplicate data stored in other cloud providers, simply access them via the GCP native tools. You can also use tools such as Python, SAS and R to combine and process sources locally if doing one-off analyses.
Auditing your warehouse
How often do you check your data warehouse and lake for unused tables, scheduled queries and processes backend? Often this can feel like a huge minefield, and let’s be honest an uphill task if inheriting a warehouse or lake. But never fear, Information_schema is here. This is a set of code, which enable you to effectively check when tables were updated last, how many rows and much much more. I personally create small tables which allow me to review what is in my warehouse and alert me when a table has not been updated in some time.
This coupled with audit log management, where you can see table dependencies alongside who is using the data, allows you to handle opportunities for table deletion.
Effective controls, such as ensuring only certain people can create tables, helps make this task manageable and prevent your warehouses from scaling out of control.
Likewise, ensuring you have oversight of your company's projects, allows a cohesive approach to overall GCP data management.
Switch it off
If it’s not useful then turn it off, this seems simple, but how often are we afraid of turning off ETL processes? If you have compute engine running code which is now redundant, get to know what this is, likewise with scheduled querries and data studio tools which also process data. This comes with the added benefit of impressing your finance officer in the reduction of GCP spend.
Keep it in GCP
Where possible build your processes to be self-contained in GCP, this includes when using ETL tools and Cloud run. If you have the data in your BigQuery warehouse — transform the data in that location rather than extracting and transferring it to a third-party tool. Likewise frequently checking which native data transfer tools are available means you no longer need to use pipelines to get your play store or google ads data into BigQuery.
Code effectively
We all know coding effectively reduces cost, but have you ever considered effective coding actually reduces your carbon footprint? This is because when you run a code you effectively are using resources. So if you’re planning to build code, keep up to date with new BigQuery features, alongside using a sample to develop from rather than utilising the full table.
Using DDL and Assert means you will not process data if it is not available or if there is an error in the table.
When considering data-heavy processes like machine learning, consider and plan before you jump into the code and establish whether you can reduce how many variables or data sources will be utilised.
Plan your tables
Map out how your data will be used and design tables based on the output requirements. For example when you have a task of building a Google Analytics 4 Data Studio report from the BigQuery dataset, consider creating daily aggregated tables, which update each morning, rather than connecting to granular tables or custom SQL. Likewise, enable a daily PDF report for people who will be reviewing this daily, but won’t need to filter the data.
You could also align the update frequency in Data Studio to when the source table updates, meaning much fewer data processing jobs.
Keep reading and learning
Take this course to become a pro: Cloud boost-A tour of Cloud sustainability.
There are many, many ways of making your warehouse more sustainable, and by small simple steps, you can start to work towards a much more manageable journey to a greener warehouse. These are by no means an exhaustive list so be sure to follow Google's regular updates on their green policy and tools and if you can attend their first green sustainability conference.