BigQuery: Soundex function and UTMs reducing hardcoding on spelling errors

What is Soundex

In a nutshell, it indexes words based on the sound pronounced in English. It generated a three-letter code which you can use to group and ultimately re-classify mis-spelling and terms into grouped variables. It ignores non-latin characters, which means when you have that pesky underscore and hyphen in your UTM you can regroup without having to strip and work with more complicated string functions.

What is a UTM

A UTM put it simply is an addition to the website you are promoting or connecting to, which allows you to track where your visit originated from. They are super easy to produce by adding a set range of text to the end of your URL. You’ll then be able to leverage and utilise this additional data to analyse where your site or app visit originated from. And for us, BigQuery users, see this in the GA4 traffic records.

The format:

URL https://www.website/podcast/

Where does Soundex come in?

Whilst the UTM parameters are super powerful, they are also easy to produce erroneous data, plus some networks, such as Facebook can have multiple groupings, depending on if it was mobile, app, web or social. Often there are variances in how teams record UTMs, including the pesky misspellings which inherently occur or variances in historic data.

Sample Code

SELECT
DISTINCT
name,
SOUNDEX(name) AS mediated_name
FROM
`data_examples.utm_dummy`
WHERE name IS NOT NULL
ORDER BY
2,
1

The output

From this one small function, you can see my terrible spelling for email has been resolved, coupled with additional code, you can actually automatically re-code your values based on position or the largest page views. One note is to ensure you generate a look-up or reference to ensure you are not then erroneously re-classifying names.

Automating re-classification

Soundex functions can wrap around string functions, alongside many other approaches. Below is a quick example of how I have automated reclassification of sources to help group my Facebook marketing performance.

WITH BASE AS (
SELECT
DISTINCT
medium,
name,
source,
SOUNDEX(CASE WHEN SUBSTRING(source,1,3) LIKE '%.%' THEN SPLIT(source, '.')[ OFFSET (1)] ELSE source END) AS soundex,
(CASE
WHEN SUBSTRING(source,1,3) LIKE '%.%' THEN SPLIT(source, '.')[ OFFSET (1)]
ELSE source END ) AS tidy_name,
FROM
`data_examples.utm_dummy`
GROUP BY
1,2,3,4,5 )
SELECT
*,
FIRST_VALUE(tidy_name) OVER (PARTITION BY soundex ORDER BY name) AS lead_name
FROM
base
ORDER BY
2,
1

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store