Bigquery is the managed data warehouse offering from Google Cloud. It is a great place to store data for analytical queries. Bigquery allows complex queries to be completed quickly against large datasets. However, it can be tricky to make accurate queries against Bigquery due to the duplicate data problem.
Why can Bigquery tables have duplicate data?
Bigquery uses a column-based data storage format called Capacitor to store its data. In column-based data storage formats each column of data is stored separately. Column-based storage confers many benefits that I discuss in my previous blog post. However, column-based storage has a significant drawback when compared to row-based systems.
In a row-based data storage system you can enforce a statement like “Each row of data must have a unique identifier number.” Then when you insert a row of data with an existing unique identifier, you would need to choose if you want to overwrite the existing data for that identifier or if you want to cancel the insertion. Column-based data storage systems do not have this concept of uniqueness. This means when you insert a row of data with an existing identifier into Bigquery, you will have two rows of data with the same identifier.
But why can’t we just write each row of data once into Bigquery? Then we wouldn’t have this problem at all!
Why is there duplicated data in our Bigquery tables?
It is a notoriously difficult problem to make a software system do something “exactly once”. The current state of the art is to be satisfied with a software system doing something “at least once.” Unfortunately, this means we need to expect that duplicate rows will be written into Bigquery by our software systems.
As a side note, not all applications require data to be perfectly correct. Sometimes it is fine to have a little bit of duplicate data and to have queries be a little bit off. However, not all applications are tolerant of data incorrectness and we should figure out how to design those systems too!
How can we deduplicate data stored in Bigquery?
Let’s look at an example of how to deduplicate data stored in Bigquery. Bigquery has a public dataset of anonymized Google Analytics data we can use. The `bigquery-public-data.google_analytics_sample.ga_sessions_*` dataset has about 1M rows. But how many duplicates does it have? We can find out using a query like:
WITH
duplicates_per_session AS (
SELECT
COUNT(*) AS duplicates
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
GROUP BY
visitID)
SELECT
duplicates,
COUNT(*)
FROM
duplicates_per_session
GROUP BY
duplicates
Just as we expected, most of the visits (869452 of them) have a single row of data, but some of them have more than one row of data. But we want one row of data per visit! There are many ways to do this, here is the one I prefer:
SELECT
agg.TABLE.*
FROM (
SELECT
ARRAY_AGG(STRUCT(TABLE)
ORDER BY
visitStartTime DESC)[SAFE_OFFSET(0)] agg
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*` TABLE
GROUP BY
visitID)
This code groups by visitID, then it chooses the row with the latest visitStartTime and discards the rest of the rows. We can use Bigquery views to save this query and easily reuse it.
Unfortunately this query is not so cheap. Can we optimize it so our queries can be faster and cheaper?
Can we do better?
Bigquery offers a fancy feature called materialized views. Bigquery materialized views allow queries results to be stored and incrementally updated. This means those queries can be accessed without having to run a complete computation every time their results are needed. Maybe we can use a Bigquery materialized view to deduplicate our data? That would make queries to our deduplicated data quick, inexpensive, and accurate!
Unfortunately this is not an option yet. Any kind of reasonable deduplication logic I could find required SQL analytical functions. In the example above I used ARRAY_AGG and SAFE_OFFSET which are both SQL analytical functions. Analytical functions are not supported in Bigquery materialized views yet.
We can use materialized views to optimize many parts of our querying, but we cannot use them for deduplication. At this time the best solution for deduplicating Bigquery data is a non-materialized view that deduplicates data every time it is queried.
Leave a Reply