Hands-on experience with BigQuery: how to query with Google Analytics Data
This article is for Marketers who want to learn how to analyze your Google Analytics data in BigQuery and work with partitioned tables and nested fields.
What you’ll expect in this article:
- Learn how Google Analytics data is structured
- See how to work with partitioned tables and nested fields
- Practice querying your Google Analytics data
Let’s start by discussing why you need to put your Google Analytics data in Google BigQuery?
The primary and foremost reason is that you want to have complete control of your GA data if Google ever decides to remove your GA property for a reason. And there, you will have your data stored in BigQeury for backup.
GA 360 and GA 4 allow you to store your data in BigQuery. Still, if you are using Universal Analytics, you may need to use the external connector .i.e Supermetrics, Fivetran, Bluemetrix, Stitch, etc., to assist you with the data migration to BigQuery.
Partitioned Table
According to Google Cloud reference, a partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.
Partitioned table is not used only for Google Analytics, but for every data. And you would always need to store your data in it.
Nested Fields
BigQuery experts always recommend denormalizing your data in a single place. So simply put, nested fields are the repeated fields nested in one record. For example, your GA visit can have several hits, page views, events, etc. And the details will be stored as nested fields at this particular session.
Query and analyse your GA Data in BigQuery
In this guide, we’ll use GA sample in the public data sets to do some queries in BigQuery. To find the GA public data, simply click on +ADD Data > explore public sets > search Google Analytics > click on GA sample and verify it.
You can see the google_analytics_sample datasets listed under bigquery-public-data. By checking if table is segmented by the date of ingestion, you can recognise this dataset is built with partition table.
For the nested field, go to your preview, and you can find plenty of repeated fields, e.g. one session with all information about the hits stored in nested fields.
To work with dates and partition tables, click Query, and the last date available for this data will always show at the end.
So says you want to retrieve the data for more than the last day, you will need to follow the instructions below by inserting the * key to select the specific timeframe and apply condition with where_TABLE_SUFFIX between ‘specific date’ and ‘ specific date’. Using _TABLE_SUFFIX will give Google attention to the date in GA data, and you have to follow the date format in the columns date.
Note that when you run Queryquery, you will retrieve each page view per date, but if you want to know the sum of total pages per date?
This is the next step you will need to do — adding the sum (totals.pageviews) and renaming the field title as total_pageviews and group by date.
There is another way where you can enjoy more functions with table suffix. By adding the following queries:
FORMAT_DATE (‘%Y%m%d’, DATE_SUB(CURRENT_DATE(), INTERVAL 1095 DAY))
AND FORMAT_DATE (‘%Y%m%d’, DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY date
ORDER BY date
This means you subtract the total page view from the last 1095 days to the day before, with the corresponding format to display data accordingly. You can, of course, change the DAY to YEAR/MONTH to give you the exact data. This is the typical function that you can use for your BigQuery date data, and learn more about it here. So whenever you are working with the partitioned table, you will always need to add TABLE SUFFIX for your filter to function.
Working with Nested Fields
Before going ahead, this big query cookbook link is something you would need to check in order to proceed. There are many standard SQL you can use for practice, and we are going to the sequence of hits, for example:
#standardSQL
SELECT
fullVisitorId,
visitId,
visitNumber,
hits.hitNumber AS hitNumber,
hits.page.pagePath AS pagePath
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) as hits
WHERE
_TABLE_SUFFIX BETWEEN ‘20170701’ AND ‘20170731’
AND
hits.type=”PAGE”
ORDER BY
fullVisitorId,
visitId,
visitNumber,
hitNumber
— — — — — — — — — — — — — — — — — — — — — — — — — — — — —
How you can interpretQueryquery is that you need to select your normal field, which is
#standardSQL
SELECT
fullVisitorId,
visitId,
visitNumber,
and these fields are nested.
hits.hitNumber AS hitNumber,
hits.page.pagePath AS pagePath
To retrieve the nested fields, you must select from your table and choose the sub-table containing the nested fields.
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) as hits
So you might question how to identify nested fields? You will see from the table listed with the record and repeated, which is the nested field you are looking for.
And here, you specify the filter and order by, and then you hit RUN.
WHERE
_TABLE_SUFFIX BETWEEN ‘20170701’ AND ‘20170731’
AND
hits.type=”PAGE”
ORDER BY
fullVisitorId,
visitId,
visitNumber,
hitNumber
From the result, you can see the real sequence of the page path. It is always good to note that you need to use the UNNEST ( field ) as NAME if you want to retrieve the nested field.
Imagine creating a query to retrieve the totals of page views, hits and transactions for the date. What you can do is to add the sum () and run it
#standardSQL
SELECT
date,
SUM ( totals.pageviews) AS pageviews,
SUM ( totals.hits) AS hits,
SUM ( totals.transactions) AS transactions,
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) as hits
GROUP BY
date
ORDER BY
date
You will notice that the null number from your result, so what you can do is to apply the following function:
By using WITH function, we are basically telling Query to give a name to the subquery ( SELECT to ORDER BY) and select all (*) of the data from table_total with the condition applied (where).