Funnel Analytics using BigQuery. What is a funnel analysis? | by Pragya Deep | Jun, 2023 | Medium

Description

URL

https://medium.com/@pragya.deep19/funnel-analytics-using-bigquery-4c521a75649

What is a funnel analysis?

Funnel analysis involves mapping and analyzing a series of events that leads towards a defined goal. For example, let us consider a user who lands on an e-commerce website and searches for a product. Then he clicks on one of the products to view its details. He adds it to the cart if he likes it and then finally completes the purchase.

As product managers, we might want to see how many users who search for a product end up buying it. And if they don’t buy it, then in which state do they abandon a product? Funnel analysis is the simple answer to this problem. So how do we build a funnel analysis tool? This blog talks about all the necessary ingredients required plus some tips and hacks.

Data store and query engine

When we talk about analytics data, the volume is enormous and the platform to be chosen should be able to read/write data at a scale of GBs within seconds. As per our observations going through different data stores and query engines, we found Druid is fast but we know it is best for aggregated data, report building, etc. It wasn’t able to provide the flexibility that was expected for Funnel analysis in our case. We chose to go ahead with Bigquery because of the high speed and flexibility to organize and query data. Plus some magic ingredients provided by Bigquery called partitioning and clustering made our lives easier. These help us when we try to filter our data. And for an organization with mutually exclusive use cases, i.e. multiple business verticles, e.g., an e-commerce site, video streaming platform, or payments platform, it is also advisable to create multiple smaller tables instead of a larger table that holds data for all your projects. It helps in keeping concerns separate and also provides faster analytics.

So for our use case, we will start making maximum use of these magic ingredients. As opposed to the most common practice of keeping all the analytics data in one place, we will first divide our larger dataset into smaller mutually exclusive project tables like one table for the e-commerce platform analytics data, one for the payments platform, one for the video streaming platform, and so on.

Partitioning and clustering are the keys to improving the performance of any query. We need to identify the most commonly used filters to be able to optimally choose the columns for partitioning. In this case, we will partition these tables on the date field as this will become the mandatory filters for any analytics. Now, the next most important column for us will be these events — product_searched,product_viewed, etc., on which we will be building our funnel, so we cluster the table on the event field.

Note: Clustering can be done on multiple columns based on the use case.

How to formulate the query

Let’s say we want to see how many users complete product purchases. The relevant steps for the same are

product_searched — user searches some product using keywords
product_viewed — user views the details of the product
product_added_to_cart — user decided to buy the item
product_purchase_completed — user makes the payment

This is a simple 4-step funnel and here we will be counting the number of unique users who have completed this journey in 10 days starting 10th up to 20th June. We would be building the query using the following steps

Steps to build the funnel

Input

Filtering

At first, we want to filter the relevant data. For that, we will put the date filter to allow only events between the 10th and 20th of June to be considered. Also, we will add filters for the relevant events

Select event_name, timestamp , user_id
from funnel_table
where event_date between "2023-06-10" and "2023-06-20"
and event_name in
(“product_search”,
“product_viewed” ,
”product_added_to_cart”,
“product_purchase_completed");

Arranging

The next step of building the funnel is arranging the events in chronological order because we are interested in the users who did these events one after the other in the order mentioned in the funnel. So we create a string of space-separated events using the below aggregation on top of our older query

SELECT user_id, string_agg(event_name, ' ' order by timestamp) as user_journey
FROM (
// output of previous query
)

Now, the query will look something like this

SELECT user_id, string_agg(event_name, ' ' order by timestamp) as user_journey
FROM (
  SELECT event_name, timestamp , user_id
  FROM funnel_table
  WHERE event_date between "2023-06-10" and "2023-06-20"
  and event_name in
  ('product_search',
  'product_viewed' ,
  'product_added_to_cart',
  'product_purchase_completed'
  )
)

and the processed data will look like this

Regex filtering

With the user journeys in place, all the hard work is done. Anyone can look at this table and say how many users went from

product_search →product_viewed

and from

product_search →product_viewed →product_added_to_cart

and so on

But how is our code going to do the same? That’s right. Using the title of this section — a regex. Bigquery is really superfast with long strings, hence regex search becomes the go-to approach for us.

Clearly, it is a lengthy string and a complex regex search. To improve the performance when matching on the user journey string, we replace the event_names with an alias in the query such that all the events for the first level are represented as E1 and those which constitute the second level as E2, and so on.

Below is the modified query to use event ids as aliases instead of the event name

SELECT user_id, string_agg(event_name, ' ' order by timestamp) as user_journey
FROM(
  SELECT user_id, timestamp ,
  case event_name
  when 'product_search' then 'E1'
  when 'product_viewed' then 'E2'
  when 'product_added_to_cart' then 'E3'
  when 'product_purchase_completed' then 'E4'
  end as level_id
  FROM funnel_table
    WHERE event_date between "2023-06-10" and "2023-06-20"
    and event_name in
    ('product_search',
    'product_viewed' ,
    'product_added_to_cart',
    'product_purchase_completed')
  )
)

Putting the pieces of the puzzle together

Finally, we aggregate the result and this query would return us the relevant count of users at every step of the funnel.

SELECT countif(regexp_contains(user_journey, r"E1")) as step1_count,
countif(regexp_contains(user_journey, r".*?E1 .*?E2")) as step2_count,
countif(regexp_contains(user_journey, r".*?E1 .*?E2 .*?E3")) as step3_count,
countif(regexp_contains(user_journey, r".*?E1 .*?E2 .*?E3 .*?E4")) as step4_count, // aka converted
(
SELECT user_id, string_agg(event_name, ' . ' order by timestamp) as user_journey
FROM(
  SELECT user_id, timestamp ,
  case event_name
  when 'product_search' then '1'
  when 'product_viewed' then '2'
  when 'product_added_to_cart' then '3'
  when 'product_purchase_completed' then '4'
  end as level_id
  FROM funnel_table
    WHERE event_date between "2023-06-10" and "2023-06-20"
    and event_name in
    ('product_search',
    'product_viewed' ,
    'product_added_to_cart',
    'product_purchase_completed')
    )
  )
)

The regular expression matches a string containing E1, E2, E3, and E4 appearing sequentially since we are interested in all those user journeys where the user does the 2nd event after the 1st event and the 3rd event after the 2nd event, and so on. This ensured the chronology of events. Reluctant matching is used in the regular expression to ensure the least amount of scanning of the user journey string as it can get really long when the date range goes up to six months or one year or more.

These are the basic steps to create a funnel for your own business problems and dig deeper to understand the issues in consumer usability. Although this is not the end, the scope of funnel analytics goes far and beyond. What we covered in this blog is just a tiny fraction of it. Concepts like Unordered funnel, Exact-Order funnel, Completed In, and many more are yet other things to be explored.

Lastly, Funnel doesn’t improve user engagement, it just points out everything that degrades it. It is just the diagnosis of the issue, not the fix itself.

With the hope of better conversions… Until next time…

Keep coding!