What is a funnel analysis?
Funnel analysis involves mapping and analyzing a series of events that leads towards a defined goal. For example, let us consider a user who lands on an e-commerce website and searches for a product. Then he clicks on one of the products to view its details. He adds it to the cart if he likes it and then finally completes the purchase.
As product managers, we might want to see how many users who search for a product end up buying it. And if they don’t buy it, then in which state do they abandon a product? Funnel analysis is the simple answer to this problem. So how do we build a funnel analysis tool? This blog talks about all the necessary ingredients required plus some tips and hacks.
Data store and query engine
When we talk about analytics data, the volume is enormous and the platform to be chosen should be able to read/write data at a scale of GBs within seconds. As per our observations going through different data stores and query engines, we found Druid is fast but we know it is best for aggregated data, report building, etc. It wasn’t able to provide the flexibility that was expected for Funnel analysis in our case. We chose to go ahead with Bigquery because of the high speed and flexibility to organize and query data. Plus some magic ingredients provided by Bigquery called partitioning and clustering made our lives easier. These help us when we try to filter our data. And for an organization with mutually exclusive use cases, i.e. multiple business verticles, e.g., an e-commerce site, video streaming platform, or payments platform, it is also advisable to create multiple smaller tables instead of a larger table that holds data for all your projects. It helps in keeping concerns separate and also provides faster analytics.
So for our use case, we will start making maximum use of these magic ingredients. As opposed to the most common practice of keeping all the analytics data in one place, we will first divide our larger dataset into smaller mutually exclusive project tables like one table for the e-commerce platform analytics data, one for the payments platform, one for the video streaming platform, and so on.
Partitioning and clustering are the keys to improving the performance of any query. We need to identify the most commonly used filters to be able to optimally choose the columns for partitioning. In this case, we will partition these tables on the date field as this will become the mandatory filters for any analytics. Now, the next most important column for us will be these events — product_searched,product_viewed, etc., on which we will be building our funnel, so we cluster the table on the event field.
Note: Clustering can be done on multiple columns based on the use case.
How to formulate the query
Let’s say we want to see how many users complete product purchases. The relevant steps for the same are
- product_searched — user searches some product using keywords
- product_viewed — user views the details of the product
- product_added_to_cart — user decided to buy the item
- product_purchase_completed — user makes the payment
This is a simple 4-step funnel and here we will be counting the number of unique users who have completed this journey in 10 days starting 10th up to 20th June. We would be building the query using the following steps
Steps to build the funnel
Input
Filtering
At first, we want to filter the relevant data. For that, we will put the date filter to allow only events between the 10th and 20th of June to be considered. Also, we will add filters for the relevant events
Select event_name, timestamp , user_id
from funnel_table
where event_date between "2023-06-10" and "2023-06-20"
and event_name in
(“product_search”,
“product_viewed” ,
”product_added_to_cart”,
“product_purchase_completed");
Arranging
The next step of building the funnel is arranging the events in chronological order because we are interested in the users who did these events one after the other in the order mentioned in the funnel. So we create a string of space-separated events using the below aggregation on top of our older query
SELECT user_id, string_agg(event_name, ' ' order by timestamp) as user_journey
FROM (
// output of previous query
)
Now, the query will look something like this
SELECT user_id, string_agg(event_name, ' ' order by timestamp) as user_journey
FROM (
SELECT event_name, timestamp , user_id
FROM funnel_table
WHERE event_date between "2023-06-10" and "2023-06-20"
and event_name in
('product_search',
'product_viewed' ,
'product_added_to_cart',
'product_purchase_completed'
)
)
and the processed data will look like this
Regex filtering
With the user journeys in place, all the hard work is done. Anyone can look at this table and say how many users went from
product_search →product_viewed
and from
product_search →product_viewed →product_added_to_cart
and so on
But how is our code going to do the same? That’s right. Using the title of this section — a regex. Bigquery is really superfast with long strings, hence regex search becomes the go-to approach for us.
Clearly, it is a lengthy string and a complex regex search. To improve the performance when matching on the user journey string, we replace the event_names with an alias in the query such that all the events for the first level are represented as E1 and those which constitute the second level as E2, and so on.
Below is the modified query to use event ids as aliases instead of the event name
SELECT user_id, string_agg(event_name, ' ' order by timestamp) as user_journey
FROM(
SELECT user_id, timestamp ,
case event_name
when 'product_search' then 'E1'
when 'product_viewed' then 'E2'
when 'product_added_to_cart' then 'E3'
when 'product_purchase_completed' then 'E4'
end as level_id
FROM funnel_table
WHERE event_date between "2023-06-10" and "2023-06-20"
and event_name in
('product_search',
'product_viewed' ,
'product_added_to_cart',
'product_purchase_completed')
)
)
Putting the pieces of the puzzle together
Finally, we aggregate the result and this query would return us the relevant count of users at every step of the funnel.
SELECT countif(regexp_contains(user_journey, r"E1")) as step1_count,
countif(regexp_contains(user_journey, r".*?E1 .*?E2")) as step2_count,
countif(regexp_contains(user_journey, r".*?E1 .*?E2 .*?E3")) as step3_count,
countif(regexp_contains(user_journey, r".*?E1 .*?E2 .*?E3 .*?E4")) as step4_count, // aka converted
(
SELECT user_id, string_agg(event_name, ' . ' order by timestamp) as user_journey
FROM(
SELECT user_id, timestamp ,
case event_name
when 'product_search' then '1'
when 'product_viewed' then '2'
when 'product_added_to_cart' then '3'
when 'product_purchase_completed' then '4'
end as level_id
FROM funnel_table
WHERE event_date between "2023-06-10" and "2023-06-20"
and event_name in
('product_search',
'product_viewed' ,
'product_added_to_cart',
'product_purchase_completed')
)
)
)
The regular expression matches a string containing E1, E2, E3, and E4 appearing sequentially since we are interested in all those user journeys where the user does the 2nd event after the 1st event and the 3rd event after the 2nd event, and so on. This ensured the chronology of events. Reluctant matching is used in the regular expression to ensure the least amount of scanning of the user journey string as it can get really long when the date range goes up to six months or one year or more.
These are the basic steps to create a funnel for your own business problems and dig deeper to understand the issues in consumer usability. Although this is not the end, the scope of funnel analytics goes far and beyond. What we covered in this blog is just a tiny fraction of it. Concepts like Unordered funnel, Exact-Order funnel, Completed In, and many more are yet other things to be explored.
Lastly, Funnel doesn’t improve user engagement, it just points out everything that degrades it. It is just the diagnosis of the issue, not the fix itself.
With the hope of better conversions… Until next time…
Keep coding!