- Fri 03 July 2020
- datascience
- #pivot table, #behavior analytics, #behavioral economics
Project Purpose¶
I wanted to learn more about the field of behavioral analytics, which I believe to be an interesting and promising field at the intersection of behavioral science and data science. Cohort analysis seemed like a good place to start.
Cohort analysis is a subset of behavioral analytics that takes data from a given platform and breaks users into related groups for analysis. There groups, or cohorts, usually share similar characteristics within a given time frame.
Cohort analysis can be valuable because it separates growth and engagement metrics since growth can mask engagement problems. Lack of activity in older users can be masked by growth from new users.
About the Data¶
I used the “Online Retail Data Set” from the UCI Machine Learning Repository. This is a data set containing all the transaction between 1/12/2010 and 9/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
The data has 8 columns:
- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
- InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
- UnitPrice: Unit price. Numeric, Product price per unit in sterling.
- CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
- Country: Country name. Nominal, the name of the country where each customer resides.
Steps Taken¶
- Explore and clean dataset
- Create 3 cohorts and conduct a cohort analysis on each
- Visualize the results using heatmaps
Step 1) Exploring and Cleaning¶
import numpy as np
import pandas as pd
import seaborn as sns
import datetime as dt
import matplotlib.pyplot as plt
sns.set_style('darkgrid')
df = pd.read_excel('Online Retail.xlsx')
df.head()
#size of data
print('rows: {}\ncolumns: {}'.format(df.shape[0],df.shape[1]))
#count of missing values for each column
df.isna().sum()
I dropped the missing CustomerID rows because there is no real way to impute these and there are still plenty of rows left for analysis.
df.dropna(subset=['CustomerID'],inplace=True)
df.info()
Steps 2 & 3) Cohort Analysis & Visualization¶
# Sort invoices by month and year
df['InvoiceMonth'] = df['InvoiceDate'].dt.to_period('M')
# first month a customer made a purchase
df['CohortMonth'] = df.groupby(
'CustomerID')['InvoiceMonth'].transform('min')
df.info()
df['year_dif'] = df['InvoiceMonth'].dt.year - \
df['CohortMonth'].dt.year
df['month_dif'] = df['InvoiceMonth'].dt.month - \
df['CohortMonth'].dt.month
df['CohortIndex'] = df['year_dif'] * 12 + df['month_dif']
df['CohortIndex'].value_counts().plot(kind='bar')
Most purchases seem to be from customers whose first order was in the same month. This means that most purchases are coming from new customers.
Cohort 1: User Retention Rate¶
To see the number of active users from each cohort, I can create a pivot table of CohortMonth and the CohortIndex.
# take each unique CustomerID from a given month and index
# reset_index() creates a new index for each combination of CohortMonth
# and CohortIndex so that the data is not multi-indexing
cohort1_df = df.groupby(['CohortMonth', 'CohortIndex'])[
'CustomerID'].apply(pd.Series.nunique).reset_index()
cohort1_df.head()
cohort1_pivot = pd.pivot_table(cohort1_df,
index='CohortMonth',
columns='CohortIndex',
values='CustomerID')
cohort1_pivot
In the table, the first column shows the total number of active users during that month. Each row shows the number of people that were still buying things after having made a purchase in that month. For example, in CohortIndex 12 and month 2010-12 the 260 represents 260 out of 948 users were still buying things 12 months after 2010-12.
I divide the pivot table by the total number of new users for a given month to get the percentage of people that continue buying things each month.
customer_count = cohort1_pivot.iloc[:, 0]
cohort1_pivot = cohort1_pivot.divide(customer_count, axis=0)
cohort1_pivot.round(3) * 100
All of the first column is 100% because if a user bought something that month, they are by nature a customer for that month. To visualize, I put the table into a seaborn heatmap.
plt.figure(figsize=(12, 10))
plt.title('Retention Rate')
sns.heatmap(data=cohort1_pivot,
annot=True,
fmt='.0%',
cmap='coolwarm')
The first cohort seems to have a much higher retention rate than other cohorts, with 27% still making purchases after a year. In addition, 50% of users in the first cohort returned 11 months later in 2011-11, an anomaly that should be investigated through looking at the marketing strategies or other differences that occured in that month.
Cohort 2: Average Quantity Sold¶
Other methods exist to see how the store is performing. Another one of these metrics is quantity sold. There could be a substantial difference between the number of users retained and the quantity they buy.
cohort2_df = df.groupby(['CohortMonth', 'CohortIndex'])[
'Quantity'].mean().reset_index()
cohort2_pivot = pd.pivot_table(cohort2_df,
index='CohortMonth',
columns='CohortIndex',
values='Quantity')
cohort2_pivot.round(1)
cohort2_pivot
And now into visualized form:
plt.figure(figsize=(12, 10))
plt.title('Average Quantity Sold')
sns.heatmap(data=cohort2_pivot,
annot=True,
cmap='coolwarm')
Over time it seems that each cohort tends to buy a lower quantity over time. It makes sense that the top row buys the most as they are early adopters, but more investigation is needed as to the cause of the overall decline.
Cohort 3: Average Sales¶
Lastly, user retention and average quantity sold don't cover the full picture of the store's success. Sales is needed as well because people could be buying less but more expensive items and vice versa. The process for the cohort analysis is the same across all 3 of the cohort analyses.
df['Sales'] = df['Quantity'] * df['UnitPrice']
cohort3_df = df.groupby(['CohortMonth', 'CohortIndex'])[
'Sales'].mean().reset_index()
cohort3_pivot = pd.pivot_table(cohort3_df,
index='CohortMonth',
columns='CohortIndex',
values='Sales')
cohort3_pivot.round(1)
cohort3_pivot
plt.figure(figsize=(12, 10))
plt.title('Average Sales')
sns.heatmap(data=cohort3_pivot,
annot=True,
cmap='coolwarm')
It seems that the average sales seems to follow the overall trend of quantity. This means that the overall decrease in quantity sold over later cohorts is most likely not due to sold items being more expensive on average.
Conclusion¶
In this post, I demonstated an application of cohort analysis. It is a simple and effective tool for tracking the behavior of different groups of users. It is also highly versatile as you can look at many different time frames and different metrics for analysis. Overall, it is useful for understanding user behavior better to drive smarter decisions for a given product or service.