image.png

Instacart Market Basket Analysis

Catie Fuentes

Introduction

The goal is to analyze the Instacart Online Grocery Shopping Dataset 2017. (https://www.instacart.com/datasets/grocery-shopping-2017) using the market basket analysis technique in order to find common patterns in the the customer's behaviors.

Market basket analysis consists of analyzing large datasets that include purchase history and products that are likely to be purchased together. Retailers use this to understand their customer's needs & behaviors in order to increase sales.

The key question in the market basket analysis is what products are most frequently purchased together.

https://runestone.academy/ns/books/published/DSA501_2022_HTTLADS/Instacart/market_basket.html

Objective

Find patterns based on the products ordered to understand the customer's needs and behaviors.

Determine if there are patterns in the kinds of products customers order and if they are telling of the lifestlye/diet-type (plant-based/vegetarian) of the customer- in order to market similar products to them.

Overview of the Datasets

The Instacart data consists of five datasets including: aisles, departments, order_products_prior, order_products_train, orders, and products.

There are 134 aisles, 21 departments, 3421083 orders, 206209 users, and 49688 products.

Features of Aisles Dataframe:

aisle_id: the aisle identification number

aisle: the corresponding name of the aisle

Features of Departments Dataframe:

department_id: the department identification number

department: the corresponding name of the department

Features of Products Dataframe:

product_id: product identification number

product_name: the corresponding product name

aisle_id: aisle identification number where the product is located

department_id: department identification number that the product belongs to

Features of Orders Dataframe:

order_id: order identification number

user_id: user identification number

eval_set: whether the order was a prior order or a train order

order_number: order number

order_dow: day of the week the order was placed

order_hour_of_day: hour of day the order was placed

days_since_prior_order: days since the users last order

Features of Prior Order Dataframe:

order_id: order identification number

product_id: product identification number

add_to_cart_order: the order in which each product was added to the cart

reordered: whether or not the product was reordered

Features of Train Order Dataframe:

order_id: order identification number

product_id: product identification number

add_to_cart_order: the order in which each product was added to the cart

reordered: whether or not the product was reordered

Importing Libraries & Data

In [1]:
%matplotlib inline

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import altair as alt
import requests
matplotlib.style.use('ggplot')
sns.set_style("whitegrid")
import json
import pickle
import scipy

alt.data_transformers.enable('json')
Out[1]:
DataTransformerRegistry.enable('json')
In [2]:
#!pip install kaggle # <- This is only needed the first time, I believe

# Install needed libraries
import json

# API credentials

token={"username":"catiefuentes","key":"4583b186134810a0769888ac7c442d00"}

# Create a folder in your Google account using the unix command "mkdir"
!mkdir ~/.kaggle

# Place the API credentials in the right place
with open('/root/.kaggle/kaggle.json','w') as file:
  json.dump(token, file)

# Change the permissions of the file kaggle.json
!chmod 600 /root/.kaggle/kaggle.json 

# Get a list of the datasets in kaggle with "instacart" in its name
!kaggle datasets list -s instacart
ref                                                       title                                              size  lastUpdated          downloadCount  voteCount  usabilityRating  
--------------------------------------------------------  ------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
psparks/instacart-market-basket-analysis                  Instacart Market Basket Analysis                  197MB  2017-11-20 03:08:26           9400        132  0.4117647        
viswajithkn/instacart-predict-shopping-time               Instacart Predict Shopping Time                    12MB  2018-10-29 17:24:48            517         11  0.64705884       
yasserh/instacart-online-grocery-basket-analysis-dataset  InstaCart Online Grocery Basket Analysis Dataset  197MB  2022-01-25 16:43:37           1085         36  0.9411765        
samwhitehill/instacart-training-sample                    InstaCart training sample                           1MB  2017-07-01 20:24:11            182          2  0.5882353        
brendanartley/simplifiedinstacartdata                     Simplified Instacart Market Basket Dataset        184MB  2022-04-03 14:53:59             86          1  0.8235294        
samwhitehill/instacart-sample-labels                      Instacart sample labels                             9KB  2017-07-01 20:26:16             32          1  0.4117647        
streptkinase/instacart-data-cache                         Instacart_data_cache                              409MB  2020-07-31 05:35:52              9          1  0.1764706        
rajashrideka/instacart-analysis-dataset                   Instacart_Analysis_Dataset                        238MB  2022-07-22 19:28:26              1          1  0.1764706        
emmettdoolan/instacart-mba                                Instacart_MBA                                     198MB  2020-04-08 13:46:01             69          1  0.11764706       
tungbinhthuong/instacart-train-test                       instacart_train_test                              303MB  2022-07-16 06:18:45              2          1  0.11764706       
despoinapatsavoudi/instacat2assignment1                   instacat-2-assignment.1                             4KB  2021-11-21 03:26:28              2          2  0.0              
streptkinase/user-features-instacart                      user_features_instacart                             4MB  2020-08-06 05:56:24             10          1  0.11764706       
In [3]:
# Download the appropriate dataset and put it in the content folder
!kaggle datasets download -d psparks/instacart-market-basket-analysis -p /content
Downloading instacart-market-basket-analysis.zip to /content
 98% 193M/197M [00:06<00:00, 36.2MB/s]
100% 197M/197M [00:06<00:00, 32.1MB/s]
In [4]:
#Decript (unzip) all the encrypted files (with the ending ".zip") 
!unzip \*.zip
Archive:  instacart-market-basket-analysis.zip
  inflating: aisles.csv              
  inflating: departments.csv         
  inflating: order_products__prior.csv  
  inflating: order_products__train.csv  
  inflating: orders.csv              
  inflating: products.csv            
In [23]:
#### COMMENTED OUT FOR USE ONLY IF KAGGLE API IS NOT WORKING ###

#url='http://136.183.139.117/dsa501data/instacart/'

#aisles=pd.read_csv(url+'aisles.csv')
#departments=pd.read_csv(url+'departments.csv')
#order_products_prior=pd.read_csv(url+'order_products__prior.csv')
#order_products_train=pd.read_csv(url+'order_products__train.csv')
#orders=pd.read_csv(url+'orders.csv')
#products=pd.read_csv(url+'products.csv')

#files=['aisles','departments','order_products__prior','order_products__train','orders','products']
#for file in files:
  #globals()[file]=pd.read_csv(url+file+'.csv')

Instacart Datasets Displayed Side-by-side

In [5]:
# created dataframes for each dataset
orders = pd.read_csv('orders.csv')
products = pd.read_csv('products.csv')
order_products_prior = pd.read_csv('order_products__prior.csv')
order_products_train = pd.read_csv('order_products__train.csv')
aisles = pd.read_csv('aisles.csv')
departments = pd.read_csv('departments.csv')

files=['aisles','departments','order_products__prior','order_products__train','orders','products']
for file in files:
  globals()[file]=pd.read_csv(file+'.csv')
In [6]:
# Got this code from JVP book. It displays frames side by side
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)
In [7]:
display(files[0],files[1],files[2])
Out[7]:

aisles

aisle_id aisle
0 1 prepared soups salads
1 2 specialty cheeses
2 3 energy granola bars
3 4 instant foods
4 5 marinades meat preparation
... ... ...
129 130 hot cereal pancake mixes
130 131 dry pasta
131 132 beauty
132 133 muscles joints pain relief
133 134 specialty wines champagnes

134 rows × 2 columns

departments

department_id department
0 1 frozen
1 2 other
2 3 bakery
3 4 produce
4 5 alcohol
5 6 international
6 7 beverages
7 8 pets
8 9 dry goods pasta
9 10 bulk
10 11 personal care
11 12 meat seafood
12 13 pantry
13 14 breakfast
14 15 canned goods
15 16 dairy eggs
16 17 household
17 18 babies
18 19 snacks
19 20 deli
20 21 missing

order_products__prior

order_id product_id add_to_cart_order reordered
0 2 33120 1 1
1 2 28985 2 1
2 2 9327 3 0
3 2 45918 4 1
4 2 30035 5 0
... ... ... ... ...
32434484 3421083 39678 6 1
32434485 3421083 11352 7 0
32434486 3421083 4600 8 0
32434487 3421083 24852 9 1
32434488 3421083 5020 10 1

32434489 rows × 4 columns

In [8]:
display(files[3],files[4],files[5])
Out[8]:

order_products__train

order_id product_id add_to_cart_order reordered
0 1 49302 1 1
1 1 11109 2 1
2 1 10246 3 0
3 1 49683 4 0
4 1 43633 5 1
... ... ... ... ...
1384612 3421063 14233 3 1
1384613 3421063 35548 4 1
1384614 3421070 35951 1 1
1384615 3421070 16953 2 1
1384616 3421070 4724 3 1

1384617 rows × 4 columns

orders

order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
3 2254736 1 prior 4 4 7 29.0
4 431534 1 prior 5 4 15 28.0
... ... ... ... ... ... ... ...
3421078 2266710 206209 prior 10 5 18 29.0
3421079 1854736 206209 prior 11 4 10 30.0
3421080 626363 206209 prior 12 1 12 18.0
3421081 2977660 206209 prior 13 1 12 7.0
3421082 272231 206209 train 14 6 14 30.0

3421083 rows × 7 columns

products

product_id product_name aisle_id department_id
0 1 Chocolate Sandwich Cookies 61 19
1 2 All-Seasons Salt 104 13
2 3 Robust Golden Unsweetened Oolong Tea 94 7
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1
4 5 Green Chile Anytime Sauce 5 13
... ... ... ... ...
49683 49684 Vodka, Triple Distilled, Twist of Vanilla 124 5
49684 49685 En Croute Roast Hazelnut Cranberry 42 1
49685 49686 Artisan Baguette 112 3
49686 49687 Smartblend Healthy Metabolism Dry Cat Food 41 8
49687 49688 Fresh Foaming Cleanser 73 11

49688 rows × 4 columns

The Features of the Dataset

In [9]:
features_dict={}
for file in files:
  features_dict[file]=[globals()[file].columns.values]
df=pd.DataFrame(features_dict).T 
df.rename(columns = {0:'Features'}, inplace=True)
df.index.name = "File"
df = df.style.set_properties(**{'text-align': 'left'}) 
df
Out[9]:
  Features
File  
aisles ['aisle_id' 'aisle']
departments ['department_id' 'department']
order_products__prior ['order_id' 'product_id' 'add_to_cart_order' 'reordered']
order_products__train ['order_id' 'product_id' 'add_to_cart_order' 'reordered']
orders ['order_id' 'user_id' 'eval_set' 'order_number' 'order_dow' 'order_hour_of_day' 'days_since_prior_order']
products ['product_id' 'product_name' 'aisle_id' 'department_id']

Head() for each dataset

In [10]:
aisles.head()
Out[10]:
aisle_id aisle
0 1 prepared soups salads
1 2 specialty cheeses
2 3 energy granola bars
3 4 instant foods
4 5 marinades meat preparation
In [11]:
departments.head()
Out[11]:
department_id department
0 1 frozen
1 2 other
2 3 bakery
3 4 produce
4 5 alcohol
In [12]:
orders.head()
Out[12]:
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
3 2254736 1 prior 4 4 7 29.0
4 431534 1 prior 5 4 15 28.0
In [40]:
orders.describe()
Out[40]:
order_id user_id order_number order_dow order_hour_of_day days_since_prior_order
count 3.421083e+06 3.421083e+06 3.421083e+06 3.421083e+06 3.421083e+06 3.214874e+06
mean 1.710542e+06 1.029782e+05 1.715486e+01 2.776219e+00 1.345202e+01 1.111484e+01
std 9.875817e+05 5.953372e+04 1.773316e+01 2.046829e+00 4.226088e+00 9.206737e+00
min 1.000000e+00 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 8.552715e+05 5.139400e+04 5.000000e+00 1.000000e+00 1.000000e+01 4.000000e+00
50% 1.710542e+06 1.026890e+05 1.100000e+01 3.000000e+00 1.300000e+01 7.000000e+00
75% 2.565812e+06 1.543850e+05 2.300000e+01 5.000000e+00 1.600000e+01 1.500000e+01
max 3.421083e+06 2.062090e+05 1.000000e+02 6.000000e+00 2.300000e+01 3.000000e+01
In [63]:
orders['user_id'].value_counts()
Out[63]:
152340    100
185641    100
185524    100
81678     100
70922     100
         ... 
13046       4
70208       4
111440      4
31762       4
47980       4
Name: user_id, Length: 206209, dtype: int64
In [13]:
products.set_index('product_id', inplace=True)
products.head()
Out[13]:
product_name aisle_id department_id
product_id
1 Chocolate Sandwich Cookies 61 19
2 All-Seasons Salt 104 13
3 Robust Golden Unsweetened Oolong Tea 94 7
4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1
5 Green Chile Anytime Sauce 5 13
In [41]:
products.describe()
Out[41]:
aisle_id department_id
count 49688.000000 49688.000000
mean 67.769582 11.728687
std 38.316162 5.850410
min 1.000000 1.000000
25% 35.000000 7.000000
50% 69.000000 13.000000
75% 100.000000 17.000000
max 134.000000 21.000000
In [15]:
order_products__prior.head()
Out[15]:
order_id product_id add_to_cart_order reordered
0 2 33120 1 1
1 2 28985 2 1
2 2 9327 3 0
3 2 45918 4 1
4 2 30035 5 0
In [16]:
order_products__train.head()
Out[16]:
order_id product_id add_to_cart_order reordered
0 1 49302 1 1
1 1 11109 2 1
2 1 10246 3 0
3 1 49683 4 0
4 1 43633 5 1

Data Description Visualizations

The plot below shows the frequency of orders placed throughout the week, with Saturday labeled as 0 and Friday labeled as 6. Most orders were placed during the weekend on Saturdays & Sundays and the least amount of orders were placed in the middle of the week on Tuesdays & Wednesdays. This is most likely due to customers wanting to prepare for the week ahead & avoid busy grocery stores on the weekend.

In [17]:
plt.figure(figsize=(12,8))
sns.countplot(x="order_dow", data=orders)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of orders by day of week", fontsize=15)
plt.show()

The plot below shows the frequency of orders by hour of day. It is labeled based on 24 hours in a day, with 0 being 12am and 23, being 11pm. Most orders were placed from 10am to 4pm. This is likely because customers may be busier during this time, since most people work from 9am-5pm.

In [18]:
plt.figure(figsize=(12,8))
sns.countplot(x="order_hour_of_day", data=orders)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Hour of day', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of orders by hour of day", fontsize=15)
plt.show()

Exploratory Data Analysis

In [19]:
order_products = order_products_prior.append(order_products_train)
order_products.shape
Out[19]:
(33819106, 4)
In [20]:
order_products.head()
Out[20]:
order_id product_id add_to_cart_order reordered
0 2 33120 1 1
1 2 28985 2 1
2 2 9327 3 0
3 2 45918 4 1
4 2 30035 5 0
In [21]:
order_products.product_id.nunique()
Out[21]:
49685
In [22]:
product_counts = order_products.groupby('product_id')['order_id'].count().reset_index().rename(columns = {'order_id':'frequency'})
product_counts = product_counts.sort_values('frequency', ascending=False)[0:100].reset_index(drop = True)
product_counts = product_counts.merge(products, on = 'product_id', how = 'left')
product_counts.head(10)
Out[22]:
product_id frequency product_name aisle_id department_id
0 24852 491291 Banana 24 4
1 13176 394930 Bag of Organic Bananas 24 4
2 21137 275577 Organic Strawberries 24 4
3 21903 251705 Organic Baby Spinach 123 4
4 47209 220877 Organic Hass Avocado 24 4
5 47766 184224 Organic Avocado 24 4
6 47626 160792 Large Lemon 24 4
7 16797 149445 Strawberries 24 4
8 26209 146660 Limes 24 4
9 27845 142813 Organic Whole Milk 84 16

The top 10 products that were most frequently ordered were: Banana, Bag of Organic Bananas, Organic Strawberries, Organic Baby Spinach, Organic Hass Avocado, Organic Avocado, Large Lemon, Strawberries, Limes, and Organic Whole Milk. 9 out of the 10 products come from the produce section. The high frequency of bananas may be skewed due to customers ordering one individual banana at a time, rather than a bundle.

In [23]:
ax = sns.barplot(x="frequency", y="product_name", data=(product_counts).head(10))
plt.title("Frequency of top 10 ordered products", fontsize=15)
Out[23]:
Text(0.5, 1.0, 'Frequency of top 10 ordered products')
In [24]:
product_departments = product_counts.merge(departments)
product_departments
Out[24]:
product_id frequency product_name aisle_id department_id department
0 24852 491291 Banana 24 4 produce
1 13176 394930 Bag of Organic Bananas 24 4 produce
2 21137 275577 Organic Strawberries 24 4 produce
3 21903 251705 Organic Baby Spinach 123 4 produce
4 47209 220877 Organic Hass Avocado 24 4 produce
... ... ... ... ... ... ...
95 9076 58269 Blueberries 116 1 frozen
96 20995 34248 Organic Broccoli Florets 116 1 frozen
97 25890 52369 Boneless Skinless Chicken Breasts 49 12 meat seafood
98 31506 52323 Extra Virgin Olive Oil 19 13 pantry
99 27156 39577 Organic Black Beans 59 15 canned goods

100 rows × 6 columns

The highest frequency of products ordered came from the produce department and the lowest came from the canned goods department. This could be due to produce going bad quickly & needing to be replaced more frequently. And contrastly, canned goods may not be as in-demand because they have a longer shelf-life and do not need to be replaced often.

In [25]:
ax = sns.barplot(x="frequency", y="department", data=product_departments)
plt.title("Frequency of products ordered from each department", fontsize=15)
Out[25]:
Text(0.5, 1.0, 'Frequency of products ordered from each department')

Exploring Patterns in Plant-based/Vegetarian Options

I merged the datasets into one table.

To look for plant-based/vegetarian-friendly products:

I dropped any departments that contained: dairy eggs/meat seafood/deli & any aisles that contained: meat/dairy/poultry/chicken/beef/fish/jerky.

I filtered the dataframe to only show plant-based/vegetarian products that users added to cart 1st & reordered.

In [26]:
#merge tables: orders and order_products__prior

df=orders.merge(order_products__prior)
In [27]:
veg_df=df.merge(product_departments)
In [28]:
veg_df
Out[28]:
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order product_id add_to_cart_order reordered frequency product_name aisle_id department_id department
0 2539329 1 prior 1 2 8 NaN 196 1 0 37298 Soda 77 7 beverages
1 2398795 1 prior 2 3 7 15.0 196 1 1 37298 Soda 77 7 beverages
2 473747 1 prior 3 3 12 21.0 196 1 1 37298 Soda 77 7 beverages
3 2254736 1 prior 4 4 7 29.0 196 1 1 37298 Soda 77 7 beverages
4 431534 1 prior 5 4 15 28.0 196 1 1 37298 Soda 77 7 beverages
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7483876 861106 206183 prior 3 6 15 7.0 43122 6 1 38686 Organic Bartlett Pear 24 4 produce
7483877 3122161 206183 prior 10 6 12 28.0 43122 20 1 38686 Organic Bartlett Pear 24 4 produce
7483878 2233071 206183 prior 14 6 11 13.0 43122 11 1 38686 Organic Bartlett Pear 24 4 produce
7483879 2554068 206202 prior 2 4 19 16.0 43122 5 0 38686 Organic Bartlett Pear 24 4 produce
7483880 3392839 206208 prior 3 2 15 7.0 43122 10 0 38686 Organic Bartlett Pear 24 4 produce

7483881 rows × 15 columns

In [29]:
#dropping any departments that would not be vegetarian-friendly

vegetarian = veg_df[veg_df["department"].str.contains("dairy eggs|meat seafood|deli") == False]
In [30]:
vegetarian = vegetarian.merge(aisles)
In [31]:
plant_df = vegetarian[vegetarian["aisle"].str.contains("meat|dairy|poultry|chicken|beef|fish|jerky") == False]
In [32]:
#filter to see only add_to_cart_order 1 & reorders

plant_df[(veg_df.add_to_cart_order==1) & (veg_df.reordered==1)]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  This is separate from the ipykernel package so we can avoid doing imports until
Out[32]:
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order product_id add_to_cart_order reordered frequency product_name aisle_id department_id department aisle
1 2398795 1 prior 2 3 7 15.0 196 1 1 37298 Soda 77 7 beverages soft drinks
2 473747 1 prior 3 3 12 21.0 196 1 1 37298 Soda 77 7 beverages soft drinks
3 2254736 1 prior 4 4 7 29.0 196 1 1 37298 Soda 77 7 beverages soft drinks
4 431534 1 prior 5 4 15 28.0 196 1 1 37298 Soda 77 7 beverages soft drinks
5 3367565 1 prior 6 2 7 19.0 196 1 1 37298 Soda 77 7 beverages soft drinks
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6413544 2349455 205908 prior 60 1 16 9.0 3957 5 1 39271 100% Raw Coconut Water 31 7 beverages refrigerated
6413545 503230 205908 prior 66 6 13 5.0 3957 3 1 39271 100% Raw Coconut Water 31 7 beverages refrigerated
6413546 2588366 205960 prior 2 3 11 18.0 3957 10 0 39271 100% Raw Coconut Water 31 7 beverages refrigerated
6413549 2297298 205984 prior 6 5 15 8.0 3957 2 1 39271 100% Raw Coconut Water 31 7 beverages refrigerated
6413557 2320180 206061 prior 4 4 13 17.0 3957 3 1 39271 100% Raw Coconut Water 31 7 beverages refrigerated

565962 rows × 16 columns

In [33]:
plant_df.fillna(0, inplace = True)

plant_df.isnull().sum()
Out[33]:
order_id                  0
user_id                   0
eval_set                  0
order_number              0
order_dow                 0
order_hour_of_day         0
days_since_prior_order    0
product_id                0
add_to_cart_order         0
reordered                 0
frequency                 0
product_name              0
aisle_id                  0
department_id             0
department                0
aisle                     0
dtype: int64
In [34]:
len(plant_df['user_id'].unique())
Out[34]:
189421

The plot below shows the frequency of the most popular plant-based products. Most products are from the Produce department, with the exception of: soda, coconut water, extra virgin olive oil, and wheat bread.

In [119]:
plt.rcParams['figure.figsize'] = (18, 7)
sns.countplot(plant_df['product_name'])

plt.title('Frequency of most popular products',fontsize = 30)
plt.xticks(rotation = 90 )
plt.show()
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
In [50]:
plant_df2 = plant_df[["order_hour_of_day","order_dow"]].mean()
plant_df2
Out[50]:
order_hour_of_day    13.416873
order_dow             2.668837
dtype: float64

The plot below shows the frequency of plant-based orders by day of week. It is similar to the frequency of orders for the original dataset (above), with weekends (day 0 & 1) being the most popular days to order. However, there is a more significant decrease in frequency of orders from Monday-Thursday.

In [58]:
plt.figure(figsize=(12,8))
sns.countplot(x="order_dow", data=plant_df)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of plant-based orders by day of week", fontsize=15)
plt.show()

The plot below shows the frequency of plant-based orders by hour of day. It looks almost identical to the original plot (above) containing all of the products.

In [59]:
plt.figure(figsize=(12,8))
sns.countplot(x="order_hour_of_day", data=plant_df)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Hour of day', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of plant-based orders by hour of day", fontsize=15)
plt.show()
In [115]:
plant_products = plant_df['product_name'].value_counts()
plant_products
Out[115]:
Banana                      472565
Bag of Organic Bananas      379450
Organic Strawberries        264683
Organic Baby Spinach        241921
Organic Hass Avocado        213584
                             ...  
Soda                         35791
Bartlett Pears               35413
Organic Romaine Lettuce      34874
Organic Broccoli Florets     32887
Sparkling Lemon Water        32788
Name: product_name, Length: 79, dtype: int64

This word cloud shows the most popular plant-based products:

Banana 472565

Bag of Organic Bananas 379450

Organic Strawberries 264683

Organic Baby Spinach 241921

Organic Hass Avocado 213584

In [117]:
from wordcloud import WordCloud, STOPWORDS

text = plant_products
wordcloud = WordCloud(
    width = 3000,
    height = 2000,
    background_color = 'white',
    stopwords = STOPWORDS).generate(str(text))
fig = plt.figure(
    figsize = (20, 20),
    facecolor = 'k',
    edgecolor = 'k')
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

Association Rule Mining

What are the most popular product pairings for plant-based products?

In [65]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
In [66]:
pb_counts = plant_df.groupby('product_id')['order_id'].count().reset_index().rename(columns = {'order_id':'frequency'})
pb_counts = plant_df.sort_values('frequency', ascending=False)[0:100].reset_index(drop = True)
pb_counts = plant_df.merge(products, on = 'product_id', how = 'left')
pb_counts.head(10)
Out[66]:
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order product_id add_to_cart_order reordered frequency product_name_x aisle_id_x department_id_x department aisle product_name_y aisle_id_y department_id_y
0 2539329 1 prior 1 2 8 0.0 196 1 0 37298 Soda 77 7 beverages soft drinks Soda 77 7
1 2398795 1 prior 2 3 7 15.0 196 1 1 37298 Soda 77 7 beverages soft drinks Soda 77 7
2 473747 1 prior 3 3 12 21.0 196 1 1 37298 Soda 77 7 beverages soft drinks Soda 77 7
3 2254736 1 prior 4 4 7 29.0 196 1 1 37298 Soda 77 7 beverages soft drinks Soda 77 7
4 431534 1 prior 5 4 15 28.0 196 1 1 37298 Soda 77 7 beverages soft drinks Soda 77 7
5 3367565 1 prior 6 2 7 19.0 196 1 1 37298 Soda 77 7 beverages soft drinks Soda 77 7
6 550135 1 prior 7 1 9 20.0 196 1 1 37298 Soda 77 7 beverages soft drinks Soda 77 7
7 3108588 1 prior 8 1 14 14.0 196 2 1 37298 Soda 77 7 beverages soft drinks Soda 77 7
8 2295261 1 prior 9 1 16 0.0 196 4 1 37298 Soda 77 7 beverages soft drinks Soda 77 7
9 2550362 1 prior 10 4 8 30.0 196 1 1 37298 Soda 77 7 beverages soft drinks Soda 77 7
In [67]:
freq_products = list(product_counts.product_id)
freq_products[1:10]
Out[67]:
[13176, 21137, 21903, 47209, 47766, 47626, 16797, 26209, 27845]
In [68]:
pb_products = plant_df[plant_df.product_id.isin(freq_products)]
pb_products.shape
Out[68]:
(6413574, 16)
In [69]:
pb_products = pb_products.merge(products, on = 'product_id', how='left')
pb_products.head()
Out[69]:
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order product_id add_to_cart_order reordered frequency product_name_x aisle_id_x department_id_x department aisle product_name_y aisle_id_y department_id_y
0 2539329 1 prior 1 2 8 0.0 196 1 0 37298 Soda 77 7 beverages soft drinks Soda 77 7
1 2398795 1 prior 2 3 7 15.0 196 1 1 37298 Soda 77 7 beverages soft drinks Soda 77 7
2 473747 1 prior 3 3 12 21.0 196 1 1 37298 Soda 77 7 beverages soft drinks Soda 77 7
3 2254736 1 prior 4 4 7 29.0 196 1 1 37298 Soda 77 7 beverages soft drinks Soda 77 7
4 431534 1 prior 5 4 15 28.0 196 1 1 37298 Soda 77 7 beverages soft drinks Soda 77 7
In [70]:
basket_analysis = plant_df.groupby(['order_id', 'product_name'])['reordered'].count().unstack().reset_index().fillna(0).set_index('order_id')
basket_analysis.head()
Out[70]:
product_name 100% Raw Coconut Water 100% Whole Wheat Bread Apple Honeycrisp Organic Asparagus Bag of Organic Bananas Banana Bartlett Pears Blueberries Broccoli Crown Bunched Cilantro ... Roma Tomato Seedless Red Grapes Small Hass Avocado Soda Sparkling Lemon Water Sparkling Natural Mineral Water Sparkling Water Grapefruit Spring Water Strawberries Yellow Onions
order_id
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

5 rows × 79 columns

In [71]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1 
    
basket_analysis = basket_analysis.applymap(encode_units)
basket_analysis.head()
Out[71]:
product_name 100% Raw Coconut Water 100% Whole Wheat Bread Apple Honeycrisp Organic Asparagus Bag of Organic Bananas Banana Bartlett Pears Blueberries Broccoli Crown Bunched Cilantro ... Roma Tomato Seedless Red Grapes Small Hass Avocado Soda Sparkling Lemon Water Sparkling Natural Mineral Water Sparkling Water Grapefruit Spring Water Strawberries Yellow Onions
order_id
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 79 columns

In [72]:
basket_analysis.size
Out[72]:
174016539
In [73]:
basket_analysis.shape
Out[73]:
(2202741, 79)
In [74]:
frequent_items = apriori(basket_analysis, min_support=0.01, use_colnames=True)
frequent_items.head()
Out[74]:
support itemsets
0 0.017239 (100% Raw Coconut Water)
1 0.027609 (100% Whole Wheat Bread)
2 0.038597 (Apple Honeycrisp Organic)
3 0.030545 (Asparagus)
4 0.172263 (Bag of Organic Bananas)
In [75]:
frequent_items.tail()
Out[75]:
support itemsets
108 0.010785 (Organic Strawberries, Organic Blueberries)
109 0.010021 (Organic Yellow Onion, Organic Garlic)
110 0.011710 (Organic Hass Avocado, Organic Raspberries)
111 0.018520 (Organic Hass Avocado, Organic Strawberries)
112 0.015373 (Organic Raspberries, Organic Strawberries)
In [76]:
frequent_items.shape
Out[76]:
(113, 2)

I used the Mlxtend python library and found out associations between frequent plant-based product pairings:

  • Apriori Algorithm:

  • Apriori algorithm assumes that any subset of a frequent itemset must be frequent.

  • Association Rule Mining:

    • a technique to identify underlying relations between different items

    • used when we want to find an associations between different items or find frequent patterns in a dataset

  • Matrices

  • Support: the number of transactions containing product X divided by the total number of transactions

  • Confidence: Measures how often items in Y appear in transactions that contain X

  • Conviction: A high conviction value means that the consequent is highly depending on the antecedent

In [78]:
rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.sort_values('lift', ascending=False)
Out[78]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
50 (Organic Yellow Onion) (Organic Garlic) 0.051493 0.049837 0.010021 0.194603 3.904783 0.007454 1.179744
51 (Organic Garlic) (Organic Yellow Onion) 0.049837 0.051493 0.010021 0.201069 3.904783 0.007454 1.187221
35 (Large Lemon) (Limes) 0.069303 0.063842 0.012440 0.179507 2.811746 0.008016 1.140970
34 (Limes) (Large Lemon) 0.063842 0.069303 0.012440 0.194863 2.811746 0.008016 1.155948
56 (Organic Raspberries) (Organic Strawberries) 0.062221 0.120161 0.015373 0.247072 2.056182 0.007897 1.168558
57 (Organic Strawberries) (Organic Raspberries) 0.120161 0.062221 0.015373 0.127938 2.056182 0.007897 1.075358
37 (Large Lemon) (Organic Avocado) 0.069303 0.080270 0.011085 0.159947 1.992599 0.005522 1.094847
36 (Organic Avocado) (Large Lemon) 0.080270 0.069303 0.011085 0.138093 1.992599 0.005522 1.079812
49 (Organic Blueberries) (Organic Strawberries) 0.045425 0.120161 0.010785 0.237418 1.975833 0.005326 1.153763
48 (Organic Strawberries) (Organic Blueberries) 0.120161 0.045425 0.010785 0.089753 1.975833 0.005326 1.048698
53 (Organic Raspberries) (Organic Hass Avocado) 0.062221 0.096963 0.011710 0.188192 1.940865 0.005676 1.112377
52 (Organic Hass Avocado) (Organic Raspberries) 0.096963 0.062221 0.011710 0.120763 1.940865 0.005676 1.066582
6 (Organic Large Extra Fancy Fuji Apple) (Bag of Organic Bananas) 0.034123 0.172263 0.010607 0.310836 1.804432 0.004729 1.201075
7 (Bag of Organic Bananas) (Organic Large Extra Fancy Fuji Apple) 0.172263 0.034123 0.010607 0.061573 1.804432 0.004729 1.029251
28 (Banana) (Organic Fuji Apple) 0.214535 0.040691 0.015409 0.071827 1.765180 0.006680 1.033546
29 (Organic Fuji Apple) (Banana) 0.040691 0.214535 0.015409 0.378693 1.765180 0.006680 1.264214
11 (Organic Raspberries) (Bag of Organic Bananas) 0.062221 0.172263 0.018388 0.295519 1.715516 0.007669 1.174961
10 (Bag of Organic Bananas) (Organic Raspberries) 0.172263 0.062221 0.018388 0.106741 1.715516 0.007669 1.049840
4 (Bag of Organic Bananas) (Organic Hass Avocado) 0.172263 0.096963 0.028302 0.164293 1.694392 0.011598 1.080567
5 (Organic Hass Avocado) (Bag of Organic Bananas) 0.096963 0.172263 0.028302 0.291880 1.694392 0.011598 1.168923
20 (Honeycrisp Apple) (Banana) 0.036214 0.214535 0.012897 0.356128 1.660001 0.005128 1.219909
21 (Banana) (Honeycrisp Apple) 0.214535 0.036214 0.012897 0.060114 1.660001 0.005128 1.025430
1 (Apple Honeycrisp Organic) (Bag of Organic Bananas) 0.038597 0.172263 0.010758 0.278711 1.617942 0.004109 1.147581
0 (Bag of Organic Bananas) (Apple Honeycrisp Organic) 0.172263 0.038597 0.010758 0.062448 1.617942 0.004109 1.025440
41 (Organic Baby Spinach) (Organic Avocado) 0.109827 0.080270 0.014023 0.127682 1.590650 0.005207 1.054351
40 (Organic Avocado) (Organic Baby Spinach) 0.080270 0.109827 0.014023 0.174697 1.590650 0.005207 1.078601
54 (Organic Hass Avocado) (Organic Strawberries) 0.096963 0.120161 0.018520 0.190997 1.589516 0.006869 1.087561
55 (Organic Strawberries) (Organic Hass Avocado) 0.120161 0.096963 0.018520 0.154124 1.589516 0.006869 1.067576
19 (Cucumber Kirby) (Banana) 0.044179 0.214535 0.014571 0.329826 1.537399 0.005093 1.172031
18 (Banana) (Cucumber Kirby) 0.214535 0.044179 0.014571 0.067921 1.537399 0.005093 1.025472
45 (Organic Hass Avocado) (Organic Baby Spinach) 0.096963 0.109827 0.015844 0.163406 1.487850 0.005195 1.064044
44 (Organic Baby Spinach) (Organic Hass Avocado) 0.109827 0.096963 0.015844 0.144266 1.487850 0.005195 1.055278
8 (Organic Lemon) (Bag of Organic Bananas) 0.039835 0.172263 0.010161 0.255089 1.480812 0.003299 1.111189
9 (Bag of Organic Bananas) (Organic Lemon) 0.172263 0.039835 0.010161 0.058988 1.480812 0.003299 1.020354
26 (Organic Avocado) (Banana) 0.080270 0.214535 0.024240 0.301982 1.407613 0.007019 1.125280
27 (Banana) (Organic Avocado) 0.214535 0.080270 0.024240 0.112990 1.407613 0.007019 1.036887
30 (Seedless Red Grapes) (Banana) 0.037539 0.214535 0.011165 0.297428 1.386383 0.003112 1.117985
31 (Banana) (Seedless Red Grapes) 0.214535 0.037539 0.011165 0.052044 1.386383 0.003112 1.015301
38 (Organic Baby Spinach) (Large Lemon) 0.109827 0.069303 0.010354 0.094279 1.360380 0.002743 1.027575
39 (Large Lemon) (Organic Baby Spinach) 0.069303 0.109827 0.010354 0.149407 1.360380 0.002743 1.046532
12 (Bag of Organic Bananas) (Organic Strawberries) 0.172263 0.120161 0.027978 0.162414 1.351640 0.007279 1.050446
13 (Organic Strawberries) (Bag of Organic Bananas) 0.120161 0.172263 0.027978 0.232837 1.351640 0.007279 1.078959
33 (Strawberries) (Banana) 0.064897 0.214535 0.018718 0.288434 1.344464 0.004796 1.103855
32 (Banana) (Strawberries) 0.214535 0.064897 0.018718 0.087251 1.344464 0.004796 1.024492
46 (Organic Baby Spinach) (Organic Strawberries) 0.109827 0.120161 0.017312 0.157630 1.311826 0.004115 1.044481
47 (Organic Strawberries) (Organic Baby Spinach) 0.120161 0.109827 0.017312 0.144074 1.311826 0.004115 1.040012
15 (Organic Yellow Onion) (Bag of Organic Bananas) 0.051493 0.172263 0.011156 0.216652 1.257686 0.002286 1.056667
14 (Bag of Organic Bananas) (Organic Yellow Onion) 0.172263 0.051493 0.011156 0.064762 1.257686 0.002286 1.014188
16 (Bag of Organic Bananas) (Organic Zucchini) 0.172263 0.047588 0.010274 0.059644 1.253358 0.002077 1.012821
17 (Organic Zucchini) (Bag of Organic Bananas) 0.047588 0.172263 0.010274 0.215907 1.253358 0.002077 1.055662
23 (Large Lemon) (Banana) 0.069303 0.214535 0.018559 0.267790 1.248234 0.003691 1.072732
22 (Banana) (Large Lemon) 0.214535 0.069303 0.018559 0.086507 1.248234 0.003691 1.018833
3 (Bag of Organic Bananas) (Organic Baby Spinach) 0.172263 0.109827 0.022868 0.132750 1.208717 0.003949 1.026432
2 (Organic Baby Spinach) (Bag of Organic Bananas) 0.109827 0.172263 0.022868 0.208217 1.208717 0.003949 1.045409
43 (Organic Strawberries) (Organic Avocado) 0.120161 0.080270 0.010888 0.090614 1.128860 0.001243 1.011374
42 (Organic Avocado) (Organic Strawberries) 0.080270 0.120161 0.010888 0.135645 1.128860 0.001243 1.017914
25 (Limes) (Banana) 0.063842 0.214535 0.014424 0.225931 1.053120 0.000728 1.014722
24 (Banana) (Limes) 0.214535 0.063842 0.014424 0.067233 1.053120 0.000728 1.003636

The plot below shows the support & confidence for the plant-based product associations.

In [118]:
import random
 
support = rules['support']

confidence = rules['confidence']

for i in range (len(support)):
   support[i] = support[i] + 0.0025 * (random.randint(1,10) - 5) 
   confidence[i] = confidence[i] + 0.0025 * (random.randint(1,10) - 5)
 
plt.scatter(support, confidence,   alpha=0.5, marker="*")
plt.title("Association of plant-based products")
plt.xlabel('support')
plt.ylabel('confidence')
plt.show()
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':

Conclusion

Plant-based Diet Type Pattern

Most common products purchased:

Banana 472565

Bag of Organic Bananas 379450

Organic Strawberries 264683

Organic Baby Spinach 241921

Organic Hass Avocado 213584

Most common product pairings:

(Organic Strawberries, Organic Blueberries)

(Organic Garlic, Organic Yellow Onion)

(Organic Raspberries, Organic Hass Avocado)

(Organic Strawberries, Organic Hass Avocado)

(Organic Strawberries, Organic Raspberries)

The time of day and day of week that customers with a plant-based/vegetarian diet typically order is about the same as the general population: 10am-4pm/Saturdays & Sundays.

The most popular department ordered from was Produce.

In [ ]:
%%shell
jupyter nbconvert --to html /content/Fuentes_Final_Project.ipynb