Instacart Market Basket Analysis¶

Catie Fuentes

Introduction¶

The goal is to analyze the Instacart Online Grocery Shopping Dataset 2017. (https://www.instacart.com/datasets/grocery-shopping-2017) using the market basket analysis technique in order to find common patterns in the the customer's behaviors.¶

Market basket analysis consists of analyzing large datasets that include purchase history and products that are likely to be purchased together. Retailers use this to understand their customer's needs & behaviors in order to increase sales.¶

The key question in the market basket analysis is what products are most frequently purchased together.¶

https://runestone.academy/ns/books/published/DSA501_2022_HTTLADS/Instacart/market_basket.html ¶

Objective¶

Find patterns based on the products ordered to understand the customer's needs and behaviors.¶

Determine if there are patterns in the kinds of products customers order and if they are telling of the lifestlye/diet-type (plant-based/vegetarian) of the customer- in order to market similar products to them.¶

Overview of the Datasets¶

The Instacart data consists of five datasets including: aisles, departments, order_products_prior, order_products_train, orders, and products.¶

There are 134 aisles, 21 departments, 3421083 orders, 206209 users, and 49688 products.¶

Features of Aisles Dataframe:

aisle_id: the aisle identification number

aisle: the corresponding name of the aisle

Features of Departments Dataframe:

department_id: the department identification number

department: the corresponding name of the department

Features of Products Dataframe:

product_id: product identification number

product_name: the corresponding product name

aisle_id: aisle identification number where the product is located

department_id: department identification number that the product belongs to

Features of Orders Dataframe:

order_id: order identification number

user_id: user identification number

eval_set: whether the order was a prior order or a train order

order_number: order number

order_dow: day of the week the order was placed

order_hour_of_day: hour of day the order was placed

days_since_prior_order: days since the users last order

Features of Prior Order Dataframe:

order_id: order identification number

product_id: product identification number

add_to_cart_order: the order in which each product was added to the cart

reordered: whether or not the product was reordered

Features of Train Order Dataframe:

order_id: order identification number

product_id: product identification number

add_to_cart_order: the order in which each product was added to the cart

reordered: whether or not the product was reordered

Importing Libraries & Data¶

%matplotlib inline

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import altair as alt
import requests
matplotlib.style.use('ggplot')
sns.set_style("whitegrid")
import json
import pickle
import scipy

alt.data_transformers.enable('json')

DataTransformerRegistry.enable('json')

#!pip install kaggle # <- This is only needed the first time, I believe

# Install needed libraries
import json

# API credentials

token={"username":"catiefuentes","key":"4583b186134810a0769888ac7c442d00"}

# Create a folder in your Google account using the unix command "mkdir"
!mkdir ~/.kaggle

# Place the API credentials in the right place
with open('/root/.kaggle/kaggle.json','w') as file:
  json.dump(token, file)

# Change the permissions of the file kaggle.json
!chmod 600 /root/.kaggle/kaggle.json 

# Get a list of the datasets in kaggle with "instacart" in its name
!kaggle datasets list -s instacart

ref                                                       title                                              size  lastUpdated          downloadCount  voteCount  usabilityRating  
--------------------------------------------------------  ------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
psparks/instacart-market-basket-analysis                  Instacart Market Basket Analysis                  197MB  2017-11-20 03:08:26           9400        132  0.4117647        
viswajithkn/instacart-predict-shopping-time               Instacart Predict Shopping Time                    12MB  2018-10-29 17:24:48            517         11  0.64705884       
yasserh/instacart-online-grocery-basket-analysis-dataset  InstaCart Online Grocery Basket Analysis Dataset  197MB  2022-01-25 16:43:37           1085         36  0.9411765        
samwhitehill/instacart-training-sample                    InstaCart training sample                           1MB  2017-07-01 20:24:11            182          2  0.5882353        
brendanartley/simplifiedinstacartdata                     Simplified Instacart Market Basket Dataset        184MB  2022-04-03 14:53:59             86          1  0.8235294        
samwhitehill/instacart-sample-labels                      Instacart sample labels                             9KB  2017-07-01 20:26:16             32          1  0.4117647        
streptkinase/instacart-data-cache                         Instacart_data_cache                              409MB  2020-07-31 05:35:52              9          1  0.1764706        
rajashrideka/instacart-analysis-dataset                   Instacart_Analysis_Dataset                        238MB  2022-07-22 19:28:26              1          1  0.1764706        
emmettdoolan/instacart-mba                                Instacart_MBA                                     198MB  2020-04-08 13:46:01             69          1  0.11764706       
tungbinhthuong/instacart-train-test                       instacart_train_test                              303MB  2022-07-16 06:18:45              2          1  0.11764706       
despoinapatsavoudi/instacat2assignment1                   instacat-2-assignment.1                             4KB  2021-11-21 03:26:28              2          2  0.0              
streptkinase/user-features-instacart                      user_features_instacart                             4MB  2020-08-06 05:56:24             10          1  0.11764706

# Download the appropriate dataset and put it in the content folder
!kaggle datasets download -d psparks/instacart-market-basket-analysis -p /content

Downloading instacart-market-basket-analysis.zip to /content
 98% 193M/197M [00:06<00:00, 36.2MB/s]
100% 197M/197M [00:06<00:00, 32.1MB/s]

#Decript (unzip) all the encrypted files (with the ending ".zip") 
!unzip \*.zip

Archive:  instacart-market-basket-analysis.zip
  inflating: aisles.csv              
  inflating: departments.csv         
  inflating: order_products__prior.csv  
  inflating: order_products__train.csv  
  inflating: orders.csv              
  inflating: products.csv

#### COMMENTED OUT FOR USE ONLY IF KAGGLE API IS NOT WORKING ###

#url='http://136.183.139.117/dsa501data/instacart/'

#aisles=pd.read_csv(url+'aisles.csv')
#departments=pd.read_csv(url+'departments.csv')
#order_products_prior=pd.read_csv(url+'order_products__prior.csv')
#order_products_train=pd.read_csv(url+'order_products__train.csv')
#orders=pd.read_csv(url+'orders.csv')
#products=pd.read_csv(url+'products.csv')

#files=['aisles','departments','order_products__prior','order_products__train','orders','products']
#for file in files:
  #globals()[file]=pd.read_csv(url+file+'.csv')

Instacart Datasets Displayed Side-by-side¶

# created dataframes for each dataset
orders = pd.read_csv('orders.csv')
products = pd.read_csv('products.csv')
order_products_prior = pd.read_csv('order_products__prior.csv')
order_products_train = pd.read_csv('order_products__train.csv')
aisles = pd.read_csv('aisles.csv')
departments = pd.read_csv('departments.csv')

files=['aisles','departments','order_products__prior','order_products__train','orders','products']
for file in files:
  globals()[file]=pd.read_csv(file+'.csv')

# Got this code from JVP book. It displays frames side by side
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

display(files[0],files[1],files[2])

display(files[3],files[4],files[5])

The Features of the Dataset¶

features_dict={}
for file in files:
  features_dict[file]=[globals()[file].columns.values]
df=pd.DataFrame(features_dict).T 
df.rename(columns = {0:'Features'}, inplace=True)
df.index.name = "File"
df = df.style.set_properties(**{'text-align': 'left'}) 
df

Head() for each dataset¶

aisles.head()

departments.head()

orders.head()

orders.describe()

orders['user_id'].value_counts()

152340    100
185641    100
185524    100
81678     100
70922     100
         ... 
13046       4
70208       4
111440      4
31762       4
47980       4
Name: user_id, Length: 206209, dtype: int64

products.set_index('product_id', inplace=True)
products.head()

products.describe()

order_products__prior.head()

order_products__train.head()

Data Description Visualizations¶

The plot below shows the frequency of orders placed throughout the week, with Saturday labeled as 0 and Friday labeled as 6. Most orders were placed during the weekend on Saturdays & Sundays and the least amount of orders were placed in the middle of the week on Tuesdays & Wednesdays. This is most likely due to customers wanting to prepare for the week ahead & avoid busy grocery stores on the weekend.¶

plt.figure(figsize=(12,8))
sns.countplot(x="order_dow", data=orders)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of orders by day of week", fontsize=15)
plt.show()

The plot below shows the frequency of orders by hour of day. It is labeled based on 24 hours in a day, with 0 being 12am and 23, being 11pm. Most orders were placed from 10am to 4pm. This is likely because customers may be busier during this time, since most people work from 9am-5pm.¶

plt.figure(figsize=(12,8))
sns.countplot(x="order_hour_of_day", data=orders)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Hour of day', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of orders by hour of day", fontsize=15)
plt.show()

Exploratory Data Analysis¶

order_products = order_products_prior.append(order_products_train)
order_products.shape

(33819106, 4)

order_products.head()

order_products.product_id.nunique()

49685

product_counts = order_products.groupby('product_id')['order_id'].count().reset_index().rename(columns = {'order_id':'frequency'})
product_counts = product_counts.sort_values('frequency', ascending=False)[0:100].reset_index(drop = True)
product_counts = product_counts.merge(products, on = 'product_id', how = 'left')
product_counts.head(10)

The top 10 products that were most frequently ordered were: Banana, Bag of Organic Bananas, Organic Strawberries, Organic Baby Spinach, Organic Hass Avocado, Organic Avocado, Large Lemon, Strawberries, Limes, and Organic Whole Milk. 9 out of the 10 products come from the produce section. The high frequency of bananas may be skewed due to customers ordering one individual banana at a time, rather than a bundle.¶

ax = sns.barplot(x="frequency", y="product_name", data=(product_counts).head(10))
plt.title("Frequency of top 10 ordered products", fontsize=15)

Text(0.5, 1.0, 'Frequency of top 10 ordered products')

product_departments = product_counts.merge(departments)
product_departments

The highest frequency of products ordered came from the produce department and the lowest came from the canned goods department. This could be due to produce going bad quickly & needing to be replaced more frequently. And contrastly, canned goods may not be as in-demand because they have a longer shelf-life and do not need to be replaced often.¶

ax = sns.barplot(x="frequency", y="department", data=product_departments)
plt.title("Frequency of products ordered from each department", fontsize=15)

Text(0.5, 1.0, 'Frequency of products ordered from each department')

Exploring Patterns in Plant-based/Vegetarian Options¶

I merged the datasets into one table.¶

To look for plant-based/vegetarian-friendly products:¶

I dropped any departments that contained: dairy eggs/meat seafood/deli & any aisles that contained: meat/dairy/poultry/chicken/beef/fish/jerky.¶

I filtered the dataframe to only show plant-based/vegetarian products that users added to cart 1st & reordered.¶

#merge tables: orders and order_products__prior

df=orders.merge(order_products__prior)

veg_df=df.merge(product_departments)

veg_df

#dropping any departments that would not be vegetarian-friendly

vegetarian = veg_df[veg_df["department"].str.contains("dairy eggs|meat seafood|deli") == False]

vegetarian = vegetarian.merge(aisles)

plant_df = vegetarian[vegetarian["aisle"].str.contains("meat|dairy|poultry|chicken|beef|fish|jerky") == False]

#filter to see only add_to_cart_order 1 & reorders

plant_df[(veg_df.add_to_cart_order==1) & (veg_df.reordered==1)]

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  This is separate from the ipykernel package so we can avoid doing imports until

plant_df.fillna(0, inplace = True)

plant_df.isnull().sum()

order_id                  0
user_id                   0
eval_set                  0
order_number              0
order_dow                 0
order_hour_of_day         0
days_since_prior_order    0
product_id                0
add_to_cart_order         0
reordered                 0
frequency                 0
product_name              0
aisle_id                  0
department_id             0
department                0
aisle                     0
dtype: int64

len(plant_df['user_id'].unique())

189421

The plot below shows the frequency of the most popular plant-based products. Most products are from the Produce department, with the exception of: soda, coconut water, extra virgin olive oil, and wheat bread.¶

plt.rcParams['figure.figsize'] = (18, 7)
sns.countplot(plant_df['product_name'])

plt.title('Frequency of most popular products',fontsize = 30)
plt.xticks(rotation = 90 )
plt.show()

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

plant_df2 = plant_df[["order_hour_of_day","order_dow"]].mean()
plant_df2

order_hour_of_day    13.416873
order_dow             2.668837
dtype: float64

The plot below shows the frequency of plant-based orders by day of week. It is similar to the frequency of orders for the original dataset (above), with weekends (day 0 & 1) being the most popular days to order. However, there is a more significant decrease in frequency of orders from Monday-Thursday.¶

plt.figure(figsize=(12,8))
sns.countplot(x="order_dow", data=plant_df)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of plant-based orders by day of week", fontsize=15)
plt.show()

The plot below shows the frequency of plant-based orders by hour of day. It looks almost identical to the original plot (above) containing all of the products.¶

plt.figure(figsize=(12,8))
sns.countplot(x="order_hour_of_day", data=plant_df)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Hour of day', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of plant-based orders by hour of day", fontsize=15)
plt.show()

plant_products = plant_df['product_name'].value_counts()
plant_products

Banana                      472565
Bag of Organic Bananas      379450
Organic Strawberries        264683
Organic Baby Spinach        241921
Organic Hass Avocado        213584
                             ...  
Soda                         35791
Bartlett Pears               35413
Organic Romaine Lettuce      34874
Organic Broccoli Florets     32887
Sparkling Lemon Water        32788
Name: product_name, Length: 79, dtype: int64

This word cloud shows the most popular plant-based products:¶

Banana 472565

Bag of Organic Bananas 379450

Organic Strawberries 264683

Organic Baby Spinach 241921

Organic Hass Avocado 213584

from wordcloud import WordCloud, STOPWORDS

text = plant_products
wordcloud = WordCloud(
    width = 3000,
    height = 2000,
    background_color = 'white',
    stopwords = STOPWORDS).generate(str(text))
fig = plt.figure(
    figsize = (20, 20),
    facecolor = 'k',
    edgecolor = 'k')
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

Association Rule Mining¶

What are the most popular product pairings for plant-based products?¶

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

pb_counts = plant_df.groupby('product_id')['order_id'].count().reset_index().rename(columns = {'order_id':'frequency'})
pb_counts = plant_df.sort_values('frequency', ascending=False)[0:100].reset_index(drop = True)
pb_counts = plant_df.merge(products, on = 'product_id', how = 'left')
pb_counts.head(10)

freq_products = list(product_counts.product_id)
freq_products[1:10]

[13176, 21137, 21903, 47209, 47766, 47626, 16797, 26209, 27845]

pb_products = plant_df[plant_df.product_id.isin(freq_products)]
pb_products.shape

(6413574, 16)

pb_products = pb_products.merge(products, on = 'product_id', how='left')
pb_products.head()

basket_analysis = plant_df.groupby(['order_id', 'product_name'])['reordered'].count().unstack().reset_index().fillna(0).set_index('order_id')
basket_analysis.head()

def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1 
    
basket_analysis = basket_analysis.applymap(encode_units)
basket_analysis.head()

basket_analysis.size

174016539

basket_analysis.shape

(2202741, 79)

frequent_items = apriori(basket_analysis, min_support=0.01, use_colnames=True)
frequent_items.head()

frequent_items.tail()

frequent_items.shape

(113, 2)

I used the Mlxtend python library and found out associations between frequent plant-based product pairings:¶

Apriori Algorithm:
Apriori algorithm assumes that any subset of a frequent itemset must be frequent.
Association Rule Mining:
- a technique to identify underlying relations between different items
- used when we want to find an associations between different items or find frequent patterns in a dataset
Matrices
Support: the number of transactions containing product X divided by the total number of transactions
Confidence: Measures how often items in Y appear in transactions that contain X
Conviction: A high conviction value means that the consequent is highly depending on the antecedent

rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.sort_values('lift', ascending=False)

The plot below shows the support & confidence for the plant-based product associations.¶

import random
 
support = rules['support']

confidence = rules['confidence']

for i in range (len(support)):
   support[i] = support[i] + 0.0025 * (random.randint(1,10) - 5) 
   confidence[i] = confidence[i] + 0.0025 * (random.randint(1,10) - 5)
 
plt.scatter(support, confidence,   alpha=0.5, marker="*")
plt.title("Association of plant-based products")
plt.xlabel('support')
plt.ylabel('confidence')
plt.show()

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':

Conclusion¶

Plant-based Diet Type Pattern¶

Most common products purchased:¶

Banana 472565

Bag of Organic Bananas 379450

Organic Strawberries 264683

Organic Baby Spinach 241921

Organic Hass Avocado 213584

Most common product pairings:¶

(Organic Strawberries, Organic Blueberries)

(Organic Garlic, Organic Yellow Onion)

(Organic Raspberries, Organic Hass Avocado)

(Organic Strawberries, Organic Hass Avocado)

(Organic Strawberries, Organic Raspberries)

The time of day and day of week that customers with a plant-based/vegetarian diet typically order is about the same as the general population: 10am-4pm/Saturdays & Sundays.¶

The most popular department ordered from was Produce.¶

%%shell
jupyter nbconvert --to html /content/Fuentes_Final_Project.ipynb

	aisle_id	aisle
0	1	prepared soups salads
1	2	specialty cheeses
2	3	energy granola bars
3	4	instant foods
4	5	marinades meat preparation
...	...	...
129	130	hot cereal pancake mixes
130	131	dry pasta
131	132	beauty
132	133	muscles joints pain relief
133	134	specialty wines champagnes

	order_id	product_id	add_to_cart_order	reordered
0	1	49302	1	1
1	1	11109	2	1
2	1	10246	3	0
3	1	49683	4	0
4	1	43633	5	1
...	...	...	...	...
1384612	3421063	14233	3	1
1384613	3421063	35548	4	1
1384614	3421070	35951	1	1
1384615	3421070	16953	2	1
1384616	3421070	4724	3	1

	aisle_id	aisle
0	1	prepared soups salads
1	2	specialty cheeses
2	3	energy granola bars
3	4	instant foods
4	5	marinades meat preparation

	order_id	user_id	order_number	order_dow	order_hour_of_day	days_since_prior_order
count	3.421083e+06	3.421083e+06	3.421083e+06	3.421083e+06	3.421083e+06	3.214874e+06
mean	1.710542e+06	1.029782e+05	1.715486e+01	2.776219e+00	1.345202e+01	1.111484e+01
std	9.875817e+05	5.953372e+04	1.773316e+01	2.046829e+00	4.226088e+00	9.206737e+00
min	1.000000e+00	1.000000e+00	1.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
25%	8.552715e+05	5.139400e+04	5.000000e+00	1.000000e+00	1.000000e+01	4.000000e+00
50%	1.710542e+06	1.026890e+05	1.100000e+01	3.000000e+00	1.300000e+01	7.000000e+00
75%	2.565812e+06	1.543850e+05	2.300000e+01	5.000000e+00	1.600000e+01	1.500000e+01
max	3.421083e+06	2.062090e+05	1.000000e+02	6.000000e+00	2.300000e+01	3.000000e+01

	aisle_id	department_id
count	49688.000000	49688.000000
mean	67.769582	11.728687
std	38.316162	5.850410
min	1.000000	1.000000
25%	35.000000	7.000000
50%	69.000000	13.000000
75%	100.000000	17.000000
max	134.000000	21.000000

	department_id	department
0	1	frozen
1	2	other
2	3	bakery
3	4	produce
4	5	alcohol
5	6	international
6	7	beverages
7	8	pets
8	9	dry goods pasta
9	10	bulk
10	11	personal care
11	12	meat seafood
12	13	pantry
13	14	breakfast
14	15	canned goods
15	16	dairy eggs
16	17	household
17	18	babies
18	19	snacks
19	20	deli
20	21	missing

	order_id	product_id	add_to_cart_order	reordered
0	2	33120	1	1
1	2	28985	2	1
2	2	9327	3	0
3	2	45918	4	1
4	2	30035	5	0
...	...	...	...	...
32434484	3421083	39678	6	1
32434485	3421083	11352	7	0
32434486	3421083	4600	8	0
32434487	3421083	24852	9	1
32434488	3421083	5020	10	1

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0
...	...	...	...	...	...	...	...
3421078	2266710	206209	prior	10	5	18	29.0
3421079	1854736	206209	prior	11	4	10	30.0
3421080	626363	206209	prior	12	1	12	18.0
3421081	2977660	206209	prior	13	1	12	7.0
3421082	272231	206209	train	14	6	14	30.0

	product_id	product_name	aisle_id	department_id
0	1	Chocolate Sandwich Cookies	61	19
1	2	All-Seasons Salt	104	13
2	3	Robust Golden Unsweetened Oolong Tea	94	7
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1
4	5	Green Chile Anytime Sauce	5	13
...	...	...	...	...
49683	49684	Vodka, Triple Distilled, Twist of Vanilla	124	5
49684	49685	En Croute Roast Hazelnut Cranberry	42	1
49685	49686	Artisan Baguette	112	3
49686	49687	Smartblend Healthy Metabolism Dry Cat Food	41	8
49687	49688	Fresh Foaming Cleanser	73	11

	Features
File
aisles	['aisle_id' 'aisle']
departments	['department_id' 'department']
order_products__prior	['order_id' 'product_id' 'add_to_cart_order' 'reordered']
order_products__train	['order_id' 'product_id' 'add_to_cart_order' 'reordered']
orders	['order_id' 'user_id' 'eval_set' 'order_number' 'order_dow' 'order_hour_of_day' 'days_since_prior_order']
products	['product_id' 'product_name' 'aisle_id' 'department_id']

	product_id	frequency	product_name	aisle_id	department_id
0	24852	491291	Banana	24	4
1	13176	394930	Bag of Organic Bananas	24	4
2	21137	275577	Organic Strawberries	24	4
3	21903	251705	Organic Baby Spinach	123	4
4	47209	220877	Organic Hass Avocado	24	4
5	47766	184224	Organic Avocado	24	4
6	47626	160792	Large Lemon	24	4
7	16797	149445	Strawberries	24	4
8	26209	146660	Limes	24	4
9	27845	142813	Organic Whole Milk	84	16

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	product_id	add_to_cart_order	reordered	frequency	product_name	aisle_id	department_id	department
0	2539329	1	prior	1	2	8	NaN	196	1	0	37298	Soda	77	7	beverages
1	2398795	1	prior	2	3	7	15.0	196	1	1	37298	Soda	77	7	beverages
2	473747	1	prior	3	3	12	21.0	196	1	1	37298	Soda	77	7	beverages
3	2254736	1	prior	4	4	7	29.0	196	1	1	37298	Soda	77	7	beverages
4	431534	1	prior	5	4	15	28.0	196	1	1	37298	Soda	77	7	beverages
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7483876	861106	206183	prior	3	6	15	7.0	43122	6	1	38686	Organic Bartlett Pear	24	4	produce
7483877	3122161	206183	prior	10	6	12	28.0	43122	20	1	38686	Organic Bartlett Pear	24	4	produce
7483878	2233071	206183	prior	14	6	11	13.0	43122	11	1	38686	Organic Bartlett Pear	24	4	produce
7483879	2554068	206202	prior	2	4	19	16.0	43122	5	0	38686	Organic Bartlett Pear	24	4	produce
7483880	3392839	206208	prior	3	2	15	7.0	43122	10	0	38686	Organic Bartlett Pear	24	4	produce

	support	itemsets
0	0.017239	(100% Raw Coconut Water)
1	0.027609	(100% Whole Wheat Bread)
2	0.038597	(Apple Honeycrisp Organic)
3	0.030545	(Asparagus)
4	0.172263	(Bag of Organic Bananas)

	support	itemsets
108	0.010785	(Organic Strawberries, Organic Blueberries)
109	0.010021	(Organic Yellow Onion, Organic Garlic)
110	0.011710	(Organic Hass Avocado, Organic Raspberries)
111	0.018520	(Organic Hass Avocado, Organic Strawberries)
112	0.015373	(Organic Raspberries, Organic Strawberries)

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
50	(Organic Yellow Onion)	(Organic Garlic)	0.051493	0.049837	0.010021	0.194603	3.904783	0.007454	1.179744
51	(Organic Garlic)	(Organic Yellow Onion)	0.049837	0.051493	0.010021	0.201069	3.904783	0.007454	1.187221
35	(Large Lemon)	(Limes)	0.069303	0.063842	0.012440	0.179507	2.811746	0.008016	1.140970
34	(Limes)	(Large Lemon)	0.063842	0.069303	0.012440	0.194863	2.811746	0.008016	1.155948
56	(Organic Raspberries)	(Organic Strawberries)	0.062221	0.120161	0.015373	0.247072	2.056182	0.007897	1.168558
57	(Organic Strawberries)	(Organic Raspberries)	0.120161	0.062221	0.015373	0.127938	2.056182	0.007897	1.075358
37	(Large Lemon)	(Organic Avocado)	0.069303	0.080270	0.011085	0.159947	1.992599	0.005522	1.094847
36	(Organic Avocado)	(Large Lemon)	0.080270	0.069303	0.011085	0.138093	1.992599	0.005522	1.079812
49	(Organic Blueberries)	(Organic Strawberries)	0.045425	0.120161	0.010785	0.237418	1.975833	0.005326	1.153763
48	(Organic Strawberries)	(Organic Blueberries)	0.120161	0.045425	0.010785	0.089753	1.975833	0.005326	1.048698
53	(Organic Raspberries)	(Organic Hass Avocado)	0.062221	0.096963	0.011710	0.188192	1.940865	0.005676	1.112377
52	(Organic Hass Avocado)	(Organic Raspberries)	0.096963	0.062221	0.011710	0.120763	1.940865	0.005676	1.066582
6	(Organic Large Extra Fancy Fuji Apple)	(Bag of Organic Bananas)	0.034123	0.172263	0.010607	0.310836	1.804432	0.004729	1.201075
7	(Bag of Organic Bananas)	(Organic Large Extra Fancy Fuji Apple)	0.172263	0.034123	0.010607	0.061573	1.804432	0.004729	1.029251
28	(Banana)	(Organic Fuji Apple)	0.214535	0.040691	0.015409	0.071827	1.765180	0.006680	1.033546
29	(Organic Fuji Apple)	(Banana)	0.040691	0.214535	0.015409	0.378693	1.765180	0.006680	1.264214
11	(Organic Raspberries)	(Bag of Organic Bananas)	0.062221	0.172263	0.018388	0.295519	1.715516	0.007669	1.174961
10	(Bag of Organic Bananas)	(Organic Raspberries)	0.172263	0.062221	0.018388	0.106741	1.715516	0.007669	1.049840
4	(Bag of Organic Bananas)	(Organic Hass Avocado)	0.172263	0.096963	0.028302	0.164293	1.694392	0.011598	1.080567
5	(Organic Hass Avocado)	(Bag of Organic Bananas)	0.096963	0.172263	0.028302	0.291880	1.694392	0.011598	1.168923
20	(Honeycrisp Apple)	(Banana)	0.036214	0.214535	0.012897	0.356128	1.660001	0.005128	1.219909
21	(Banana)	(Honeycrisp Apple)	0.214535	0.036214	0.012897	0.060114	1.660001	0.005128	1.025430
1	(Apple Honeycrisp Organic)	(Bag of Organic Bananas)	0.038597	0.172263	0.010758	0.278711	1.617942	0.004109	1.147581
0	(Bag of Organic Bananas)	(Apple Honeycrisp Organic)	0.172263	0.038597	0.010758	0.062448	1.617942	0.004109	1.025440
41	(Organic Baby Spinach)	(Organic Avocado)	0.109827	0.080270	0.014023	0.127682	1.590650	0.005207	1.054351
40	(Organic Avocado)	(Organic Baby Spinach)	0.080270	0.109827	0.014023	0.174697	1.590650	0.005207	1.078601
54	(Organic Hass Avocado)	(Organic Strawberries)	0.096963	0.120161	0.018520	0.190997	1.589516	0.006869	1.087561
55	(Organic Strawberries)	(Organic Hass Avocado)	0.120161	0.096963	0.018520	0.154124	1.589516	0.006869	1.067576
19	(Cucumber Kirby)	(Banana)	0.044179	0.214535	0.014571	0.329826	1.537399	0.005093	1.172031
18	(Banana)	(Cucumber Kirby)	0.214535	0.044179	0.014571	0.067921	1.537399	0.005093	1.025472
45	(Organic Hass Avocado)	(Organic Baby Spinach)	0.096963	0.109827	0.015844	0.163406	1.487850	0.005195	1.064044
44	(Organic Baby Spinach)	(Organic Hass Avocado)	0.109827	0.096963	0.015844	0.144266	1.487850	0.005195	1.055278
8	(Organic Lemon)	(Bag of Organic Bananas)	0.039835	0.172263	0.010161	0.255089	1.480812	0.003299	1.111189
9	(Bag of Organic Bananas)	(Organic Lemon)	0.172263	0.039835	0.010161	0.058988	1.480812	0.003299	1.020354
26	(Organic Avocado)	(Banana)	0.080270	0.214535	0.024240	0.301982	1.407613	0.007019	1.125280
27	(Banana)	(Organic Avocado)	0.214535	0.080270	0.024240	0.112990	1.407613	0.007019	1.036887
30	(Seedless Red Grapes)	(Banana)	0.037539	0.214535	0.011165	0.297428	1.386383	0.003112	1.117985
31	(Banana)	(Seedless Red Grapes)	0.214535	0.037539	0.011165	0.052044	1.386383	0.003112	1.015301
38	(Organic Baby Spinach)	(Large Lemon)	0.109827	0.069303	0.010354	0.094279	1.360380	0.002743	1.027575
39	(Large Lemon)	(Organic Baby Spinach)	0.069303	0.109827	0.010354	0.149407	1.360380	0.002743	1.046532
12	(Bag of Organic Bananas)	(Organic Strawberries)	0.172263	0.120161	0.027978	0.162414	1.351640	0.007279	1.050446
13	(Organic Strawberries)	(Bag of Organic Bananas)	0.120161	0.172263	0.027978	0.232837	1.351640	0.007279	1.078959
33	(Strawberries)	(Banana)	0.064897	0.214535	0.018718	0.288434	1.344464	0.004796	1.103855
32	(Banana)	(Strawberries)	0.214535	0.064897	0.018718	0.087251	1.344464	0.004796	1.024492
46	(Organic Baby Spinach)	(Organic Strawberries)	0.109827	0.120161	0.017312	0.157630	1.311826	0.004115	1.044481
47	(Organic Strawberries)	(Organic Baby Spinach)	0.120161	0.109827	0.017312	0.144074	1.311826	0.004115	1.040012
15	(Organic Yellow Onion)	(Bag of Organic Bananas)	0.051493	0.172263	0.011156	0.216652	1.257686	0.002286	1.056667
14	(Bag of Organic Bananas)	(Organic Yellow Onion)	0.172263	0.051493	0.011156	0.064762	1.257686	0.002286	1.014188
16	(Bag of Organic Bananas)	(Organic Zucchini)	0.172263	0.047588	0.010274	0.059644	1.253358	0.002077	1.012821
17	(Organic Zucchini)	(Bag of Organic Bananas)	0.047588	0.172263	0.010274	0.215907	1.253358	0.002077	1.055662
23	(Large Lemon)	(Banana)	0.069303	0.214535	0.018559	0.267790	1.248234	0.003691	1.072732
22	(Banana)	(Large Lemon)	0.214535	0.069303	0.018559	0.086507	1.248234	0.003691	1.018833
3	(Bag of Organic Bananas)	(Organic Baby Spinach)	0.172263	0.109827	0.022868	0.132750	1.208717	0.003949	1.026432
2	(Organic Baby Spinach)	(Bag of Organic Bananas)	0.109827	0.172263	0.022868	0.208217	1.208717	0.003949	1.045409
43	(Organic Strawberries)	(Organic Avocado)	0.120161	0.080270	0.010888	0.090614	1.128860	0.001243	1.011374
42	(Organic Avocado)	(Organic Strawberries)	0.080270	0.120161	0.010888	0.135645	1.128860	0.001243	1.017914
25	(Limes)	(Banana)	0.063842	0.214535	0.014424	0.225931	1.053120	0.000728	1.014722
24	(Banana)	(Limes)	0.214535	0.063842	0.014424	0.067233	1.053120	0.000728	1.003636

Instacart Market Basket Analysis¶

Introduction¶

The goal is to analyze the Instacart Online Grocery Shopping Dataset 2017. (https://www.instacart.com/datasets/grocery-shopping-2017) using the market basket analysis technique in order to find common patterns in the the customer's behaviors.¶

Market basket analysis consists of analyzing large datasets that include purchase history and products that are likely to be purchased together. Retailers use this to understand their customer's needs & behaviors in order to increase sales.¶

The key question in the market basket analysis is what products are most frequently purchased together.¶

https://runestone.academy/ns/books/published/DSA501_2022_HTTLADS/Instacart/market_basket.html¶

Objective¶

Find patterns based on the products ordered to understand the customer's needs and behaviors.¶

Determine if there are patterns in the kinds of products customers order and if they are telling of the lifestlye/diet-type (plant-based/vegetarian) of the customer- in order to market similar products to them.¶

Overview of the Datasets¶

The Instacart data consists of five datasets including: aisles, departments, order_products_prior, order_products_train, orders, and products.¶

There are 134 aisles, 21 departments, 3421083 orders, 206209 users, and 49688 products.¶

Importing Libraries & Data¶

Instacart Datasets Displayed Side-by-side¶

The Features of the Dataset¶

Head() for each dataset¶

Data Description Visualizations¶

The plot below shows the frequency of orders by hour of day. It is labeled based on 24 hours in a day, with 0 being 12am and 23, being 11pm. Most orders were placed from 10am to 4pm. This is likely because customers may be busier during this time, since most people work from 9am-5pm.¶

Exploratory Data Analysis¶

Exploring Patterns in Plant-based/Vegetarian Options¶

I merged the datasets into one table.¶

To look for plant-based/vegetarian-friendly products:¶

I dropped any departments that contained: dairy eggs/meat seafood/deli & any aisles that contained: meat/dairy/poultry/chicken/beef/fish/jerky.¶

I filtered the dataframe to only show plant-based/vegetarian products that users added to cart 1st & reordered.¶

The plot below shows the frequency of the most popular plant-based products. Most products are from the Produce department, with the exception of: soda, coconut water, extra virgin olive oil, and wheat bread.¶

The plot below shows the frequency of plant-based orders by hour of day. It looks almost identical to the original plot (above) containing all of the products.¶

This word cloud shows the most popular plant-based products:¶

Association Rule Mining¶

What are the most popular product pairings for plant-based products?¶

I used the Mlxtend python library and found out associations between frequent plant-based product pairings:¶

The plot below shows the support & confidence for the plant-based product associations.¶

Conclusion¶

Plant-based Diet Type Pattern¶

Most common products purchased:¶

Most common product pairings:¶

The time of day and day of week that customers with a plant-based/vegetarian diet typically order is about the same as the general population: 10am-4pm/Saturdays & Sundays.¶

The most popular department ordered from was Produce.¶

https://runestone.academy/ns/books/published/DSA501_2022_HTTLADS/Instacart/market_basket.html ¶