Analyzing over 3 million Instacart grocery orders to uncover reorder patterns and forecast cart sizes using exploratory data analysis, feature engineering, and hierarchical Bayesian modeling.
Author

Sailaxman Kotha

Published

October 15, 2025

Overview

Note👨‍💻 Source Code & Models

The complete Jupyter notebooks, Stan scripts, and raw data are available in the public repository

View Project on GitHub

Instacart is an American grocery delivery platform with millions of users. This project analyzes a sample of over 3 million grocery orders from more than 200,000 users to answer two questions: Which products will a customer reorder? and How many items will they put in their next cart?

3.4M+

Orders Analyzed

200K+

Unique Users

50K+

Products

21

Departments

Tech Stack

Python Pandas Seaborn Plotly CmdStanPy Bayesian Statistics Stan

Rather than treating every user and product identically, two Bayesian models tackle these questions from different angles. A Hierarchical Bayesian Logistic Regression predicts whether a specific product will be reordered by capturing per-user and per-product variation through random effects. A Bayesian Poisson Regression forecasts how many total items a customer will add to their cart, producing full predictive distributions with credible intervals. Both models are implemented in Stan and sampled via MCMC, enabling principled uncertainty quantification that flat machine learning models cannot provide.

Dataset & Schema

The dataset originates from Instacart’s public release and spans six relational tables. For each user, between 4 and 100 of their orders are available, along with the exact sequence of products added to each cart.

Instacart database schema showing six interconnected tables

Database schema showing relationships between orders, products, aisles, and departments
NoteKey Tables
  • orders — All orders (prior, train, test) with timestamps and user info
  • order_products_prior — Products in each prior order with reorder flags
  • order_products_train — Products in the most recent order (training target)
  • products — Product names linked to aisles and departments
  • aisles & departments — Categorical hierarchies for products
Show code
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

aisles = pd.read_csv('data/aisles.csv')
departments = pd.read_csv('data/departments.csv')
order_products_prior = pd.read_csv('data/order_products_prior.csv')
order_products_train = pd.read_csv('data/order_products_train.csv')
orders = pd.read_csv('data/orders.csv')
products = pd.read_csv('data/products.csv')

Data Cleaning & Validation

Before any analysis begins, each of the six tables needs validation for missing values, duplicates, and inconsistent entries.

Missing Values

The days_since_prior_order column contains NaN values — these are not errors. They represent each user’s first ever order, which naturally has no prior order to reference.

Show code
orders.isna().sum()

Duplicate Check & Garbage Values

All tables pass the duplicate check cleanly. Product names, however, contain special characters — accents, hyphens, and symbols — that need normalization.

Show code
import re
special_char = set(re.findall("[^a-zA-Z0-9 ]", ''.join(products.loc[:,'product_name'])))
print(f'Found {len(special_char)} distinct special characters in product names')

products['product_name'] = products['product_name'].apply(
    lambda x: re.sub(r'[^a-zA-Z0-9 ]', '', x).strip()
)

Exploratory Data Analysis

Order Distribution Across Evaluation Sets

The dataset splits into three evaluation sets. The prior set (3.2M orders) forms the bulk of historical data, while train (~131K) and test (~75K) each represent one user’s most recent order.

Bar chart showing distribution of orders in train and test evaluation sets

Distribution of orders across train and test sets

Bar chart showing the massive prior set dwarfing train and test sets

Distribution across all evaluation sets including prior orders

Reorder vs. First-Time Purchases

The first critical insight: approximately 59% of all products in orders are reorders. Customers overwhelmingly stick with products they already know.

Pie chart showing the split between reordered and first-time purchased products

59% reorders vs 41% first-time purchases

Nearly 6 out of 10 products in any given cart are items the customer has purchased before. This strong reorder tendency makes predicting reorder behavior both feasible and commercially valuable.

Formally, defining \(y_i \in \{0, 1\}\) as the reorder indicator for item \(i\), the overall reorder rate across the dataset is:

\[ \hat{p}_{\text{reorder}} = \frac{1}{N}\sum_{i=1}^{N} y_i \approx 0.59 \]

Department-Level Analysis

Products span 21 departments. Personal care, snacks, and pantry lead in product count, while categories like bulk and alcohol are smaller and more specialized.

Horizontal bar chart showing product count across departments

Number of products per department

Treemap: Aisles Within Each Department

The hierarchical structure of departments and aisles is best visualized as a treemap. Each rectangle represents an aisle, nested inside its parent department, with area proportional to the number of products. This makes it immediately clear which departments are product-dense and which aisles dominate within them.

Show code
fig = px.treemap(departments_info, path=['department', 'aisle'],
                 title='Aisles in each department')
fig.update_layout(autosize=False, width=1050, height=1000)
fig.show()

Treemap visualization of aisles within each of the 21 departments

Treemap showing aisles nested within each department — larger rectangles indicate more aisles

Top 15 Departments by Reorder Volume

Ranking departments by total reorder volume (not just rate) reveals which categories generate the most repeat business in absolute terms. This interactive chart highlights the top 15 departments driving habitual purchases.

Top Aisles by Product Count

Within departments, the top 10 aisles by product count include candy/chocolate, ice cream, and vitamins/supplements — categories with high product variety.

Bar chart showing the top 10 aisles by product count

Top 10 aisles by number of products

Reorder Rate Distribution

The distribution across orders reveals an interesting shape — most orders cluster around a moderate reorder rate (0.4–0.7), but there is a long tail of orders composed almost entirely of repeat purchases.

Histogram showing the distribution of reorder rates

Distribution of reorder rates across orders

Deep Dive: Reorder Behavior

Understanding reorder patterns requires slicing the data at multiple levels — department, aisle, time gap, and product.

Average Reorders Per Department

Computing the average number of reorders per distinct product in each department highlights which categories drive habitual buying:

\[ \text{avg\_reorders}_d = \frac{\sum_{p \in d} \text{reorders}_p}{|\{p \in d\}|} \]

Bar chart showing average reorder count per product by department

Average reorders per product across departments
Show code
dept_stats = departments_info.groupby('department').agg(
    total_reorders=('reordered', 'sum'),
    distinct_products=('product_id', 'nunique')
).assign(
    avg_reorders_per_product=lambda d: d['total_reorders'] / d['distinct_products']
)

Probability of Reorder by Department

A more informative metric is the conditional probability: “If a product comes from department \(d\), how likely is it to be reordered?”

\[ P(\text{reorder} \mid \text{dept} = d) = \frac{\#\text{reorders from } d}{\#\text{purchases from } d} \]

Horizontal bar chart of reorder probabilities by department

Probability of reorder from each department
Show code
dept_probs = departments_info.groupby('department')['reordered'].mean() \
    .sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(12, 8))
bars = ax.barh(dept_probs.index, dept_probs.values, color='green', edgecolor='black')
ax.invert_yaxis()
ax.set_xlabel('P(reorder | department)')
ax.set_title('Reorder Probability by Department')
plt.tight_layout()
plt.show()
TipHighest Reorder Rates

Dairy & eggs, produce, and beverages show the highest reorder probabilities — staple categories that customers replenish on a regular cycle. The lowest reorder rates appear in personal care and babies, where products tend to be one-time or infrequent purchases.

Probability of Reorder by Aisle

Drilling down to the aisle level (~134 aisles) shows even finer granularity:

\[ P(\text{reorder} \mid \text{aisle} = a) = \frac{\#\text{reorders from } a}{\#\text{purchases from } a} \]

Full horizontal bar chart of reorder probabilities across all aisles

Probability of reorder from each aisle

Impact of Order Gap on Reorder Behavior

The time between consecutive orders (days_since_prior_order) significantly influences reorder behavior. Data shows clear spikes at 7 days (weekly shoppers), with additional modes at 14 and 30 days.

Count plot showing reorder vs first-time purchase by days since prior order

Gap between orders and reorder behavior

Cart Position Analysis

The median add-to-cart position distribution shows most products are added early — customers start with habitual items and then explore.

Histogram showing when products are typically added to cart

Distribution of median cart position across products

Feature Engineering

Several features are engineered at the user, product, and order level to encode the patterns uncovered above.

User-Level Features

Each user’s reorder tendency is quantified as the fraction of their total purchases that are reorders:

\[ P(\text{reorder} \mid \text{user} = u) = \frac{\sum_{i \in \mathcal{O}_u} y_i}{|\mathcal{O}_u|} \]

where \(\mathcal{O}_u\) is the set of all items across all prior orders for user \(u\).

Show code
user_activity = prior_order.groupby('user_id')['reordered'].agg(
    ['sum', 'count']
).rename(columns={'sum': 'user_reorders', 'count': 'user_purchases'}).reset_index()

user_activity['P(reorder | user)'] = (
    user_activity['user_reorders'] / user_activity['user_purchases']
)

Per-Order User Activity

The reorder rate is also computed at the per-order level to capture how behavior evolves across successive orders:

Show code
user_activity_per_order = prior_order.groupby(
    ['user_id', 'order_id']
)['reordered'].agg(
    ['sum', 'count']
).rename(columns={'sum': 'user_reorders', 'count': 'user_purchases'}).reset_index()

user_activity_per_order['P(reorder | user)'] = (
    user_activity_per_order['user_reorders'] / user_activity_per_order['user_purchases']
)

Department-Level & Cart Size Features

Show code
# Department reorder rate
dep_info = departments_info.groupby('department_id', as_index=False).agg(
    {'reordered': 'mean'}
).rename(columns={'reordered': 'reorder_dept_rate'})

# Cart size and user average
cart_size = order_products_train.groupby('order_id').size().reset_index(name='cart_size')

cart_df = order_products_train.merge(
    orders[['order_id', 'order_number', 'order_dow', 'order_hour_of_day',
            'days_since_prior_order', 'user_id']], on='order_id', how='left'
).merge(cart_size, on='order_id', how='left').merge(
    products[['product_id', 'aisle_id', 'department_id']], on='product_id', how='left'
)

user_avg = cart_df.groupby('user_id')['cart_size'].mean().reset_index(name='avg_cart_size')
cart_df = cart_df.merge(user_avg, on='user_id', how='left')
NoteComplete Feature Set
Feature Description
user_reorders Total reorders by the user
user_purchases Total products purchased by the user
P(reorder \| user) User-level reorder probability
reorder_dept_rate Department-level reorder rate
cart_size Number of items in the current order
avg_cart_size User’s average cart size across all orders
order_number Sequential order count for the user
order_dow Day of week of the order
order_hour_of_day Hour of day of the order
days_since_prior_order Gap in days since last order

Feature Signal Strength

Before modeling, a quick signal assessment identifies which predictors carry meaningful information:

Show code
# Pearson correlation for numeric features
num_feats = ['order_number', 'order_dow', 'order_hour_of_day',
             'days_since_prior_order', 'cart_size']

numeric_corr = {
    feat: cart_df[feat].corr(cart_df['reordered'])
    for feat in num_feats
}

# Signal strength for categorical features
cat_feats = ['aisle_id', 'department_id']
cat_signal = []
for feat in cat_feats:
    rates = df.groupby(feat)['reordered'].mean()
    cat_signal.append({
        'feature': feat,
        'num_categories': rates.size,
        'std_of_rates': rates.std(),
        'range_of_rates': rates.max() - rates.min()
    })

Reorder Rate vs Order Number

As users place more orders, their reorder rate climbs initially and then plateaus — confirming that customers develop stable purchasing habits over time.

Line plot showing how reorder probability changes with order count

Reorder rate as a function of order number

The Math Behind the Models

Problem 1: Will This Product Be Reordered?

For each observation \(n\) (a user–product pair in an order), define \(y_n \in \{0,1\}\) as the binary reorder indicator. A standard logistic regression models the probability as:

\[ P(y_n = 1) = \text{logistic}(\alpha + \mathbf{x}_n^T \boldsymbol{\beta}) = \frac{1}{1 + e^{-(\alpha + \mathbf{x}_n^T \boldsymbol{\beta})}} \]

This assumes a single intercept \(\alpha\) and shared slopes \(\boldsymbol{\beta}\) across all users and products. The problem: User A (who reorders 90% of the time) and User B (who reorders 20% of the time) are forced to share the same baseline. Similarly, bananas and birthday candles get no product-specific adjustment.

The Hierarchical Extension

A hierarchical model introduces random effects — per-user and per-product intercept offsets drawn from population-level distributions:

\[ \text{logit}\big(P(y_n = 1)\big) = \underbrace{\alpha}_{\text{global baseline}} + \underbrace{\gamma_{u[n]}}_{\text{user effect}} + \underbrace{\delta_{p[n]}}_{\text{product effect}} + \underbrace{\beta_{\text{num}} x_{\text{ord\_num}} + \beta_{\text{dow}} x_{\text{dow}} + \beta_{\text{hour}} x_{\text{hour}}}_{\text{fixed effects for order context}} \]

The random effects are not free parameters — they are constrained by group-level priors that create partial pooling:

\[ \gamma_u \sim \mathcal{N}(0, \sigma_u^2), \quad \delta_p \sim \mathcal{N}(0, \sigma_p^2) \]

Each user’s and product’s effect is pulled toward the global mean, with the degree of shrinkage controlled by \(\sigma_u^2\) and \(\sigma_p^2\). Users with abundant data get estimates closer to their individual rates; users with sparse data borrow strength from the population.

Problem 2: How Many Items in the Cart?

For each order \(i\), the cart size \(c_i\) (a non-negative integer) is modeled with a Poisson distribution and log-link:

\[ c_i \sim \text{Poisson}(\lambda_i), \quad \log(\lambda_i) = \beta_0 + \beta_1 x_{\text{ord\_num}} + \beta_2 x_{\text{dow}} + \beta_3 x_{\text{hour}} + \beta_4 x_{\text{avg\_cart}} \]

where \(\lambda_i\) is the expected cart size, determined by order-level features. By Bayes’ rule, the joint posterior over all coefficients is:

\[ p(\boldsymbol{\beta} \mid \mathbf{c}, \mathbf{X}) \propto \prod_{i=1}^{N} \text{Poisson}(c_i \mid \lambda_i(\boldsymbol{\beta})) \cdot \prod_{j=0}^{4} \mathcal{N}(\beta_j \mid 0, 1) \]

The posterior mean and expected cart size for a new order are computed from \(S\) posterior samples:

\[ \hat{\lambda}_{\text{new}} = \frac{1}{S}\sum_{s=1}^{S} \exp\!\big(\beta_0^{(s)} + \beta_1^{(s)} x_1 + \cdots + \beta_4^{(s)} x_4\big) \]

And the full predictive distribution is:

\[ p(c_{\text{new}} = k \mid \mathbf{X}_{\text{new}}, \text{data}) = \frac{1}{S}\sum_{s=1}^{S} \frac{e^{-\lambda^{(s)}} (\lambda^{(s)})^k}{k!} \]

This gives not just a point estimate but a complete probability distribution over cart sizes, enabling credible intervals for inventory planning and staffing decisions.

Prior Justification

Given the logistic scale and standardized predictors, weakly informative priors are used for all parameters:

Hierarchical Logistic Regression

\[ \alpha \sim \mathcal{N}(0, 1) \qquad \text{(global intercept)} \] \[ \sigma_u, \sigma_p \sim \text{Half-}\mathcal{N}(0, 1) \qquad \text{(controls how far users/products stray from global intercept)} \] \[ \tilde{u}, \tilde{p} \sim \mathcal{N}(0, 1) \qquad \text{(raw offsets for non-centered form)} \] \[ \beta_{\text{num}}, \beta_{\text{dow}}, \beta_{\text{hour}} \sim \mathcal{N}(0, 1) \qquad \text{(fixed slopes)} \]

On the logit scale, a \(\mathcal{N}(0,1)\) prior places most mass within \(\pm 2\), corresponding to probabilities between roughly 12% and 88%. This is broad enough to accommodate diverse effects without letting the model converge on extreme or implausible values. The half-normal priors on \(\sigma_u\) and \(\sigma_p\) regularize the group-level variation, shrinking sparse groups toward zero.

Poisson Regression

\[ \beta_0, \beta_1, \beta_2, \beta_3, \beta_4 \sim \mathcal{N}(0, 1) \qquad \text{(all coefficients on the log scale)} \]

Non-Centered Parameterization

For computational efficiency, the hierarchical model uses a non-centered parameterization. Instead of sampling \(\gamma_u \sim \mathcal{N}(0, \sigma_u)\) directly, a standard normal auxiliary variable \(\tilde{u}\) is sampled and then scaled:

\[ \tilde{u} \sim \mathcal{N}(0, 1), \quad \gamma_u = \tilde{u} \cdot \sigma_u \]

This is mathematically equivalent but breaks the correlation between \(\sigma_u\) and \(\gamma_u\) in the posterior — dramatically improving sampling efficiency when the number of groups (users, products) is large.

Model 1: Hierarchical Logistic Regression in Stan

Preparing the Data

Show code
df = order_products_train.merge(
    orders[['order_id', 'user_id', 'order_number', 'order_dow', 'order_hour_of_day']],
    on='order_id'
).merge(
    products[['product_id', 'aisle_id', 'department_id']], on='product_id'
)

# Sample for tractable MCMC
df = df.sample(n=5000, random_state=42)

# Integer indices for random effects (Stan 1-based)
model_l = df[['user_id', 'product_id', 'order_number',
              'order_dow', 'order_hour_of_day', 'reordered']].copy()
model_l['user_idx'] = model_l['user_id'].astype('category').cat.codes
model_l['product_idx'] = model_l['product_id'].astype('category').cat.codes

Building the Stan Data

Show code
stan_data = {
    "N": len(model_l),
    "U": int(model_l.user_idx.max()) + 1,
    "P": int(model_l.product_idx.max()) + 1,
    "user_idx": (model_l.user_idx.values + 1).astype(int),
    "prod_idx": (model_l.product_idx.values + 1).astype(int),
    "ord_num": model_l.order_number.values.astype(float),
    "ord_dow": model_l.order_dow.values.astype(float),
    "ord_hour": model_l.order_hour_of_day.values.astype(float),
    "y": model_l.reordered.values.astype(int),
}

# Standardize predictors for better sampling
for name in ['ord_num', 'ord_dow', 'ord_hour']:
    arr = stan_data[name]
    stan_data[name] = (arr - np.mean(arr)) / np.std(arr)

The Stan Model — hier_logit.stan

data {
  int<lower=1> N;                 // # observations
  int<lower=1> U;                 // # users
  int<lower=1> P;                 // # products
  array[N] int<lower=1,upper=U> user_idx;
  array[N] int<lower=1,upper=P> prod_idx;
  vector[N] ord_num;
  vector[N] ord_dow;
  vector[N] ord_hour;
  array[N] int<lower=0,upper=1> y;
}
parameters {
  real alpha;
  real<lower=0> sigma_u;
  real<lower=0> sigma_p;
  vector[U] u_raw;
  vector[P] p_raw;
  real beta_num;
  real beta_dow;
  real beta_hour;
}
transformed parameters {
  vector[U] gamma_u = u_raw * sigma_u;
  vector[P] delta_p = p_raw * sigma_p;
}
model {
  // Priors
  alpha     ~ normal(0, 1);
  sigma_u   ~ normal(0, 1);
  sigma_p   ~ normal(0, 1);
  u_raw     ~ normal(0, 1);
  p_raw     ~ normal(0, 1);
  beta_num  ~ normal(0, 1);
  beta_dow  ~ normal(0, 1);
  beta_hour ~ normal(0, 1);

  // Likelihood
  for (n in 1:N) {
    real eta = alpha
               + gamma_u[user_idx[n]]
               + delta_p[prod_idx[n]]
               + beta_num  * ord_num[n]
               + beta_dow  * ord_dow[n]
               + beta_hour * ord_hour[n];
    y[n] ~ bernoulli_logit(eta);
  }
}

Sampling

Show code
from cmdstanpy import CmdStanModel, install_cmdstan

install_cmdstan()
model = CmdStanModel(stan_file='hier_logit.stan')

fit = model.sample(
    data=stan_data,
    seed=42,
    chains=4,
    iter_warmup=1000,
    iter_sampling=1000,
    adapt_delta=0.99,
    max_treedepth=15
)

print(fit.summary())
print(fit.diagnose())

Posterior Distributions

The posterior draws for the global parameters and fixed slopes reveal how much of the reorder variation is attributable to users, products, and order context:

Six posterior histograms showing the distribution of key model parameters

Posterior distributions of α, σ_u, σ_p, β_num, β_dow, β_hour
Show code
df_draws = fit.draws_pd()

params = ['alpha', 'sigma_u', 'sigma_p', 'beta_num', 'beta_dow', 'beta_hour']
fig, axes = plt.subplots(len(params), 1, figsize=(6, len(params)*2.2))
for ax, name in zip(axes, params):
    ax.hist(df_draws[name], bins=40, density=True, alpha=0.7)
    ax.set_title(f"Posterior of {name}")
    ax.axvline(df_draws[name].mean(), color='k', linestyle='--', label='mean')
    ax.legend()
plt.tight_layout()
plt.show()

Interpreting the Posteriors

Each posterior distribution tells a specific story about how Instacart customers behave. The histograms above show the density of 4,000 MCMC draws (4 chains × 1,000 samples) for each parameter. The dashed vertical line marks the posterior mean.

Global Intercept — \(\alpha \approx 0.44\)

The global intercept represents the baseline log-odds of reorder when all predictors are at their mean and user/product random effects are zero. Converting to probability through the logistic function:

\[ p_0 = \frac{1}{1 + e^{-\alpha}} = \frac{1}{1 + e^{-0.44}} \approx 0.61 \]

This means an average user buying an average product has roughly a 61% chance of reordering it — closely matching the 59% overall reorder rate observed in the raw data. The tight posterior (narrow histogram) indicates high confidence in this estimate.

User-Effect Standard Deviation — \(\sigma_u \approx 0.62\)

This parameter controls how much individual users deviate from the global intercept. A value of 0.62 means users’ log-odds intercepts are drawn from \(\mathcal{N}(0, 0.62^2)\). To understand the practical impact, convert to odds ratios:

\[ \text{95% of users:} \quad \exp(\pm 1.96 \times 0.62) = [0.30, \; 3.39] \]

This means the most habitual users have reorder odds 3.4× the baseline, while the most exploratory users have odds just 0.3× the baseline. The wide spread confirms massive user heterogeneity — exactly the kind of variation that a flat model would miss.

Product-Effect Standard Deviation — \(\sigma_p \approx 0.75\)

Products vary even more than users. The 95% interval of product odds ratios spans:

\[ \text{95% of products:} \quad \exp(\pm 1.96 \times 0.75) = [0.23, \; 4.35] \]

A staple like whole milk has a strongly positive \(\delta_p\), pushing its reorder odds to 4× the baseline. A novelty item like a seasonal candle has a strongly negative offset, dropping to 0.23× the baseline. This asymmetry between products is the main reason hierarchical modeling outperforms flat approaches.

Order Number Slope — \(\beta_{\text{num}} \approx 0.75\)

One standard deviation increase in the user’s order count multiplies the odds of reorder by:

\[ \text{OR} = \exp(\beta_{\text{num}}) = \exp(0.75) \approx 2.12 \]

Experienced users are about twice as likely to reorder. This aligns with the EDA finding that reorder rates climb with order count before plateauing — the model has learned this relationship directly from the data.

Day-of-Week Slope — \(\beta_{\text{dow}} \approx -0.05\)

The posterior is centered just below zero. Converting: \(\exp(-0.05) = 0.951\), indicating a 5% drop in odds per SD increase in day-of-week. This is a negligible effect — whether someone orders on a Monday or Saturday barely changes their reorder tendency.

Hour-of-Day Slope — \(\beta_{\text{hour}} \approx 0\)

The posterior density straddles zero almost symmetrically, confirming that hour of day has no measurable effect on reorder probability. Morning shoppers and late-night browsers reorder at the same rates.

User and Product Random Effects

The distributions of per-user and per-product random intercepts show how the model captures individual-level variation:

Histogram of user random intercepts showing population spread

Posterior distribution of all user random intercepts γᵤ

Histogram of product random intercepts showing population spread

Posterior distribution of all product random intercepts δₚ
Show code
# User intercepts
gamma_cols = [c for c in df_draws.columns if c.startswith('gamma_u[')]
all_gamma = df_draws[gamma_cols].values.flatten()
plt.figure(figsize=(6, 4))
plt.hist(all_gamma, bins=60, density=True, alpha=0.7)
plt.title("Posterior distribution of all γᵤ (user intercepts)")
plt.xlabel("γᵤ value")
plt.show()

# Product intercepts
delta_cols = [c for c in df_draws.columns if c.startswith('delta_p[')]
all_delta = df_draws[delta_cols].values.flatten()
plt.figure(figsize=(6, 4))
plt.hist(all_delta, bins=60, density=True, alpha=0.7)
plt.title("Posterior distribution of all δₚ (product intercepts)")
plt.xlabel("δₚ value")
plt.show()

Prediction: Reorder Probability for a Specific User–Product Pair

To compute the probability that a specific user will reorder a specific product on their next order, the posterior draws are used directly:

\[ \pi^{(s)} = \text{logistic}\!\big(\alpha^{(s)} + \gamma_u^{(s)} + \delta_p^{(s)} + \beta_{\text{num}}^{(s)} x_{\text{num}} + \beta_{\text{dow}}^{(s)} x_{\text{dow}} + \beta_{\text{hour}}^{(s)} x_{\text{hour}}\big) \]

\[ \hat{\pi} = \frac{1}{S}\sum_{s=1}^{S} \pi^{(s)}, \quad \text{95% CI} = \big[\pi_{0.025}, \pi_{0.975}\big] \]

Show code
u, p, num, dow, hour = 1235, 563, 5, 2, 15

alpha_draws  = fit.stan_variable('alpha')
beta_num_d   = fit.stan_variable('beta_num')
beta_dow_d   = fit.stan_variable('beta_dow')
beta_hour_d  = fit.stan_variable('beta_hour')
gamma_u_mat  = fit.stan_variable('gamma_u')   # shape (n_draws, U)
delta_p_mat  = fit.stan_variable('delta_p')   # shape (n_draws, P)

gamma_u_draws = gamma_u_mat[:, u]
delta_p_draws = delta_p_mat[:, p]

eta = (alpha_draws + gamma_u_draws + delta_p_draws
       + beta_num_d * num + beta_dow_d * dow + beta_hour_d * hour)

pi_draws = 1 / (1 + np.exp(-eta))
pi_mean = pi_draws.mean()
pi_ci = np.percentile(pi_draws, [2.5, 97.5])

print(f"Reorder probability for user={u}, product={p}:")
print(f"  Posterior mean = {pi_mean:.3f}")
print(f"  95% CI         = [{pi_ci[0]:.3f}, {pi_ci[1]:.3f}]")

Example result: For user 1235 and product 563, the posterior mean reorder probability is 0.977 with a 95% credible interval of [0.899, 0.999]. The model is highly confident this user will reorder this product.

Model 2: Bayesian Poisson Regression for Cart Size

The second model shifts focus from what goes into the cart to how many items go in. For each order \(i\), the cart size \(c_i\) is modeled as:

\[ c_i \sim \text{Poisson}(\lambda_i), \quad \log(\lambda_i) = \beta_0 + \beta_1 x_{\text{ord\_num}} + \beta_2 x_{\text{dow}} + \beta_3 x_{\text{hour}} + \beta_4 x_{\text{avg\_cart}} \]

The Stan Model — poisson_reg.stan

data {
  int<lower=1> N;            // # of orders
  array[N] int<lower=0> y;   // cart_size (count)
  vector[N] x1;              // order_number
  vector[N] x2;              // order_dow
  vector[N] x3;              // order_hour_of_day
  vector[N] x4;              // avg_cart_size
}
parameters {
  real beta0;
  real beta1;
  real beta2;
  real beta3;
  real beta4;
}
model {
  // Priors (on log-scale)
  beta0 ~ normal(0, 1);
  beta1 ~ normal(0, 1);
  beta2 ~ normal(0, 1);
  beta3 ~ normal(0, 1);
  beta4 ~ normal(0, 1);

  // Likelihood
  for (n in 1:N) {
    real eta = beta0
               + beta1 * x1[n]
               + beta2 * x2[n]
               + beta3 * x3[n]
               + beta4 * x4[n];
    y[n] ~ poisson_log(eta);
  }
}

Preparing Data & Sampling

Show code
stan_d_pois = {
    "N": len(cart_df),
    "y": cart_df["cart_size"].values.astype(int),
    "x1": cart_df["order_number"].values.astype(float),
    "x2": cart_df["order_dow"].values.astype(float),
    "x3": cart_df["order_hour_of_day"].values.astype(float),
    "x4": cart_df["avg_cart_size"].values.astype(float),
}

pois_model = CmdStanModel(stan_file="poisson_reg.stan")
fit_pois = pois_model.sample(data=stan_d_pois, seed=42, chains=4,
                              iter_warmup=1000, iter_sampling=1000)

Posterior Predictive Check

The posterior predictive distribution is compared against observed cart sizes to validate model fit:

Overlay of predicted and observed cart size distributions

Posterior predictive distribution vs observed cart sizes
Show code
coef_names = ['beta0', 'beta1', 'beta2', 'beta3', 'beta4']
beta_means = {name: np.mean(fit_pois.stan_variable(name)) for name in coef_names}

x1 = stan_d_pois['x1']
x2 = stan_d_pois['x2']
x3 = stan_d_pois['x3']
x4 = stan_d_pois['x4']

eta = (beta_means['beta0']
       + beta_means['beta1'] * x1
       + beta_means['beta2'] * x2
       + beta_means['beta3'] * x3
       + beta_means['beta4'] * x4)

lam = np.exp(eta)

Prediction: How Many Items in the Cart?

For a new order with known features, the full predictive distribution over cart sizes is computed:

\[ P(c_{\text{new}} = k) = \frac{1}{S}\sum_{s=1}^{S} \frac{e^{-\lambda^{(s)}} (\lambda^{(s)})^k}{k!} \]

Show code
import math

order_number, order_dow, order_hour, avg_cart_size = 5, 2, 15, 12.3

# Standardize inputs
x1 = (order_number - stan_d_pois['x1'].mean()) / stan_d_pois['x1'].std()
x2 = (order_dow    - stan_d_pois['x2'].mean()) / stan_d_pois['x2'].std()
x3 = (order_hour   - stan_d_pois['x3'].mean()) / stan_d_pois['x3'].std()
x4 = (avg_cart_size - stan_d_pois['x4'].mean()) / stan_d_pois['x4'].std()

post = fit_pois.draws_pd()
b0, b1, b2, b3, b4 = [post[f'beta{i}'].values for i in range(5)]

eta_draws = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4
lambda_draws = np.exp(eta_draws)

expected_size = lambda_draws.mean()
y_pred_draws = np.random.poisson(lambda_draws)
ci_lower, ci_upper = np.percentile(y_pred_draws, [2.5, 97.5])

k = 10
pmf_k = np.exp(-lambda_draws) * (lambda_draws**k) / math.factorial(k)
prob_k = pmf_k.mean()

print(f"Expected cart size  = {expected_size:.2f} items")
print(f"95% predictive interval = [{ci_lower:.0f}, {ci_upper:.0f}] items")
print(f"P(cart_size = {k})  = {prob_k:.3f}")

Example result: For a user with average cart size 12.3, placing their 5th order on a Wednesday at 3 PM: expected cart size = 12.01 items, 95% predictive interval = [6, 19] items, and the probability of exactly 10 items is 0.105.

Evaluation Metrics

Model accuracy is measured with RMSE and MAE:

\[ \text{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(c_i - \hat{\lambda}_i)^2}, \qquad \text{MAE} = \frac{1}{N}\sum_{i=1}^{N}|c_i - \hat{\lambda}_i| \]

Model Comparison: Bayesian vs. Classical

Both Bayesian models are compared against their frequentist counterparts to understand where the Bayesian framework adds value:

Show code
from sklearn.linear_model import LogisticRegression, PoissonRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, log_loss
from sklearn.metrics import mean_squared_error, mean_absolute_error

# -- Reorder prediction: Classical Logistic Regression --
X = model_l[['order_number', 'order_dow', 'order_hour_of_day']]
y = model_l['reordered']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model_logistic = LogisticRegression(max_iter=300)
model_logistic.fit(X_train, y_train)
y_prob = model_logistic.predict_proba(X_test)[:, 1]

print('===== Logistic Regression =====')
print(f'Accuracy: {accuracy_score(y_test, y_prob > 0.5):.3f}')
print(f'AUC:      {roc_auc_score(y_test, y_prob):.3f}')
print(f'Log-loss: {log_loss(y_test, y_prob):.3f}')

# -- Cart size: Classical Poisson Regression --
X_reg = cart_df[['order_number', 'order_dow', 'order_hour_of_day', 'avg_cart_size']]
y_reg = cart_df['cart_size']
Xr_train, Xr_test, yr_train, yr_test = train_test_split(X_reg, y_reg,
                                                          test_size=0.2, random_state=42)

model_poisson = PoissonRegressor(alpha=1e-6, max_iter=300)
model_poisson.fit(Xr_train, yr_train)
y_pred_poisson = model_poisson.predict(Xr_test)

print("===== Poisson Regression =====")
print(f"RMSE: {mean_squared_error(yr_test, y_pred_poisson, squared=False):.2f}")
print(f"MAE:  {mean_absolute_error(yr_test, y_pred_poisson):.2f}")
NoteKey Comparison Insight

While classical models achieve competitive point metrics, only the Bayesian models deliver full posterior distributions over predictions. This means calibrated uncertainty estimates — credible intervals for reorder probability and cart size — that are essential for downstream decisions like inventory stocking, staffing, and personalized promotions.

Key Takeaways

1. Reorder behavior dominates the Instacart ecosystem. With 59% of all cart items being reorders, customer behavior is deeply habitual. The hierarchical model’s global intercept of \(\alpha \approx 0.44\) translates to a baseline reorder probability of 61%, confirming that the average user–product interaction leans toward repurchase. This makes reorder prediction a high-signal problem with direct commercial value for recommendation engines and inventory systems.

2. User and product heterogeneity demand a hierarchical approach. The posterior estimates of \(\sigma_u \approx 0.62\) and \(\sigma_p \approx 0.75\) reveal that both users and products contribute substantial variation beyond what fixed effects can capture. 95% of users fall within an odds-ratio range of \([0.30, 3.39]\), while products span \([0.23, 4.35]\). A flat logistic regression ignores this structure entirely, producing miscalibrated predictions for anyone who deviates from the population average.

3. Experience drives loyalty, not timing. The order number slope \(\beta_{\text{num}} \approx 0.75\) shows that experienced users are about twice as likely to reorder (\(\text{OR} = 2.12\)), while day-of-week (\(\beta_{\text{dow}} \approx -0.05\)) and hour-of-day (\(\beta_{\text{hour}} \approx 0\)) have negligible effects. This suggests that marketing efforts should focus on retention and building purchase history rather than optimizing delivery time slots.

4. Full predictive distributions enable risk-aware operations. The Bayesian Poisson model produces not just a point estimate for cart size but a complete probability distribution — enabling credible intervals like \([6, 19]\) items for a typical order. This tail-risk information is critical for warehouse staffing, delivery capacity planning, and inventory allocation, where underestimating demand is costly.

5. Partial pooling rescues sparse data. Many users and products have limited interaction histories. The hierarchical structure pools information across similar groups, producing stable estimates even for rare items and cold-start users. This is the practical advantage of the Bayesian framework over tree-based or flat classifiers — more reliable predictions exactly where they matter most.

6. Non-centered parameterization is essential at scale. With thousands of users and products, the naive centered parameterization creates a funnel geometry in the posterior that Stan’s NUTS sampler struggles with. The non-centered reparameterization — sampling \(\tilde{u} \sim \mathcal{N}(0,1)\) and computing \(\gamma_u = \tilde{u} \cdot \sigma_u\) — eliminates divergent transitions and yields clean, well-mixed chains.


NoteFull Project Report

For a complete writeup covering methodology, convergence diagnostics, model comparison tables, and future directions, download the full report:

📄 Download Project Report (PDF)

Built with Python, Stan, and curiosity.

Back to top