Hierarchical Bayesian Modeling for Customer Purchase Behavior

Overview

👨‍💻 Source Code & Models

The complete Jupyter notebooks, Stan scripts, and raw data are available in the public repository

Instacart is an American grocery delivery platform with millions of users. This project analyzes a sample of over 3 million grocery orders from more than 200,000 users to answer two questions: Which products will a customer reorder? and How many items will they put in their next cart?

3.4M+

Orders Analyzed

200K+

Unique Users

50K+

Products

Departments

Tech Stack

Python Pandas Seaborn Plotly CmdStanPy Bayesian Statistics Stan

Rather than treating every user and product identically, two Bayesian models tackle these questions from different angles. A Hierarchical Bayesian Logistic Regression predicts whether a specific product will be reordered by capturing per-user and per-product variation through random effects. A Bayesian Poisson Regression forecasts how many total items a customer will add to their cart, producing full predictive distributions with credible intervals. Both models are implemented in Stan and sampled via MCMC, enabling principled uncertainty quantification that flat machine learning models cannot provide.

Dataset & Schema

The dataset originates from Instacart’s public release and spans six relational tables. For each user, between 4 and 100 of their orders are available, along with the exact sequence of products added to each cart.

Instacart database schema showing six interconnected tables — Database schema showing relationships between orders, products, aisles, and departments

Key Tables

orders — All orders (prior, train, test) with timestamps and user info
order_products_prior — Products in each prior order with reorder flags
order_products_train — Products in the most recent order (training target)
products — Product names linked to aisles and departments
aisles & departments — Categorical hierarchies for products

Show code

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

aisles = pd.read_csv('data/aisles.csv')
departments = pd.read_csv('data/departments.csv')
order_products_prior = pd.read_csv('data/order_products_prior.csv')
order_products_train = pd.read_csv('data/order_products_train.csv')
orders = pd.read_csv('data/orders.csv')
products = pd.read_csv('data/products.csv')

Data Cleaning & Validation

Before any analysis begins, each of the six tables needs validation for missing values, duplicates, and inconsistent entries.

Missing Values

The days_since_prior_order column contains NaN values — these are not errors. They represent each user’s first ever order, which naturally has no prior order to reference.

Show code

orders.isna().sum()

Duplicate Check & Garbage Values

All tables pass the duplicate check cleanly. Product names, however, contain special characters — accents, hyphens, and symbols — that need normalization.

Show code

import re
special_char = set(re.findall("[^a-zA-Z0-9 ]", ''.join(products.loc[:,'product_name'])))
print(f'Found {len(special_char)} distinct special characters in product names')

products['product_name'] = products['product_name'].apply(
    lambda x: re.sub(r'[^a-zA-Z0-9 ]', '', x).strip()
)

Exploratory Data Analysis

Order Distribution Across Evaluation Sets

The dataset splits into three evaluation sets. The prior set (3.2M orders) forms the bulk of historical data, while train (~131K) and test (~75K) each represent one user’s most recent order.

Bar chart showing distribution of orders in train and test evaluation sets — Distribution of orders across train and test sets

Bar chart showing the massive prior set dwarfing train and test sets — Distribution across all evaluation sets including prior orders

Reorder vs. First-Time Purchases

The first critical insight: approximately 59% of all products in orders are reorders. Customers overwhelmingly stick with products they already know.

Pie chart showing the split between reordered and first-time purchased products — 59% reorders vs 41% first-time purchases

Nearly 6 out of 10 products in any given cart are items the customer has purchased before. This strong reorder tendency makes predicting reorder behavior both feasible and commercially valuable.

Formally, defining \(y_i \in \{0, 1\}\) as the reorder indicator for item \(i\), the overall reorder rate across the dataset is:

\hat{p}_{\text{reorder}} = \frac{1}{N}\sum_{i=1}^{N} y_i \approx 0.59

Department-Level Analysis

Products span 21 departments. Personal care, snacks, and pantry lead in product count, while categories like bulk and alcohol are smaller and more specialized.

Horizontal bar chart showing product count across departments — Number of products per department

Treemap: Aisles Within Each Department

The hierarchical structure of departments and aisles is best visualized as a treemap. Each rectangle represents an aisle, nested inside its parent department, with area proportional to the number of products. This makes it immediately clear which departments are product-dense and which aisles dominate within them.

Show code

fig = px.treemap(departments_info, path=['department', 'aisle'],
                 title='Aisles in each department')
fig.update_layout(autosize=False, width=1050, height=1000)
fig.show()

Treemap visualization of aisles within each of the 21 departments — Treemap showing aisles nested within each department — larger rectangles indicate more aisles

Top 15 Departments by Reorder Volume

Ranking departments by total reorder volume (not just rate) reveals which categories generate the most repeat business in absolute terms. This interactive chart highlights the top 15 departments driving habitual purchases.

Top Aisles by Product Count

Within departments, the top 10 aisles by product count include candy/chocolate, ice cream, and vitamins/supplements — categories with high product variety.

Bar chart showing the top 10 aisles by product count — Top 10 aisles by number of products

Reorder Rate Distribution

The distribution across orders reveals an interesting shape — most orders cluster around a moderate reorder rate (0.4–0.7), but there is a long tail of orders composed almost entirely of repeat purchases.

Histogram showing the distribution of reorder rates — Distribution of reorder rates across orders

Deep Dive: Reorder Behavior

Understanding reorder patterns requires slicing the data at multiple levels — department, aisle, time gap, and product.

Average Reorders Per Department

Computing the average number of reorders per distinct product in each department highlights which categories drive habitual buying:

\text{avg\_reorders}_d = \frac{\sum_{p \in d} \text{reorders}_p}{|\{p \in d\}|}

Bar chart showing average reorder count per product by department — Average reorders per product across departments

Show code

dept_stats = departments_info.groupby('department').agg(
    total_reorders=('reordered', 'sum'),
    distinct_products=('product_id', 'nunique')
).assign(
    avg_reorders_per_product=lambda d: d['total_reorders'] / d['distinct_products']
)

Probability of Reorder by Department

A more informative metric is the conditional probability: “If a product comes from department \(d\), how likely is it to be reordered?”

P(\text{reorder} \mid \text{dept} = d) = \frac{\#\text{reorders from } d}{\#\text{purchases from } d}

Horizontal bar chart of reorder probabilities by department — Probability of reorder from each department

Show code

dept_probs = departments_info.groupby('department')['reordered'].mean() \
    .sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(12, 8))
bars = ax.barh(dept_probs.index, dept_probs.values, color='green', edgecolor='black')
ax.invert_yaxis()
ax.set_xlabel('P(reorder | department)')
ax.set_title('Reorder Probability by Department')
plt.tight_layout()
plt.show()

Highest Reorder Rates

Dairy & eggs, produce, and beverages show the highest reorder probabilities — staple categories that customers replenish on a regular cycle. The lowest reorder rates appear in personal care and babies, where products tend to be one-time or infrequent purchases.

Probability of Reorder by Aisle

Drilling down to the aisle level (~134 aisles) shows even finer granularity:

P(\text{reorder} \mid \text{aisle} = a) = \frac{\#\text{reorders from } a}{\#\text{purchases from } a}

Full horizontal bar chart of reorder probabilities across all aisles — Probability of reorder from each aisle

Impact of Order Gap on Reorder Behavior

The time between consecutive orders (days_since_prior_order) significantly influences reorder behavior. Data shows clear spikes at 7 days (weekly shoppers), with additional modes at 14 and 30 days.

Count plot showing reorder vs first-time purchase by days since prior order — Gap between orders and reorder behavior

Most Popular Products

The top 20 most-ordered products are dominated by fresh produce — bananas, organic strawberries, and baby spinach lead by a wide margin. The heavy-tailed distribution means the top 10% of products account for roughly 50% of all order lines.

Bar chart of the 20 most ordered products on Instacart — Top 20 most frequently ordered products

Cart Position Analysis

The median add-to-cart position distribution shows most products are added early — customers start with habitual items and then explore.

Histogram showing when products are typically added to cart — Distribution of median cart position across products

Feature Engineering

Several features are engineered at the user, product, and order level to encode the patterns uncovered above.

User-Level Features

Each user’s reorder tendency is quantified as the fraction of their total purchases that are reorders:

\[ P(\text{reorder} \mid \text{user} = u) = \frac{\sum_{i \in \mathcal{O}_u} y_i}{|\mathcal{O}_u|} \] where \(\mathcal{O}_u\) is the set of all items across all prior orders for user \(u\).

Show code

user_activity = prior_order.groupby('user_id')['reordered'].agg(
    ['sum', 'count']
).rename(columns={'sum': 'user_reorders', 'count': 'user_purchases'}).reset_index()

user_activity['P(reorder | user)'] = (
    user_activity['user_reorders'] / user_activity['user_purchases']
)

Per-Order User Activity

The reorder rate is also computed at the per-order level to capture how behavior evolves across successive orders:

Show code

user_activity_per_order = prior_order.groupby(
    ['user_id', 'order_id']
)['reordered'].agg(
    ['sum', 'count']
).rename(columns={'sum': 'user_reorders', 'count': 'user_purchases'}).reset_index()

user_activity_per_order['P(reorder | user)'] = (
    user_activity_per_order['user_reorders'] / user_activity_per_order['user_purchases']
)

Department-Level & Cart Size Features

Show code

# Department reorder rate
dep_info = departments_info.groupby('department_id', as_index=False).agg(
    {'reordered': 'mean'}
).rename(columns={'reordered': 'reorder_dept_rate'})

# Cart size and user average
cart_size = order_products_train.groupby('order_id').size().reset_index(name='cart_size')

cart_df = order_products_train.merge(
    orders[['order_id', 'order_number', 'order_dow', 'order_hour_of_day',
            'days_since_prior_order', 'user_id']], on='order_id', how='left'
).merge(cart_size, on='order_id', how='left').merge(
    products[['product_id', 'aisle_id', 'department_id']], on='product_id', how='left'
)

user_avg = cart_df.groupby('user_id')['cart_size'].mean().reset_index(name='avg_cart_size')
cart_df = cart_df.merge(user_avg, on='user_id', how='left')

Complete Feature Set

Feature	Description
`user_reorders`	Total reorders by the user
`user_purchases`	Total products purchased by the user
`P(reorder \\| user)`	User-level reorder probability
`reorder_dept_rate`	Department-level reorder rate
`cart_size`	Number of items in the current order
`avg_cart_size`	User’s average cart size across all orders
`order_number`	Sequential order count for the user
`order_dow`	Day of week of the order
`order_hour_of_day`	Hour of day of the order
`days_since_prior_order`	Gap in days since last order

Feature Signal Strength

Before modeling, a quick signal assessment identifies which predictors carry meaningful information:

Show code

# Pearson correlation for numeric features
num_feats = ['order_number', 'order_dow', 'order_hour_of_day',
             'days_since_prior_order', 'cart_size']

numeric_corr = {
    feat: cart_df[feat].corr(cart_df['reordered'])
    for feat in num_feats
}

# Signal strength for categorical features
cat_feats = ['aisle_id', 'department_id']
cat_signal = []
for feat in cat_feats:
    rates = df.groupby(feat)['reordered'].mean()
    cat_signal.append({
        'feature': feat,
        'num_categories': rates.size,
        'std_of_rates': rates.std(),
        'range_of_rates': rates.max() - rates.min()
    })

Reorder Rate vs Order Number

As users place more orders, their reorder rate climbs initially and then plateaus — confirming that customers develop stable purchasing habits over time.

Line plot showing how reorder probability changes with order count — Reorder rate as a function of order number

The Math Behind the Models

Problem 1: Will This Product Be Reordered?

For each observation \(n\) (a user–product pair in an order), define \(y_n \in \{0,1\}\) as the binary reorder indicator. A standard logistic regression models the probability as:

P(y_n = 1) = \text{logistic}(\alpha + \mathbf{x}_n^T \boldsymbol{\beta}) = \frac{1}{1 + e^{-(\alpha + \mathbf{x}_n^T \boldsymbol{\beta})}}

This assumes a single intercept \(\alpha\) and shared slopes \(\boldsymbol{\beta}\) across all users and products. The problem: User A (who reorders 90% of the time) and User B (who reorders 20% of the time) are forced to share the same baseline. Similarly, bananas and birthday candles get no product-specific adjustment.

The Hierarchical Extension

A hierarchical model introduces random effects — per-user and per-product intercept offsets drawn from population-level distributions:

\text{logit}\big(P(y_n = 1)\big) = \underbrace{\alpha}_{\text{global baseline}} + \underbrace{\gamma_{u[n]}}_{\text{user effect}} + \underbrace{\delta_{p[n]}}_{\text{product effect}} + \underbrace{\beta_{\text{num}} x_{\text{ord\_num}} + \beta_{\text{dow}} x_{\text{dow}} + \beta_{\text{hour}} x_{\text{hour}}}_{\text{fixed effects for order context}}

The random effects are not free parameters — they are constrained by group-level priors that create partial pooling:

\gamma_u \sim \mathcal{N}(0, \sigma_u^2), \quad \delta_p \sim \mathcal{N}(0, \sigma_p^2)

Each user’s and product’s effect is pulled toward the global mean, with the degree of shrinkage controlled by \(\sigma_u^2\) and \(\sigma_p^2\). Users with abundant data get estimates closer to their individual rates; users with sparse data borrow strength from the population.

Problem 2: How Many Items in the Cart?

For each order \(i\), the cart size \(c_i\) (a non-negative integer) is modeled with a Poisson distribution and log-link:

c_i \sim \text{Poisson}(\lambda_i), \quad \log(\lambda_i) = \beta_0 + \beta_1 x_{\text{ord\_num}} + \beta_2 x_{\text{dow}} + \beta_3 x_{\text{hour}} + \beta_4 x_{\text{avg\_cart}}

where \(\lambda_i\) is the expected cart size, determined by order-level features. By Bayes’ rule, the joint posterior over all coefficients is:

p(\boldsymbol{\beta} \mid \mathbf{c}, \mathbf{X}) \propto \prod_{i=1}^{N} \text{Poisson}(c_i \mid \lambda_i(\boldsymbol{\beta})) \cdot \prod_{j=0}^{4} \mathcal{N}(\beta_j \mid 0, 1)

The posterior mean and expected cart size for a new order are computed from \(S\) posterior samples:

\hat{\lambda}_{\text{new}} = \frac{1}{S}\sum_{s=1}^{S} \exp\!\big(\beta_0^{(s)} + \beta_1^{(s)} x_1 + \cdots + \beta_4^{(s)} x_4\big)

And the full predictive distribution is:

p(c_{\text{new}} = k \mid \mathbf{X}_{\text{new}}, \text{data}) = \frac{1}{S}\sum_{s=1}^{S} \frac{e^{-\lambda^{(s)}} (\lambda^{(s)})^k}{k!}

This gives not just a point estimate but a complete probability distribution over cart sizes, enabling credible intervals for inventory planning and staffing decisions.

Prior Justification

Given the logistic scale and standardized predictors, weakly informative priors are used for all parameters:

Hierarchical Logistic Regression

\alpha \sim \mathcal{N}(0, 1) \qquad \text{(global intercept)} \] \[ \sigma_u, \sigma_p \sim \text{Half-}\mathcal{N}(0, 1) \qquad \text{(controls how far users/products stray from global intercept)} \] \[ \tilde{u}, \tilde{p} \sim \mathcal{N}(0, 1) \qquad \text{(raw offsets for non-centered form)} \] \[ \beta_{\text{num}}, \beta_{\text{dow}}, \beta_{\text{hour}} \sim \mathcal{N}(0, 1) \qquad \text{(fixed slopes)}

On the logit scale, a \(\mathcal{N}(0,1)\) prior places most mass within \(\pm 2\), corresponding to probabilities between roughly 12% and 88%. This is broad enough to accommodate diverse effects without letting the model converge on extreme or implausible values. The half-normal priors on \(\sigma_u\) and \(\sigma_p\) regularize the group-level variation, shrinking sparse groups toward zero.

Poisson Regression

\beta_0, \beta_1, \beta_2, \beta_3, \beta_4 \sim \mathcal{N}(0, 1) \qquad \text{(all coefficients on the log scale)}

Non-Centered Parameterization

For computational efficiency, the hierarchical model uses a non-centered parameterization. Instead of sampling \(\gamma_u \sim \mathcal{N}(0, \sigma_u)\) directly, a standard normal auxiliary variable \(\tilde{u}\) is sampled and then scaled:

\tilde{u} \sim \mathcal{N}(0, 1), \quad \gamma_u = \tilde{u} \cdot \sigma_u

This is mathematically equivalent but breaks the correlation between \(\sigma_u\) and \(\gamma_u\) in the posterior — dramatically improving sampling efficiency when the number of groups (users, products) is large.

Model 1: Hierarchical Logistic Regression in Stan

Preparing the Data

Show code

df = order_products_train.merge(
    orders[['order_id', 'user_id', 'order_number', 'order_dow', 'order_hour_of_day']],
    on='order_id'
).merge(
    products[['product_id', 'aisle_id', 'department_id']], on='product_id'
)

# Sample for tractable MCMC
df = df.sample(n=5000, random_state=42)

# Integer indices for random effects (Stan 1-based)
model_l = df[['user_id', 'product_id', 'order_number',
              'order_dow', 'order_hour_of_day', 'reordered']].copy()
model_l['user_idx'] = model_l['user_id'].astype('category').cat.codes
model_l['product_idx'] = model_l['product_id'].astype('category').cat.codes

Building the Stan Data

Show code

stan_data = {
    "N": len(model_l),
    "U": int(model_l.user_idx.max()) + 1,
    "P": int(model_l.product_idx.max()) + 1,
    "user_idx": (model_l.user_idx.values + 1).astype(int),
    "prod_idx": (model_l.product_idx.values + 1).astype(int),
    "ord_num": model_l.order_number.values.astype(float),
    "ord_dow": model_l.order_dow.values.astype(float),
    "ord_hour": model_l.order_hour_of_day.values.astype(float),
    "y": model_l.reordered.values.astype(int),
}

# Standardize predictors for better sampling
for name in ['ord_num', 'ord_dow', 'ord_hour']:
    arr = stan_data[name]
    stan_data[name] = (arr - np.mean(arr)) / np.std(arr)

The Stan Model — `hier_logit.stan`

data {
  int<lower=1> N;                 // # observations
  int<lower=1> U;                 // # users
  int<lower=1> P;                 // # products
  array[N] int<lower=1,upper=U> user_idx;
  array[N] int<lower=1,upper=P> prod_idx;
  vector[N] ord_num;
  vector[N] ord_dow;
  vector[N] ord_hour;
  array[N] int<lower=0,upper=1> y;
}
parameters {
  real alpha;
  real<lower=0> sigma_u;
  real<lower=0> sigma_p;
  vector[U] u_raw;
  vector[P] p_raw;
  real beta_num;
  real beta_dow;
  real beta_hour;
}
transformed parameters {
  vector[U] gamma_u = u_raw * sigma_u;
  vector[P] delta_p = p_raw * sigma_p;
}
model {
  // Priors
  alpha     ~ normal(0, 1);
  sigma_u   ~ normal(0, 1);
  sigma_p   ~ normal(0, 1);
  u_raw     ~ normal(0, 1);
  p_raw     ~ normal(0, 1);
  beta_num  ~ normal(0, 1);
  beta_dow  ~ normal(0, 1);
  beta_hour ~ normal(0, 1);

  // Likelihood
  for (n in 1:N) {
    real eta = alpha
               + gamma_u[user_idx[n]]
               + delta_p[prod_idx[n]]
               + beta_num  * ord_num[n]
               + beta_dow  * ord_dow[n]
               + beta_hour * ord_hour[n];
    y[n] ~ bernoulli_logit(eta);
  }
}

Sampling

Show code

from cmdstanpy import CmdStanModel, install_cmdstan

install_cmdstan()
model = CmdStanModel(stan_file='hier_logit.stan')

fit = model.sample(
    data=stan_data,
    seed=42,
    chains=4,
    iter_warmup=1000,
    iter_sampling=1000,
    adapt_delta=0.99,
    max_treedepth=15
)

print(fit.summary())
print(fit.diagnose())

Posterior Distributions

The posterior draws for the global parameters and fixed slopes reveal how much of the reorder variation is attributable to users, products, and order context:

Six posterior histograms showing the distribution of key model parameters — Posterior distributions of α, σ_u, σ_p, β_num, β_dow, β_hour

Show code

df_draws = fit.draws_pd()

params = ['alpha', 'sigma_u', 'sigma_p', 'beta_num', 'beta_dow', 'beta_hour']
fig, axes = plt.subplots(len(params), 1, figsize=(6, len(params)*2.2))
for ax, name in zip(axes, params):
    ax.hist(df_draws[name], bins=40, density=True, alpha=0.7)
    ax.set_title(f"Posterior of {name}")
    ax.axvline(df_draws[name].mean(), color='k', linestyle='--', label='mean')
    ax.legend()
plt.tight_layout()
plt.show()

Interpreting the Posteriors

Each posterior distribution tells a specific story about how Instacart customers behave. The histograms above show the density of 4,000 MCMC draws (4 chains × 1,000 samples) for each parameter. The dashed vertical line marks the posterior mean.

Global Intercept — \(\alpha \approx 0.44\)

The global intercept represents the baseline log-odds of reorder when all predictors are at their mean and user/product random effects are zero. Converting to probability through the logistic function:

p_0 = \frac{1}{1 + e^{-\alpha}} = \frac{1}{1 + e^{-0.44}} \approx 0.61

This means an average user buying an average product has roughly a 61% chance of reordering it — closely matching the 59% overall reorder rate observed in the raw data. The tight posterior (narrow histogram) indicates high confidence in this estimate.

User-Effect Standard Deviation — \(\sigma_u \approx 0.62\)

This parameter controls how much individual users deviate from the global intercept. A value of 0.62 means users’ log-odds intercepts are drawn from \(\mathcal{N}(0, 0.62^2)\). To understand the practical impact, convert to odds ratios:

\text{95% of users:} \quad \exp(\pm 1.96 \times 0.62) = [0.30, \; 3.39]

This means the most habitual users have reorder odds 3.4× the baseline, while the most exploratory users have odds just 0.3× the baseline. The wide spread confirms massive user heterogeneity — exactly the kind of variation that a flat model would miss.

Product-Effect Standard Deviation — \(\sigma_p \approx 0.75\)

Products vary even more than users. The 95% interval of product odds ratios spans:

\text{95% of products:} \quad \exp(\pm 1.96 \times 0.75) = [0.23, \; 4.35]

A staple like whole milk has a strongly positive \(\delta_p\), pushing its reorder odds to 4× the baseline. A novelty item like a seasonal candle has a strongly negative offset, dropping to 0.23× the baseline. This asymmetry between products is the main reason hierarchical modeling outperforms flat approaches.

Order Number Slope — \(\beta_{\text{num}} \approx 0.75\)

One standard deviation increase in the user’s order count multiplies the odds of reorder by:

\text{OR} = \exp(\beta_{\text{num}}) = \exp(0.75) \approx 2.12

Experienced users are about twice as likely to reorder. This aligns with the EDA finding that reorder rates climb with order count before plateauing — the model has learned this relationship directly from the data.

Day-of-Week Slope — \(\beta_{\text{dow}} \approx -0.05\)

The posterior is centered just below zero. Converting: \(\exp(-0.05) = 0.951\), indicating a 5% drop in odds per SD increase in day-of-week. This is a negligible effect — whether someone orders on a Monday or Saturday barely changes their reorder tendency.

Hour-of-Day Slope — \(\beta_{\text{hour}} \approx 0\)

The posterior density straddles zero almost symmetrically, confirming that hour of day has no measurable effect on reorder probability. Morning shoppers and late-night browsers reorder at the same rates.

User and Product Random Effects

The distributions of per-user and per-product random intercepts show how the model captures individual-level variation:

Histogram of user random intercepts showing population spread — Posterior distribution of all user random intercepts γᵤ

Histogram of product random intercepts showing population spread — Posterior distribution of all product random intercepts δₚ

Show code

# User intercepts
gamma_cols = [c for c in df_draws.columns if c.startswith('gamma_u[')]
all_gamma = df_draws[gamma_cols].values.flatten()
plt.figure(figsize=(6, 4))
plt.hist(all_gamma, bins=60, density=True, alpha=0.7)
plt.title("Posterior distribution of all γᵤ (user intercepts)")
plt.xlabel("γᵤ value")
plt.show()

# Product intercepts
delta_cols = [c for c in df_draws.columns if c.startswith('delta_p[')]
all_delta = df_draws[delta_cols].values.flatten()
plt.figure(figsize=(6, 4))
plt.hist(all_delta, bins=60, density=True, alpha=0.7)
plt.title("Posterior distribution of all δₚ (product intercepts)")
plt.xlabel("δₚ value")
plt.show()

Prediction: Reorder Probability for a Specific User–Product Pair

To compute the probability that a specific user will reorder a specific product on their next order, the posterior draws are used directly:

\pi^{(s)} = \text{logistic}\!\big(\alpha^{(s)} + \gamma_u^{(s)} + \delta_p^{(s)} + \beta_{\text{num}}^{(s)} x_{\text{num}} + \beta_{\text{dow}}^{(s)} x_{\text{dow}} + \beta_{\text{hour}}^{(s)} x_{\text{hour}}\big) \] \[ \hat{\pi} = \frac{1}{S}\sum_{s=1}^{S} \pi^{(s)}, \quad \text{95% CI} = \big[\pi_{0.025}, \pi_{0.975}\big]

Show code

u, p, num, dow, hour = 1235, 563, 5, 2, 15

alpha_draws  = fit.stan_variable('alpha')
beta_num_d   = fit.stan_variable('beta_num')
beta_dow_d   = fit.stan_variable('beta_dow')
beta_hour_d  = fit.stan_variable('beta_hour')
gamma_u_mat  = fit.stan_variable('gamma_u')   # shape (n_draws, U)
delta_p_mat  = fit.stan_variable('delta_p')   # shape (n_draws, P)

gamma_u_draws = gamma_u_mat[:, u]
delta_p_draws = delta_p_mat[:, p]

eta = (alpha_draws + gamma_u_draws + delta_p_draws
       + beta_num_d * num + beta_dow_d * dow + beta_hour_d * hour)

pi_draws = 1 / (1 + np.exp(-eta))
pi_mean = pi_draws.mean()
pi_ci = np.percentile(pi_draws, [2.5, 97.5])

print(f"Reorder probability for user={u}, product={p}:")
print(f"  Posterior mean = {pi_mean:.3f}")
print(f"  95% CI         = [{pi_ci[0]:.3f}, {pi_ci[1]:.3f}]")

Example result: For user 1235 and product 563, the posterior mean reorder probability is 0.977 with a 95% credible interval of [0.899, 0.999]. The model is highly confident this user will reorder this product.

Model 2: Bayesian Poisson Regression for Cart Size

The second model shifts focus from what goes into the cart to how many items go in. For each order \(i\), the cart size \(c_i\) is modeled as:

c_i \sim \text{Poisson}(\lambda_i), \quad \log(\lambda_i) = \beta_0 + \beta_1 x_{\text{ord\_num}} + \beta_2 x_{\text{dow}} + \beta_3 x_{\text{hour}} + \beta_4 x_{\text{avg\_cart}}

The Stan Model — `poisson_reg.stan`

data {
  int<lower=1> N;            // # of orders
  array[N] int<lower=0> y;   // cart_size (count)
  vector[N] x1;              // order_number
  vector[N] x2;              // order_dow
  vector[N] x3;              // order_hour_of_day
  vector[N] x4;              // avg_cart_size
}
parameters {
  real beta0;
  real beta1;
  real beta2;
  real beta3;
  real beta4;
}
model {
  // Priors (on log-scale)
  beta0 ~ normal(0, 1);
  beta1 ~ normal(0, 1);
  beta2 ~ normal(0, 1);
  beta3 ~ normal(0, 1);
  beta4 ~ normal(0, 1);

  // Likelihood
  for (n in 1:N) {
    real eta = beta0
               + beta1 * x1[n]
               + beta2 * x2[n]
               + beta3 * x3[n]
               + beta4 * x4[n];
    y[n] ~ poisson_log(eta);
  }
}

Preparing Data & Sampling

Show code

stan_d_pois = {
    "N": len(cart_df),
    "y": cart_df["cart_size"].values.astype(int),
    "x1": cart_df["order_number"].values.astype(float),
    "x2": cart_df["order_dow"].values.astype(float),
    "x3": cart_df["order_hour_of_day"].values.astype(float),
    "x4": cart_df["avg_cart_size"].values.astype(float),
}

pois_model = CmdStanModel(stan_file="poisson_reg.stan")
fit_pois = pois_model.sample(data=stan_d_pois, seed=42, chains=4,
                              iter_warmup=1000, iter_sampling=1000)

Posterior Predictive Check

The posterior predictive distribution is compared against observed cart sizes to validate model fit:

Overlay of predicted and observed cart size distributions — Posterior predictive distribution vs observed cart sizes

Show code

coef_names = ['beta0', 'beta1', 'beta2', 'beta3', 'beta4']
beta_means = {name: np.mean(fit_pois.stan_variable(name)) for name in coef_names}

x1 = stan_d_pois['x1']
x2 = stan_d_pois['x2']
x3 = stan_d_pois['x3']
x4 = stan_d_pois['x4']

eta = (beta_means['beta0']
       + beta_means['beta1'] * x1
       + beta_means['beta2'] * x2
       + beta_means['beta3'] * x3
       + beta_means['beta4'] * x4)

lam = np.exp(eta)

Prediction: How Many Items in the Cart?

For a new order with known features, the full predictive distribution over cart sizes is computed:

P(c_{\text{new}} = k) = \frac{1}{S}\sum_{s=1}^{S} \frac{e^{-\lambda^{(s)}} (\lambda^{(s)})^k}{k!}

Show code

import math

order_number, order_dow, order_hour, avg_cart_size = 5, 2, 15, 12.3

# Standardize inputs
x1 = (order_number - stan_d_pois['x1'].mean()) / stan_d_pois['x1'].std()
x2 = (order_dow    - stan_d_pois['x2'].mean()) / stan_d_pois['x2'].std()
x3 = (order_hour   - stan_d_pois['x3'].mean()) / stan_d_pois['x3'].std()
x4 = (avg_cart_size - stan_d_pois['x4'].mean()) / stan_d_pois['x4'].std()

post = fit_pois.draws_pd()
b0, b1, b2, b3, b4 = [post[f'beta{i}'].values for i in range(5)]

eta_draws = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4
lambda_draws = np.exp(eta_draws)

expected_size = lambda_draws.mean()
y_pred_draws = np.random.poisson(lambda_draws)
ci_lower, ci_upper = np.percentile(y_pred_draws, [2.5, 97.5])

k = 10
pmf_k = np.exp(-lambda_draws) * (lambda_draws**k) / math.factorial(k)
prob_k = pmf_k.mean()

print(f"Expected cart size  = {expected_size:.2f} items")
print(f"95% predictive interval = [{ci_lower:.0f}, {ci_upper:.0f}] items")
print(f"P(cart_size = {k})  = {prob_k:.3f}")

Example result: For a user with average cart size 12.3, placing their 5th order on a Wednesday at 3 PM: expected cart size = 12.01 items, 95% predictive interval = [6, 19] items, and the probability of exactly 10 items is 0.105.

Evaluation Metrics

Model accuracy is measured with RMSE and MAE:

\text{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(c_i - \hat{\lambda}_i)^2}, \qquad \text{MAE} = \frac{1}{N}\sum_{i=1}^{N}|c_i - \hat{\lambda}_i|

Model Comparison: Bayesian vs. Classical

Both Bayesian models are compared against their frequentist counterparts to understand where the Bayesian framework adds value:

Show code

from sklearn.linear_model import LogisticRegression, PoissonRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, log_loss
from sklearn.metrics import mean_squared_error, mean_absolute_error

# -- Reorder prediction: Classical Logistic Regression --
X = model_l[['order_number', 'order_dow', 'order_hour_of_day']]
y = model_l['reordered']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model_logistic = LogisticRegression(max_iter=300)
model_logistic.fit(X_train, y_train)
y_prob = model_logistic.predict_proba(X_test)[:, 1]

print('===== Logistic Regression =====')
print(f'Accuracy: {accuracy_score(y_test, y_prob > 0.5):.3f}')
print(f'AUC:      {roc_auc_score(y_test, y_prob):.3f}')
print(f'Log-loss: {log_loss(y_test, y_prob):.3f}')

# -- Cart size: Classical Poisson Regression --
X_reg = cart_df[['order_number', 'order_dow', 'order_hour_of_day', 'avg_cart_size']]
y_reg = cart_df['cart_size']
Xr_train, Xr_test, yr_train, yr_test = train_test_split(X_reg, y_reg,
                                                          test_size=0.2, random_state=42)

model_poisson = PoissonRegressor(alpha=1e-6, max_iter=300)
model_poisson.fit(Xr_train, yr_train)
y_pred_poisson = model_poisson.predict(Xr_test)

print("===== Poisson Regression =====")
print(f"RMSE: {mean_squared_error(yr_test, y_pred_poisson, squared=False):.2f}")
print(f"MAE:  {mean_absolute_error(yr_test, y_pred_poisson):.2f}")

Key Comparison Insight

While classical models achieve competitive point metrics, only the Bayesian models deliver full posterior distributions over predictions. This means calibrated uncertainty estimates — credible intervals for reorder probability and cart size — that are essential for downstream decisions like inventory stocking, staffing, and personalized promotions.

Key Takeaways

1. Reorder behavior dominates the Instacart ecosystem. With 59% of all cart items being reorders, customer behavior is deeply habitual. The hierarchical model’s global intercept of \(\alpha \approx 0.44\) translates to a baseline reorder probability of 61%, confirming that the average user–product interaction leans toward repurchase. This makes reorder prediction a high-signal problem with direct commercial value for recommendation engines and inventory systems.

2. User and product heterogeneity demand a hierarchical approach. The posterior estimates of \(\sigma_u \approx 0.62\) and \(\sigma_p \approx 0.75\) reveal that both users and products contribute substantial variation beyond what fixed effects can capture. 95% of users fall within an odds-ratio range of \([0.30, 3.39]\), while products span \([0.23, 4.35]\). A flat logistic regression ignores this structure entirely, producing miscalibrated predictions for anyone who deviates from the population average.

3. Experience drives loyalty, not timing. The order number slope \(\beta_{\text{num}} \approx 0.75\) shows that experienced users are about twice as likely to reorder (\(\text{OR} = 2.12\)), while day-of-week (\(\beta_{\text{dow}} \approx -0.05\)) and hour-of-day (\(\beta_{\text{hour}} \approx 0\)) have negligible effects. This suggests that marketing efforts should focus on retention and building purchase history rather than optimizing delivery time slots.

4. Full predictive distributions enable risk-aware operations. The Bayesian Poisson model produces not just a point estimate for cart size but a complete probability distribution — enabling credible intervals like \([6, 19]\) items for a typical order. This tail-risk information is critical for warehouse staffing, delivery capacity planning, and inventory allocation, where underestimating demand is costly.

5. Partial pooling rescues sparse data. Many users and products have limited interaction histories. The hierarchical structure pools information across similar groups, producing stable estimates even for rare items and cold-start users. This is the practical advantage of the Bayesian framework over tree-based or flat classifiers — more reliable predictions exactly where they matter most.

6. Non-centered parameterization is essential at scale. With thousands of users and products, the naive centered parameterization creates a funnel geometry in the posterior that Stan’s NUTS sampler struggles with. The non-centered reparameterization — sampling \(\tilde{u} \sim \mathcal{N}(0,1)\) and computing \(\gamma_u = \tilde{u} \cdot \sigma_u\) — eliminates divergent transitions and yields clean, well-mixed chains.

Full Project Report

For a complete writeup covering methodology, convergence diagnostics, model comparison tables, and future directions, download the full report:

📄 Download Project Report (PDF)

Built with Python, Stan, and curiosity.

Overview

Tech Stack

Dataset & Schema

Data Cleaning & Validation

Missing Values

Duplicate Check & Garbage Values

Exploratory Data Analysis

Order Distribution Across Evaluation Sets

Reorder vs. First-Time Purchases

Department-Level Analysis

Treemap: Aisles Within Each Department

Top 15 Departments by Reorder Volume

Top Aisles by Product Count

Reorder Rate Distribution

Deep Dive: Reorder Behavior

Average Reorders Per Department

Probability of Reorder by Department

Probability of Reorder by Aisle

Impact of Order Gap on Reorder Behavior

Most Popular Products

Cart Position Analysis

Feature Engineering

User-Level Features

Per-Order User Activity

Department-Level & Cart Size Features

Feature Signal Strength

Reorder Rate vs Order Number

The Math Behind the Models

Problem 1: Will This Product Be Reordered?

The Hierarchical Extension

Problem 2: How Many Items in the Cart?

Prior Justification

Hierarchical Logistic Regression

Poisson Regression

Non-Centered Parameterization

Model 1: Hierarchical Logistic Regression in Stan

Preparing the Data

Building the Stan Data

The Stan Model — hier_logit.stan

Sampling

Posterior Distributions

Interpreting the Posteriors

Global Intercept — \(\alpha \approx 0.44\)

User-Effect Standard Deviation — \(\sigma_u \approx 0.62\)

Product-Effect Standard Deviation — \(\sigma_p \approx 0.75\)

Order Number Slope — \(\beta_{\text{num}} \approx 0.75\)

Day-of-Week Slope — \(\beta_{\text{dow}} \approx -0.05\)

Hour-of-Day Slope — \(\beta_{\text{hour}} \approx 0\)

User and Product Random Effects

Prediction: Reorder Probability for a Specific User–Product Pair

Model 2: Bayesian Poisson Regression for Cart Size

The Stan Model — poisson_reg.stan

Preparing Data & Sampling

Posterior Predictive Check

Prediction: How Many Items in the Cart?

Evaluation Metrics

Model Comparison: Bayesian vs. Classical

Key Takeaways

The Stan Model — `hier_logit.stan`

The Stan Model — `poisson_reg.stan`