Sequential Learning in Recommendation Systems: From Markov Chains to Transformers
Sequential Learning in Recommendation Systems: From Markov Chains to Transformers
A comprehensive guide to sequence-based recommendation techniques and key research papers
Introduction
Traditional recommendation systems treat user interactions as independent events, missing the crucial temporal patterns in user behavior. Sequential recommendation systems address this limitation by modeling the order and timing of user interactions to predict what users will engage with next.
This guide traces the evolution of sequential recommendation from early statistical methods to modern transformer-based architectures, highlighting key papers and techniques that have shaped the field.
Problem Definition
Sequential Recommendation Task: Given a user’s historical interaction sequence $S_u = {s_1, s_2, …, s_t}$ ordered by timestamp, predict the next item $s_{t+1}$ that the user will interact with.
Key Challenges:
- Capturing both short-term and long-term user preferences
- Handling sparse and variable-length sequences
- Modeling temporal dynamics and seasonal patterns
- Scaling to millions of users and items
Era 1: Statistical Foundations (2000-2010)
Markov Chain Models
Core Idea: Model sequential dependencies using probabilistic state transitions.
First-order Markov Chain:
P(s_{t+1} | s_1, s_2, ..., s_t) = P(s_{t+1} | s_t)
Key Papers:
- “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data” (Srivastava et al., 2000) - Foundational work on web usage patterns
- “Factorizing personalized Markov chains for next-basket recommendation” (Rendle et al., WWW 2010) - Combined matrix factorization with Markov chains
Limitations:
- Strong independence assumptions
- Limited capacity for long-term dependencies
- Sparsity issues with higher-order models
Sequential Pattern Mining
Approach: Discover frequent sequential patterns in user behavior data.
Key Papers:
- “Sequential Pattern Mining: A Survey” (Fournier-Viger et al., 2017) - Comprehensive survey of pattern mining techniques
Advantages: Interpretable patterns, efficient for specific domains Limitations: Hard to generalize, difficulty with continuous features
Era 2: Neural Network Revolution (2010-2017)
RNN-Based Methods
GRU4Rec (2016) - The Pioneer
Paper: “Session-based Recommendations with Recurrent Neural Networks” (Hidasi et al., ICLR 2016) Code: GitHub
Key Contributions:
- First successful application of RNNs to session-based recommendation
- Session-parallel mini-batches for efficient training
- Ranking-based loss functions (BPR, TOP1)
- 35% improvement over item-to-item recommendations
Technical Innovation:
class GRU4Rec(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
self.gru = nn.GRU(input_size, hidden_size, batch_first=True)
self.linear = nn.Linear(hidden_size, output_size)
def forward(self, input_seq, hidden):
output, hidden = self.gru(input_seq, hidden)
prediction = self.linear(output[:, -1, :])
return prediction, hidden
Improvements on GRU4Rec
“Improved Recurrent Neural Networks for Session-based Recommendation” (Tan et al., 2016)
- Data augmentation techniques
- Better loss function comparisons
- Improved handling of data sparsity
“Recurrent Neural Networks with Top-k Gains” (Hidasi & Karatzoglou, 2018)
- 35% better performance through improved loss functions
- Better negative sampling strategies
Attention Mechanisms
NARM (2017) - Neural Attentive Recommendation Machine
Paper: “Neural Attentive Session-based Recommendation” (Li et al., CIKM 2017) Code: GitHub
Key Innovation: Combined global and local attention mechanisms
- Global encoder: captures overall session context
- Local encoder: focuses on recent interactions
- Attention mechanism: dynamically weighs global vs local information
STAMP (2018) - Short-Term Attention/Memory Priority
Paper: “STAMP: Short-Term Attention/Memory Priority Model for Session-based Recommendation” (Liu et al., KDD 2018) Code: GitHub
Key Features:
- Separates long-term and short-term interests
- Attention mechanism for current session intent
- Memory network for general user preferences
- Addresses interest drift in long sessions
Era 3: Transformer Era (2017-2020)
Self-Attention Revolution
SASRec (2018) - Self-Attentive Sequential Recommendation
Paper: “Self-Attentive Sequential Recommendation” (Kang & McAuley, ICDM 2018) Code: GitHub
Breakthrough Contributions:
- First successful application of self-attention to sequential recommendation
- Unidirectional attention to prevent information leakage
- Position embeddings for temporal information
- Parallelizable training unlike RNN-based methods
Architecture Highlights:
class SASRec(nn.Module):
def __init__(self, item_size, hidden_size, num_heads, num_layers):
self.item_embedding = nn.Embedding(item_size, hidden_size)
self.pos_embedding = nn.Embedding(max_len, hidden_size)
self.transformer_blocks = nn.ModuleList([
TransformerBlock(hidden_size, num_heads)
for _ in range(num_layers)
])
self.output_layer = nn.Linear(hidden_size, item_size)
def forward(self, seq):
positions = torch.arange(len(seq))
x = self.item_embedding(seq) + self.pos_embedding(positions)
for transformer in self.transformer_blocks:
x = transformer(x)
return self.output_layer(x[:, -1, :])
Performance: Significant improvements over RNN-based methods across multiple datasets
BERT4Rec (2019) - Bidirectional Sequential Recommendation
Paper: “BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer” (Sun et al., CIKM 2019) Code: GitHub
Key Innovations:
- Bidirectional self-attention (unlike SASRec’s unidirectional approach)
- Cloze task training: predicting masked items instead of next-item
- Better representation learning through bidirectional context
- More training samples from masking strategy
Training Objective:
def cloze_loss(model, sequence):
masked_seq, targets = random_mask(sequence)
predictions = model(masked_seq)
loss = F.cross_entropy(predictions[masked_positions], targets)
return loss
Performance: Superior performance on dense datasets with longer sequences
Era 4: Advanced Sequential Learning (2020-Present)
Industrial Scale Deployment
BST (2019) - Behavior Sequence Transformer at Alibaba
Paper: “Behavior Sequence Transformer for E-commerce Recommendation in Alibaba” (Chen et al., DLP-KDD 2019) Code: GitHub
Industrial Innovation:
- Successfully deployed at Taobao with significant CTR improvements
- Handles heterogeneous behavior types (clicks, purchases, cart additions)
- Incorporates side information beyond item IDs
- Optimized for large-scale production deployment
Key Features:
- Multi-behavior modeling in unified framework
- Integration with existing recommendation pipelines
- Real-time inference capabilities
- Significant business impact in production
Multi-Interest Modeling
MIND (2019) - Multi-Interest Network with Dynamic Routing
Paper: “Multi-Interest Network with Dynamic Routing for Recommendation at Tmall” (Li et al., KDD 2019)
Core Insight: Users have multiple concurrent interests that evolve independently
Technical Approach:
- Dynamic routing mechanism to separate different interests
- Multiple representation vectors per user
- Capsule network architecture for interest extraction
ComiRec (2020) - Controllable Multi-Interest Framework
Paper: “Controllable Multi-Interest Framework for Recommendation” (Cen et al., KDD 2020)
Features:
- Controllable interest extraction with explicit constraints
- Multi-interest sequential modeling with attention
- Better interpretability through interest visualization
Graph-Enhanced Sequential Models
SR-GNN (2019) - Session-based Recommendation with Graph Neural Networks
Paper: “Session-based Recommendation with Graph Neural Networks” (Wu et al., AAAI 2019)
Innovation: Combines sequential patterns with graph structures
- Models sessions as directed graphs
- Graph neural networks capture complex item transitions
- Integrates global graph information with local sequential patterns
GC-SAN (2019) - Graph Contextualized Self-Attention Network
Paper: “Graph Contextualized Self-Attention Network for Session-based Recommendation” (Xu et al., IJCAI 2019)
Approach: Combines self-attention with graph convolution for session modeling
Contrastive Learning Approaches
CL4SRec (2022) - Contrastive Learning for Sequential Recommendation
Paper: “Contrastive Learning for Sequential Recommendation” (Xie et al., ICDE 2022)
Key Techniques:
- Data augmentation strategies: crop, mask, reorder operations
- Contrastive learning objectives for better representations
- Self-supervised learning without additional labels
Augmentation Example:
class SequenceAugmentation:
def crop(self, sequence):
length = len(sequence)
crop_length = int(length * self.crop_ratio)
start_idx = random.randint(0, length - crop_length)
return sequence[start_idx:start_idx + crop_length]
def mask(self, sequence):
masked_seq = sequence.copy()
mask_num = int(len(sequence) * self.mask_ratio)
mask_indices = random.sample(range(len(sequence)), mask_num)
for idx in mask_indices:
masked_seq[idx] = 0 # mask token
return masked_seq
S3-Rec (2020) - Self-Supervised Learning for Sequential Recommendation
Paper: “S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization” (Zhou et al., CIKM 2020)
Self-supervised Tasks:
- Masked Item Prediction
- Masked Attribute Prediction
- Subsequence Prediction
- Sequence Order Prediction
Time-Aware Sequential Models
TiSASRec (2020) - Time Interval Aware SASRec
Paper: “Time Interval Aware Self-Attention for Sequential Recommendation” (Li et al., WSDM 2020)
Temporal Innovation:
- Explicit time interval modeling beyond positional encoding
- Time-aware self-attention with interval dependencies
- Better handling of irregular time gaps
Time-aware Attention:
class TimeAwareAttention(nn.Module):
def __init__(self, config):
self.time_embedding = nn.Embedding(config.max_time_interval, config.hidden_size)
self.attention = nn.MultiheadAttention(config.hidden_size, config.num_heads)
def forward(self, sequence_embeddings, time_intervals):
time_embs = self.time_embedding(time_intervals)
temporal_sequence = sequence_embeddings + time_embs
attended_sequence, _ = self.attention(
temporal_sequence, temporal_sequence, temporal_sequence
)
return attended_sequence
Industry Applications
E-commerce Success Stories
- Amazon: Product-to-product recommendations, session-based real-time prediction
- Alibaba: DIEN and BST deployed at scale with significant CTR improvements
- Taobao: Multi-behavior sequential modeling in production
Streaming Platforms
- Netflix: Next episode prediction, binge-watching behavior modeling
- Spotify: Playlist continuation, skip prediction, session-based radio
- YouTube: Video sequence modeling for personalized recommendations
Key Takeaways
Evolution Summary:
- Statistical Era: Markov chains provided foundational concepts but had limited expressiveness
- Neural Era: RNNs and attention mechanisms enabled complex pattern learning
- Transformer Era: Self-attention achieved state-of-the-art performance with parallelizable training
- Modern Era: Multi-interest, graph-enhanced, and contrastive learning approaches address real-world complexity
Current State:
- Transformer-based architectures dominate the field
- Industrial deployments show 10-30% improvements in key metrics
- Multi-interest and contrastive learning are active research areas
- Large language model integration is emerging
Future Outlook: The field is moving toward more sophisticated understanding of user behavior, better handling of multi-modal data, and increased focus on fairness and privacy. The integration of causal reasoning and federated learning approaches will likely define the next phase of sequential recommendation research.
Essential Papers to Read
Foundational (Must Read)
- GRU4Rec (Hidasi et al., 2016) - Started the deep learning revolution
- SASRec (Kang & McAuley, 2018) - Introduced self-attention to sequential recommendation
- BERT4Rec (Sun et al., 2019) - Bidirectional context learning breakthrough
Advanced Techniques
- BST (Chen et al., 2019) - Industrial scale deployment insights
- CL4SRec (Xie et al., 2022) - Contrastive learning for better representations
- MIND (Li et al., 2019) - Multi-interest modeling pioneer
Foundational Papers
- Srivastava, J., et al. (2000). “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data.” ACM Computing Surveys.
- Rendle, S., et al. (2010). “Factorizing personalized Markov chains for next-basket recommendation.” WWW 2010.
- Fournier-Viger, P., et al. (2017). “Sequential Pattern Mining: A Survey.” ACM Computing Surveys.
Neural Network Revolution
- Hidasi, B., et al. (2016). “Session-based Recommendations with Recurrent Neural Networks.” ICLR 2016. arXiv:1511.06939
- Tan, Y. K., et al. (2016). “Improved Recurrent Neural Networks for Session-based Recommendation.” RecSys 2016.
- Hidasi, B., & Karatzoglou, A. (2018). “Recurrent Neural Networks with Top-k Gains.” RecSys 2018.
- Li, J., et al. (2017). “Neural Attentive Session-based Recommendation.” CIKM 2017. arXiv:1711.04725
- Liu, Q., et al. (2018). “STAMP: Short-Term Attention/Memory Priority Model for Session-based Recommendation.” KDD 2018.
Transformer Era
- Kang, W. C., & McAuley, J. (2018). “Self-Attentive Sequential Recommendation.” ICDM 2018. arXiv:1808.09781
- Sun, F., et al. (2019). “BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer.” CIKM 2019. arXiv:1904.06690
Industrial Applications
- Chen, Q., et al. (2019). “Behavior Sequence Transformer for E-commerce Recommendation in Alibaba.” DLP-KDD 2019. arXiv:1905.06874
- Li, C., et al. (2019). “Multi-Interest Network with Dynamic Routing for Recommendation at Tmall.” KDD 2019. arXiv:1904.08030
- Cen, Y., et al. (2020). “Controllable Multi-Interest Framework for Recommendation.” KDD 2020. arXiv:2005.09347
Graph-Enhanced Models
- Wu, S., et al. (2019). “Session-based Recommendation with Graph Neural Networks.” AAAI 2019. arXiv:1811.00855
- Xu, C., et al. (2019). “Graph Contextualized Self-Attention Network for Session-based Recommendation.” IJCAI 2019.
Contrastive Learning
- Xie, X., et al. (2022). “Contrastive Learning for Sequential Recommendation.” ICDE 2022. arXiv:2010.14395
- Zhou, K., et al. (2020). “S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization.” CIKM 2020. arXiv:2008.07873
Time-Aware Models
- Li, J., et al. (2020). “Time Interval Aware Self-Attention for Sequential Recommendation.” WSDM 2020. arXiv:2005.07683
Recent Advances
- Hou, Y., et al. (2022). “CORE: Simple and Effective Session-based Recommendation within Consistent Representation Space.” SIGIR 2022. arXiv:2204.11067
Recent Surveys
- Wang, S., et al. (2021). “A Survey on Session-based Recommender Systems.” ACM Computing Surveys.
- Fang, H., et al. (2020). “Deep Learning for Sequential Recommendation.” IEEE TKDE.
Conclusion
Sequential recommendation has evolved from simple Markov chains to sophisticated transformer-based systems that power today’s largest digital platforms. The field continues to advance rapidly, with new architectures, evaluation methods, and applications emerging regularly. Understanding this progression provides crucial insights for both researchers developing new techniques and practitioners implementing production systems.
The next frontier lies in developing more interpretable, fair, and efficient sequential models that can handle the complexity of real-world user behavior while respecting privacy and ensuring equitable recommendations across diverse user populations.
Cited as:
@article{reneejia2025sequential-learning-in-recommendation-systems-from-markov-chains-to-transformers,
title = "Sequential Learning in Recommendation Systems: From Markov Chains to Transformers",
author = "Renee Jia",
journal = "renee-jia.github.io",
year = "2025",
url = "https://renee-jia.github.io/ai-learning-guide/sequential-learning-recommendation-systems/"
}