Sequential Learning in Recommendation Systems: From Markov Chains to Transformers

20-30 min read

Sequential Learning in Recommendation Systems: From Markov Chains to Transformers

A comprehensive guide to sequence-based recommendation techniques and key research papers

Introduction

Traditional recommendation systems treat user interactions as independent events, missing the crucial temporal patterns in user behavior. Sequential recommendation systems address this limitation by modeling the order and timing of user interactions to predict what users will engage with next.

This guide traces the evolution of sequential recommendation from early statistical methods to modern transformer-based architectures, highlighting key papers and techniques that have shaped the field.

Problem Definition

Sequential Recommendation Task: Given a user’s historical interaction sequence $S_u = {s_1, s_2, …, s_t}$ ordered by timestamp, predict the next item $s_{t+1}$ that the user will interact with.

Key Challenges:

Capturing both short-term and long-term user preferences
Handling sparse and variable-length sequences
Modeling temporal dynamics and seasonal patterns
Scaling to millions of users and items

Era 1: Statistical Foundations (2000-2010)

Markov Chain Models

Core Idea: Model sequential dependencies using probabilistic state transitions.

First-order Markov Chain:

P(s_{t+1} | s_1, s_2, ..., s_t) = P(s_{t+1} | s_t)

Key Papers:

“Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data” (Srivastava et al., 2000) - Foundational work on web usage patterns
“Factorizing personalized Markov chains for next-basket recommendation” (Rendle et al., WWW 2010) - Combined matrix factorization with Markov chains

Limitations:

Strong independence assumptions
Limited capacity for long-term dependencies
Sparsity issues with higher-order models

Sequential Pattern Mining

Approach: Discover frequent sequential patterns in user behavior data.

Key Papers:

“Sequential Pattern Mining: A Survey” (Fournier-Viger et al., 2017) - Comprehensive survey of pattern mining techniques

Advantages: Interpretable patterns, efficient for specific domains Limitations: Hard to generalize, difficulty with continuous features

Era 2: Neural Network Revolution (2010-2017)

RNN-Based Methods

GRU4Rec (2016) - The Pioneer

Paper: “Session-based Recommendations with Recurrent Neural Networks” (Hidasi et al., ICLR 2016) Code: GitHub

Key Contributions:

First successful application of RNNs to session-based recommendation
Session-parallel mini-batches for efficient training
Ranking-based loss functions (BPR, TOP1)
35% improvement over item-to-item recommendations

Technical Innovation:

class GRU4Rec(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        self.gru = nn.GRU(input_size, hidden_size, batch_first=True)
        self.linear = nn.Linear(hidden_size, output_size)
    
    def forward(self, input_seq, hidden):
        output, hidden = self.gru(input_seq, hidden)
        prediction = self.linear(output[:, -1, :])
        return prediction, hidden

Improvements on GRU4Rec

“Improved Recurrent Neural Networks for Session-based Recommendation” (Tan et al., 2016)

Data augmentation techniques
Better loss function comparisons
Improved handling of data sparsity

“Recurrent Neural Networks with Top-k Gains” (Hidasi & Karatzoglou, 2018)

35% better performance through improved loss functions
Better negative sampling strategies

Attention Mechanisms

NARM (2017) - Neural Attentive Recommendation Machine

Paper: “Neural Attentive Session-based Recommendation” (Li et al., CIKM 2017) Code: GitHub

Key Innovation: Combined global and local attention mechanisms

Global encoder: captures overall session context
Local encoder: focuses on recent interactions
Attention mechanism: dynamically weighs global vs local information

STAMP (2018) - Short-Term Attention/Memory Priority

Paper: “STAMP: Short-Term Attention/Memory Priority Model for Session-based Recommendation” (Liu et al., KDD 2018) Code: GitHub

Key Features:

Separates long-term and short-term interests
Attention mechanism for current session intent
Memory network for general user preferences
Addresses interest drift in long sessions

Era 3: Transformer Era (2017-2020)

Self-Attention Revolution

SASRec (2018) - Self-Attentive Sequential Recommendation

Paper: “Self-Attentive Sequential Recommendation” (Kang & McAuley, ICDM 2018) Code: GitHub

Breakthrough Contributions:

First successful application of self-attention to sequential recommendation
Unidirectional attention to prevent information leakage
Position embeddings for temporal information
Parallelizable training unlike RNN-based methods

Architecture Highlights:

class SASRec(nn.Module):
    def __init__(self, item_size, hidden_size, num_heads, num_layers):
        self.item_embedding = nn.Embedding(item_size, hidden_size)
        self.pos_embedding = nn.Embedding(max_len, hidden_size)
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(hidden_size, num_heads)
            for _ in range(num_layers)
        ])
        self.output_layer = nn.Linear(hidden_size, item_size)
    
    def forward(self, seq):
        positions = torch.arange(len(seq))
        x = self.item_embedding(seq) + self.pos_embedding(positions)
        
        for transformer in self.transformer_blocks:
            x = transformer(x)
        
        return self.output_layer(x[:, -1, :])

Performance: Significant improvements over RNN-based methods across multiple datasets

BERT4Rec (2019) - Bidirectional Sequential Recommendation

Paper: “BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer” (Sun et al., CIKM 2019) Code: GitHub

Key Innovations:

Bidirectional self-attention (unlike SASRec’s unidirectional approach)
Cloze task training: predicting masked items instead of next-item
Better representation learning through bidirectional context
More training samples from masking strategy

Training Objective:

def cloze_loss(model, sequence):
    masked_seq, targets = random_mask(sequence)
    predictions = model(masked_seq)
    loss = F.cross_entropy(predictions[masked_positions], targets)
    return loss

Performance: Superior performance on dense datasets with longer sequences

Era 4: Advanced Sequential Learning (2020-Present)

Industrial Scale Deployment

BST (2019) - Behavior Sequence Transformer at Alibaba

Paper: “Behavior Sequence Transformer for E-commerce Recommendation in Alibaba” (Chen et al., DLP-KDD 2019) Code: GitHub

Industrial Innovation:

Successfully deployed at Taobao with significant CTR improvements
Handles heterogeneous behavior types (clicks, purchases, cart additions)
Incorporates side information beyond item IDs
Optimized for large-scale production deployment

Key Features:

Multi-behavior modeling in unified framework
Integration with existing recommendation pipelines
Real-time inference capabilities
Significant business impact in production

Multi-Interest Modeling

MIND (2019) - Multi-Interest Network with Dynamic Routing

Paper: “Multi-Interest Network with Dynamic Routing for Recommendation at Tmall” (Li et al., KDD 2019)

Core Insight: Users have multiple concurrent interests that evolve independently

Technical Approach:

Dynamic routing mechanism to separate different interests
Multiple representation vectors per user
Capsule network architecture for interest extraction

ComiRec (2020) - Controllable Multi-Interest Framework

Paper: “Controllable Multi-Interest Framework for Recommendation” (Cen et al., KDD 2020)

Features:

Controllable interest extraction with explicit constraints
Multi-interest sequential modeling with attention
Better interpretability through interest visualization

Graph-Enhanced Sequential Models

SR-GNN (2019) - Session-based Recommendation with Graph Neural Networks

Paper: “Session-based Recommendation with Graph Neural Networks” (Wu et al., AAAI 2019)

Innovation: Combines sequential patterns with graph structures

Models sessions as directed graphs
Graph neural networks capture complex item transitions
Integrates global graph information with local sequential patterns

GC-SAN (2019) - Graph Contextualized Self-Attention Network

Paper: “Graph Contextualized Self-Attention Network for Session-based Recommendation” (Xu et al., IJCAI 2019)

Approach: Combines self-attention with graph convolution for session modeling

Contrastive Learning Approaches

CL4SRec (2022) - Contrastive Learning for Sequential Recommendation

Paper: “Contrastive Learning for Sequential Recommendation” (Xie et al., ICDE 2022)

Key Techniques:

Data augmentation strategies: crop, mask, reorder operations
Contrastive learning objectives for better representations
Self-supervised learning without additional labels

Augmentation Example:

class SequenceAugmentation:
    def crop(self, sequence):
        length = len(sequence)
        crop_length = int(length * self.crop_ratio)
        start_idx = random.randint(0, length - crop_length)
        return sequence[start_idx:start_idx + crop_length]
    
    def mask(self, sequence):
        masked_seq = sequence.copy()
        mask_num = int(len(sequence) * self.mask_ratio)
        mask_indices = random.sample(range(len(sequence)), mask_num)
        for idx in mask_indices:
            masked_seq[idx] = 0  # mask token
        return masked_seq

S3-Rec (2020) - Self-Supervised Learning for Sequential Recommendation

Paper: “S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization” (Zhou et al., CIKM 2020)

Self-supervised Tasks:

Masked Item Prediction
Masked Attribute Prediction
Subsequence Prediction
Sequence Order Prediction

Time-Aware Sequential Models

TiSASRec (2020) - Time Interval Aware SASRec

Paper: “Time Interval Aware Self-Attention for Sequential Recommendation” (Li et al., WSDM 2020)

Temporal Innovation:

Explicit time interval modeling beyond positional encoding
Time-aware self-attention with interval dependencies
Better handling of irregular time gaps

Time-aware Attention:

class TimeAwareAttention(nn.Module):
    def __init__(self, config):
        self.time_embedding = nn.Embedding(config.max_time_interval, config.hidden_size)
        self.attention = nn.MultiheadAttention(config.hidden_size, config.num_heads)
    
    def forward(self, sequence_embeddings, time_intervals):
        time_embs = self.time_embedding(time_intervals)
        temporal_sequence = sequence_embeddings + time_embs
        attended_sequence, _ = self.attention(
            temporal_sequence, temporal_sequence, temporal_sequence
        )
        return attended_sequence

Industry Applications

E-commerce Success Stories

Amazon: Product-to-product recommendations, session-based real-time prediction
Alibaba: DIEN and BST deployed at scale with significant CTR improvements
Taobao: Multi-behavior sequential modeling in production

Streaming Platforms

Netflix: Next episode prediction, binge-watching behavior modeling
Spotify: Playlist continuation, skip prediction, session-based radio
YouTube: Video sequence modeling for personalized recommendations

Key Takeaways

Evolution Summary:

Statistical Era: Markov chains provided foundational concepts but had limited expressiveness
Neural Era: RNNs and attention mechanisms enabled complex pattern learning
Transformer Era: Self-attention achieved state-of-the-art performance with parallelizable training
Modern Era: Multi-interest, graph-enhanced, and contrastive learning approaches address real-world complexity

Current State:

Transformer-based architectures dominate the field
Industrial deployments show 10-30% improvements in key metrics
Multi-interest and contrastive learning are active research areas
Large language model integration is emerging

Future Outlook: The field is moving toward more sophisticated understanding of user behavior, better handling of multi-modal data, and increased focus on fairness and privacy. The integration of causal reasoning and federated learning approaches will likely define the next phase of sequential recommendation research.

Essential Papers to Read

Foundational (Must Read)

GRU4Rec (Hidasi et al., 2016) - Started the deep learning revolution
SASRec (Kang & McAuley, 2018) - Introduced self-attention to sequential recommendation
BERT4Rec (Sun et al., 2019) - Bidirectional context learning breakthrough

Advanced Techniques

BST (Chen et al., 2019) - Industrial scale deployment insights
CL4SRec (Xie et al., 2022) - Contrastive learning for better representations
MIND (Li et al., 2019) - Multi-interest modeling pioneer

Foundational Papers

Srivastava, J., et al. (2000). “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data.” ACM Computing Surveys.
Rendle, S., et al. (2010). “Factorizing personalized Markov chains for next-basket recommendation.” WWW 2010.
Fournier-Viger, P., et al. (2017). “Sequential Pattern Mining: A Survey.” ACM Computing Surveys.

Neural Network Revolution

Hidasi, B., et al. (2016). “Session-based Recommendations with Recurrent Neural Networks.” ICLR 2016. arXiv:1511.06939
Tan, Y. K., et al. (2016). “Improved Recurrent Neural Networks for Session-based Recommendation.” RecSys 2016.
Hidasi, B., & Karatzoglou, A. (2018). “Recurrent Neural Networks with Top-k Gains.” RecSys 2018.
Li, J., et al. (2017). “Neural Attentive Session-based Recommendation.” CIKM 2017. arXiv:1711.04725
Liu, Q., et al. (2018). “STAMP: Short-Term Attention/Memory Priority Model for Session-based Recommendation.” KDD 2018.

Transformer Era

Kang, W. C., & McAuley, J. (2018). “Self-Attentive Sequential Recommendation.” ICDM 2018. arXiv:1808.09781
Sun, F., et al. (2019). “BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer.” CIKM 2019. arXiv:1904.06690

Industrial Applications

Chen, Q., et al. (2019). “Behavior Sequence Transformer for E-commerce Recommendation in Alibaba.” DLP-KDD 2019. arXiv:1905.06874
Li, C., et al. (2019). “Multi-Interest Network with Dynamic Routing for Recommendation at Tmall.” KDD 2019. arXiv:1904.08030
Cen, Y., et al. (2020). “Controllable Multi-Interest Framework for Recommendation.” KDD 2020. arXiv:2005.09347

Graph-Enhanced Models

Wu, S., et al. (2019). “Session-based Recommendation with Graph Neural Networks.” AAAI 2019. arXiv:1811.00855
Xu, C., et al. (2019). “Graph Contextualized Self-Attention Network for Session-based Recommendation.” IJCAI 2019.

Contrastive Learning

Xie, X., et al. (2022). “Contrastive Learning for Sequential Recommendation.” ICDE 2022. arXiv:2010.14395
Zhou, K., et al. (2020). “S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization.” CIKM 2020. arXiv:2008.07873

Time-Aware Models

Li, J., et al. (2020). “Time Interval Aware Self-Attention for Sequential Recommendation.” WSDM 2020. arXiv:2005.07683

Recent Advances

Hou, Y., et al. (2022). “CORE: Simple and Effective Session-based Recommendation within Consistent Representation Space.” SIGIR 2022. arXiv:2204.11067

Recent Surveys

Wang, S., et al. (2021). “A Survey on Session-based Recommender Systems.” ACM Computing Surveys.
Fang, H., et al. (2020). “Deep Learning for Sequential Recommendation.” IEEE TKDE.

Conclusion

Sequential recommendation has evolved from simple Markov chains to sophisticated transformer-based systems that power today’s largest digital platforms. The field continues to advance rapidly, with new architectures, evaluation methods, and applications emerging regularly. Understanding this progression provides crucial insights for both researchers developing new techniques and practitioners implementing production systems.

The next frontier lies in developing more interpretable, fair, and efficient sequential models that can handle the complexity of real-world user behavior while respecting privacy and ensuring equitable recommendations across diverse user populations.

Cited as:


@article{reneejia2025sequential-learning-in-recommendation-systems-from-markov-chains-to-transformers,
  title   = "Sequential Learning in Recommendation Systems: From Markov Chains to Transformers",
  author  = "Renee Jia",
  journal = "renee-jia.github.io",
  year    = "2025",
  url     = "https://renee-jia.github.io/ai-learning-guide/sequential-learning-recommendation-systems/"

}

View Article

Share on

X Facebook LinkedIn Bluesky

Sequential Learning in Recommendation Systems: From Markov Chains to Transformers

Introduction

Problem Definition

Era 1: Statistical Foundations (2000-2010)

Markov Chain Models

Sequential Pattern Mining

Era 2: Neural Network Revolution (2010-2017)

RNN-Based Methods

GRU4Rec (2016) - The Pioneer

Improvements on GRU4Rec

Attention Mechanisms

NARM (2017) - Neural Attentive Recommendation Machine

STAMP (2018) - Short-Term Attention/Memory Priority

Era 3: Transformer Era (2017-2020)

Self-Attention Revolution

SASRec (2018) - Self-Attentive Sequential Recommendation

BERT4Rec (2019) - Bidirectional Sequential Recommendation

Era 4: Advanced Sequential Learning (2020-Present)

Industrial Scale Deployment

BST (2019) - Behavior Sequence Transformer at Alibaba

Multi-Interest Modeling

MIND (2019) - Multi-Interest Network with Dynamic Routing

ComiRec (2020) - Controllable Multi-Interest Framework

Graph-Enhanced Sequential Models

SR-GNN (2019) - Session-based Recommendation with Graph Neural Networks

GC-SAN (2019) - Graph Contextualized Self-Attention Network

Contrastive Learning Approaches

CL4SRec (2022) - Contrastive Learning for Sequential Recommendation

S3-Rec (2020) - Self-Supervised Learning for Sequential Recommendation

Time-Aware Sequential Models

TiSASRec (2020) - Time Interval Aware SASRec

Industry Applications

E-commerce Success Stories

Streaming Platforms

Key Takeaways

Essential Papers to Read

Foundational (Must Read)

Advanced Techniques

Foundational Papers

Neural Network Revolution

Transformer Era

Industrial Applications

Graph-Enhanced Models

Contrastive Learning

Time-Aware Models

Recent Advances

Recent Surveys

Conclusion

Cited as:

Share on

You may also enjoy

Attention Residuals: A Comprehensive Understanding

The Web Is Not a Neutral Environment for Agents

Modeling Long User Histories for Ads Ranking

The Evolution of Reward Hacking and Jailbreak Research in AI