20-30 min read

Sequential Learning in Recommendation Systems: From Markov Chains to Transformers

A comprehensive guide to sequence-based recommendation techniques and key research papers

Introduction

Traditional recommendation systems treat user interactions as independent events, missing the crucial temporal patterns in user behavior. Sequential recommendation systems address this limitation by modeling the order and timing of user interactions to predict what users will engage with next.

This guide traces the evolution of sequential recommendation from early statistical methods to modern transformer-based architectures, highlighting key papers and techniques that have shaped the field.

Problem Definition

Sequential Recommendation Task: Given a user’s historical interaction sequence $S_u = {s_1, s_2, …, s_t}$ ordered by timestamp, predict the next item $s_{t+1}$ that the user will interact with.

Key Challenges:

  • Capturing both short-term and long-term user preferences
  • Handling sparse and variable-length sequences
  • Modeling temporal dynamics and seasonal patterns
  • Scaling to millions of users and items

Era 1: Statistical Foundations (2000-2010)

Markov Chain Models

Core Idea: Model sequential dependencies using probabilistic state transitions.

First-order Markov Chain:

P(s_{t+1} | s_1, s_2, ..., s_t) = P(s_{t+1} | s_t)

Key Papers:

  • “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data” (Srivastava et al., 2000) - Foundational work on web usage patterns
  • “Factorizing personalized Markov chains for next-basket recommendation” (Rendle et al., WWW 2010) - Combined matrix factorization with Markov chains

Limitations:

  • Strong independence assumptions
  • Limited capacity for long-term dependencies
  • Sparsity issues with higher-order models

Sequential Pattern Mining

Approach: Discover frequent sequential patterns in user behavior data.

Key Papers:

  • “Sequential Pattern Mining: A Survey” (Fournier-Viger et al., 2017) - Comprehensive survey of pattern mining techniques

Advantages: Interpretable patterns, efficient for specific domains Limitations: Hard to generalize, difficulty with continuous features

Era 2: Neural Network Revolution (2010-2017)

RNN-Based Methods

GRU4Rec (2016) - The Pioneer

Paper: “Session-based Recommendations with Recurrent Neural Networks” (Hidasi et al., ICLR 2016) Code: GitHub

Key Contributions:

  • First successful application of RNNs to session-based recommendation
  • Session-parallel mini-batches for efficient training
  • Ranking-based loss functions (BPR, TOP1)
  • 35% improvement over item-to-item recommendations

Technical Innovation:

class GRU4Rec(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        self.gru = nn.GRU(input_size, hidden_size, batch_first=True)
        self.linear = nn.Linear(hidden_size, output_size)
    
    def forward(self, input_seq, hidden):
        output, hidden = self.gru(input_seq, hidden)
        prediction = self.linear(output[:, -1, :])
        return prediction, hidden

Improvements on GRU4Rec

“Improved Recurrent Neural Networks for Session-based Recommendation” (Tan et al., 2016)

  • Data augmentation techniques
  • Better loss function comparisons
  • Improved handling of data sparsity

“Recurrent Neural Networks with Top-k Gains” (Hidasi & Karatzoglou, 2018)

  • 35% better performance through improved loss functions
  • Better negative sampling strategies

Attention Mechanisms

NARM (2017) - Neural Attentive Recommendation Machine

Paper: “Neural Attentive Session-based Recommendation” (Li et al., CIKM 2017) Code: GitHub

Key Innovation: Combined global and local attention mechanisms

  • Global encoder: captures overall session context
  • Local encoder: focuses on recent interactions
  • Attention mechanism: dynamically weighs global vs local information

STAMP (2018) - Short-Term Attention/Memory Priority

Paper: “STAMP: Short-Term Attention/Memory Priority Model for Session-based Recommendation” (Liu et al., KDD 2018) Code: GitHub

Key Features:

  • Separates long-term and short-term interests
  • Attention mechanism for current session intent
  • Memory network for general user preferences
  • Addresses interest drift in long sessions

Era 3: Transformer Era (2017-2020)

Self-Attention Revolution

SASRec (2018) - Self-Attentive Sequential Recommendation

Paper: “Self-Attentive Sequential Recommendation” (Kang & McAuley, ICDM 2018) Code: GitHub

Breakthrough Contributions:

  • First successful application of self-attention to sequential recommendation
  • Unidirectional attention to prevent information leakage
  • Position embeddings for temporal information
  • Parallelizable training unlike RNN-based methods

Architecture Highlights:

class SASRec(nn.Module):
    def __init__(self, item_size, hidden_size, num_heads, num_layers):
        self.item_embedding = nn.Embedding(item_size, hidden_size)
        self.pos_embedding = nn.Embedding(max_len, hidden_size)
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(hidden_size, num_heads)
            for _ in range(num_layers)
        ])
        self.output_layer = nn.Linear(hidden_size, item_size)
    
    def forward(self, seq):
        positions = torch.arange(len(seq))
        x = self.item_embedding(seq) + self.pos_embedding(positions)
        
        for transformer in self.transformer_blocks:
            x = transformer(x)
        
        return self.output_layer(x[:, -1, :])

Performance: Significant improvements over RNN-based methods across multiple datasets

BERT4Rec (2019) - Bidirectional Sequential Recommendation

Paper: “BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer” (Sun et al., CIKM 2019) Code: GitHub

Key Innovations:

  • Bidirectional self-attention (unlike SASRec’s unidirectional approach)
  • Cloze task training: predicting masked items instead of next-item
  • Better representation learning through bidirectional context
  • More training samples from masking strategy

Training Objective:

def cloze_loss(model, sequence):
    masked_seq, targets = random_mask(sequence)
    predictions = model(masked_seq)
    loss = F.cross_entropy(predictions[masked_positions], targets)
    return loss

Performance: Superior performance on dense datasets with longer sequences

Era 4: Advanced Sequential Learning (2020-Present)

Industrial Scale Deployment

BST (2019) - Behavior Sequence Transformer at Alibaba

Paper: “Behavior Sequence Transformer for E-commerce Recommendation in Alibaba” (Chen et al., DLP-KDD 2019) Code: GitHub

Industrial Innovation:

  • Successfully deployed at Taobao with significant CTR improvements
  • Handles heterogeneous behavior types (clicks, purchases, cart additions)
  • Incorporates side information beyond item IDs
  • Optimized for large-scale production deployment

Key Features:

  • Multi-behavior modeling in unified framework
  • Integration with existing recommendation pipelines
  • Real-time inference capabilities
  • Significant business impact in production

Multi-Interest Modeling

MIND (2019) - Multi-Interest Network with Dynamic Routing

Paper: “Multi-Interest Network with Dynamic Routing for Recommendation at Tmall” (Li et al., KDD 2019)

Core Insight: Users have multiple concurrent interests that evolve independently

Technical Approach:

  • Dynamic routing mechanism to separate different interests
  • Multiple representation vectors per user
  • Capsule network architecture for interest extraction

ComiRec (2020) - Controllable Multi-Interest Framework

Paper: “Controllable Multi-Interest Framework for Recommendation” (Cen et al., KDD 2020)

Features:

  • Controllable interest extraction with explicit constraints
  • Multi-interest sequential modeling with attention
  • Better interpretability through interest visualization

Graph-Enhanced Sequential Models

SR-GNN (2019) - Session-based Recommendation with Graph Neural Networks

Paper: “Session-based Recommendation with Graph Neural Networks” (Wu et al., AAAI 2019)

Innovation: Combines sequential patterns with graph structures

  • Models sessions as directed graphs
  • Graph neural networks capture complex item transitions
  • Integrates global graph information with local sequential patterns

GC-SAN (2019) - Graph Contextualized Self-Attention Network

Paper: “Graph Contextualized Self-Attention Network for Session-based Recommendation” (Xu et al., IJCAI 2019)

Approach: Combines self-attention with graph convolution for session modeling

Contrastive Learning Approaches

CL4SRec (2022) - Contrastive Learning for Sequential Recommendation

Paper: “Contrastive Learning for Sequential Recommendation” (Xie et al., ICDE 2022)

Key Techniques:

  • Data augmentation strategies: crop, mask, reorder operations
  • Contrastive learning objectives for better representations
  • Self-supervised learning without additional labels

Augmentation Example:

class SequenceAugmentation:
    def crop(self, sequence):
        length = len(sequence)
        crop_length = int(length * self.crop_ratio)
        start_idx = random.randint(0, length - crop_length)
        return sequence[start_idx:start_idx + crop_length]
    
    def mask(self, sequence):
        masked_seq = sequence.copy()
        mask_num = int(len(sequence) * self.mask_ratio)
        mask_indices = random.sample(range(len(sequence)), mask_num)
        for idx in mask_indices:
            masked_seq[idx] = 0  # mask token
        return masked_seq

S3-Rec (2020) - Self-Supervised Learning for Sequential Recommendation

Paper: “S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization” (Zhou et al., CIKM 2020)

Self-supervised Tasks:

  • Masked Item Prediction
  • Masked Attribute Prediction
  • Subsequence Prediction
  • Sequence Order Prediction

Time-Aware Sequential Models

TiSASRec (2020) - Time Interval Aware SASRec

Paper: “Time Interval Aware Self-Attention for Sequential Recommendation” (Li et al., WSDM 2020)

Temporal Innovation:

  • Explicit time interval modeling beyond positional encoding
  • Time-aware self-attention with interval dependencies
  • Better handling of irregular time gaps

Time-aware Attention:

class TimeAwareAttention(nn.Module):
    def __init__(self, config):
        self.time_embedding = nn.Embedding(config.max_time_interval, config.hidden_size)
        self.attention = nn.MultiheadAttention(config.hidden_size, config.num_heads)
    
    def forward(self, sequence_embeddings, time_intervals):
        time_embs = self.time_embedding(time_intervals)
        temporal_sequence = sequence_embeddings + time_embs
        attended_sequence, _ = self.attention(
            temporal_sequence, temporal_sequence, temporal_sequence
        )
        return attended_sequence

Industry Applications

E-commerce Success Stories

  • Amazon: Product-to-product recommendations, session-based real-time prediction
  • Alibaba: DIEN and BST deployed at scale with significant CTR improvements
  • Taobao: Multi-behavior sequential modeling in production

Streaming Platforms

  • Netflix: Next episode prediction, binge-watching behavior modeling
  • Spotify: Playlist continuation, skip prediction, session-based radio
  • YouTube: Video sequence modeling for personalized recommendations

Key Takeaways

Evolution Summary:

  1. Statistical Era: Markov chains provided foundational concepts but had limited expressiveness
  2. Neural Era: RNNs and attention mechanisms enabled complex pattern learning
  3. Transformer Era: Self-attention achieved state-of-the-art performance with parallelizable training
  4. Modern Era: Multi-interest, graph-enhanced, and contrastive learning approaches address real-world complexity

Current State:

  • Transformer-based architectures dominate the field
  • Industrial deployments show 10-30% improvements in key metrics
  • Multi-interest and contrastive learning are active research areas
  • Large language model integration is emerging

Future Outlook: The field is moving toward more sophisticated understanding of user behavior, better handling of multi-modal data, and increased focus on fairness and privacy. The integration of causal reasoning and federated learning approaches will likely define the next phase of sequential recommendation research.

Essential Papers to Read

Foundational (Must Read)

  1. GRU4Rec (Hidasi et al., 2016) - Started the deep learning revolution
  2. SASRec (Kang & McAuley, 2018) - Introduced self-attention to sequential recommendation
  3. BERT4Rec (Sun et al., 2019) - Bidirectional context learning breakthrough

Advanced Techniques

  1. BST (Chen et al., 2019) - Industrial scale deployment insights
  2. CL4SRec (Xie et al., 2022) - Contrastive learning for better representations
  3. MIND (Li et al., 2019) - Multi-interest modeling pioneer

Foundational Papers

  • Srivastava, J., et al. (2000). “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data.” ACM Computing Surveys.
  • Rendle, S., et al. (2010). “Factorizing personalized Markov chains for next-basket recommendation.” WWW 2010.
  • Fournier-Viger, P., et al. (2017). “Sequential Pattern Mining: A Survey.” ACM Computing Surveys.

Neural Network Revolution

  • Hidasi, B., et al. (2016). “Session-based Recommendations with Recurrent Neural Networks.” ICLR 2016. arXiv:1511.06939
  • Tan, Y. K., et al. (2016). “Improved Recurrent Neural Networks for Session-based Recommendation.” RecSys 2016.
  • Hidasi, B., & Karatzoglou, A. (2018). “Recurrent Neural Networks with Top-k Gains.” RecSys 2018.
  • Li, J., et al. (2017). “Neural Attentive Session-based Recommendation.” CIKM 2017. arXiv:1711.04725
  • Liu, Q., et al. (2018). “STAMP: Short-Term Attention/Memory Priority Model for Session-based Recommendation.” KDD 2018.

Transformer Era

  • Kang, W. C., & McAuley, J. (2018). “Self-Attentive Sequential Recommendation.” ICDM 2018. arXiv:1808.09781
  • Sun, F., et al. (2019). “BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer.” CIKM 2019. arXiv:1904.06690

Industrial Applications

  • Chen, Q., et al. (2019). “Behavior Sequence Transformer for E-commerce Recommendation in Alibaba.” DLP-KDD 2019. arXiv:1905.06874
  • Li, C., et al. (2019). “Multi-Interest Network with Dynamic Routing for Recommendation at Tmall.” KDD 2019. arXiv:1904.08030
  • Cen, Y., et al. (2020). “Controllable Multi-Interest Framework for Recommendation.” KDD 2020. arXiv:2005.09347

Graph-Enhanced Models

  • Wu, S., et al. (2019). “Session-based Recommendation with Graph Neural Networks.” AAAI 2019. arXiv:1811.00855
  • Xu, C., et al. (2019). “Graph Contextualized Self-Attention Network for Session-based Recommendation.” IJCAI 2019.

Contrastive Learning

  • Xie, X., et al. (2022). “Contrastive Learning for Sequential Recommendation.” ICDE 2022. arXiv:2010.14395
  • Zhou, K., et al. (2020). “S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization.” CIKM 2020. arXiv:2008.07873

Time-Aware Models

  • Li, J., et al. (2020). “Time Interval Aware Self-Attention for Sequential Recommendation.” WSDM 2020. arXiv:2005.07683

Recent Advances

  • Hou, Y., et al. (2022). “CORE: Simple and Effective Session-based Recommendation within Consistent Representation Space.” SIGIR 2022. arXiv:2204.11067

Recent Surveys

  • Wang, S., et al. (2021). “A Survey on Session-based Recommender Systems.” ACM Computing Surveys.
  • Fang, H., et al. (2020). “Deep Learning for Sequential Recommendation.” IEEE TKDE.

Conclusion

Sequential recommendation has evolved from simple Markov chains to sophisticated transformer-based systems that power today’s largest digital platforms. The field continues to advance rapidly, with new architectures, evaluation methods, and applications emerging regularly. Understanding this progression provides crucial insights for both researchers developing new techniques and practitioners implementing production systems.

The next frontier lies in developing more interpretable, fair, and efficient sequential models that can handle the complexity of real-world user behavior while respecting privacy and ensuring equitable recommendations across diverse user populations.


Cited as:

@article{reneejia2025sequential-learning-in-recommendation-systems-from-markov-chains-to-transformers, title = "Sequential Learning in Recommendation Systems: From Markov Chains to Transformers", author = "Renee Jia", journal = "renee-jia.github.io", year = "2025", url = "https://renee-jia.github.io/ai-learning-guide/sequential-learning-recommendation-systems/" }
View Article