Advanced Transformer Architectures

Efficiency

In this session, our readings cover:

Required Readings:

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

JAMBA

More readings:

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Efficient Memory Management for Large Language Model Serving with PagedAttention

Attention Mechanisms in Computer Vision: A Survey