Bonus session on KV Cache, Tooling and WMDP

Efficiency Safety

KV Caching in LLM:

Retentive Network: A Successor to Transformer for Large Language Models

RWKV: Reinventing RNNs for the Transformer Era

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Liu, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Must know tools for training/finetuning/serving LLM’s -

  1. Torchtune - Build on top of Pytorch, for training and finetuning LLM’s. Uses yaml based configs for easily running experiments. Github -

  2. axolotl - Built on top on Huggigface peft and transformer library, supports fine-tuning a large number for models like Mistral, LLama etc. Provides support for techniques like RLHF, DPO, LORA, qLORA etc. Github

  3. LitGPT - Build on nanoGPT and Megatron, support pre-training and fine-tuning, has examples like Starcoder, TinyLlama etc. Github -

  4. Maxtext - Jax based library for training LLM’s on Google TPU’s with configs for models like Gemma, Mistral and LLama2 etc. Github

  5. Langchain- https://python.langchain.com/docs/get_started/introduction

  6. haystack.deepset.ai
    • https://github.com/deepset-ai/haystack
    • LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it’s best suited for building RAG, question answering, semantic search or conversational agent chatbots.
  7. LlamaIndex
    • https://docs.llamaindex.ai/en/stable/ LlamaIndex supports Retrieval-Augmented Generation (RAG). Instead of asking LLM to generate an answer immediately, LlamaIndex: retrieves information from your data sources first, / adds it to your question as context, and / asks the LLM to answer based on the enriched prompt.
  8. Making Retrieval Augmented Generation Fast
    • https://www.pinecone.io/learn/fast-retrieval-augmented-generation/
  9. OpenMoE
    • https://github.com/XueFuzhao/OpenMoE