The fourth version of the Efficient Natural Language and Speech Processing (ENLSP-IV) workshop will focus on how to make large language and foundation models more efficient in terms of Architecture, Training, and Inference in their real-world applications. This year, following the trend of industry and academia, we put more emphasis on investigating new architectures to make future language and foundation models more efficient. Moreover, we highlight the importance of comprehensive evaluation and benchmarking new efficient models from different practical aspects.
The workshop program offers an interactive platform for gathering experts and talents from academia and industry through invited talks, panel discussion, paper submission, reviews, interactive poster sessions, oral presentations and a couple of mentorship sessions for new researchers.
This will be a unique opportunity to discuss and share challenging problems, build connections, exchange ideas and brainstorm, and foster future collaborations. The topics of this workshop can be of interest for people working on general machine learning, deep learning, hardware, optimization, theory and applications.
Overview
As large language models (e.g. GPT-3, GPT-4, Llama 3, PALM, Gemini, and Pangu-∑), pre-trained speech models (e.g. wav2vec, Hubert, wavLM, Whisper, Conformer-1 and Conformer-2 ) and other foundation models (e.g. GPT-4o, and Stable Diffusion) have advanced rapidly and become more prominent and widespread, improving their efficiency would be more crucial.
While it is true that the computational power and GPU resources have played a significant role in the success of these models, we need to also be aware that using more computational resources can result in: (a) increasing the cost of training and deploying such models, (b) making the models less accessible, (c) less contribution from the research community, and (d) increasing the environmental costs of the models. Moreover, it is evident that most of these pre-trained models are largely over-parameterized and their efficiency is under question. Lack of efficiency can largely limit the application of these advanced models in practice.
Building upon the framework of our previous three editions, this workshop remains dedicated to investigating solutions for enhancing the efficiency of pre-trained language and foundation models but with introducing some fresh and important new topics to the community and encouraging their contributions.
Just to highlight a few: (1) Despite the ubiquitous usage of Transformers, they suffer from quadratic computational complexity which limits their efficiency especially for longer sequence lengths. Should we improve the efficiency of Transformers (e.g. in Hedgehog, Gated Linear Attention) or look for other architectures (e.g. Mamba, Jamba, RVKW, xLSTM, and SSMs)? (2) For accelerating training, we have seen the significant impact of designing hardware efficient implementations such as in Flash Attention. Should we focus more on these hardware-aware solutions or more on new/improved architectures?
(3) For efficient inference, there are solutions such as: Speculative Decoding [Link1] [Link2] where the performance is strongly model and task-dependent and the draft and target models should have the same vocabulary (tokenizer); improved KV-caching (e.g. [Link]) which has a limited speed-up; and many-in-one models such as SortedNet, MatFormer, and LayerSkip but the performance of sub-models drops compared to their corresponding individual models.
(4) While there are many so-called efficient solutions in the literature, there is no fair, comprehensive and practical evaluation of these models and their comparison to each other. For example, we do not know the hallucination extent of the new architectures vs. the transformer model (e.g. in [Link]).
Call for Papers
Investing in the future of language and foundation models requires a concrete effort to enhance their efficiency across multiple dimensions (including architecture, training, and inference) and having a comprehensive evaluation framework.
To encourage engagement from the NeurIPS community, we present several active research topics in this field that invite participation and contributions. The scope of this workshop includes, but not limited to, the following topics:
Efficient Architectures Proposing alternative architectures that are more efficient than Transformers (in terms of computational complexity, memory footprint, handling longer sequence lengths ) or modifying Transformer architectures to make them more efficient
- Linear and sub-quadratic Transformers , sparse attention Transformers
- New architures for LLMs and foundation models and their scalability
- Evaluation and benchmarking of new architectures (fair comparison of different models)
- Long sequence modeling
- Dense vs. sparse architectures (MoEs)
Efficient Training How can we reduce the cost of pre-training or fine-tuning new models?
- More efficient pre-training solutions, from better initialization and hyper-parameter tuning to better optimization which lowers the cost of pre-training
- Parameter efficient fine-tuning (PEFT) solutions for large pre-trained models
- Efficient instruction tuning, prompt engineering and in-context learning
- Hardware-aware solutions (e.g. better CUDA kernels), memory read/write aware solutions
- Data-efficient training, reducing the requirement for labeled data, data compression and distillation
Efficient Inference How can we reduce the cost of inference for LLMs and foundation models?
- Improved speculative sampling for LLMs, self-speculative sampling, selecting among multiple drafts, one draft model for different heterogeneous target models
- Neural model compression techniques such as quantization, pruning, and knowledge distillation
- Improved KV-caching solutions for Transformers
- Distributed inference of large pre-trained models
- Serving many target devices with one model, many-in-one models, early exiting, elastic networks
Evaluation and Benchmarking of Efficient Models Introducing new efficient solutions underscores the need for comprehensive benchmarks to accurately evaluate their efficacy and performance.
- Datasets, benchmarks, leaderboards for evaluating efficient models
- Benchmarking the performance of efficient models from different perspectives such as reasoning, hallucination, understanding, and generation quality
- Benchmarking efficiency of models in terms of their memory footprint, training time, inference time, different target hardware devices and inference platforms (e.g. GPU vs. CPU)
Efficient Solutions in other Modalities and Applications
- Efficiency of foundational or pre-trained models in multi-modal set-up and other modalities (beyond NLP and Speech) such as biology, chemistry, computer vision, and time series
- Efficient representations (e.g. Matryoshka representation) and models in dense retrieval and search
- Efficient Federated learning, lower communication costs, tackling heterogeneous data and models
- Efficient graph and LLM joint learning
Submission Instructions
You are invited to submit your papers in our CMT submission portal (Link). All the submitted papers have to be anonymous for double-blind review. We expect each paper will be reviewed by at least three reviewers. The content of the paper (excluding the references and supplementary materials) should not be more than 8 pages for Long Papers and 4 pages for Short Papers, strictly following the NeurIPS template style (Link). Please be advised that the NeurIPS submission checklist is not needed for our workshop submissions.
Authors can submit up to 100 MB of supplementary materials separately. Authors are highly encouraged to submit their codes for reproducibility purposes. According to the guideline of the NeurIPS workshops, already published papers are not encouraged for submission, but you are allowed to submit your ArXiv papers or the ones which are under submission (for example any NeurIPS submissions can be submitted concurrently to workshops ). Moreover, a work that is presented at the main NeurIPS conference should not appear in a workshop. Please make sure to indicate the complete list of conflict of interests for all the authors of your paper. To encourage higher quality submissions, our sponsors are offering the Best Paper and the Best Poster Awards to qualified outstanding original oral and poster presentations (upon nomination of the reviewers). Bear in mind that our workshop is not archival, but the accepted papers will be hosted on the workshop website. Moreover, we are currently negotiating with a publisher to host opt-in accepted papers in a special issue proceeding for our workshop.
Important Dates:
Special NeurIPS Fast Track Submission Deadline: September 30, 2024 Anywhere on Earth (AOE)
Submission Deadline: September 15, 2024 Anywhere on Earth (AOE)
Acceptance Notification: October 09, 2024 AOE
- Camera-Ready Submission: October 25, 2024 AOE
- Workshop Date: December 14, 2024
Keynote Speakers
Danqi Chen
Princeton
Bhavana Dalvi
Allen Institute for AI
Weizhu Chen
Microsoft
Tri Dao
Princeton/Together AI
Hananeh Hajishirzi
University of Washington
Navdeep Jaitly
Apple
Lili Mou
University of Alberta
Panelists
Marjan Ghazvini Nejad
Meta
Joel Hestness
Cerebras
Navdeep Jaitly
Apple
Katie Derthick
Microsoft
Schedule
Title: (
KeyNote Talk) Efficiency through Learning from Experience
Presenter: Dr. Bhavana Dalvi Mishra
BioDr. Bhavana Dalvi Mishra is a Lead Research Scientist at the Allen Institute for AI (Ai2). Her research focuses on NLP, interactive reasoning, and scientific discovery. She obtained her Ph.D. in Computer Science from Carnegie Mellon University in 2015 and earned her Master's in Computer Science from the Indian Institute of Technology, Bombay in 2007. She has received several awards, including two Best Paper runner-up awards, Google Ph.D. fellowship, and Barbara Lazarus Women@IT Fellowship from CMU for her contributions to NLP and AI.
AbstractDespite the physiological limitations of the human brain, humans are remarkably efficient thinkers, in large part because they can learn from experience, allowing them to avoid prior reasoning errors and quickly jump to conclusions that previously took substantial effort. Similarly, language models (LMs) can rapidly improve their inference-time efficiency through inference-time learning, supplementing lower-level methods like fast decoding and caching. I'll describe two agent-based systems (CLIN and SSO) that do this, using an external RAG (retrieval-augmented generation) memory to help the agent navigate a complex, virtual environment. Unlike typical RAG systems, the memory is dynamic and updated after each task (including forgetting unhelpful learnings). In addition, unlike reinforcement-based continual learning techniques, these systems rapidly learn from just a handful of examples by exploiting LMs to conjecture useful generalizations of past experiences. I'll outline three critical activities in this process - what to remember, how to index those memories, and how to retrieve from that index - and how those choices impact the effectiveness of the resulting agent. While this concept of efficiency is a little different to foundational architectural considerations, I'll show that it is nonetheless powerful, and an important additional tool in the toolbox for efficient future applications.
Title: (
KeyNote Talk) Multi-Teacher Distillation: An Ensemble-Then-Distill Approach
Presenter: Prof. Lili Mou
Bio Dr. Lili Mou is an Assistant Professor at the Department of Computing Science, University of Alberta. He is also an Alberta Machine Intelligence Institute (Amii) Fellow and a Canada CIFAR AI (CCAI) Chair. Lili received his BS and PhD degrees in 2012 and 2017, respectively, from School of EECS, Peking University. After that, he worked as a postdoctoral fellow at the University of Waterloo. His research interests mainly lie in designing novel machine learning algorithms and frameworks for NLP. He has publications at top conferences and journals, including ACL, EMNLP, TACL, ICML, ICLR, and NeurIPS. He also presented tutorials at EMNLP'19 and ACL'20. He received a AAAI New Faculty Highlight Award in 2021.
AbstractKnowledge distillation (KD) aims to transfer the knowledge in a large model (called a teacher) into a small one (called a student), and has become an emerging research topic as the sizes of deep learning models keep growing. Today, there are abundant readily available large models, such as ChatGPT, LLaMa, and T5. It then becomes natural to ask: Can we distill the knowledge from multiple teachers? At first glance, it appears easy to perform multi-teacher KD, as we can simply train the student from the union of teachers’ predictions. However, I would argue that such a naïve attempt may not work well for multi-teacher KD. This is because traditional KD adopts the cross-entropy loss, which tends to yield a smooth distribution. In this talk, I will present a novel ensemble-then-distill approach, which builds an ensemble of teacher models to train the student. I will also discuss applications to text generation and syntactic parsing.
Title: (
KeyNote Talk) Hardware-aware Algorithms for Language Modeling
Presenter: Prof. Tri Dao
BioTri Dao is an Assistant Professor at Princeton University and chief scientist of Together AI. He completed his PhD in Computer Science at Stanford. He works at the intersection of machine learning and systems, and his research interests include sequence models with long-range memory and structured matrices for compact deep learning models. His work has received the COLM 2024 Outstanding paper award and ICML 2022 Outstanding paper runner-up award.
AbstractTransformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. We describe recent progress on subquadratic-time architectures such structured state space models (SSMs). We identify that a key weakness of such models is their inability to perform content-based reasoning, and propose a selection mechanism to address this shortcoming. Though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks. The resulting architecture (Mamba and Mamba-2) matches or exceeds the performance of strong modern Transformers on language modeling, validated at 3B scales on both pretraining and downstream evaluation, while enjoying 5x higher inference throughput and linear scaling in sequence length. Hybridizing Mamba layers with a 2-4 attention layers leads to state-of-the-art models, excelling at long context and fast inference.
Title: (
KeyNote Talk) Speech generative modeling with little tokenization
Presenter: Dr. Navdeep Jaitly
BioNavdeep Jaitly is a Research Scientist at Apple Machine Learning Research (MLR) where he leads a team of researchers working on fundamental techniques in Machine Learning with an emphasis on speech and language. He got his PhD from University of Toronto, under the supervision of Geoffrey Hinton in the foundational days of Deep Learning. During a PhD internship at Google in 2011 he demonstrated how Deep Neural Networks would revolutionize speech recognition replacing HMM systems that were in use before. After his PhD he joined Google Brain working on sequence models and techniques such as Listen Attend and Spell, Adversarial Autoencoders and Pointer Networks. He has also held machine learning research positions at Nvidia, Google Brain Robotics (initiating robotic ping pong), D. E. Shaw and the National Labs.
AbstractIt is well accepted now that speech needs to be tokenized before it can be modeled with transformer based generative models. In fact there is a rich body of intricate work using semantic and other acoustic tokens for speech modeling. In this talk we show how tokenization may not be necessary and that, indeed, a simple way of discretizing Mel-spectrograms (which we call d-Mel) is enough to build generative models with transformers. We show how we can build conditional generative models of speech (text-to-speech) using d-Mel and transformer based models. We also demonstrate that the same technique can be applied to multi-modal generation of speech conditioned on text and video. It is our hope that this leads to more exploration on minimal preprocessing of speech for use in generative modeling.
Title: (
KeyNote Talk) Optimizing Data Use for Efficient Pre-training
Presenter: Prof. Danqi Chen
BioDanqi Chen is an assistant professor of Computer Science at Princeton University and co-leads the Princeton NLP group. She is also an Associate Director of Princeton Language and Intelligence. Her recent research focuses on training, adapting and understanding large language models, especially with the goal of making them more accessible to academia. Before joining Princeton, Danqi was a visiting scientist at Facebook AI Research. She received her Ph.D. from Stanford University (2018) and her B.E. from Tsinghua University (2012), both in Computer Science. Her research was recognized by a Sloan Fellowship, an NSF CAREER award, a Samsung AI Researcher of the Year award, and outstanding paper awards from ACL and EMNLP.
AbstractTraining large language models relies heavily on the quality and composition of data, yet optimizing data selection and utilization remains a significant challenge in the field. In this talk, I will outline several key ideas to enhance training efficiency through better data use and cover several findings from my lab on selecting high-quality datasets and optimizing data compositions. I will also introduce a simple yet powerful pre-training approach that conditions on meta-data information associated with training data. This approach is remarkably straightforward to implement, incurs minimal computational overhead, and yields significant efficiency gains.
Title: (
Spotlight 1) Sparsified State-Space Models are Efficient Highway Networks
Presenter: Woomin Song
AuthorsWoomin Song (KAIST),Jihoon Tack (KAIST),Sangwoo Mo (University of Michigan),Seunghyuk Oh (KAIST),Jinwoo Shin (KAIST)
AbstractState-space models (SSMs) offer a promising architecture for sequence modeling, providing an alternative to Transformers by replacing expensive self-attention with linear recurrences. In this paper, we propose a simple yet effective trick to enhance SSMs within given computational budgets by sparsifying them. Our intuition is that tokens in SSMs are highly redundant due to gradual recurrent updates, and dense recurrence operations block the delivery of past information. In particular, we observe that upper layers of SSMs tend to be more redundant as they encode global information, while lower layers encode local information. Motivated by this, we introduce Simba, a hierarchical sparsification method for SSMs based on token pruning. Simba sparsifies upper layers more than lower layers, encouraging the upper layers to behave like highways. To achieve this, we propose a novel token pruning criterion for SSMs, measuring the global impact of tokens on the final output by accumulating local recurrences. We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks. Moreover, we illustrate the effect of highways, showing that Simba not only enhances efficiency but also improves the information flow across long sequences.
Title: (
Spotlight 2) Longhorn: State Space Models are Amortized Online Learners
Presenter: Bo Liu
AuthorsBo Liu (University of Texas, Austin), Rui Wang (Helixon),Lemeng Wu (University of Texas, Austin),Yihao Feng (University of Texas, Austin ),Peter Stone (University of Texas at Austin and Sony AI),Qiang Liu (UT Austin)
AbstractModern large language models are built on sequence modeling via next-token prediction.
While the Transformer remains the dominant architecture for sequence modeling, its quadratic decoding complexity in sequence length poses a major limitation. State-space models (SSMs) present a competitive alternative, offering linear decoding efficiency while maintaining parallelism during training. However, most existing SSMs rely on linear recurrence designs that appear somewhat ad hoc.
In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems. This approach links SSM design to formulating precise online learning objectives, with state transition rules derived from solving these objectives.
Based on this insight, we introduce a novel deep SSM architecture, Longhorn, whose update resembles the closed-form solution for solving the online associative recall problem. Our experimental results show that Longhorn outperforms state-of-the-art SSMs, including the Mamba model, on standard sequence modeling benchmarks, language modeling, and vision tasks. Specifically, Longhorn achieves a 1.8x improvement in sample efficiency compared to Mamba, and can extrapolate over contexts that are up to 16x longer during inference.
Title: (
Spotlight 3) GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference
Presenter: Hao Kang
AuthorsHao Kang (Georgia Institute of Technology),Qingru Zhang (Georgia Institute of Technology)*,Souvik Kundu (Intel Labs),Geonhwa Jeong (Georgia Institute of Technology),Zaoxing Liu (University of Maryland),Tushar Krishna (Georgia Institute of Technology),Tuo Zhao (Gatech)
AbstractKey-value (KV) caching has become the de-facto technique to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing entries group-wise. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient error reduction framework that augments a quantization scheme with two error reduction components and achieves near-lossless performance at high compression ratios. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low-rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments show that GEAR can maintain similar accuracy to that of FP16 cache with improvement up to 24.42% over the SOTA baselines at 2-bit compression. Additionally, compared to LLM inference with FP16 KV cache, GEAR can reduce peak-memory of up to $2.39\times$, bringing $2.1\times\sim 5.07\times$ throughput improvement. Our code will be publicly available.
Title: (
Spotlight 4) An Evolved Universal Transformer Memory
Presenter: Edoardo Cetin
AuthorsEdoardo Cetin (Sakana AI)*,Qi Sun (tokyo institute of technology),Tianyu Zhao (Sakana AI),Yujin Tang (Sakana AI)
AbstractWe introduce Neural Attention Memory Models (NAMMs) to improve the performance and efficiency of transformer foundation models. NAMMs are evolved atop pre-trained transformers to provide different latent contexts containing the most relevant information for individual layers and attention heads. NAMMs are universally applicable to any model using self-attention as they condition exclusively on the attention matrices produced in each layer. NAMMs learned on a relatively small set of problems substantially improve performance across multiple unseen long-context language tasks while cutting the model's input contexts up to a fraction of the original sizes, setting them apart from prior hand-designed KV cache eviction strategies that only aim to preserve model behavior. We show the generality of our conditioning enables zero-shot transfer of NAMMs trained only on language to entirely new transformer architectures even across input modalities, with their benefits carrying over to vision and reinforcement learning.
Title: (
Spotlight 5) OLMoE: Open Mixture-of-Experts Language Models
Presenter: Luca Soldaini
AuthorsNiklas Muennighoff (Contextual AI/Allen Institute for Artificial Intelligence)*,Luca Soldaini (Allen Institute for Artificial Intelligence),Dirk Groeneveld (Allen Institute for Artificial Intelligence),Kyle Lo (Allen Institute for Artificial Intelligence),Jacob Morrison (Allen Institute for AI),Sewon Min (University of Washington),Weijia Shi (University of Washington),Pete Walsh (Allen Institute for Artificial Intelligence),Oyvind Tafjord (AI2),Nathan Lambert (Allen Institute for Artificial Intelligence),Yuling Gu (Allen Institute for Artificial Intelligence),Shane Arora (Allen Institute for Artificial Intelligence),Akshita Bhagia (Allen Institute for Artificial Intelligence),Dustin Schwenk (Allen Institute for Artificial Intelligence),David Wadden (Allen Institute for Artificial Intelligence),Alexander Wettig (Princeton University),Binyuan Hui (Alibaba Group),Tim Dettmers (Allen Institute for Artificial Intelligence),Douwe Kiela (Contextual AI),Noah Smith (Allen Institute for AI, University of Washington),Pang Wei Koh (Allen Institute for AI, University of Washington),Amanpreet Singh (Contextual AI),Hannaneh Hajishirzi (Allen Institute for AI, University of Washington)
AbstractWe introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2- 13B-Chat and DeepSeekMoE-16B. We present novel findings on MoE training, define and analyze new routing properties showing high specialization in our model, and open-source all our work: model weights, training data, code, and logs.
Title: (
Spotlight 6) RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Presenter: Huiqiang Jiang
AuthorsDi Liu (Shanghai Jiao Tong University),Meng Chen (Fudan University),Baotong Lu (Microsoft Research)*,Huiqiang Jiang (Microsoft Research Asia),Zhenhua Han (Microsoft),Qianxi Zhang (MSRA),Qi Chen (Microsoft Research Asia),Chengruidong Zhang (MSFT),Bailu Ding (Microsoft Research),Kai Zhang (Fudan University),Chen Chen (Shanghai Jiao Tong University),Fan Yang (MSRA),Yuqing Yang (Microsoft),Lili Qiu (Microsoft Research Asia
AbstractTransformer-based Large Language Models (LLMs) have become increasingly important. However, due to the quadratic time complexity of attention computation, scaling LLMs to longer contexts incurs extremely slow inference latency and high GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to both accelerate attention computation and reduce GPU memory consumption. By leveraging the dynamic sparsity of attention mechanism, RetrievalAttention proposes to use approximate nearest neighbor search (ANNS) indexes for KV vectors in CPU memory and retrieves the most relevant ones with vector search during generation. Unfortunately, we observe that the off-the-shelf ANNS indexes are often ineffective for such retrieval tasks due to the out-of-distribution (OOD) between query vectors and key vectors in attention mechanism. RetrievalAttention addresses the OOD challenge by designing an attention-aware vector search algorithm that can adapt to the distribution of query vectors. Our evaluation shows that RetrievalAttention only needs to access 1--3% of data while maintaining high model accuracy. This leads to significant reduction in the inference cost of long-context LLMs with much lower GPU memory footprint. In particular, RetrievalAttention only needs a single NVIDIA RTX4090 (24GB) for serving 128K tokens in LLMs with 8B parameters, which is capable of generating one token in 0.188 seconds.
Title: (
Spotlight 7) Post-Training Statistical Calibration for Higher Activation Sparsity
Presenter: Vui Seng Chua
AuthorsVui Seng Chua (Intel Corporation),Yujie Pan (Intel)*,Nilesh Jain (Intel)
AbstractWe present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5× additional LLM decoding speedup against CATS[12] at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability. The code is available at https://github.com/IntelLabs/SCAP.
Title: (
Spotlight 8) Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences
Presenter: Niklas Schmidinger
AuthorsNiklas Schmidinger (Johannes Kepler University Linz)*,Lisa Schneckenreiter (Johannes Kepler University, Linz),Philipp Seidl (JKU Linz),Johannes Schimunek (Johannes Kepler University Linz),Pieter-Jan Hoedt (LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria),Johannes Brandstetter (LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria),Andreas Mayr (Johannes Kepler University Linz),Sohvi Luukkonen (Johannes Kepler University),Sepp Hochreiter (LIT AI Lab, Institute for Machine Learning, Johannes Kepler University, NXAI GmbH, Linz, Austria),Guenter Klambauer (LIT AI Lab)
AbstractLanguage models for biological and chemical sequences enable crucial applications such as drug discovery, protein engineering, and precision medicine. Currently, these language models are predominantly based on Transformer architectures. While Transformers have yielded impressive results, their quadratic runtime dependency on sequence length complicates their use for long genomic sequences and in-context learning on proteins and chemical sequences. Recently, the recurrent xLSTM architecture has been shown to perform favorably compared to Transformers and modern state-space models (SSMs) in the natural language domain. Similar to SSMs, xLSTMs have linear runtime dependency and allow for constant-memory decoding at inference time, which makes them prime candidates for modeling long-range dependencies in biological and chemical sequences. In this work, we tailor xLSTM towards these domains and we propose a suite of language models called Bio-xLSTM. Extensive experiments in three large domains, genomics, proteins, and chemistry, were performed to assess xLSTM's ability to model biological and chemical sequences. The results show that Bio-xLSTM is a highly proficient generative model for DNA, protein, and chemical sequences, learns rich representations, and can perform in-context learning for proteins and small molecules.
Title: (
Spotlight 9) Inference-Friendly Models With MixAttention
Presenter: Shashank Rajput
AuthorsShashank Rajput (Databricks)*,Ying Sheng (NA),Sean Owen (Databricks),Vitaliy Chiley (Cerebras)
AbstractThe size of the key-value (KV) cache plays a critical role in determining both the maximum context length and the number of concurrent requests supported during inference in modern language models. The KV cache size grows proportionally with the number of attention heads and the tokens processed, leading to increased memory consumption and slower inference for long inputs. In this work, we explore the use of MixAttention, a model architecture modification closely related to a blog published by Character.AI. MixAttention combines sliding window attention, where only a small subset of recent tokens is stored in the KV cache, with KV cache sharing across layers. Our experiments demonstrate that MixAttention significantly reduces memory usage and improves inference speed without sacrificing model performance in both short and long-context tasks. We also explore various configurations of this architecture, identifying those that maintain quality across evaluation metrics while optimizing resource efficiency.
Title: (
Spotlight 10) One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation
Presenter: Fabian Paischer
AuthorsFabian Paischer (ELLIS Unit Linz, LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria)*,Lukas Hauzenberger (ELLIS Unit Linz, LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz),Thomas Schmied (ELLIS Unit Linz, LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz),Benedikt Alkin (Institute for Machine Learning),Marc Deisenroth (University College London),Sepp Hochreiter (LIT AI Lab, In
AbstractAbstract TBD
Title: (
KeyNote Talk) The LoRA Journey and Learnings: from Creation to Industrial-Scale Adoption
Presenter: Dr. Weizhu Chen
BioWeizhu Chen Weizhu Chen is the Vice President leading the Microsoft GenAI modeling team, driving innovation in large-scale AI model training, including pre-training, post-training, and evaluation for both Microsoft and OpenAI. Under his leadership, the team has pioneered groundbreaking advancements such as LoRA, DeBERTa, and Phi-3 models. With over 19 years at Microsoft, Weizhu has held pivotal roles in shaping AI and machine learning technologies. Previously, he served as Partner Science Manager at Microsoft Azure AI and led teams in the Business Applications Group and Research divisions, focusing on deep learning, NLP, and distributed machine learning at cloud scale. Before joining Microsoft, he contributed to research on information retrieval at IBM Research. Weizhu’s career reflects a deep commitment to advancing the state of AI, making a significant impact on the field and enabling transformative technologies.
AbstractAbstract TBD
Title: (
KeyNote Talk) How to build fully open language models: from pre-training to post-training
Presenter: Prof. Hananeh Hajishirzi
BioHananeh Hajishirzi, Hanna Hajishirzi is the Torode Family Associate Professor in the Allen School of Computer Science and Engineering at the University of Washington and a Senior Director of NLP at AI2. Her current research delves into various domains within Natural Language Processing (NLP) and Artificial Intelligence (AI), with a particular emphasis on accelerating the science of language modeling, broadening their scope, and enhancing their applicability and usefulness for human lives. She has published over 140 scientific articles in prestigious journals and conferences across ML, AI, NLP, and Computer Vision. She is the recipient of numerous awards, including the Sloan Fellowship, NSF CAREER Award, Intel Rising Star Award, Allen Distinguished Investigator Award, Academic Achievement UIUC Alumni Award, and Innovator of the Year Award by GeekWire. The work from her lab has been nominated for or has received best paper awards at various conferences and has been featured in numerous magazines and newspapers.
AbstractLanguage models (LMs) have become ubiquitous in both AI research and commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. In this talk, I present our OLMo project aimed at building strong language models and making them fully accessible to researchers along with open-source code for data, training, and inference. Training language models are expensive, therefore we optimize for quality vs. compute cost. I focus on how data, architecture, and training improvements advance models at pre-training and post-training stages with less compute cost.
Time |
Title |
Presenter |
8:00AM - 8:15AM |
Breakfast |
8:15AM - 8:30AM |
Opening Remarks |
8:30AM - 9:00AM |
(KeyNote Talk) Efficiency through Learning from Experience |
|
Dr. Bhavana Dalvi Mishra |
9:00AM - 9:30AM |
(KeyNote Talk) Multi-Teacher Distillation: An Ensemble-Then-Distill Approach |
|
Prof. Lili Mou |
9:30AM - 10:00AM |
Morning Break |
10:00AM - 10:30AM |
(KeyNote Talk) Hardware-aware Algorithms for Language Modeling |
|
Prof. Tri Dao |
10:30AM - 11:00AM |
(KeyNote Talk) Speech generative modeling with little tokenization |
|
Dr. Navdeep Jaitly |
11:00AM - 11:30AM |
(KeyNote Talk) Optimizing Data Use for Efficient Pre-training |
|
Prof. Danqi Chen |
11:30AM - 11:36AM |
(Spotlight 1) Sparsified State-Space Models are Efficient Highway Networks |
|
Woomin Song |
11:36AM - 11:42AM |
(Spotlight 2) Longhorn: State Space Models are Amortized Online Learners |
|
Bo Liu |
11:42AM - 11:48AM |
(Spotlight 3) GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference |
|
Hao Kang |
11:48AM - 11:54AM |
(Spotlight 4) An Evolved Universal Transformer Memory |
|
Edoardo Cetin |
11:54AM - 12:00PM |
(Spotlight 5) OLMoE: Open Mixture-of-Experts Language Models |
|
Luca Soldaini |
12:00PM - 1:30PM |
Lunch Break |
12:30PM - 1:30PM |
Poster Session I-(Paper IDs #1 - #50 [Link to Posters]) |
1:30PM - 1:36PM |
(Spotlight 6) RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval |
|
Huiqiang Jiang |
1:36PM - 1:42PM |
(Spotlight 7) Post-Training Statistical Calibration for Higher Activation Sparsity |
|
Vui Seng Chua |
1:42PM - 1:48PM |
(Spotlight 8) Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences |
|
Niklas Schmidinger |
1:48PM - 1:54PM |
(Spotlight 9) Inference-Friendly Models With MixAttention |
|
Shashank Rajput |
1:54PM - 2:00PM |
(Spotlight 10) One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation |
|
Fabian Paischer |
2:00PM - 2:30PM |
(KeyNote Talk) The LoRA Journey and Learnings: from Creation to Industrial-Scale Adoption |
|
Dr. Weizhu Chen |
2:30PM - 3:00PM |
(KeyNote Talk) How to build fully open language models: from pre-training to post-training |
|
Prof. Hananeh Hajishirzi |
3:00PM - 3:30PM |
Afternoon Break |
3:30PM - 4:20PM |
Interactive Panel Discussion |
- Marjan Ghazvini Nejad
- Joel Hestness
- Navdeep Jaitly
- Katie Derthick
|
4:20PM-4:30PM |
Best Paper Awards and Closing Remarks |
4:30PM - 5:30PM |
Poster Session II-(Paper IDs #51 - #105 [Link to Posters]) |
Organizers
Mehdi Rezagholizadeh
Huawei Noah's Ark Lab
Yu Cheng
Chinese University of Hong Kong
Yue Dong
University of California, Riverside
Vahid Partovi Nia
Ecole Polytechnique Montreal & Huawei
Qun Liu
Huawei Noah's Ark Lab
Boxing Chen
Huawei Noah's Ark Lab
Volunteers
David Alfonso-Hermelo
Huawei Noah's Ark Lab
Khalil Bibi
Haven Studios
Mahsa Ghazvini Nejad
Huawei Noah's Ark Lab
Ali Edalati
Huawei Noah's Ark Lab
Technical Committee
- Dasgupta Sabyasachi (Sanofi)
- Dan Alistarh (ISTA)
- Vahid Partovi Nia (Ecole Polytechnique Montreal & Huawei)
- Tanya Roosta (Amazon)
- Peyman Passban (Sanofi)
- Ehsaneddin Asgari (QCRI)
- Hamidreza Saghir (Microsoft)
- Yue Dong (University of California, Riverside)
- Ruijiang Li (Sanofi)
- Abbas Ghaddar (Huawei Noah's Ark Lab)
- Alireza Ghaffari (McGill University)
- Yu Cheng (Chinese University of Hong Kong)
- Jahangir Alam (CRIM-Montreal)
- Hamidreza Mahyar (McMaster University)
- Yufei Cui (Huawei Noah's Ark Lab)
- Mahdi Biparva (Huawei Noah's Ark Lab)
- Soheila Samiee (BASF)
- Walid Ahmed (Huawei Technologies Canada)
- Ehsan Kamalloo (Service Now Research)
- Anderson Avila (INRS-EMT)
- Abbas Rahimi (IBM)
- David Alfonso Hermelo (Huawei Noah's Ark Lab)
- Makesh Narsimhan Sreedhar (NVIDIA)
- Ahmad Rashid (University of Waterloo & Vector Institute)
- Suyuchen Wang (Universite de Montreal & Mila)
- Tianyu Jiang (University of Cincinnati)
- Peilin Yu (Brown University)
- Khalil Bibi
- Aysegul Bumin (Amazon)
- Abderrahim Fathan (CRIM- Montreal)
- Aref Jafari (University of Waterloo)
- Dan Fu (Stanford University)
- Anusha Sabbineni (Amazon)
- Parsa Omidi (Huawei Technologies Canada)
- Young Jin Kim (Microsoft)
- Giovanni Monea (EPFL)
- Mofetoluwa Adeyemi (University of Waterloo)
- Xindi Wang (University of Western Ontario)
|
- Alessio Brutti (Fondazione Bruno Kessler)
- Saleh Ashkboos (ETH Zurich)
- Parsa Kavehzadeh (Huawei Noah's Ark Lab)
- Hossein Rajabzadeh (University of Waterloo)
- Mohammadreza Tayaranian (McGill University)
- Varun Gangal (ASAPP Inc.)
- Sebastian Jaszczur (IDEAS NCBR, University of Warsaw)
- Ali Edalati (Huawei Noah's Ark Lab)
- Mojtaba Valipour (University of Waterloo)
- Heitor Guimarães (INRS University)
- Jing Li (Mitsubishi Electric Research Laboratories)
- Mohammad Ruhul Amin (Fordham University)
- Mohammad Dehghan (Autodesk)
- Raffy Fahim (Microsoft)
- Feiyang Kang (Virginia Tech University)
- Ning Shi (University of Alberta)
- Daria Soboleva (Cerebras Systems)
- Qingru Zhang (Georgia Institute of Technology)
- Lilly Kumari (University of Washington)
- Thomas Ortner (IBM Research Zurich - Europe)
- Dominik Wagner (Technische Hochschule Nuernberg)
- Benyamin Jamialahmadi (University of Waterloo)
- Tianshu Zhu (Huawei Noah's Ark Lab)
- Haoran Zhao (Drexel University & University of Washington)
- Satya Sai Srinath Namburi (Amazon)
- Mouloud Belbahri (Layer 6 AI)
- Abhishek Panigrahi (Princeton University)
- Arthur Pimentel (INRS)
- Mahsa Salmani (Huawei Technologies Canada)
- Mohammad Ali Alomrani (Huawei Noah's Ark Lab)
- Abdul Hameed Azeemi (Lahore University)
- Mohammadreza Pourreza (Google Research)
- Yunan Zhang (Microsoft)
- MohammadAli SadraeiJavaheri (Sharif University)
- Omid Ghahroodi (Sharif University)
- Adam Lee (UC Bereley)
|
Platinum Sponsor
Gold Sponsors