The fourth version of the Efficient Natural Language and Speech Processing (ENLSP-IV) workshop will focus on how to make large language and foundation models more efficient in terms of Architecture, Training, and Inference in their real-world applications. This year, following the trend of industry and academia, we put more emphasis on investigating new architectures to make future language and foundation models more efficient. Moreover, we highlight the importance of comprehensive evaluation and benchmarking new efficient models from different practical aspects. The workshop program offers an interactive platform for gathering experts and talents from academia and industry through invited talks, panel discussion, paper submission, reviews, interactive poster sessions, oral presentations and a couple of mentorship sessions for new researchers. This will be a unique opportunity to discuss and share challenging problems, build connections, exchange ideas and brainstorm, and foster future collaborations. The topics of this workshop can be of interest for people working on general machine learning, deep learning, hardware, optimization, theory and applications.

Overview

As large language models (e.g. GPT-3, GPT-4, Llama 3, PALM, Gemini, and Pangu-∑), pre-trained speech models (e.g. wav2vec, Hubert, wavLM, Whisper, Conformer-1 and Conformer-2 ) and other foundation models (e.g. GPT-4o, and Stable Diffusion) have advanced rapidly and become more prominent and widespread, improving their efficiency would be more crucial. While it is true that the computational power and GPU resources have played a significant role in the success of these models, we need to also be aware that using more computational resources can result in: (a) increasing the cost of training and deploying such models, (b) making the models less accessible, (c) less contribution from the research community, and (d) increasing the environmental costs of the models. Moreover, it is evident that most of these pre-trained models are largely over-parameterized and their efficiency is under question. Lack of efficiency can largely limit the application of these advanced models in practice.

Building upon the framework of our previous three editions, this workshop remains dedicated to investigating solutions for enhancing the efficiency of pre-trained language and foundation models but with introducing some fresh and important new topics to the community and encouraging their contributions. Just to highlight a few: (1) Despite the ubiquitous usage of Transformers, they suffer from quadratic computational complexity which limits their efficiency especially for longer sequence lengths. Should we improve the efficiency of Transformers (e.g. in Hedgehog, Gated Linear Attention) or look for other architectures (e.g. Mamba, Jamba, RVKW, xLSTM, and SSMs)? (2) For accelerating training, we have seen the significant impact of designing hardware efficient implementations such as in Flash Attention. Should we focus more on these hardware-aware solutions or more on new/improved architectures? (3) For efficient inference, there are solutions such as: Speculative Decoding [Link1] [Link2] where the performance is strongly model and task-dependent and the draft and target models should have the same vocabulary (tokenizer); improved KV-caching (e.g. [Link]) which has a limited speed-up; and many-in-one models such as SortedNet, MatFormer, and LayerSkip but the performance of sub-models drops compared to their corresponding individual models. (4) While there are many so-called efficient solutions in the literature, there is no fair, comprehensive and practical evaluation of these models and their comparison to each other. For example, we do not know the hallucination extent of the new architectures vs. the transformer model (e.g. in [Link]).

Call for Papers

Investing in the future of language and foundation models requires a concrete effort to enhance their efficiency across multiple dimensions (including architecture, training, and inference) and having a comprehensive evaluation framework. To encourage engagement from the NeurIPS community, we present several active research topics in this field that invite participation and contributions. The scope of this workshop includes, but not limited to, the following topics:

Efficient Architectures Proposing alternative architectures that are more efficient than Transformers (in terms of computational complexity, memory footprint, handling longer sequence lengths ) or modifying Transformer architectures to make them more efficient

  • Linear and sub-quadratic Transformers , sparse attention Transformers
  • New architures for LLMs and foundation models and their scalability
  • Evaluation and benchmarking of new architectures (fair comparison of different models)
  • Long sequence modeling
  • Dense vs. sparse architectures (MoEs)
Efficient Training How can we reduce the cost of pre-training or fine-tuning new models?
  • More efficient pre-training solutions, from better initialization and hyper-parameter tuning to better optimization which lowers the cost of pre-training
  • Parameter efficient fine-tuning (PEFT) solutions for large pre-trained models
  • Efficient instruction tuning, prompt engineering and in-context learning
  • Hardware-aware solutions (e.g. better CUDA kernels), memory read/write aware solutions
  • Data-efficient training, reducing the requirement for labeled data, data compression and distillation
Efficient Inference How can we reduce the cost of inference for LLMs and foundation models?
  • Improved speculative sampling for LLMs, self-speculative sampling, selecting among multiple drafts, one draft model for different heterogeneous target models
  • Neural model compression techniques such as quantization, pruning, and knowledge distillation
  • Improved KV-caching solutions for Transformers
  • Distributed inference of large pre-trained models
  • Serving many target devices with one model, many-in-one models, early exiting, elastic networks
Evaluation and Benchmarking of Efficient Models Introducing new efficient solutions underscores the need for comprehensive benchmarks to accurately evaluate their efficacy and performance.
  • Datasets, benchmarks, leaderboards for evaluating efficient models
  • Benchmarking the performance of efficient models from different perspectives such as reasoning, hallucination, understanding, and generation quality
  • Benchmarking efficiency of models in terms of their memory footprint, training time, inference time, different target hardware devices and inference platforms (e.g. GPU vs. CPU)
Efficient Solutions in other Modalities and Applications
  • Efficiency of foundational or pre-trained models in multi-modal set-up and other modalities (beyond NLP and Speech) such as biology, chemistry, computer vision, and time series
  • Efficient representations (e.g. Matryoshka representation) and models in dense retrieval and search
  • Efficient Federated learning, lower communication costs, tackling heterogeneous data and models
  • Efficient graph and LLM joint learning

Submission Instructions

You are invited to submit your papers in our CMT submission portal (Link). All the submitted papers have to be anonymous for double-blind review. We expect each paper will be reviewed by at least three reviewers. The content of the paper (excluding the references and supplementary materials) should not be more than 8 pages for Long Papers and 4 pages for Short Papers, strictly following the NeurIPS template style (Link). Please be advised that the NeurIPS submission checklist is not needed for our workshop submissions.
Authors can submit up to 100 MB of supplementary materials separately. Authors are highly encouraged to submit their codes for reproducibility purposes. According to the guideline of the NeurIPS workshops, already published papers are not encouraged for submission, but you are allowed to submit your ArXiv papers or the ones which are under submission (for example any NeurIPS submissions can be submitted concurrently to workshops ). Moreover, a work that is presented at the main NeurIPS conference should not appear in a workshop. Please make sure to indicate the complete list of conflict of interests for all the authors of your paper. To encourage higher quality submissions, our sponsors are offering the Best Paper and the Best Poster Awards to qualified outstanding original oral and poster presentations (upon nomination of the reviewers). Bear in mind that our workshop is not archival, but the accepted papers will be hosted on the workshop website. Moreover, we are currently negotiating with a publisher to host opt-in accepted papers in a special issue proceeding for our workshop.

Important Dates:

  • Special NeurIPS Fast Track Submission Deadline: September 30, 2024 Anywhere on Earth (AOE)
  • Submission Deadline: September 15, 2024 Anywhere on Earth (AOE)
  • Acceptance Notification: October 09, 2024 AOE
  • Camera-Ready Submission: October 25, 2024 AOE
  • Workshop Date: December 14, 2024

Keynote Speakers

Danqi Chen
Danqi Chen
Princeton
Bhavana Dalvi
Bhavana Dalvi
Allen Institute for AI
Weizhu Chen
Weizhu Chen
Microsoft
Tri Dao
Tri Dao
Princeton/Together AI
Hananeh Hajishirzi
Hananeh Hajishirzi
University of Washington
Navdeep Jaitly
Navdeep Jaitly
Apple
Lili Mou
Lili Mou
University of Alberta

Panelists

Marjan Ghazvini Nejad
Marjan Ghazvini Nejad
Meta
Joel Hestness
Joel Hestness
Cerebras
Navdeep Jaitly
Navdeep Jaitly
Apple
Katie Derthick
Katie Derthick
Microsoft

Schedule

Time Title Presenter
8:00AM - 8:15AM Breakfast
8:15AM - 8:30AM Opening Remarks
8:30AM - 9:00AM (KeyNote Talk) Efficiency through Learning from Experience
Dr. Bhavana Dalvi Mishra
9:00AM - 9:30AM (KeyNote Talk) Multi-Teacher Distillation: An Ensemble-Then-Distill Approach
Prof. Lili Mou
9:30AM - 10:00AM Morning Break
10:00AM - 10:30AM (KeyNote Talk) Hardware-aware Algorithms for Language Modeling
Prof. Tri Dao
10:30AM - 11:00AM (KeyNote Talk) Speech generative modeling with little tokenization
Dr. Navdeep Jaitly
11:00AM - 11:30AM (KeyNote Talk) Optimizing Data Use for Efficient Pre-training
Prof. Danqi Chen
11:30AM - 11:36AM (Spotlight 1) Sparsified State-Space Models are Efficient Highway Networks
Woomin Song
11:36AM - 11:42AM (Spotlight 2) Longhorn: State Space Models are Amortized Online Learners
Bo Liu
11:42AM - 11:48AM (Spotlight 3) GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference
Hao Kang
11:48AM - 11:54AM (Spotlight 4) An Evolved Universal Transformer Memory
Edoardo Cetin
11:54AM - 12:00PM (Spotlight 5) OLMoE: Open Mixture-of-Experts Language Models
Luca Soldaini
12:00PM - 1:30PM Lunch Break
12:30PM - 1:30PM Poster Session I-(Paper IDs #1 - #50 [Link to Posters])
1:30PM - 1:36PM (Spotlight 6) RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Huiqiang Jiang
1:36PM - 1:42PM (Spotlight 7) Post-Training Statistical Calibration for Higher Activation Sparsity
Vui Seng Chua
1:42PM - 1:48PM (Spotlight 8) Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences
Niklas Schmidinger
1:48PM - 1:54PM (Spotlight 9) Inference-Friendly Models With MixAttention
Shashank Rajput
1:54PM - 2:00PM (Spotlight 10) One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation
Fabian Paischer
2:00PM - 2:30PM (KeyNote Talk) The LoRA Journey and Learnings: from Creation to Industrial-Scale Adoption
Dr. Weizhu Chen
2:30PM - 3:00PM (KeyNote Talk) How to build fully open language models: from pre-training to post-training
Prof. Hananeh Hajishirzi
3:00PM - 3:30PM Afternoon Break
3:30PM - 4:20PM Interactive Panel Discussion
  • Marjan Ghazvini Nejad
  • Joel Hestness
  • Navdeep Jaitly
  • Katie Derthick
4:20PM-4:30PM Best Paper Awards and Closing Remarks
4:30PM - 5:30PM Poster Session II-(Paper IDs #51 - #105 [Link to Posters])

Organizers

Mehdi Rezagholizadeh
Mehdi Rezagholizadeh
Huawei Noah's Ark Lab

 
Peyman Passban
Peyman Passban
Sanofi

 
Yu Cheng
Yu Cheng
Chinese University of Hong Kong

 
Soheila Samiee
Soheila Samiee
BASF

 
Yue Dong
Yue Dong
University of California, Riverside

 
Vahid Partovi Nia
Vahid Partovi Nia
Ecole Polytechnique Montreal & Huawei

 
Qun Liu
Qun Liu
Huawei Noah's Ark Lab

 
Boxing Chen
Boxing Chen
Huawei Noah's Ark Lab

 

Volunteers

David Alfonso-Hermelo
David Alfonso-Hermelo
Huawei Noah's Ark Lab

 
Khalil Bibi
Khalil Bibi
Haven Studios

 
Mahsa Ghazvini Nejad
Mahsa Ghazvini Nejad
Huawei Noah's Ark Lab

 
Ali Edalati
Ali Edalati
Huawei Noah's Ark Lab

 


Technical Committee

  • Dasgupta Sabyasachi (Sanofi)
  • Dan Alistarh (ISTA)
  • Vahid Partovi Nia (Ecole Polytechnique Montreal & Huawei)
  • Tanya Roosta (Amazon)
  • Peyman Passban (Sanofi)
  • Ehsaneddin Asgari (QCRI)
  • Hamidreza Saghir (Microsoft)
  • Yue Dong (University of California, Riverside)
  • Ruijiang Li (Sanofi)
  • Abbas Ghaddar (Huawei Noah's Ark Lab)
  • Alireza Ghaffari (McGill University)
  • Yu Cheng (Chinese University of Hong Kong)
  • Jahangir Alam (CRIM-Montreal)
  • Hamidreza Mahyar (McMaster University)
  • Yufei Cui (Huawei Noah's Ark Lab)
  • Mahdi Biparva (Huawei Noah's Ark Lab)
  • Soheila Samiee (BASF)
  • Walid Ahmed (Huawei Technologies Canada)
  • Ehsan Kamalloo (Service Now Research)
  • Anderson Avila (INRS-EMT)
  • Abbas Rahimi (IBM)
  • David Alfonso Hermelo (Huawei Noah's Ark Lab)
  • Makesh Narsimhan Sreedhar (NVIDIA)
  • Ahmad Rashid (University of Waterloo & Vector Institute)
  • Suyuchen Wang (Universite de Montreal & Mila)
  • Tianyu Jiang (University of Cincinnati)
  • Peilin Yu (Brown University)
  • Khalil Bibi
  • Aysegul Bumin (Amazon)
  • Abderrahim Fathan (CRIM- Montreal)
  • Aref Jafari (University of Waterloo)
  • Dan Fu (Stanford University)
  • Anusha Sabbineni (Amazon)
  • Parsa Omidi (Huawei Technologies Canada)
  • Young Jin Kim (Microsoft)
  • Giovanni Monea (EPFL)
  • Mofetoluwa Adeyemi (University of Waterloo)
  • Xindi Wang (University of Western Ontario)
  • Alessio Brutti (Fondazione Bruno Kessler)
  • Saleh Ashkboos (ETH Zurich)
  • Parsa Kavehzadeh (Huawei Noah's Ark Lab)
  • Hossein Rajabzadeh (University of Waterloo)
  • Mohammadreza Tayaranian (McGill University)
  • Varun Gangal (ASAPP Inc.)
  • Sebastian Jaszczur (IDEAS NCBR, University of Warsaw)
  • Ali Edalati (Huawei Noah's Ark Lab)
  • Mojtaba Valipour (University of Waterloo)
  • Heitor Guimarães (INRS University)
  • Jing Li (Mitsubishi Electric Research Laboratories)
  • Mohammad Ruhul Amin (Fordham University)
  • Mohammad Dehghan (Autodesk)
  • Raffy Fahim (Microsoft)
  • Feiyang Kang (Virginia Tech University)
  • Ning Shi (University of Alberta)
  • Daria Soboleva (Cerebras Systems)
  • Qingru Zhang (Georgia Institute of Technology)
  • Lilly Kumari (University of Washington)
  • Thomas Ortner (IBM Research Zurich - Europe)
  • Dominik Wagner (Technische Hochschule Nuernberg)
  • Benyamin Jamialahmadi (University of Waterloo)
  • Tianshu Zhu (Huawei Noah's Ark Lab)
  • Haoran Zhao (Drexel University & University of Washington)
  • Satya Sai Srinath Namburi (Amazon)
  • Mouloud Belbahri (Layer 6 AI)
  • Abhishek Panigrahi (Princeton University)
  • Arthur Pimentel (INRS)
  • Mahsa Salmani (Huawei Technologies Canada)
  • Mohammad Ali Alomrani (Huawei Noah's Ark Lab)
  • Abdul Hameed Azeemi (Lahore University)
  • Mohammadreza Pourreza (Google Research)
  • Yunan Zhang (Microsoft)
  • MohammadAli SadraeiJavaheri (Sharif University)
  • Omid Ghahroodi (Sharif University)
  • Adam Lee (UC Bereley)


Diamond Sponsors



Platinum Sponsor

Gold Sponsors