The fourth version of the Efficient Natural Language and Speech Processing (ENLSP-IV) workshop will focus on how to make large language and foundation models more efficient in terms of Architecture, Training, and Inference in their real-world applications. This year, following the trend of industry and academia, we put more emphasis on investigating new architectures to make future language and foundation models more efficient. Moreover, we highlight the importance of comprehensive evaluation and benchmarking new efficient models from different practical aspects.
The workshop program offers an interactive platform for gathering experts and talents from academia and industry through invited talks, panel discussion, paper submission, reviews, interactive poster sessions, oral presentations and a couple of mentorship sessions for new researchers.
This will be a unique opportunity to discuss and share challenging problems, build connections, exchange ideas and brainstorm, and foster future collaborations. The topics of this workshop can be of interest for people working on general machine learning, deep learning, hardware, optimization, theory and applications.
Overview
As large language models (e.g. GPT-3, GPT-4, Llama 3, PALM, Gemini, and Pangu-∑), pre-trained speech models (e.g. wav2vec, Hubert, wavLM, Whisper, Conformer-1 and Conformer-2 ) and other foundation models (e.g. GPT-4o, and Stable Diffusion) have advanced rapidly and become more prominent and widespread, improving their efficiency would be more crucial.
While it is true that the computational power and GPU resources have played a significant role in the success of these models, we need to also be aware that using more computational resources can result in: (a) increasing the cost of training and deploying such models, (b) making the models less accessible, (c) less contribution from the research community, and (d) increasing the environmental costs of the models. Moreover, it is evident that most of these pre-trained models are largely over-parameterized and their efficiency is under question. Lack of efficiency can largely limit the application of these advanced models in practice.
Building upon the framework of our previous three editions, this workshop remains dedicated to investigating solutions for enhancing the efficiency of pre-trained language and foundation models but with introducing some fresh and important new topics to the community and encouraging their contributions.
Just to highlight a few: (1) Despite the ubiquitous usage of Transformers, they suffer from quadratic computational complexity which limits their efficiency especially for longer sequence lengths. Should we improve the efficiency of Transformers (e.g. in Hedgehog, Gated Linear Attention) or look for other architectures (e.g. Mamba, Jamba, RVKW, xLSTM, and SSMs)? (2) For accelerating training, we have seen the significant impact of designing hardware efficient implementations such as in Flash Attention. Should we focus more on these hardware-aware solutions or more on new/improved architectures?
(3) For efficient inference, there are solutions such as: Speculative Decoding [Link1] [Link2] where the performance is strongly model and task-dependent and the draft and target models should have the same vocabulary (tokenizer); improved KV-caching (e.g. [Link]) which has a limited speed-up; and many-in-one models such as SortedNet, MatFormer, and LayerSkip but the performance of sub-models drops compared to their corresponding individual models.
(4) While there are many so-called efficient solutions in the literature, there is no fair, comprehensive and practical evaluation of these models and their comparison to each other. For example, we do not know the hallucination extent of the new architectures vs. the transformer model (e.g. in [Link]).
Call for Papers
Investing in the future of language and foundation models requires a concrete effort to enhance their efficiency across multiple dimensions (including architecture, training, and inference) and having a comprehensive evaluation framework.
To encourage engagement from the NeurIPS community, we present several active research topics in this field that invite participation and contributions. The scope of this workshop includes, but not limited to, the following topics:
Efficient Architectures Proposing alternative architectures that are more efficient than Transformers (in terms of computational complexity, memory footprint, handling longer sequence lengths ) or modifying Transformer architectures to make them more efficient
- Linear and sub-quadratic Transformers , sparse attention Transformers
- New architures for LLMs and foundation models and their scalability
- Evaluation and benchmarking of new architectures (fair comparison of different models)
- Long sequence modeling
- Dense vs. sparse architectures (MoEs)
Efficient Training How can we reduce the cost of pre-training or fine-tuning new models?
- More efficient pre-training solutions, from better initialization and hyper-parameter tuning to better optimization which lowers the cost of pre-training
- Parameter efficient fine-tuning (PEFT) solutions for large pre-trained models
- Efficient instruction tuning, prompt engineering and in-context learning
- Hardware-aware solutions (e.g. better CUDA kernels), memory read/write aware solutions
- Data-efficient training, reducing the requirement for labeled data, data compression and distillation
Efficient Inference How can we reduce the cost of inference for LLMs and foundation models?
- Improved speculative sampling for LLMs, self-speculative sampling, selecting among multiple drafts, one draft model for different heterogeneous target models
- Neural model compression techniques such as quantization, pruning, and knowledge distillation
- Improved KV-caching solutions for Transformers
- Distributed inference of large pre-trained models
- Serving many target devices with one model, many-in-one models, early exiting, elastic networks
Evaluation and Benchmarking of Efficient Models Introducing new efficient solutions underscores the need for comprehensive benchmarks to accurately evaluate their efficacy and performance.
- Datasets, benchmarks, leaderboards for evaluating efficient models
- Benchmarking the performance of efficient models from different perspectives such as reasoning, hallucination, understanding, and generation quality
- Benchmarking efficiency of models in terms of their memory footprint, training time, inference time, different target hardware devices and inference platforms (e.g. GPU vs. CPU)
Efficient Solutions in other Modalities and Applications
- Efficiency of foundational or pre-trained models in multi-modal set-up and other modalities (beyond NLP and Speech) such as biology, chemistry, computer vision, and time series
- Efficient representations (e.g. Matryoshka representation) and models in dense retrieval and search
- Efficient Federated learning, lower communication costs, tackling heterogeneous data and models
- Efficient graph and LLM joint learning
Submission Instructions
You are invited to submit your papers in our CMT submission portal (Link). All the submitted papers have to be anonymous for double-blind review. We expect each paper will be reviewed by at least three reviewers. The content of the paper (excluding the references and supplementary materials) should not be more than 8 pages for Long Papers and 4 pages for Short Papers, strictly following the NeurIPS template style (Link). Please be advised that the NeurIPS submission checklist is not needed for our workshop submissions.
Authors can submit up to 100 MB of supplementary materials separately. Authors are highly encouraged to submit their codes for reproducibility purposes. According to the guideline of the NeurIPS workshops, already published papers are not encouraged for submission, but you are allowed to submit your ArXiv papers or the ones which are under submission (for example any NeurIPS submissions can be submitted concurrently to workshops ). Moreover, a work that is presented at the main NeurIPS conference should not appear in a workshop. Please make sure to indicate the complete list of conflict of interests for all the authors of your paper. To encourage higher quality submissions, our sponsors are offering the Best Paper and the Best Poster Awards to qualified outstanding original oral and poster presentations (upon nomination of the reviewers). Bear in mind that our workshop is not archival, but the accepted papers will be hosted on the workshop website. Moreover, we are currently negotiating with a publisher to host opt-in accepted papers in a special issue proceeding for our workshop.
Important Dates:
Special NeurIPS Fast Track Submission Deadline: September 30, 2024 Anywhere on Earth (AOE)
Submission Deadline: September 15, 2024 Anywhere on Earth (AOE)
Acceptance Notification: October 09, 2024 AOE
- Camera-Ready Submission: October 25, 2024 AOE
- Workshop Date: December 14, 2024
Keynote Speakers
Danqi Chen
Princeton
Peter Clark
Allen Institute for AI
Weizhu Chen
Microsoft
Tri Dao
Princeton/Together AI
Hananeh Hajishirzi
University of Washington
Navdeep Jaitly
Apple
Lili Mou
University of Alberta
Panelists
Marjan Ghazvini Nejad
Meta
Lu Hou
Huawei
Joel Hestness
Cerebras
Katie Derthick
Microsoft
Tentative Schedule
Title: (
KeyNote Talk) Multi-Teacher Distillation: An Ensemble-Then-Distill Approach
Presenter: Prof. Lili Mou
Bio Dr. Lili Mou is an Assistant Professor at the Department of Computing Science, University of Alberta. He is also an Alberta Machine Intelligence Institute (Amii) Fellow and a Canada CIFAR AI (CCAI) Chair. Lili received his BS and PhD degrees in 2012 and 2017, respectively, from School of EECS, Peking University. After that, he worked as a postdoctoral fellow at the University of Waterloo. His research interests mainly lie in designing novel machine learning algorithms and frameworks for NLP. He has publications at top conferences and journals, including ACL, EMNLP, TACL, ICML, ICLR, and NeurIPS. He also presented tutorials at EMNLP'19 and ACL'20. He received a AAAI New Faculty Highlight Award in 2021.
AbstractAbstract Knowledge distillation (KD) aims to transfer the knowledge in a large model (called a teacher) into a small one (called a student), and has become an emerging research topic as the sizes of deep learning models keep growing. Today, there are abundant readily available large models, such as ChatGPT, LLaMa, and T5. It then becomes natural to ask: Can we distill the knowledge from multiple teachers? At first glance, it appears easy to perform multi-teacher KD, as we can simply train the student from the union of teachers’ predictions. However, I would argue that such a naïve attempt may not work well for multi-teacher KD. This is because traditional KD adopts the cross-entropy loss, which tends to yield a smooth distribution. In this talk, I will present a novel ensemble-then-distill approach, which builds an ensemble of teacher models to train the student. I will also discuss applications to text generation and syntactic parsing.
Title: (
KeyNote Talk) Efficiency through Learning from Experience
Presenter: Dr. Peter Clark
BioPeter Clark is a Senior Research Director and founding member of the Allen Institute for AI (Ai2). He also served as Interim CEO from 2022-2023. He leads Ai2's Aristo Project, developing AI agents that can systematically reason, explain, and continually improve over time. He received his Ph.D. in 1991, has published over 250 papers, and has received several awards, including four Best Paper awards, a Boeing Associate Technical Fellowship, and Senior Membership of AAAI.
AbstractAbstract Despite the physiological limitations of the human brain, humans are remarkably efficient thinkers, in large part because they can learn from experience, allowing them to avoid prior reasoning errors and quickly jump to conclusions that previously took substantial effort. Similarly, language models (LMs) can rapidly improve their inference-time efficiency through inference-time learning, supplementing lower-level methods like fast decoding and caching. I'll describe two agent-based systems (CLIN and SSO) that do this, using an external RAG (retrieval-augmented generation) memory to help the agent navigate a complex, virtual environment. Unlike typical RAG systems, the memory is dynamic and updated after each task (including forgetting unhelpful learnings). In addition, unlike reinforcement-based continual learning techniques, these systems rapidly learn from just a handful of examples by exploiting LMs to conjecture useful generalizations of past experiences. I'll outline three critical activities in this process - what to remember, how to index those memories, and how to retrieve from that index - and how those choices impact the effectiveness of the resulting agent. While this concept of efficiency is a little different to foundational architectural considerations, I'll show that it is nonetheless powerful, and an important additional tool in the toolbox for efficient future applications.
Title: Accepted Oral Presentations
Presenter: TBD
AuthorsTBD
AbstractAbstract TBD
Title: (
KeyNote Talk) Title TBD
Presenter: Prof. Christopher Re
BioChristopher Re is a Stanford AI Lab associate professor, he leads research on foundational AI with a focus on weak supervision and the interplay between AI and systems design.
AbstractAbstract TBD
Title: (
KeyNote Talk) Title TBD
Presenter: Dr. Navdeep Jaitly
BioNavdeep Jaitly worked under Geoffrey Hinton at the University of Toronto, his interest lie in pushing the frontier of Deep Learning research deep learning for Apple, following work on Google's Brain team.
AbstractAbstract TBD
Title: (
KeyNote Talk) Title TBD
Presenter: Prof. Danqi Chen
BioDanqi Chen co-leads Princeton's NLP Group, researches large language models, and emphasizes practicality and accessibility in AI development.
AbstractAbstract TBD
Title: Accepted Oral Presentations
Presenter: TBD
AuthorsTBD
AbstractAbstract TBD
Title: (
KeyNote Talk) Title TBD
Presenter: Dr. Weizhu Chen
BioWeizhu Chen leads a modeling team in Microsoft Gen AI, working on large-scale (OpenAI and Microsoft) model training.
AbstractAbstract TBD
Title: (
KeyNote Talk) Title TBD
Presenter: Prof. Hananeh Hajishirzi
BioHananeh Hajishirzi, a leading NLP expert focusing on large language models, explores how AI can reason and understand complex information from various sources.
AbstractAbstract TBD
Time |
Title |
Presenter |
8:10AM - 8:15AM |
Opening Speech |
8:15AM - 8:45AM |
(KeyNote Talk) Multi-Teacher Distillation: An Ensemble-Then-Distill Approach |
|
Prof. Lili Mou |
8:45AM - 9:15AM |
(KeyNote Talk) Efficiency through Learning from Experience |
|
Dr. Peter Clark |
9:15AM - 10:00AM |
Accepted Oral Presentations |
|
TBD |
10:00AM - 10:30AM |
Morning Break |
10:30AM - 11:00AM |
(KeyNote Talk) Title TBD |
|
Prof. Christopher Re |
11:00AM - 11:30AM |
(KeyNote Talk) Title TBD |
|
Dr. Navdeep Jaitly |
11:30AM - 12:00AM |
(KeyNote Talk) Title TBD |
|
Prof. Danqi Chen |
12:00PM - 12:30PM |
Accepted Oral Presentations |
|
TBD |
12:30PM - 1:15PM |
Lunch Break |
1:15PM - 2:00PM |
Poster Session I & Free Discussion |
2:00PM - 2:30PM |
(KeyNote Talk) Title TBD |
|
Dr. Weizhu Chen |
2:30PM - 3:00PM |
(KeyNote Talk) Title TBD |
|
Prof. Hananeh Hajishirzi |
03:00PM - 03:15PM |
Afternoon Break |
3:20PM - 4:10PM |
Interactive Panel Discussion |
- Marjan Ghazvini Nejad
- Joel Hestness
- Katie Derthick
- Lu Hou
|
4:10PM-4:15PM |
Best Paper and Poster Awards |
4:15PM - 5:00PM |
Poster Session II & Free Discussion |
Organizers
Mehdi Rezagholizadeh
Huawei Noah's Ark Lab
Yu Cheng
Chinese University of Hong Kong
Yue Dong
University of California, Riverside
Vahid Partovi Nia
Ecole Polytechnique Montreal & Huawei
Qun Liu
Huawei Noah's Ark Lab
Boxing Chen
Huawei Noah's Ark Lab
Volunteers
David Alfonso-Hermelo
Huawei Noah's Ark Lab
Khalil Bibi
Haven Studios
Mahsa Ghazvini Nejad
Huawei Noah's Ark Lab
Ali Edalati
Huawei Noah's Ark Lab
Technical Committee
- Dasgupta Sabyasachi (Sanofi)
- Dan Alistarh (ISTA)
- Vahid Partovi Nia (Ecole Polytechnique Montreal & Huawei)
- Tanya Roosta (Amazon)
- Peyman Passban (Sanofi)
- Ehsaneddin Asgari (QCRI)
- Hamidreza Saghir (Microsoft)
- Yue Dong (University of California, Riverside)
- Ruijiang Li (Sanofi)
- Abbas Ghaddar (Huawei Noah's Ark Lab)
- Alireza Ghaffari (McGill University)
- Yu Cheng (Chinese University of Hong Kong)
- Jahangir Alam (CRIM-Montreal)
- Hamidreza Mahyar (McMaster University)
- Yufei Cui (Huawei Noah's Ark Lab)
- Mahdi Biparva (Huawei Noah's Ark Lab)
- Soheila Samiee (BASF)
- Walid Ahmed (Huawei Technologies Canada)
- Ehsan Kamalloo (Service Now Research)
- Anderson Avila (INRS-EMT)
- Abbas Rahimi (IBM)
- David Alfonso Hermelo (Huawei Noah's Ark Lab)
- Makesh Narsimhan Sreedhar (NVIDIA)
- Ahmad Rashid (University of Waterloo & Vector Institute)
- Suyuchen Wang (Universite de Montreal & Mila)
- Tianyu Jiang (University of Cincinnati)
- Peilin Yu (Brown University)
- Khalil Bibi
- Aysegul Bumin (Amazon)
- Abderrahim Fathan (CRIM- Montreal)
- Aref Jafari (University of Waterloo)
- Dan Fu (Stanford University)
- Anusha Sabbineni (Amazon)
- Parsa Omidi (Huawei Technologies Canada)
- Young Jin Kim (Microsoft)
- Giovanni Monea (EPFL)
- Mofetoluwa Adeyemi (University of Waterloo)
- Xindi Wang (University of Western Ontario)
|
- Alessio Brutti (Fondazione Bruno Kessler)
- Saleh Ashkboos (ETH Zurich)
- Parsa Kavehzadeh (Huawei Noah's Ark Lab)
- Hossein Rajabzadeh (University of Waterloo)
- Mohammadreza Tayaranian (McGill University)
- Varun Gangal (ASAPP Inc.)
- Sebastian Jaszczur (IDEAS NCBR, University of Warsaw)
- Ali Edalati (Huawei Noah's Ark Lab)
- Mojtaba Valipour (University of Waterloo)
- Heitor Guimarães (INRS University)
- Jing Li (Mitsubishi Electric Research Laboratories)
- Mohammad Ruhul Amin (Fordham University)
- Mohammad Dehghan (Autodesk)
- Raffy Fahim (Microsoft)
- Feiyang Kang (Virginia Tech University)
- Ning Shi (University of Alberta)
- Daria Soboleva (Cerebras Systems)
- Qingru Zhang (Georgia Institute of Technology)
- Lilly Kumari (University of Washington)
- Thomas Ortner (IBM Research Zurich - Europe)
- Dominik Wagner (Technische Hochschule Nuernberg)
- Benyamin Jamialahmadi (University of Waterloo)
- Tianshu Zhu (Huawei Noah's Ark Lab)
- Haoran Zhao (Drexel University & University of Washington)
- Satya Sai Srinath Namburi (Amazon)
- Mouloud Belbahri (Layer 6 AI)
- Abhishek Panigrahi (Princeton University)
- Arthur Pimentel (INRS)
- Mahsa Salmani (Huawei Technologies Canada)
- Mohammad Ali Alomrani (Huawei Noah's Ark Lab)
- Abdul Hameed Azeemi (Lahore University)
- Mohammadreza Pourreza (Google Research)
- Yunan Zhang (Microsoft)
- MohammadAli SadraeiJavaheri (Sharif University)
- Omid Ghahroodi (Sharif University)
- Adam Lee (UC Bereley)
|
Platinum Sponsor
Gold Sponsors