×

Accepted papers

poster
Snakes and Ladders: Accelerating State Space Model Inference with Speculative Decoding
poster
GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference
poster
One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation
poster
Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training
poster
Post-Training Statistical Calibration for Higher Activation Sparsity
poster
ThinK: Thinner Key Cache by Query-Driven Pruning
poster
BiRNA-BERT: Adaptive Tokenization for Efficient RNA Language Modeling
poster
Disentangling Questions from Query Generation for Task-Adaptive Retrieval
poster
The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation
poster
Different Rates for Different Weights: Decoupled Relative Learning Rate Schedules
poster
Distributed Speculative Inference of Large Language Models
poster
XC-C ACHE: Cross-Attending to Cached Context for Efficient LLM Inference
poster
Text Summarization With Graph Attention Networks
poster
How Redundant Is the Transformer Stack in Speech Representation Models?
poster
Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models
poster
AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability
poster
OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters
poster
Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts
poster
Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection
poster
VL-Mamba: Exploring State Space Models for Multimodal Learning
poster
An Evolved Universal Transformer Memory
poster
Less is Enough: Adapting Pre-trained Vision Transformers for Audio-Visual Speaker Verification
poster
Composite Attention: A Framework for Combining Sequence Mixing Primitives
poster
Inference-Friendly Models With MixAttention
poster
S2D: Sorted Speculative Decoding For More Efficient Deployment of Large Language Models
poster
On the Efficiency of NLP-Inspired Methods for Tabular Deep Learning
poster
OLMOE: Open Mixture-of-Experts Language Models
poster
Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
poster
MisD-MoE: A Multimodal Misinformation Detection Framework with Adaptive Feature Selection
poster
Sparsified State-Space Models are Efficient Highway Networks
poster
Approximations may be all you need: Towards Pre-training LLMs with Low-Rank Decomposition and Optimizers
poster
Partially Shared Query-Key for Lightweight Language Models
poster
RGP: Achieving Memory-Efficient Model Fine-tuning Via Randomized Gradient Projection
poster
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
poster
Improving Multi-candidate Speculative Decoding
poster
Efficient Alignment of Large Language Models via Data Sampling
poster
Longhorn: State Space Models are Amortized Online Learners
poster
Towards Low-bit Communication for Tensor Parallel LLM Inference
poster
CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios
poster
StructMoE : Augmenting MoEs with Hierarchically Routed Low Rank Experts
poster
Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection
poster
Dynamic layer selection in decoder-only transformers
poster
KD-LoRA: A Hybrid Approach to Efficient Fine-Tuning with LoRA and Knowledge Distillation
poster
Sparse Upcycling: Inference Inefficient Finetuning
poster
A Unified Framework for Speculative Decoding with Multiple Drafters as a Bandit
poster
Residual vector quantization for KV cache compression in large language model
poster
Accelerating the Low-Rank Decomposed Models
poster
CROSS-JEM: Accurate and Efficient Cross-encoders for Short-text Ranking Tasks
poster
Enhanced label noise robustness through early adaptive filtering for the self-supervised speaker verification task
poster
Speculative Streaming: Fast LLM Inference without Auxiliary Models
poster
On the Implicit Relation between Low-Rank Adaptation and Differential Privacy
poster
Is 3D Convolution with 5D Tensors Really Necessary for Video Analysis?
poster
Hysteresis Activation Function for Efficient Inference
poster
Approximate Top-k for Increased Parallelism
poster
A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression
poster
RAEE: A Robust Retrieval-Augmented Early Exiting Framework for Efficient Inference
poster
Efficiently Dispatching Flash Attention For Partially Filled Attention Masks
poster
Speculative Diffusion Decoding for Accelerated Language Generation
poster
Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models
poster
The EarlyBird Gets the WORM: Heuristically Accelerating EarlyBird Convergence
poster
Post Training Quantization of Large Language Models with Microscaling Formats
poster
Computational Bottlenecks of Training Small-scale Large Language Models
poster
QuAILoRA: Quantization-Aware Initialization for LoRA
poster
Mai Ho‘omāuna i ka ‘Ai: Language Models Improve Automatic Speech Recognition in Hawaiian
poster
Inducing Elasticity in Foundation Models: Post-Training Techniques for Adaptable Inference
poster
NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks
poster
Enabling Resource-Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines
poster
EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models
poster
ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain
poster
Scaling laws for post-training quantized large language models
poster
Beyond Token Generation: Adaptive Chunk-Distilled Language Modeling
poster
Dataset Distillation for Audio Classification: A Data-Efficient Alternative to Active Learning
poster
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
poster
SharedContextBench: Evaluating Long-Context Methods in KV Cache Reuse
poster
Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences
poster
Dynamic Vocabulary Pruning in Early-Exit LLMs
poster
FastDraft: How to Train Your Draft
poster
Lightweight Neural Networks for Speech Emotion Recognition using Layer-wise Adaptive Quantization
poster
LORC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
poster
Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts
poster
SuperPos-Prompt: Enhancing Soft Prompt Tuning of Language Models with Superposition of Multi Token Embeddings
poster
Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning
poster
Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond