Meta AI LM-Infinite – Massive LLM improvemen…

2023/9/2

Meta AI LM-Infinite – Massive LLM improvement!

Paper: https://huggingface.co/papers/2308.16137

— Overview:
This paper identifies and addresses a key limitation of large language models (LLMs) – the inability to generalize to sequence lengths longer than their training corpus. Even models using relative position encodings struggle to generate coherent text beyond contexts seen during training. The authors diagnose three contributing factors through empirical analysis, and propose a simple and efficient solution called LM-Infinite that enables on-the-fly length generalization without retraining. When tested on models like LLaMA and GPT-J, LM-Infinite maintained fluency and performance at lengths up to 32x greater than training, while being 3x faster.

— Background:
LLMs have achieved impressive natural language generation, but still face challenges in handling long sequences of text. Most training schemes limit sequence lengths to a fixed size to control costs. This leads to degradation and incoherent text when models encounter longer contexts during inference. Advancements like relative position encodings were meant to mitigate this, but length generalization still fails in practice. Common remedies like finetuning on longer texts are compute-intensive. This motivates an efficient on-the-fly solution.

— Key Factors Limiting Length Generalization:

1. Unseen Long Distances: Relative position encodings rely on distance between tokens. On very long sequences, some distances far exceed those seen during training. This leads to exploding attention logits as the model tries to distinguish new distances.

2. Unseen Number of Tokens: Longer contexts increase the number of tokens attended to. This dilutes attention and causes high entropy, losing information.

3. Implicitly Encoded Absolute Position: Earlier transformer layers seem to implicitly encode some absolute position information. When sequence length increases, this encoding for initial tokens gets distorted.

— Proposed Solution – LM-Infinite:
To address the above factors, the authors propose two simple modifications:

1. Λ-Shaped Attention Mask: Limits tokens attended to preserve recent local context, while always attending to initial salient tokens. This maintains some position encoding and prevents attention dilution.

2. Bounding Relative Distances: Clips effective distances during attention to maximum training length. This caps exploding logits from unseen long distances.

These core principles make LM-Infinite model-agnostic. It can be applied to any LLM using relative position encodings like RoPE or Alibi without retraining. For RoPE, query vectors are rotated to fixed distances and key vectors remain unchanged. For Alibi, offset values are clipped to training length.

The Λ-mask allows each token to attend to previous n global tokens, and all tokens within n local distance. Typical values are n_global=10-100 and n_local=training length. Distance bounding only affects the global branch. Together, this retains local context and global position information while limiting unfamiliar distances and attention targets.

— Experiments and Results:
LM-Infinite was tested on varieties of LLMs including LLaMA, GPT-J, and MPT-7B on ArXiv and OpenWebText datasets.

– Perplexity remained stable at lengths up to 32k tokens, 3-16x longer than training corpus. This indicates maintained fluency.

– BLEU and ROUGE scores also stayed consistent with quality comparable to or better than fine-tuned models.

– On passkey retrieval with distractors, accuracy degraded slower than baseline models, extending coherent generation.

– Encoding and decoding were sped up by 3.16x and 2.72x respectively on length 32k with no drop in quality.

Overall, LM-Infinite enabled excellent on-the-fly length generalization across models, despite no parameter updates. It provides efficient length extension without costly retraining.

— Conclusion:
This paper identified key limitations of relative position encodings – unseen distances, excessive tokens, and distorted position encoding. It introduced LM-Infinite to address these issues through attention masking and distance bounding. This simple technique demonstrated consistent fluency and performance on a variety of LLMs at lengths far exceeding training corpus. LM-Infinite provides an effective and model-agnostic solution for length generalization, without requiring fine-tuning. Future work can further improve information retention from masked content.