Meta’s new BLT architecture upgrades LLMs by replacing tokens
December 19, 2024

Meta’s new BLT architecture upgrades LLMs by replacing tokens


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. learn more


The artificial intelligence research community is constantly looking for new ways to improve large language models (LLMs), and the latest method is a new architecture introduced by scientists from Meta and the University of Washington.

their technology, Byte latency tform of robbery (BLT), may be the next important paradigm in making the LL.M. more versatile and scalable.

BLT solves one of the long-standing problems of LLM operating at the byte level (rather than token). BLT can open the way to new models that can handle raw material, are robust to change, and do not rely on fixed vocabularies.

Tokens and Bytes

Most LL.M.s are trained on a set of statically labeled, pre-defined sequences of bytes.

During inference, the tokenizer decomposes the input sequence into tokens and passes them to the LLM.

This makes the model more efficient in using computational resources, but can also introduce biases that can degrade the model’s performance when faced with tokens that are not included in the vocabulary.

For example, many leading language models can become slower and more expensive when faced with languages ​​that are less represented on the web because their words are not included in the model’s token vocabulary. Misspelled words can also cause the model to label input incorrectly. Tokenized models can have difficulty handling character-level tasks, such as sequences of operations.

Additionally, modifying the vocabulary requires retraining the model. Expanding the token vocabulary may require architectural changes to the model to accommodate the increased complexity.

Alternatively, LLM can be trained directly on single bytes, which can solve many of the problems mentioned above. However, large-scale training of byte-level LLM is expensive and cannot handle very long sequencewhich is why tokenization remains an important component of current LL.M.

Byte Latent Transformer (BLT)

Byte Latent Transformer (BLT) is a tokenizer-less architecture that learns directly from raw bytes and matches the performance of tokenization-based models. To address the inefficiencies of other byte-level LLMs, BLT uses a dynamic approach to group bytes based on the level of information they contain.

“A core idea of ​​our architecture is that models should dynamically allocate computation where needed,” the researchers wrote.

Unlike tokenization models, BLT does not have a fixed vocabulary. Instead, it maps arbitrary groups of bytes to patch Use entropy measure. BLT performs dynamic patching through a novel architecture with three transformer blocks: two small byte-level encoder/decoder models and a large “potential global transformer.”

BLT architecture (Source: arXiv)

Encoders and decoders are lightweight models. The encoder receives the raw input bytes and builds a patch representation that is fed to the global converter. At the other end, the native decoder takes the batch representation processed by the global converter and decodes it into raw bytes.

The potential full range transformer is the main power of this model. It accepts the patch representation produced by the encoder and predicts the next patch in the sequence. When processed by the decoder, the patch is unpacked into one or more bytes.

Global transformers occupy the largest share of computing resources during training and inference. The patching mechanism therefore determines how the global converter is used and can help control the amount of computation used for different parts of the input and output.

BLT redefines the trade-off between vocabulary size and computational requirements. In a standard LLM, increased vocabulary means larger tokens on average, which can reduce the number of steps required to process a sequence. However, the projection layer inside the transformer also needs to be larger, which itself consumes more resources.

In contrast, BLT balances computational resources based on material complexity rather than vocabulary size. For example, most word endings are easy to predict and require fewer resources. On the other hand, predicting the first byte of a new word or the first word of a sentence requires more computing cycles.

“BLT opens a new dimension of scaling, allowing simultaneous increases in model and patch size within a fixed inference budget,” the researchers wrote. “This new paradigm becomes advantageous for computational mechanisms common in real-world environments.”

BLT in action

The researchers conducted experiments on models of different sizes using BLT and the classic Transformer, running between 400 million and 8 billion parameters.

According to the authors, this is “the first study of flip-flop control extensions to byte-level models with up to 8B parameters and 4T training bytes, showing that we can train models end-to-end starting from bytes without Fixed vocabulary tokenization”.

Their results show that when controlling the amount of computing resources allocated to training, BLT performs as well as camel 3 At the same time, the number of FLOPs used during inference is reduced by 50%. This efficiency comes from dynamic patching of the model, which results in longer bytes, saving operations that can be reallocated to increase the size of the global potential transformer.

“To our knowledge, BLT is the first byte-level Transformer architecture capable of achieving scaling trends that match BPE-based models under computational optimality,” the researchers wrote.

In addition to efficiency, BLT models have proven to be more robust to noisy inputs than tokenizer-based models. They enhance character-level understanding and show better performance on tasks such as character manipulation and low-resource machine translation. The researchers say that BLT’s ability to directly process raw bytes “provides significant improvements in modeling the long tail of the material” compared to tagging, meaning the model is better able to handle patterns that do not appear frequently in the material. corpus.

This is still the beginning of a new standard for creating language models. The researchers note that existing converter libraries and code bases are designed to be very efficient for tokenizer-based converter architectures. This means there is still room for BLT to benefit from software and hardware optimizations.


2024-12-18 14:48:54

Leave a Reply

Your email address will not be published. Required fields are marked *