Hugging Face shows how test-time scaling helps small language models punch above their weight
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. learn more
In a new case study, Hugging Face researchers show how small language model (SLM) can be configured to outperform larger models. Their results show that the Llama 3 model with 3B parameters can outperform the 70B version of the model on complex mathematical problems.
There is a face hug full record The entire process and provides a roadmap for companies that want to create their own customized inference models.
Extended test time calculation
The work is inspired by OpenAI o1which uses extra “thinking” to solve complex math, coding, and reasoning problems.
The key idea behind a model like o1 is to extend “test-time computation”, which essentially means using more computational cycles during inference to test and validate different responses and reasoning paths before producing the final answer. Extended test-time calculations are particularly useful when there is not enough memory to run large models.
Because o1 is a proprietary model and OpenAI has been tight-lipped about its inner workings, researchers have been speculating on how it works and trying to reverse-engineer the process. There are already several Alternatives to opening o1.
Embrace face work is based on DeepMind research released in Augustwhich studies the trade-off between inference time and pre-training computation. The study provides comprehensive guidance on how to balance training and inference computation to achieve optimal results within a fixed budget.
In addition to using additional inference time calculations, the success of this technique depends on two key components: a reward model that evaluates the SLM answer, and a search algorithm that optimizes the path required to refine the answer.
different inference algorithms
The simplest way to use test time scaling is “majority voting”, where the model is sent the same prompt multiple times and the one with the highest votes is chosen. Majority voting can be useful in simple problems, but its gains quickly plateau on complex inference problems or tasks where errors are consistent across generations.
A more advanced inference method is “Best-of-N”. In this technique, SLM generates multiple answers, but instead of using majority voting, a reward model is used to evaluate the answers and select the best answer. Weighted N-Best is a more nuanced version of this method that takes consistency into account to select answers that are both confident and occur more frequently than other answers.
The researchers used the Process Reward Model (PRM) to score SLM responses not only based on the final answer, but also on the multiple stages they went through to get to the final answer. Their experiments show that weighted Best-of-N and PRM bring Flame-3.2 1B Close to Llama-3.2 8B performance on the difficult MATH-500 benchmark.
Add new search
In order to further improve the performance of the model, the researchers added a search algorithm to the model’s reasoning process. Rather than generating an answer all at once, they used “bunch search,” an algorithm that guides the model step by step through the answer process.
At each step, SLM produces multiple partial answers. The search algorithm uses a reward model to evaluate answers and select subsets worthy of further exploration. This process is repeated until the model exhausts its inference budget or reaches the correct answer. This way, the inference budget can be narrowed and focused on the most promising answers.
The researchers found that while beam search improved model performance on complex problems, it often underperformed other techniques on simpler problems. To address this challenge, they added two more elements to their reasoning strategy.
The first is Diversified Verification Tree Search (DVTS), a variant of beam search that ensures the SLM does not fall into wrong inference paths and diversifies its response branches. Second, they developed a “computationally optimal scaling strategy”, as suggested in the DeepMind paper, that dynamically selects the best test-time scaling strategy based on the difficulty of the input problem.
The combination of these technologies allows the Llama-3.2 1B to punch above its weight and significantly outperform the 8B model. They also found that the strategy was scalable, and when applied to Llama-3.2 3B, they were able to outperform the larger 70B model.
There is no perfect solution yet
Extending test time calculations changes the dynamics of model costs. Enterprises can now choose where to allocate their computing resources. For example, if you are low on memory or can tolerate slower response times, you can use a smaller model and spend more cycles of inference time to produce more accurate answers.
However, test time scaling also has its limitations. For example, in the experiments conducted by Hugging Face, the researchers used the specially trained Llama-3.1-8B model as the PRM, which required running two models in parallel (even though it was much more resource efficient than the 70B model). The researchers admit that the holy grail of test-time scaling is “self-validation,” where the original model verifies its own answers rather than relying on an external verifier. This is an open area of research.
The quiz time scaling technique proposed in this study is also limited to questions where answers can be clearly assessed, such as coding and mathematics. Creating reward models and validators for subjective tasks such as creative writing and product design requires further research.
But it is obvious that the expansion of test time has produced many interests and activities We expect more tools and technologies to emerge in the coming months. Businesses would be wise to monitor the situation as it develops.
2024-12-20 20:46:10