Alibaba researchers unveil Marco-o1, an LLM with advanced reasoning capabilities
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Find out more
Recent issue of OpenAI o1 has drawn much attention to large-scale reasoning models (LRMs) and is inspiring new models that aim to solve complex problems that classical language models often struggle with. Building on the success of o1 and the concept of LRMs, Alibaba researchers presented Marco-o1which improves reasoning abilities and solves open-ended problems where clear standards and measurable rewards are lacking.
OpenAI o1 uses “inference time scaling” to improve the model’s reasoning ability by giving it “time to think”. Basically, the model uses more computing cycles during inference to generate more tokens and review its answers, which improves its performance on tasks that require reasoning. o1 is known for its impressive reasoning abilities, especially in standard answer tasks such as math, physics and coding.
However, many applications involve open-ended problems that lack clear solutions and measurable rewards. “We aimed to push the boundaries of LLMs even further, enhancing their reasoning abilities to tackle complex real-world challenges,” the Alibaba researchers write.
Marco-o1 is a fine-tuned version of Alibaba Quen2-7B-Instruct which integrates advanced techniques such as chain of thought (CoT) fine tuning, Search for trees in Monte Carlo (MCTS) and action strategies of reasoning.
The researchers trained Marco-o1 on a combination of data sets, including Open-O1 CoT dataset; the Marco-o1 CoT dataset, a synthetic dataset generated using MCTS; and the Marco-o1 instruction data set, a collection of custom data accompanying instructions for reasoning tasks.
MCTS is a search algorithm that has been shown to be effective in complex problem solving scenarios. It intelligently explores different solution paths by sequentially sampling possibilities, simulating outcomes and gradually building a decision tree. It has proven to be very effective at complex AI problems, such as beating the game of Go.
Marco-o1 uses MCTS to explore multiple paths of thought while generating response tokens. The model uses the reliability scores of candidate response tokens to build its decision tree and explore different branches. This allows the model to consider a wider range of possibilities and reach more informed and nuanced conclusions, especially in open-ended scenarios. The researchers also introduced a flexible reasoning strategy that allows them to adjust the granularity of the MCTS steps by defining the number of tokens generated at each node in the tree. This provides a trade-off between accuracy and computing cost, giving users the flexibility to balance performance and efficiency.
Another key innovation in the Marco-o1 is the introduction of a reflection mechanism. During the reasoning process, the model periodically prompts the phrase: “Wait! Maybe I made some mistakes! I have to think again from scratch.” This causes the model to reevaluate its reasoning steps, identify potential errors, and refine its thought process.
“This approach allows the model to act as its own critic, identifying potential errors in its reasoning,” the researchers write. “By explicitly encouraging the model to reconsider its initial conclusions, we encourage it to re-express and refine its thought process.”
To evaluate Marco-o1’s performance, the researchers conducted experiments on several tasks, including the MGSM benchmark, a dataset for multilingual elementary school math problems. Marco-o1 significantly outperformed the baseline Quen2-7B model, especially when the MCTS component was adjusted for single-token granularity.
However, the primary goal of Marco-o1 was to answer reasoning challenges in open-ended scenarios. To that end, the researchers tested a model of translating colloquial and slang expressions, a task that requires understanding the subtle nuances of language, culture, and context. Experiments showed that Marco-o1 was able to capture and translate these expressions more efficiently than traditional translation tools. For example, the model correctly translated a Chinese colloquial expression, which literally means, “This shoe offers the feeling of stepping on poop,” into the English equivalent, “This shoe has a comfortable sole.” The model’s chain of reasoning shows how it evaluates the various potential meanings and arrives at the correct translation.
This paradigm can prove useful for tasks such as product design and strategy, which require deep and contextual understanding and lack well-defined benchmarks and metrics.
A new wave of reasoning models
Since the release of o1, AI labs have been racing to publish reasoning models. Last week, Chinese AI lab DeepSeek announced R1-Lite-Previewits o1 competitor, which is currently only available through the company’s online chat interface. The R1-Lite-Preview is said to be better than the o1 in several key metrics.
The open source community is also catching up to the private model market, publishing models and datasets that take advantage of the scaling laws of inference time. The Alibaba team announced Marco-o1 on Hugging Face together with a partial reasoning data set which researchers can use to train their own reasoning models. Another recently released model is LLaVA-o1developed by researchers from multiple universities in China, which brings the inference-in-time paradigm to open source vision language models (VLM).
The release of these models comes amid uncertainty over the future of model scaling laws. Various reports indicate that the profitability of training larger models is decreasing and may hit a wall. But what is certain is that we are just beginning to explore the possibilities of scaling inference time.
Source link