Sakana AI’s CycleQD outperforms traditional fine-tuning methods for multi-skill language models
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. learn more
Researchers are at Partial AI A resource-efficient framework is developed to create hundreds of language models specialized for different tasks. called Period QDa technique that uses evolutionary algorithms to combine the skills of different models without requiring expensive and slow training processes.
CycleQD can build a large number of task-specific proxies, providing a more sustainable alternative to the current paradigm of increasing model size.
Rethinking model training
Large language model (LL.M.) Demonstrated exceptional ability in a variety of assignments. However, training LL.M.s in multiple skills remains a challenge. When fine-tuning a model, engineers must balance data for different skills and ensure that one skill does not dominate the others. Current approaches often involve training larger and larger models, leading to ever-increasing computational demands and resource requirements.
“We believe that rather than aiming to develop a single large model that performs well on all tasks, a population-based approach to evolving a diverse population of niche models may provide an alternative, more sustainable approach to Expanding the development of artificial intelligence agents with advanced capabilities,” Sakana researchers wrote in a blog post.
create model populationthe researchers drew inspiration from Qualitative Diversity (QD), an evolutionary computing paradigm focused on discovering a diverse set of solutions from an initial population sample. QD aims to create samples with various “behavioral characteristics” (BC), representing different skill areas. It achieves this through an evolutionary algorithm (EA), which selects parent samples and uses crossover and mutation operations to create new samples.
Period QD
CycleQD incorporates QD into the post-training process for LL.M.s to help them learn new and complex skills. CycleQD is useful when you have multiple small models fine-tuned for very specific skills, such as coding or execution Databases and operating systems operations, and you want to build new variants with different combinations of these skills.
In the CycleQD framework, each of these skills is considered a behavioral characteristic or quality for next-generation model optimization. In each generation, the algorithm focuses on one specific skill as its quality indicator, while using other skills as its BC.
“This ensures that each skill receives attention, making the LLM more balanced and competent overall,” the researchers explain.
CycleQD starts with a group of expert LL.M.s, each specializing in a skill. The algorithm then applies “crossover” and “mutation” operations to add new, higher-quality models to the population. Crossover combines features of two parent models to create a new model, while mutation makes random changes to the model to explore new possibilities.
The crossover operation is based on Model merginga technique that combines the parameters of two LL.M.s to create new models with combined skills. This is a cost-effective and fast method for developing comprehensive models without the need for fine-tuning.
Mutation operation uses singular value decomposition (SVD), a decomposition method that breaks any matrix into simpler component parts, making its elements easier to understand and manipulate. CycleQD uses SVD to decompose a model’s skills into basic components or sub-skills. By adjusting these sub-skills, the mutation process creates models that can explore new capabilities beyond those of their parent models. This helps the model avoid falling into predictable patterns and reduces the risk of overfitting.
Evaluating the performance of CycleQD
Researchers applied CycleQD to a group of Dial 3-8B Expert models fine-tuned for coding, database operations, and operating system operations. The goal was to see if evolutionary methods could combine the skills of these three models to create a superior model.
The results show that CycleQD outperforms traditional fine-tuning and model merging methods in the evaluated tasks. Notably, the model fine-tuned on all datasets performed only slightly better than the single-skill expert model despite being trained on more data. Moreover, the traditional training process is much slower and more expensive. CycleQD is also able to create a variety of models with different performance levels for the target task.
“These results clearly show that CycleQD outperforms traditional methods, demonstrating its effectiveness in training LL.M.s in a variety of skills,” the researchers wrote.
The researchers believe CycleQD has the potential to enable lifelong learning for artificial intelligence systems, allowing them to grow, adapt and accumulate knowledge over time. This could have direct implications for real-world applications. For example, CycleQD can be used to continuously fuse the skills of expert models instead of training large models from scratch.
Another exciting direction is the development of multi-agent systems, where a large number of specialized agents developed through CycleQD can collaborate, compete and learn from each other.
“From scientific discovery to solving real-world problems, large numbers of specialized agents could redefine the limits of artificial intelligence,” the researchers wrote.
2024-12-06 21:08:35