A GPU Native Generative AI Platform
December 20, 2024

A GPU Native Generative AI Platform

three years ago We set out to redefine how artificial intelligence is developed and deployed. Our goal is not just to improve existing systems; Rebuilding AI infrastructure from the ground up Delivering a higher-performance, programmable, and portable infrastructure platform. We recognize that in order to meet today’s challenges and stay ahead of the rapidly evolving technology, we need to completely rethink the AI ​​stack from first principles.

The arrival of large-scale generative AI has changed the nature of AI infrastructure. In order to meet the rapidly growing resource requirements, GenAI requires innovation from the lowest level of GPU programming all the way to the service layer, and only Modular can do this.

Today, we’re announcing the first steps to address these challenges Maximum 24.6with preview function Max GPU. This GPU release demonstrates the power of the MAX platform and is just the beginning of the advancements we will bring to AI infrastructure as we move into 2025. Promote artificial intelligence infrastructure to the world in the coming months.

Introducing MAX GPU: New GenAI native service stack

in the heart of MAX 24.6 released MAX GPU – The first vertically integrated stack of generative AI services that eliminates dependence on vendor-specific computing libraries such as NVIDIA CUDA.

MAX GPU is built on two breakthrough technologies. The first is MAX Engine, a high-performance AI model compiler and runtime built using the innovative Mojo GPU core for NVIDIA GPUs without CUDA core dependencies. The second is MAX Serve, a sophisticated Python native service layer designed specifically for LLM applications. MAX Serve specializes in complex request batching and scheduling, delivering consistent and reliable performance even under heavy workloads.

Unlike existing tools that only address specific parts of the AI ​​workflow, MAX is designed to support the entire development experience—from initial experimentation, through deployment to production. MAX provides a unified platform for exploring new models, testing and optimization, and Provide high-performance reasoning capabilities Legacy technologies previously required cobbling together.

Comparison between vLLM and MAX architecture

We cannot overstate the importance of this diagram to our mission and vision of simplifying AI infrastructure and making it accessible to everyone. We strive to significantly reduce the complexity of the entire AI infrastructure stack, and as you can see above, MAX reduces the incredible fragmentation of today’s ecosystem. As a developer, it has become very challenging to juggle this with such a large number of different technologies. We’re always looking for new ways and want to make sure the world builds with us.

We can now ship uncompressed Docker containers without the CUDA toolkit for NVIDIA GPUs, which brings our total size down to under 3.7 GB, compared to 10.6 GB for the vLLM container, a 65% reduction. For customers who don’t need PyTorch and just want to use MAX Graphs, it drops even further to just 2.83GB. When we compressed it, its size was under 1GB.

Enterprise-grade development-to-deployment flexibility

MAX Engine supports flexible inference deployment across multiple hardware platforms, allowing developers to conduct local experiments on laptops and seamlessly scale to production cloud environments. Combined with MAX Serve’s native Hugging Face model support, teams can quickly develop, test, and deploy any PyTorch LLM. Custom weight support, including Llama Guard integration, further enables developers to customize models for specific tasks.

When it’s time to put your model into production, MAX Serve provides an OpenAI-compatible client API packaged in a compact Docker container that runs on the NVIDIA platform. Teams can then deploy models across all major clouds, including AWS, GCP, and Azure, with options for both clouds Direct virtual machine deployment and Enterprise-grade Kubernetes orchestration. This flexibility ensures you can securely host your own models, keeping your GenAI infrastructure completely under your control.

The core of this workflow is magicModular’s command-line tool that simplifies the entire MAX life cycle. Magic handles everything from installation and environment management to development and deployment, making it easier to manage your AI infrastructure. Read more about Magic: The Gathering here.

High-performance GenAI models and hardware portability

We’re also expanding MAX’s capabilities with new high-performance models. These models provide optimized implementations of many popular LL.M.s such as Llama and Mistral. When used out of the box on NVIDIA GPUs, MAX’s performance in standard throughput benchmarks matches that of vLLM, an established AI services framework. These models also support a range of quantification methods, and we are working hard on this range of native MAX models to define SOTA performance in the coming weeks and months.

industry standard Share GPTv3 Benchmark tests demonstrate the performance capabilities of MAX GPUs with Llama 3.1, achieving throughput 3860 output tokens per second on NVIDIA A100 GPU Using only MAX’s innovative NVIDIA cores –GPU utilization is greater than 95%. Currently, we achieve this level of performance without optimizations like PagedAttention, which will be implemented early next year. All of this is to emphasize that we are just getting started and our numbers will only continue to improve.

MAX GPU is launched, supporting NVIDIA A100, L40, L4 and A10 accelerators, the industry standard for LLM inference and the most optimized GPU on the market. H100, H200 and AMD support will be launched early next year.

Designed for hardware portability

Next up is support for AMD MI300X GPUs, which we are currently refining and expected to be available soon. MAX’s NVIDIA and AMD cores are built on the same underlying technology, allowing us to quickly support new hardware platforms. We will be sharing exciting details about AMD soon and will continue to expand our AMD support in early 2025.

Try MAX 24.6 and our night version now

We’re excited to invite developers to explore an early technology preview of MAX GPU and see how it can transform your AI workflow. Despite being a preview, this version is still full of Many features and functions picture New high-performance models running on NVIDIA GPUs, Compatibility with OpenAI APIand interoperability with Hugging Face models.

Start running Llama 3 on MAX GPU today!

This is just the beginning – in 2025, we will continue to expand the GPU technology stack to provide higher performance in more generative AI modes, such as text-to-visual and multi-GPU support for large models. We are also working to enhance the portability of the new hardware architecture and introduce a complete GPU programming framework for low-level control and customization.

To help you stay ahead of the curve, we’ve released Detailed documentation of our nightly buildmaking it easier to install and take advantage of the latest GPU features directly from the development branch.

As the year comes to an end, we extend our sincerest wishes to everyone. 2025 is shaping up to be a pivotal year for AI infrastructure, and we’re excited to be at the forefront of this transformation. We are very excited. We look forward to continuing this journey with you in the new year—see you in January!

2024-12-17 18:50:19

Leave a Reply

Your email address will not be published. Required fields are marked *