Google’s DeepMind tackles weather forecasting, with great performance
By some measures, AI systems are now competitive with traditional computing methods to generate weather forecasts. Because their training penalizes errors, the predictions tend to be “fuzzy” — as you move further back in time, the models make less specific predictions because they’re more likely to be wrong. As a result, you start to see things like expanding storm tracks, and the storms themselves lose their clearly defined edges.
However, the use of AI is still extremely tempting because the alternative is a computational model of the atmospheric circulation, which is extremely computationally demanding. Nevertheless, it is very successful, with the ensemble model from the European Center for Medium-Range Weather Forecasts considered best in class.
In a paper published today, Google DeepMind claims that its new artificial intelligence system can outperform the European model in forecasting by at least a week and often longer. DeepMind’s system, called GenCast, combines some of the computational approaches used by atmospheric scientists with diffusion modelcommonly used in generative AI. The result is a system that maintains high resolution while significantly reducing computational costs.
File prediction
Traditional computing methods have two main advantages over AI systems. The first is that they are directly based on atmospheric physics, incorporating the rules we know govern the behavior of our actual weather, and calculating some of the details in a way that is directly informed by empirical data. They are also run as files, which means that multiple instances of the model are run. Due to the chaotic nature of the weather, these different runs will gradually diverge, providing a measure of forecast uncertainty.
At least one attempt has been made to merge some aspects of traditional weather models with AI systems. An internal Google project used a traditional atmospheric circulation model that divided the Earth’s surface into a grid of cells, but used AI predict the behavior of each cell. This provided much better computing power, but at the expense of relatively large grid cells, resulting in relatively low resolution.
Based on AI weather forecasts, DeepMind decided to skip physics and instead embrace the ability to operate a file.
Gen Cast is based on diffusion models which have a key feature that is useful here. Basically, these models are trained by starting with a mixture of the original – image, text, weather pattern – and then a variation where noise is injected. The system is supposed to create a variation of the noisy version that is close to the original. Once learned, it can be powered by pure noise and evolve to be closer to whatever it targets.
In this case, realistic weather data is the goal, and the system takes a pure noise input and evolves it based on the current state of the atmosphere and its recent history. For longer-term forecasts, “history” includes both actual data and predicted data from previous forecasts. The system moves forward in 12-hour increments, so the forecast for day three will include initial conditions, earlier history, and two forecasts from days one and two.
This is useful when creating a complete forecast because you can add different noise patterns to it as input and each will produce slightly different weather data output. It serves the same purpose as in a traditional weather model: it provides a measure of forecast uncertainty.
For each grid square, GenCast works with six surface weather measurements, along with six that track the state of the atmosphere and 13 different altitudes at which it estimates air pressure. Each of these grid squares is 0.2 degrees on a side, which is a higher resolution than the European model uses for its predictions. Despite this resolution, DeepMind estimates that a single instance (meaning not the entire file) can take up to 15 days on one of Google Tensor Processing Systems in just eight minutes.
It is possible to create a complete forecast by running multiple versions in parallel and then integrating the results. Given the amount of hardware Google has available, the entire process will likely take less than 20 minutes from start to finish. The source and training data will be hosted on GitHub for DeepMind’s GraphCast project. Given the relatively low computational requirements, we can probably expect that individual academic research teams will start experimenting with it.
Measures of success
DeepMind reports that GenCast dramatically outperforms the best traditional forecasting model. Using a standard field benchmark, DeepMind found that GenCast was more accurate than the European model in 97 percent of the tests it ran, which checked different output values at different times in the future. Furthermore, the confidence values based on the uncertainty obtained from the ensemble were generally reasonable.
Past AI weather forecasters trained on real data are generally not great at handling extreme weather because it rarely appears in the training set. But GenCast did quite well, often outperforming the European model in things like abnormally high and low temperatures and air pressure (one percent frequency or less, including the 0.01 percentile).
DeepMind also went beyond standard tests to see if GenCast could be useful. This research involved the projection of tropical cyclone tracks, which is important work for forecast models. For the first four days, GenCast was significantly more accurate than the European model and maintained its lead for about a week.
One of DeepMind’s most interesting tests was checking a global wind power forecast based on information from the Global Powerplant Database. This involved using it to predict the wind speed at 10 meters above the surface (which is actually lower than where most turbines are located, but the best possible approximation) and then using that number to work out how much power would be generated. The system outperformed the traditional weather model by 20 percent for the first two days and stayed ahead with a shrinking lead for up to a week.
Researchers don’t spend much time investigating why performance seems to gradually decline over a week or so. Ideally, more details about GenCast’s limitations would help inform further improvements, so researchers are likely considering it. In any case, today’s paper marks the second time that something like a hybrid approach — mixing aspects of traditional forecasting systems with AI — has been reported to be used to improve forecasts. And both of these cases took very different approaches, raising the prospect that some of their features could be combined.
Nature, 2024. DOI: 10.1038/s41586-024-08252-9 (About DOI).