Tuesday, January 27, 2015

Weather Forecasting Doesn't Have To Be This Bad

Picture it - winter, somewhere in the northeastern US, two to three years from now. The National Weather Service is predicting a ‘historic’, ‘crippling’ blizzard. Snow accumulations are predicted to be over 18”, and coastal storm surges could top 4 feet, destroying homes and blocking evacuation.

Everyone sees all these warnings and thinks, “Yeah, right - just like that ‘crippling’ storm we had in January 2015, the one that dropped all of 8” of snow in NYC - the one that the National Weather Service actually apologized about.” Public officials, still stinging from accusations of overreacting with transit shutdowns and travel bans in 2015, decide to keep the roads open and mass transit running.  Residents in coastal communities decide against making storm preparations.

And then the unthinkable happens - the storm exceeds the forecasts. Snow accumulations top two feet. Storm surges crest at 6 feet, and high tide brings massive flooding from Atlantic City to Cape Cod. Thousands of people are stuck on stalled subways and trains for hours without power, in temperatures far below freezing. Others are trapped in their homes without power or adequate supplies. First responders are overwhelmed and can’t reach all those impacted by the storm, and the death toll climbs to over 50.

This is the devastating impact of bad weather forecasting - when people see a storm warning that doesn’t pan out, they are less likely to respond to the next storm warning. We see it in Gulf Coast communities that are frequently warned of impending hurricanes.

But it doesn’t have to be this bad. We have the technology, today, to make weather forecasting phenomenally better. There are two main avenues to better forecasting: better sensing data and added compute power for modeling.

Sensing data is the bedrock of setting the initial conditions for forecast models. The more data you have on initial conditions, the better your model will be. But the state of the art in sensing is still ridiculously close to what we did 30 years ago. Worldwide, there are about 800 weather balloons launched each day - meaning each balloon is responsible for covering about 200,000 square miles of ground area.  

If we really want to get great data on initial conditions, we need to make massive investments in weather-sensing drones - sensing platforms that can be deployed on demand and piloted into storm systems for detailed, fine-grained data on pressure and temperature. And we need to replace the costly system of oceanic weather buoys with autonomously-deployed sensing platforms, like those being developed by Liquid Robotics. Within five years, we could increase our sensing capacity by 10x simply through the cost reductions possible with autonomous deployment.

Once you have all this data, what are you going to do with it? The current approach of NOAA is to make massive investments in single-purpose computing platforms. Earlier this month, they announced that they are investing $45 million, with the goal of bringing their compute capacity for modeling up to 5 petaflops. It’s worth examining whether or not single-purpose computing is still the right approach in a world where massive amounts of cores are available on-demand.

On a sunny July day, 5 petaflops is probably vastly overpowered to create the forecast “sunny and mid-80s today”. But in the face of extreme weather, scaling up to twice that capacity could enable much more accurate forecasting - why not lease 600,000 cores from Google or Amazon for a day, run the severe weather model at 10 petaflops, and create a much more accurate model?

The bottom line is, it is within our technical capacity to end bad forecasts for severe weather in the next 5-10 years, but it is going to require some radical new approaches to sensing and computing.


5 comments:

Sean Cier said...

High-end weather modeling is notoriously difficult to parallelize, and requires more than just raw compute. HPC isn't always chosen just because the scientists using it are stuck in the seventies, it's often because clusters with slow interconnect and limited I/O just aren't suited for the models they run. Which isn't to say they couldn't help augment things more than they do right now, of course -- some is undoubtedly just inertia because of the difficulty of refactoring the models they use.

Jonathan Betz said...

Educate me more about this, Sean. Isn't it fundamentally a fluid dynamics model? Shouldn't I be able to divide the space into cells which can be modeled independently with cross-communication only at the edges of the cells? Is the interconnect bandwidth requirement that high?

My understanding is that, due to compute constraints, global forecast models use a time step on the order of minutes. My assertion is that if you could increase the compute power 10x on demand, you could reduce the step size accordingly and see a vast improvement in the quality of the resulting forecast.

Sean Cier said...

It's been years since I worked on the edges of HPC, and so the little relevant knowledge I had has decayed. I ran some numbers to try to figure out what the data transfer per time step would look like, as that's one of the key constraints, but failed (I'm not sure how much data per cell is stored by these models, for instance); it didn't look like those numbers were coming out that high, for what it's worth, so that's probably not the main constraint.

One issue is that in large clusters, nodes are unreliable; if you need to gather all the results of step 1 before starting step 2, you'll be waiting a while, so you need to use a method like redundancy to reduce the variation between nodes. Meaning you instantly cut the throughput of the cluster by a constant factor.

Let's ignore that and other supercomputer advantages like interconnect speed, though. You suggest that a cloud-style solution could provide more bang for the buck per petaflop. That's true of highly bursty, on-demand needs, but the NOAA's needs aren't bursty; there's really no such thing as a "sunny day" across the globe. GFS for example is run 4 times a day, every day; in other words, that capacity is pretty much fully utilized. Building a 5 petaflop cluster on, say, EC2 would run something like $65 million per year, if it was even possible (back of the envelope calculations based on http://arstechnica.com/information-technology/2013/11/18-hours-33k-and-156314-cores-amazon-cloud-hpc-hits-a-petaflop/ based on the assumption that the properties of the model being run there weren't entirely dissimilar to that of a weather forecast). These supercomputers will last longer than that -- if NOAA got 5 years out of this latest one (I assume they'll get more, but at that point it starts to have diminishing value as EC2 prices come down), that's easily a 7x savings over a cloud solution. Meaning if they use more than about 15% of the capacity of their new computer, then a single-purpose computer is actually cheaper than EC2 -- still ignoring those other advantages above, which could easily add another factor of 2 or 10.

Jonathan Betz said...

Let's stipulate for the moment that single-purpose supercomputing is the right approach for the nominal case of weather forecasting. I maintain that there are a finite number of events each year where the economic downside to an inaccurate forecast means that you have a high incentive to temporarily scale up the compute power applied to the problem.

Estimates for the economic loss to NYC from this week's over-cautious shutdown are in the $100M - $200M range; in preventing that kind of downside, even spending as little as $1M on expanded compute power for a few days to improve forecast accuracy is a well-made investment.

FWIW, some digging surfaced this paper which does a much more scientific examination of the topic and actually gets into cost analysis.

Jonathan Betz said...
This comment has been removed by the author.