Petabyte database will track meteors

SQL Server chosen as platform for early-warning project

Our fascination with the prospect of asteroids smashing into the Earth is as deep as the craters that can result from such cosmic fireballs. Think of all the movies Hollywood has made, from little-seen B flicks such as A Fire in the Sky to scientifically shaky blockbusters such as Meteor and Armageddon.

Once dismissed as the province of fringe cult groups, the fear of what astronomers call "impact events" turns out, with improved satellite and telescopic monitoring, to not be so irrational after all.

The latest and most ambitious to detect 'near-Earth objects' (NEOs) is the Panoramic Survey Telescope and Rapid Response System, or Pan-STARRS.

A joint venture of the University of Hawaii, a number of other universities and the US Air Force, Pan-STARRS is today testing a telescope mounted with the finest digital camera in existence, which boasts a resolution of 1.4 billion pixels.

When Pan-STARRS is fully operational several years from now, it will have four telescopes, each with a 1.4-gigapixel camera.

That will give Pan-STARRS a wider, faster and more powerful view into space, and enable it to meet its mandate of tracking virtually all NEOs larger than 300 meters in diameter as well as many smaller NEOs.

It'll have plenty to see. About once a year, an asteroid between 5-10 meters in diameter explodes in the Earth's upper atmosphere, releasing as much energy as the atomic bomb used at Hiroshima. And it doesn't take a big one to slip through and cause a lot of damage; certainly that 300-meter line has been crossed before, to devastating effect.. The asteroid behind 1908's Tunguska Event in Siberia created an explosion equivalent to 10-15 megatons of TNT (about 1,000 times the Hiroshima bomb), knocking over an estimated 80 million trees and causing an earthquake that's estimated to have measured a 5.0 on the Richter scale (which was uninvented at the time). It was only about 50 meters in diameter.

Searching the skies for meteors creates a lot of material — with just a single telescope, Pan-STARRS already generates 1.4TB of raw image data nightly. Compressing, storing and crunching that data in an economical fashion turns out to be a feat of database engineering as impressive as the collection process.

Rather than turning to an expensive supercomputer equipped with hundreds or thousands of processors, Pan-STARRS will use a cluster of 50 PC servers connected to 1.1 petabytes of disk storage via fast Infiniband networking gear, says Alex Szalay, a physics and astronomy professor at Johns Hopkins University and one of the architects of Pan-STARRS' database.

And rather than using a database management program better-known for ultra-large data warehouses, such as IBM's DB2, TeraData, or Oracle Database, Pan-STARRS will use Microsoft's just-released SQL Server 2008.

Even Microsoft would probably admit that despite improved data compression and a resource governor to manage multiple workloads, SQL Server 2008 is not the most intuitive choice for this clustered, 'scaled-out' schema.

"SQL Server 2008 takes us to the next level, but that is within the'scale-up' model," says Ted Kummert, corporate vice president of Microsoft's data storage and platform division. Rather, Microsoft's recent acquisition of large data warehouse-focused startup vendor DATAllegro "will take us to the greatest level of scale-out," he says.

There are several reasons why Pan-STARRS went with SQL Server 2008.

One is cost. Deploying Pan-STARRS will cost just US$750,000, (NZ$1.057 million), due to the low cost of the PC hardware and the heavy academic discounts offered by Microsoft for SQL Server and Windows Server 2008.

"People in academia are always operating on a shoestring budget, so we wanted to be able to create something others could emulate," Szalay says.

More important, however, is Microsoft's long involvement with the astronomical community, especially via its technical ambassador, Jim Gray. The noted database researcher, whodisappeared at sea in early 2007 and is now presumed dead, was instrumental in building predecessor databases, such as TerraServer, a massive free web archive of satellite pictures of the Earth stored in SQL Server, and the 40TB SkyServer, a similar repository of astronomical images.

Indeed, the distributed database platform that Pan-STARRS (and, it is hoped, other applications) will run on is called GrayWulf in Gray's honor.

"Gray worked with us for more than a decade. All the credit should go to him," Szalay says.

"He changed astronomy as we know it," said Maria A Nieto-Santisteban, a software engineer at Johns Hopkins and Pan-STARRS' technical lead. "We still ask ourselves, 'How would Jim do this?'"

Astronomers first began storing data digitally in the mid-1970s, shortly after they began replacing conventional photographic plates with digital camera technology.

Efficiency-wise, digital cameras were still a vast improvement over those photographic plates, which required astronomers to hunch over them with magnifying glasses, counting galaxies and stars. But the digital image resolution back then left something to be desired — just 260,000 pixels.

Data storage was also crude. Image data was and is still stored in a low-level format based on 80 character-long punch cards. But the flat files used to store the data proved difficult to search and otherwise manipulate.

Gray guided the building of SkyServer, which holds 100 billion rows of data and a million distinct IP addresses, and serves 10-15,000 professional astronomers as well as countless schoolchildren who use SkyServer to complete astronomy reports.

Pan-STARRS, which Gray helped conceive, will be far larger, containing, by the end of 2010, 300TB of data, with some individual tables as large as 20 TB, Szalay says. The repository will include data on more than 140 billion cosmic objects and 5.5 billion actively tracked ones.

Though Pan-STARRS won't use up all 1.1 PB of storage for many years, it will still rank as one of the world's largest databases.

As a clustered system, the data will be partitioned, with a separate names database serving as the index. Since most cosmic objects don't have names such as Earth or Alpha Centauri, most searches will be done via a graphical interface that, according to Szalay, "looks and feels a lot like MapQuest or Google Maps".

Besides looking up data on individual stars or galaxies, Pan-STARRS will also be used to do some deep data mining -- astronomical intelligence, if you will. For instance, Szalay hopes to import old astronomical data from the pre-digital age and run the information through a spatial cross-matching engine in order to create a master database that links all past and present data about every single star or planet.

Pan-STARRS will also serve as a cloud database for outside astronomers, who will be allowed to remotely run queries and store results within Pan-STARRS.

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

Tags sql servermeteorspan-starrs

Show Comments
[]