Big data the NASA way

Chris Mattmann, a senior computer scientist at the NASA's Jet Propulsion Laboratories will be speaking at the ITEX conference next week

Comments

When the Curiosity rover arrived on Mars two months ago it was just about the best public relations exercise that NASA could have hoped for, short of actually landing a human on the red planet.

"They've really done a lot for the agency to make people think it's cool to work at NASA again," says senior computer scientist Chris Mattmann, who works at Jet Propulsion Laboratories (JPL), one of ten NASA centres.

Mattmann is speaking at the ITEX conference in Auckland on November 8 at the Viaduct Events Centre. He has worked for NASA since he was an undergraduate at the University of Southern California, when he took a part time academic position.

While not directly involved in the Curiosity mission, Mattmann has worked on Apache (the open source software foundation) data processing and information integration software projects that help power NASA's planetary data system -- the archive for all its space missions.

Mattmann became involved in Nutch, an open source search engine program, when studying for his doctorate. Nutch was created by Doug Cutting, who went on to found the big data system Hadoop.

Cutting was inspired to create Nutch because of a frustration with the 'black box' approach that Google had towards its search technology.

"He really felt that search should be more open and people should be able to tinker with things like ranking," says Mattmann.

Mattmann has used Nutch in his work at NASA, which he describes as "organising the information for scientists." At JPL he leads teams who build large scale data systems that manage hundreds of terabytes of information.

"Part of the stuff that I help to do is organise the information for scientists," he says.

"Organisation can range from the way that files are specifically named and the information that's captured in file names, to their organisation on discs, to the way the information is disseminated to the public."

Using Nutch within NASA, contributing code and helping people on the mailing lists has led to Mattmann becoming an Apache 'committer'. According to the Apache website that means he has access to the source code repository, and can help make strategic decisions around bug fixes and new software releases.

Mattmann explains that Nutch could originally only scale up to 100 million web pages, whereas the big search engines such as Yahoo and Google were in the four billion page range. So Cutting set about creating a new system, which he called Hadoop, allegedly after his childhood stuffed toy.

Inspired by Cutting's work and sense of humour, Mattmann has started his own project called Tika - a text analysis tool that detects and extracts metadata and structured text content from various documents using existing parser libraries. It is named after a soft toy belonging to the daughter of his partner in the project.

"Most of the open source work I do is through Apache, a lot of it has to do with the Apache licence being a very permissive licence," Mattman says. "It allows people downstream that leverage Apache based software to use that upstream open source component in arbitrary ways. It makes it so the software I build -- when we distribute it to customers, or others we collaborate with, we don't have to give them any surprises."

Mattmann says NASA has been an active user of open source software for around 15 years but only recently has it become active on the production side. For the past two years NASA has held open source summits, outlining its contribution to open source.

NASA categorises its data in different levels, and in the next generation earth science system satellite area where Mattmann works it is publically distributed via DAACs (Distributed Active Archive Centres). He says the programs and tools used to process data vary depending on the preferences of the scientists involved in the project. "A lot times the software itself is coupled to the instrument."

Level zero data is raw data that comes off the instrument and level one data is data which has started to be calibrated from raw voltages.

Mattmann says that the public can have access, through the DAACs, to level two data. This is data that is calibrated, geospatially identified and mapped to a physical model (measurements that can be mapped in space and time).

"It's so voluminous, because it's raw measurements in space and time from an instrument. You probably won't use that in your IT organisations, it might be too big for you," he says.

It's when you get to level 3 data, which is typically mapped or gridded information, that the user can really "crank on it" because the files are lot smaller and more manageable, says Mattmann. This information is often used in discussions about temperature and climate change.

"With each level of processing there are more assumptions that are codified into the data. More scientific assumptions that you didn't necessarily make," Mattmann points out.

His enthusiasm for big data projects is contagious, but when asked how he came to have a career as a NASA computer scientist, he says it's a "lame story".

He grew up playing video games, but it wasn't until his last year at high school when working on the student Yearbook that he worked with Adobe Illustrator and decided he needed to understand more about computers. So he followed some of his friends into the computer science department at college.

"I have no secret to being successful at computer science other than hard work and then sticking my head in the books and deciding I was going to do well," Mattmann says.

At the ITEX conference Mattmann will talk about his work with the Square Kilometre Array project. The $1.9 billion project will be split between an Australia-New Zealand consortium and a consortium of African nations led by South Africa.

This means New Zealand has been left on the outer edge of the project, but Mattmann says there is plenty for New Zealanders to be fascinated about.

He's been working with the South African team on the data processing side, "figuring out how to leverage NASA technology and open source data management technology."

"Some of the requirements of SKA make it mind boggling," he says.

"The 700 terabytes of data per second and how to prepare ourselves for the next decade through the use of data systems, software architecture design, and open source. How to be able to handle that data deluge."