Computerworld

INSIGHT: Metadata as the Rosetta Stone to data management

In 1964, The New Statesman coined the term Information Explosion to describe the deluge of paper-based data, and the problems that created.

In 1964, The New Statesman coined the term Information Explosion to describe the deluge of paper-based data, and the problems that created.

Fast-forward to 2015 and organisations of all sizes are facing a deluge of digital data.

Studies from Gartner and IDC show CIOs and IT administrators rank managing data growth at the top of their concerns.

This is not simply a case of working out how to manage the cost of storing and protecting their ever-expanding volumes of data.

IT decision makers need to determine which data are worth preserving, and how to extract value from these data.

This issue extends from small to large organisations, affecting most use cases. Big Data is not only about the speed (velocity) and amount (volume) of data created, it is also about the variety of data types and data storage types, a much harder issue to manage.

The crux of the problem is enabling intelligence and unified management across a variety of data types and storage types.

In a data-centric world, limiting users’ and IT’s ability to globally manage and analyse all types of data leads to greater complexity, the cost of managing this complexity, and importantly, reduces your organisation’s ability to drive better business decisions.

Data stovepipes limit data value

The storage industry sees the rapid growth of data as a very lucrative problem for which they position themselves as the solution. Whatever your requirement, there are plenty of options for storing data to meet the needs of the moment.

However, even within a single organisation different data types and workflows can drive diverse storage requirements, as can the different stages of data’s lifecycle, which have different storage performance requirements.

This leads to a disparate mix of storage infrastructure, representing different storage types, of various ages, and from multiple vendors. Costs rise as data stovepipes emerge.

Stale or persistent data gets stranded in expensive high-performance systems by inertia, or by the complexity of trying to figure out which data should be disposed of, which should be retained, and how to keep track of where these data reside.

Simply building more and better storage containers does not solve the problem of managing data variety. It would be like car manufacturers adding more fuel tanks to vehicles in response to decreased engine efficiency.

Metadata is the Rosetta Stone

The storage industry is focused on the data, particularly capacity and bandwidth, since that is what drives the cost and design decisions for the infrastructure. But within all data are multiple sorts of metadata that hold the keys to solving these problems.

Page Break

Metadata is data about data. It’s like a roadmap giving you a bird’s eye view of everything, without needing to access it directly.

Traditional infrastructure-based approaches are like planning a trip by first driving all available routes before deciding which is best. With a roadmap, the decision is simple and ensures you select the best possible route.

Storage-centric solutions to data management simply cannot provide the intelligence about the data they store, nor were they designed to. Coalescing metadata provides an intelligent roadmap to data management and enables new insights without altering the underlying infrastructure.

Every digital file contains multiple types of metadata. Many systems manage the file-system metadata describing the basic attributes, such as file size, location, name, when it was last modified.

Very few systems are able to classify and manage the richer and more descriptive metadata that enrich the roadmap, and give you more information to work with.

These richer metadata types include: geospatial metadata found in satellite imagery, other unique metadata found in MRI files, genome sequences, medical records and so on.

So rather than creating a giant data lake of all physical data, why not create a virtual lake of metadata? Why not use the metadata roadmap to plot the journey?

The available metadata from many different data types and data locations can be made centrally searchable, from which new patterns and discoveries can be made.

This approach can drive decisions about how to manage data, without needing to physically move it, or to alter the underlying storage infrastructure.

This approach provides the Research Data Storage Infrastructure (RDSI) project with the ability to manage 55 petabytes of nationally significant data across eight nodes (data stores), representing hundreds of data sets, from multiple higher education institutions across Australia.

Mediaflux, the powerful data management platform from Arcitecta, is the engine that leverages the power of metadata to enable seamless collaboration across these eight sites.

Although each location has its own data centres, with different storage environments and use cases, researchers can now search across all sites as though everything were consolidated into a single infrastructure.

Mediaflux harvested the metadata of all kinds of files, enabling rich query and data mining across otherwise incompatible environments.

Whether in a small enterprise or a nationwide research network such as RDSI, Mediaflux empowers organisations to create a virtual metadata lake to get the advantages of Big Data methodologies without recreating their entire infrastructures – addressing the issue of data variety, not just volume and velocity.

By Jason Lohrey, CTO, Arcitecta & Floyd Christofferson, CMO, Arcitecta