Avoiding hits and flops with data

Steve Illingworth explains how big-data analytics can benefit today's CIOs.

Steve Illingworth looks back at a 27-year career, starting as an engineer in South Africa who worked with technologists in building big instrument control panels. His growing interest in computers led him to take up a master’s degree in computer science. Back in his native United Kingdom, he progressed to roles in Oracle, EMC, and is now chief technology officer for Asia Pacific and Japan at Pivotal.

The latter provides enterprise-ready data analysis platform as a service (PaaS) based on software from VMware and parent company EMC.

Harley Ogier of CIO New Zealand caught up with Illingworth at the EMC Forum in Sydney. Here are excerpts from the interview:

CIO New Zealand: What was the rationale for forming Pivotal?

We’ve had twenty-five years of what we call open systems, hardware with Unix-technology, relationship technologies, et cetera. But you look at all of these – what we call the ‘consumer internet giants’ – they don’t have mainframes. They don’t put all of their data into a relational database. But what they’re doing is admired by just about every corporation on the planet.

You take a bank or telco – could they grow at the same rate that Google, Amazon and eBay grew? Could they certainly triple their customer base in two months, and their IT systems manage it? No. So we think this is a whole new data fabric. They’ve pioneered it. They had to do a lot of the development work themselves.

They’re all laying on Hadoop, and we think the whole data fabric is changing, and we think that new [fabric] demands – VMware started, it often did very well – your infrastructure should be a service. If you just expand on that a little bit and think “how did we used to do it?” – I want to buy an SAP system, therefore I need a platform to run it on. I buy the server, I buy the storage, I buy the network, and I provision it for the busiest day of the year. And the rest of the time, it’s only running at 30 percent utilisation, but I need that SLA. And then I do that for my HR. I do it for my CRM, and I’ve got all of this stuff that I’m not using.

VMware solves that beautifully. It makes all of the hardware a service and now you set service level agreements for the applications, separate the two. What we’re doing is beyond that now. What we’re doing is saying yes, infrastructure as a service is a really good thing, but you need to do the same for all of your developers in provisioning software”.

So [your developers] want a development environment, they want a database, they want an application server, they want APIs, we don’t go out and buy it – what you do is just as a service, you say “I want that, that, that, that, that” and you’re building the application in minutes, not months. Not only that, but the ability to measure it – so now, a big organisation can say all of this is now provided. You can come in and use it, and we will bill you on how much you’ve used, because we can measure it.

That’s where we’re heading – providing the whole of the infrastructure as a service, built on top of a cloud architecture we’re providing all of this software. And it’s configurable.

…So you can take any of your own technology, and provide that in our software as a service, as another service to your developers. So you’re not locked in to what we give you – you could add in other bits and pieces if you want.

That could run on any cloud. You could do it on Amazon Web Services, on Google, on Microsoft, or if you want it inside your firewall.

The big banks and the telcos are very uncomfortable putting their customer information [in the public cloud] so they want to run it inside the firewall.

Now maybe – it’s a true story – I’ve come across banks where developers want to prototype something, they build the application on AWS, then they come in and show their boss. But what’s been happening is they also take some of the customer data. So what we’re doing is saying “no”. You can now take Cloud Foundry, take that whole platform as a service and put it inside your firewall running on whatever. Now you’ve got both worlds that are kind of insulated.

[You could] develop the prototype in the cloud, but don’t put the real data there, when the prototype has been accepted, move it inside the firewall. It’s Cloud Foundry running both sides – it’s the same.

If you look at Pivotal, we’re also providing an MPP database, a Hadoop implementation that’s supported by Pivotal as well, so all of these are just services. It’s not just building a new application – you’ve also got, as a service, the ability to do the analytics and provide real-time analytics. You’ve got a platform that scales.

So you want to do social network analysis, if you do it the traditional way, you’re going to need a lot of storage, right? A lot. But now you could trial this in the cloud. We can do it in the cloud, we can do the streaming for you. You can grab a subset of the data, bang it onto the platform and do analytics. So it’s a very powerful – we think the whole of the data fabric is changing.

You can’t do this kind of stuff in a relational database.

Why should CIOs be abreast of what Pivotal does?

Rapid application development. Get your ideas in front of business users quicker and cheaper, and save yourself a tonne of money. Separate the compute from the storage.

Big data is not new to the CIO – they are storing it. This is why EMC has grown phenomenally in the last 15 years. They were a storage company – they are not now – 30 percent of their business is storage, 70 is something else. But they’ve grown phenomenally on the back of all of these major corporations storing humongous amounts of data.

The difference with Pivotal is now not just storing – which is what [the CIO is] doing, they store [that big data] in a relational database, the relational database fills up, so they take it out, put it on disk, to make room for the new stuff. And then a line of business says, ‘I want to analyse that old stuff’ [and IT responds] ‘Uh… no. You can’t. Here’s the file. Go do something.”’ That is not IT responding to business use cases. What we’re doing is putting a platform together where yes, you can take it out of the database and just put it on storage – you just store on a different architecture, you store on Hadoop. You do that, you’ve now got the concept of a ‘data lake’.

Here’s a fact, I’ve got a whole talk on this: the concept of a data warehouse. You take all of your source data, you put it into one central point, a single source of truth, it’s clean, and you do all of your reporting. A true enterprise data warehouse that’s up to date – do you know how many there are in the world?

Zero. Not one.

Why? The types of data change too fast. The volume. And once you start to build it, and you get all of your ETL working, you get all of your reports working, it’s hard to change. You’re not agile enough. So the concept of a virtual data warehouse, is you just bang the data into this data lake, and you can query it, in minutes. And that’s what we do. You throw the data onto Hadoop, I create an external table, I give you the permission, I say ‘go’. In minutes.

CIO New Zealand: You mentioned in your presentation at the EMC forum Hadoop itself has very poor accessibility via SQL, and that’s part of what Pivotal adds – SQL. So you’ve actually got a proper query optimiser and so forth.

What we’ve done: it’s called HAWQ, and I’ve got a customer already live in Singapore with this.

What we did is take years of investment in a powerful query optimiser, MPP, from the Greenplum database. We’ve taken all of that technology, and we’ve pushed it down onto Hadoop. It actually runs on the Hadoop name nodes. So now you’ve got powerful SQL engine – we’re not selling it as a relational database – this is a powerful query optimiser, it also gives you 100 percent ANSI compliance with all of the 2003 extensions.

CIO New Zealand: But you can still access it natively as Hadoop data, for applications built on Hadoop?

Correct.

You can develop the applications on GemFire and SQLFire – in-memory databases, very, very fast, very scalable – but they persist their data onto Hadoop, which is the cheapest storage.

Start to think of it this way: where did EMC come from 9 to 10 years ago? Storage. But they then started to take over VMware, RSA, Isilon, Data Domain, Greenplum – one after another. This is a big data strategy, and it’s come to fruition in [Pivotal] because what I can do as Greenplum technology, or Pivotal technology, is I can use [EMC’s] storage and their storage is raw disk. So Isilon is pure disk, but it can go up to 15 petabytes. As a single file system. So I’ve got this raw disk, and I can just put compute power in front of that. So now I’ve separated compute from storage. That is phenomenal.

The customer in Singapore wants to do video analytics – the video files are huge. They did not want servers to store this data – they wanted to separate the two. So how do I do backups of this data? I talked about the data lake – how are you going to back it up? Data Domain has a dedicated switch into my technology. Data Domain de-duplicates. So your backup file ends up at about 20 percent of the real data file. So you save – if you’ve got a 14-day backup strategy, you need Data Domain, and you can –by the way – use it on all of your Word, Excel, all of the other files.

Who builds the security for us? RSA. World-class. So the EMC strategy has been phenomenal. Their business, only about 30 percent is storage now. 70 percent is all of the other stuff. There’s a whole strategy that’s gone on, that I think pads out the picture – so you don’t just think about the data lake managing big data, querying the data – that system could become mission-critical. And if that’s the case, how are you going to back it up? Data Domain gives me snapshots, mirror replication, that you do not get with Hadoop.

There’s a real good story, and my story has to be about all of this technology, but if you’re an IBM customer, or a NetApp customer, you can be talking to me as well, because I can – this is why we’re spinning off – because I can run on IBM hardware. I can run on HP hardware. Dell, Huawei. Any x86 servers can run our technology as well.

I run it on my laptop. I know I’m an MPP platform, but we own VMware, so I’ve got my MPP platform, plus Hadoop, running in a VMware image on my Apple Mac.

So how can enterprises take advantage of Pivotal’s technology?

I talk to a lot of customers as I go around, and one of the biggest problems is agility.

Most of them have a data warehouse of some description – it’s already there. It’s not an enterprise data warehouse, but it’s also become pretty static. It’s hard to change. Because the ETLs working, the reports are working, the portals are working, so they’re now in the position when somebody wants to do something new – social network analytics, right? Twitter, Facebook, even web logs – you want to analyse the web logs?

If you want to see a bank’s customers, and what they’ve been looking at on the website – you know the user ID on the website, but how do you link that to all of the relational data to get a holistic view of that customer? That’s very, very difficult. You can’t get the web logs into a relational database, the marketing organisation who wants to do the analytics – what you do is you throw them the web logs, say “go do something yourself, ‘cause I can’t change”. They say “hang on a minute, I also want the customer information, the financials information…” [and you respond] “here, another file, another file…”

You’ve started off with “I want to put everything in a centralised data warehouse”, now you’re taking it out, and throwing files all over the place. We solve that problem.

I don’t need to change your data warehouse’s data model. A lot of my business is migrating non-performant relational data warehouses into a performant system, but with the ability that you recognise the vision in the future is to take any data, social network, web logs, ATM transactions, call data records, text from your CRM – anything that’s unstructured, throw it on the Hadoop cluster and it becomes a data lake that you can query in minutes, not months. No changes to the data model.

It gives you the ability to be agile, to get back ownership if the data ,to attract users and to attract data. Not doing what you’re doing now – you’re repelling users, and repelling data – but get back ownership, put it in a secure environment, and the answer is “yes” to any line of business. That is agility.

There was an acronym that came out of [The Data Warehousing Institute], they came out with it about six years ago, and they called it “MAD analytics”.

MAD is an acronym – Magnetic: attract users and attract data, don’t repel them, which is what data warehouse technology does at the moment. The A is Agile: have the agility to adapt to the business requirement in minutes not months. D, Deep: bring the analytics to the data, not the data to the analytics. We’ve actually got – it’s a whole other topic of discussion – but we open-sourced MADLib as well, which is all of the advanced algorithms which run on the Pivotal platform to do the advanced analytics.

I think the real value from storing data comes from advanced analytics, not from reporting. What’s going to happen, what will the sales be next month, how many new accounts in the next quarter am I likely to open, what is the impact, how much am I likely to make from this marketing campaign before I actually run it? To know that.

I’ll tell you one story that brings it home. I think this is one of the most powerful stories about the advanced analytics.

Did you see the movie John Carter? I loved it, but it was a huge flop for Disney. What did they used to do? The big movie studios? The most important number to a movie studio: opening weekend in the US. How much money?

That’s what they’re trying to forecast. John Carter was a huge flop because they were using the old way – they were using call centres in the Philippines. They were calling up all around the US, “Have you heard about this movie, what’s it about, are you going to go see it, are your friends going to go see it?” They got it absolutely wrong. You watch now, even around Sydney, watch what the studios are doing. The marketing is changing every couple of weeks. The posters, et cetera.

So, did you see the movie The Avengers? Big difference between the two. Avengers, they did not use call centres. Walt Disney have Pivotal technology.

What they decided to do in every movie when they advertise it, they now say “oh and by the way, go to avengersmovie.com”, or whatever-movie-dot-com. You’re giving feedback there. They’re analysing social network data from Twitter, from Facebook, they’re looking at this. They’re getting a much more powerful story, and they’re doing it much more accurately. They actually forecast The Avengers number – they forecast it with over 90 percent accuracy.

Where is the value in that to them? Because if they can do that for two years, and they’re going to the theatres and saying “look, every time I tell you, I’m 90 percent correct, listen to me, I want more theatres for this movie”, the theatres will start to believe them. It’s a different way of doing business, it’s much more accurate, and it’s cheaper. And they’re agile enough to be able to do this.

So, this is what our technology does. Now you can’t take Twitter, and web logs, and Facebook data and put it in a relational database. You don’t want to do that. You you can store it as a blob, but it’s horribly expensive to do it – that’s the key thing, that’s why we don’t.

The idea is you throw it onto disk on Hadoop, you link it to your relational data, and now you can do advanced analytics on it. You can use your SAS, your SPSS, and you can start to do advanced analytics on all of that data. As a user, it’s just data. You don’t know where it’s stored, you know what, 99 percent [of the time], you don’t care. As long as I see a table, I don’t care where it is. That is kind of what we are doing.

For a CIO, it’s that. Magnetic: attract users and attract data. Agility, that answer’s ‘yes, what’s the question’, and D, bring the analytics to the data, do not take the data to the analytics. Stay in control of your data and provide a service to the business. And oh, by the way, do it as a service. Don’t go out and procure technology. If you need a proof-of-concept, go to our website and download the software, as I’ve done, as a VMware image, and try this out. Go to Amazon Web Services, and try CloudFoundry.com. That’s where we’re heading, and I’m really excited about the technology.

Harley Ogier attended the EMC Forum in Sydney as a guest of EMC.

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

More about AgileAmazon Web ServicesAppleData DomainDelleBayEMC CorporationExcelFacebookGoogleHPHuaweiIBM AustraliaMicrosoftNetAppNetAppOraclePivotalRSASAP AustraliaSASSPSSVMware AustraliaWalt Disney

Show Comments
[]