Computerworld

Heat maps can monitor system latency

New method of latency analysis identified by Oracle staffer

While datacentre managers have long used heat maps to help determine where to best position racks of servers and cooling units, this mode of visualisation can also be handy for better understanding system latency, argues an Oracle engineer in the July issue of Communications of the ACM.

(The journal is the official publication of the US Association for Computing Machinery).

"Presenting latency as a heat map is an effective way to identify subtle characteristics that may otherwise be missed," writes Brendan Gregg, a principal software engineer at Oracle, in the article "Visualising system latency."

Gregg also cautioned that while such visualisation can give us greater overview of what is taking place, it doesn't always provide answers for the behaviour being observed. Still, heat maps can provide insight into tackling the next generation of datacentre latency issues.

Pinpointing the causes of system sluggishness has long been a frustration for datacentre managers and system administrators. Network analysis tools are available to visualise network performance, though other aspects of a system, such as the responsiveness of disks in a storage array, have been harder to quantify.

Sun Microsystems has long offered one tool for its Solaris operating system, called DTrace, that can characterise latency within various parts of a system on a second-by-second basis. The overwhelming data it can produce, however, still needs to be boiled down into a readily understandable form.

Enter Gregg's heat maps. Heat maps are a simple visualisation technique in which, on a two-dimensional graph, different values are represented by different colours.

Heat graphs can reveal more than the line graphs on most network analysis tools, because while graphs "would allow average latency to be examined over time, the actual makeup or distribution of that latency cannot be identified beyond a maximum, if provided," he writes.

Heat maps are also good for rapidly identifying outliers, which then can be examined in greater detail, he argued.

For the article, Gregg plotted a variety of unusual workload conditions, using the Oracle Analytics visualisation software to visually render data gathered by DTrace. He set the X axis to represent time and the Y axis to represent the time of latency. The darkest colours represented the most input-output.

In many cases, he found simple workloads can produce a variety of complex and sometimes unexplainable — patterns.

In one case, a small amount of data was sequentially written to a pool of disks. Gregg expected to see only "white noise" representing random latency to appear. Instead, the heat map showed latency levels rising and falling in distinct patterns for some unknown reason. "Visualising latency in this way clearly poses more questions than it provides answers," he said.

Another pattern proved equally mysterious. The test involved sending a stream of data to 44 disks. First, data would be sent to only one disk, then to two disks, and so on, until all 44 disks were receiving data.

Gregg expected disk latency to increase in a linear fashion as the system buses became saturated with data.

Instead the latency would increase, then subside somewhat, before increasing some more.

Gregg also used a heat map to reveal the shock effects that loud noise has on servers, phenomena that Gregg demonstrated a few years back on YouTube.

Although these heat maps were done on a system running on the Zettabyte File System running over Network File Storage protocol, this approach could be used for characterising the operations of other file systems, and even other components such as CPUs, Gregg writes.