FTPOnline - N-Tier Is the New Frontier of IT Operations Management

FTPOnline

Channels

Conferences

Return to the FTPOnline Special Report: Operations Management

N-Tier Is the New Frontier of IT Operations Management (Continued)

Size Matters
The complexity of n-tier distributed applications architecture, in and of itself, complicates management of those applications. Instead of monolithic code running exclusively on dedicated servers, we now have many software components connected to many platforms. In J2EE environments in particular, you might have multiple Enterprise JavaBeans (EJB) components and an additional middleware tier in the form of application servers such as BEA WebLogic or IBM WebSphere.

For all the wonderful flexibility and economy such an approach offers, the advantages come with a price. Application servers deliver the tremendous benefit of distributing workloads and providing load-balancing options, but also introduce a level of complexity because their style of middleware creates numerous many-to-many connections. Though the basic design of a distributed application might be well understood, and its underlying hardware elements well monitored, at a given moment it might be daunting to obtain a comprehensive picture of the application's health. A single business transaction might often kick off a sequence of processes. Each process might be supported by events that transpire at a business-logic software, hardware, or network level. A glitch in one might ripple through the others.

Server and application deployments were formerly infrequent. Now it is common to roll out new servers and patches weekly, taxing scarce system administration resources, increasing costs, and risking system availability caused by misconfigured systems.

The Trouble With Troubleshooting
N-tier distributed applications usually have a Web of interdependencies—for data, for bandwidth, for backup, for authentication. When a problem crops up, where do you start looking for the cause? In most cases, IT operations troubleshooters have ready sources of information. Monitoring software pushes alerts into the visual fields of datacenter staff, and logs of operations can readily show lists of recent changes to configurations. Fortunately, the collective experience and intelligence of IT operations staff can bring most troubleshooting situations to rapid and successful conclusions.

But what about other occasions? What happens when the time required for diagnosis far exceeds the time for actual resolution? What happens when the monitoring software pops up dozens of alerts and it isn't obvious which are completely redundant? What happens when it becomes increasingly apparent that the most recent configuration change—the most frequent source of problems—is not the source of the application illness? Savvy IT operations people will often grab the nearby change management database to look for pointers to problems. However, the change management database usually contains the intents of changes and not necessarily the patches and configurations that actually exist on the datacenter floor. Tick, tock. The escalation process begins. While actual flat-down application outages are rare, nagging degradations are not. And worse, IT operations can be embarrassed by a major degradation at a crucial time, even though it's a once-in-a-blue-moon occurrence.

Putting Them Through Changes
Veteran IT operations directors will tell you that, paradoxically, it's the planned changes that can cause the most serious issues. Typically, unplanned changes are easily noticed and quickly correctable. Hardware breakdowns, server crashes, and router misconfigurations tend to stand out. A swap, a restart, or a redo can often restore an application to full health.

But planned changes, especially the kind that involve propagation of software across a set of devices, can be notoriously difficult. Not only do immediate rollbacks send expenses—and blood pressures—up, but they often necessitate delays until the next deployment window, sometimes a week or more in the future. Despite deployment teams' best efforts, it's difficult to model complex n-tier distributed applications in staging environments—at least not without considerable expense in facilities, equipment, and personnel. And, as applications get bigger, it becomes more likely that differences in test and product environments will appear. Even subtle differences can wreck a deployment. This is an area in deep need of software for simulating software releases, and thankfully, it's coming.

In a worst case, a deployed change or patch might appear successful at first, but higher load factors reveal a hitch later. This situation can give rise to some mutual finger-pointing, because, after all, IT operations might be unable to instantly prove what is causing the n-tier distributed application's sickness. With so many runtime cross-dependencies, the evidence might be hard to assemble.