"We show you how to process the future".
 
SYSTEMS MANAGER CORNER
 


» Security Corner

 

ITUG Connection

Sleuthing a problem: Case Study

We recently were asked to help find the cause of a problem that had been plaguing a system for several weeks: Activity on the system was high, apparently for no discernable reason. CPU activity was high, with the average CPU busy for all CPUs over 70% busy. The CPUs were spending hundreds of minutes each day above 90% CPU busy. The disks were unusually busy, with basically constant activity throughout the day. The client was facing a major upgrade, but was reluctant to do the upgrade before they knew what was causing the problem.

The system: S72000 server with 4 CPUs and 16 disks running OS level G06.08. Each CPU has 1048 MB of memory. It is a stand-alone system with only TCP/IP connectivity.

The application: This is a web-enabled system, with clients accessing the system via the Internet. There are two applications, one of which is accessed by telnet, the other via remote service calls (RSC). It uses NonStop-SQL for the database, with a very low insert rate. There are few batch jobs.

Unfortunately, there are no application logs that tell us what the users are doing. Normally one would hope to find an application log which records when users sign on or off, what kind of requests are made, and what are the results of those requests. A log of this type should be built into every application, so that customer usage patterns (at least) can be identified and used for optimizing the system.

Without the log we will have to do some investigation to find out what the users are doing. We are used to working with systems on a "black box" basis: i.e. we don't want or need to understand the intricacies of the customer's code to investigate problems.

The good news is that the client's users were not complaining. They seemed to be getting reasonable response, as measured by real-users connecting via the WWW.

A quick glance of the system using Viewsys confirmed all the CPUs were over 70% busy. To get a comprehensive view of the system we started capturing performance data 24 x 7 for CPUs, disks, TMF, and processes using Measure. After collecting a week's worth of data, we began our in-depth analysis of the problem. We plotted key statistics for each entity on a half-hour basis. Creating week by half-hour charts for each entity gave us a graphical view of what the system was doing for a complete 7 x 24-hour period.

These charts showed there was very little swapping and plenty of memory available on each CPU. At its peak the TMF transaction rate for a CPU was less than 6 transactions per second. The disk queues were less than 20% busy, but CPU 2 was averaging around 80 to 100 i/o's per second. The most interesting charts were the CPU utilization and process charts. The CPU utilization charts showed that all CPUs were equally busy and except for a few periods the average CPU busy remained above 55% busy - even at 3:00 in the morning. Chart 1 shows the CPU and disk i/o rate for one 24-hour period, by half-hour.

Chart 1: CPU and Disk I/O Rate - One Day by Half-Hour

For a customer access system, this chart shows one of two things: Either the customers are evenly distributed across the globe and accessing the system constantly, or someone or something is submitting requests and receiving data around the clock. From talking to our client, we know that the former was not the case. So why weren't we seeing the usual "bell curve" for usage following customer activity?

We then looked at the TCP/IP, telnet, and RSC process classes. We plotted the sent and received message counts by half-hour for each process classes. These charts revealed some very interesting and unusual trends. The TCP/IP sent counts followed a normal demand curve peaking at around 18,000 messages per half-hour, but with the exception of a couple of periods the received counts remained above 1.7 million messages per half-hour (or 944 msg/sec) throughout the day. The RSC sent and received counts followed a normal demand curve peaking at around 250,000 messages and 4,500 messages per half-hour, respectively. The most interesting chart was the telnet chart. This chart showed that the sent and received messages for telnet were consistently above 1.4 million messages per half-hour. Charts 2 through 4 show the message counts for each of these process classes for one 24-hour period, by half-hour.

Chart 2: TCP/IP sent and received messages - One Day by Half-Hour

Chart 3: Telnet sent and received messages - One Day by Half-Hour

Chart 4: RSC sent and received messages - One Day by Half-Hour

Based upon the process charts, we deduced that some user (or set of users) was automatically generating requests to the telnet application. The application was responding with lots of output messages.

The next step was to gather evidence to support our theory. At this point, we took a look a closer look at the TCP/IP and telnet process classes. We wanted to see if we could narrow the traffic to a specific TCP/IP and telnet process. Plotting the received message counts for each TCP/IP process and the sent message counts for each telnet process, revealed that the TCP/IP process, $ZTC1, and the telnet process, $ZTN1, were receiving and sending the bulk of the messages, respectively. (Charts 5 and 6 show the message counts by TCP/IP and telnet process.)

Chart 5: Telnet messages sent - One Day by Half-Hour

Chart 6: TCP/IP messages received - One Day by Half-Hour

So, yes, most of the traffic was coming through a single IP address. However, in conversations with our client, we learned that although there were about a dozen IP addresses available, the majority of the end-users access the system via the IP address associated with the TCP/IP process in question. So narrowing down to a specific TCP/IP and telnet process did not narrow down the user pool. A minor setback.

The final piece of evidence we needed was to tie the high CPU usage with the large message counts. To do this we studied the busiest process classes in terms of messages counts and CPU usage. The busiest process classes were the TSYSDP2, DBS, MLP, HDEPOT, and TI. At this point, we knew very little on how the application processes interacted with each other. But looking strictly at the performance data, we were able to gather the facts listed below. Charts 7 and 8 show the message and CPU busy for the process class.

Chart 7: CPU usage by process class - One Day by Half-Hour

Chart 8: Messages sent by process class - One Day by Half-Hour

  1. TI has the most sent messages with over 1.5 million messages per half-hour or 833 msg/sec; however, it used less than 20% of the CPU. These message counts are about the same number of messages being sent and received as the telnet. Between 00:00 and 07:00, $T13 and $T14 account for majority of the sent messages.
  2. MLP message counts hovers between 750,00 and 1 million and CPU utilization is between 45% and 64%.
  3. The three busiest process classes in term of CPU utilization are tsysdp2, dbs, and mlp. Between 00:00 and 07:00, these 3 process classes account for about 150% of the CPU utilization or CPU busy of 37.5% per CPU. We did know the dbs process is the process that handles queries and updates to the SQL database. We figured the high CPU utilization rate for the DP2 process class could be explained by the high access rate to the SQL database. The function of the mlp process class was unknown to us, but we were able to determine that certain mlp processes were much busier in terms of % CPU busy than others. The same held true for the DBS process class.
  4. None of these process classes follow a typical demand curve either in terms of CPU usage or message counts.

At this point, we felt we had gathered enough evidence to support our theory. As we discussed the evidence with our clients, we learned the TI processes handled the requests from the telnet processes and the MLP process formatted the data from the SQL query performed by the DBS process for display. So far so good, our evidence points to the same application. When we revealed our theory that an end-user was causing the problem, our client disclosed that they had one end-user that was the "resident number cruncher". This "number cruncher" has 4 PCs running automated scripts, which send query requests every 10 minutes 24 x 7 to the telnet application resulting in a very large response. The "number cruncher" fit our profile in that he/she accessed the application on the IP address in question and his/her requests hit the processes in question.

We had our suspect. Our next step is to nail him/her. The client plans to work with the "number cruncher" to resolve this problem. The first step is to have the "number cruncher" move to a different IP address. If the TCP/IP and telnet traffic also moves then we know we are on the right trail. The next step is to have the "number cruncher" shutdown the scripts or block "number cruncher's" access to the application for a few hours and see the impact on the system. If we see that dramatically lower message counts and lower CPU usage then we know our suspect is the culprit.

This article was reprinted with permission from The Connection , November/December 2001, Vol. 22 No. 6.

 
©Copyright 2009
Company | Ban Bottlenecks | Consulting | Software | Papers | Home | Sitemap