"We show you how to process the future".
 
SYSTEMS MGR CORNER
 


» Security Corner

 

ITUG Connection

Fundamentals of Capacity Management

The title of this article refers to capacity "management" deliberately. Management is an ongoing process, and capacity management should be an important ongoing function at every shop. For us in the fault-tolerant OLTP arena, capacity management is critical. Our organizations have invested a lot of money to ensure that the computers continue to process through a hardware failure. However, we know lots of shops who shrug their shoulders if they have a processing problem during their peak season. "It just happens," they say. It doesn't have to.

The unfortunate thing about capacity problems is that they surface when the transaction volume is highest. A capacity-related outage has high impact on lots of transactions. It has high visibility with management. And most times it can't be fixed on the fly.

The fact is, capacity-related outages can usually be anticipated and avoided. Wouldn't it be nice to get through the peak season without being paged for a problem?

Capacity and performance are not the same thing. While the term "performance analysis" is all-inclusive, capacity analysis is a distinctly different discipline. Performance analysis deals primarily with the application. Its measurements are based on the transactions: throughput and response. Performance is primarily affected by the design of the application, and secondarily by the capacity issues of the OS and the platform.

Capacity management is the discipline of managing the resources used by the applications. It focuses on the computer, devices and other entities that are in the transaction path. How much CPU/disk/memory/comm are being used to process the workload? How busy are the devices and server processes? When will we run out of capacity on any entity?

If it's done properly, capacity management ensures that there will always be adequate processing power to handle the expected workload that the outside world throws at the system.

Quantifying and tracking demand.

"The systems administration professional responsible for managing the computer complex must ensure that there is enough capacity to handle anticipated demand on the system." That's all well and good, but where do we start? We start with understanding the "demand" that is placed on a computer system, and how to measure it.

We all know what a transaction is, right? Wrong! A "transaction" is usually defined differently for each system, and each application. A transaction means one thing to a systems professional, and may mean something entirely different to the business partner. To a systems professional, a transaction relates to some unit of work on the computer. It typically is a message, a screen of data, or a batch of data. It may be a customer session, with lots of data being retrieved interactively. If we are to relate (as capacity managers) the transaction to the usage of the computer resources, it must be measurable.

But the first rule of tracking demand is to talk to your business partner. Understand thoroughly his/her concept of a transaction, and then relate it to the system's measurable transaction. For example, a gasoline purchase at the pump is typically one business transaction (a credit card purchase), but two system transactions ( a pre-authorization, and then a completion). As a systems manager, you must understand the two definitions and be able to relate, explain, and measure them.

The relevant word in the last sentence is "measure". Without the ability to measure transaction data you cannot properly manage the capacity of your system. We recommend that you capture and compare three numbers each month: 1) The total transactions for the month, and from it calculate the average day's volume for the month; 2) The peak day of the month; and 3) The peak half-hour of the month.

Capturing a system's transaction volume will depend on the application and requires some thought. The first and ideal choice is to use existing reports, but be careful. Make sure you understand the source of the report. For example, on an ATM system where the NSK system acts as a switch routing transactions to a mainframe host, a report generated by the host would not be appropriate. It won't include the number of transactions that were processed by the NSK but failed to make it to the host due to routing or other problems.

Your next choice is to write a program to the capture the volume using an application log file as input. For ACI's BASE24 ATM and POS application, we have written programs to report in CSV (comma-separated) format the transaction volumes by half-hour using the TLF and PTLF as inputs. We can then use the CSV data in Excel or a database.

We've also used the process entity's messages sent/received counters from Measure for tracking the volume. We choose a process class that best represents a system transaction, measure those processes, and then calculate the total messages for each half-hour. On an EMS-based application, we've used the EMS distributors as the process class and PATHTCP2 for Pathway applications.

Once you have the numbers, create a small spreadsheet and keep 18 months of history in it. Watch month-to-month and 12-month growth for the average day of the month, the peak day of the month, and the peak half-hour of the month. We think that you'll see that they don't grow in proportion to each other. Business patterns change, and while the total for the month may not grow at all, the peak day or the peak half-hour may grow significantly. (See Table 1 and 2 for examples.)

For an even clearer view of your system, chart the data. We produce 18-month historical charts, one-month charts, and week by the half-hour every month for each of our systems. With these tables and charts we can see cycles over the course of time. Certain times of the day will have heavier traffic than others; certain days of the week will be heavier; certain times of the month will be heavier; and certain seasons or individual days of the year will be heavier. For example: An ATM system will typically show two peaks each weekday: one around lunchtime, and one around five o'clock. ATM systems will be quite busy on payday. But the holiday season doesn't pose that large an increase. POS systems are always busiest on weekends. They have a distinct and frequently daunting set of peaks to climb during the holidays. Pharmacy systems worry about cold and flu season. Cash management systems always have their heaviest load after long weekends, especially Thanksgiving. We cannot stress enough the importance of understanding the demand on your system.

TABLE 1

TOTAL TRANSACTIONS AND PEAK DAY

            FROM                 TO         AVG        TOTAL    GROWTH    12 MOS
'Tue Aug 01 2000'  'Thu Aug 31 2000'   1,633,060   50,624,860     0.68%     5.70%
'Sat Jul 01 2000'  'Mon Jul 31 2000'   1,621,998   50,281,945    -2.96%     7.26%
'Thu Jun 01 2000'  'Fri Jun 30 2000'   1,671,417   50,142,530     1.29%    18.19%
      :
'Sun Aug 01 1999'  'Tue Aug 31 1999'   1,544,953   47,893,555     2.17%          
'Thu Jul 01 1999'  'Sat Jul 31 1999'   1,512,181   46,877,640     6.93%          
'Tue Jun 01 1999'  'Wed Jun 30 1999'   1,414,152   42,424,576     0.92%          
'Sat May 01 1999'  'Mon May 31 1999'   1,401,208   43,437,462     4.49%          
'Thu Apr 01 1999'  'Fri Apr 30 1999'   1,340,984   40,229,546                    

         PEAK DAY  TRANSACTIONS    GROWTH    12 MOS
'Fri Aug 04 2000'     2,011,097    -1.41%     4.91%
'Fri Jul 14 2000'     2,039,925    -5.77%     2.58%
'Fri Jun 30 2000'     2,164,920     6.28%    23.68%
      :
'Fri Aug 13 1999'     1,916,928    -3.60%          
'Fri Jul 30 1999'     1,988,597    13.61%          
'Fri Jun 18 1999'     1,750,424    -3.92%          
'Fri May 28 1999'     1,821,888     3.00%          
'Fri Apr 30 1999'     1,768,903                     

TABLE 2

PEAK HALF-HOUR

                 DATETIME  TRANSACTIONS       TPS    GROWTH     12 Mo
      '17:30 Fr 08/15/00'        87,049     48.36     2.57%    11.80%
      '17:00 Fr 07/14/00'        84,867     47.15    -4.17%     0.33%
      '17:00 Fr 06/30/00'        88,564     49.20     5.38%          
              :
      '17:00 Fr 09/01/99'        88,127     48.96    13.19%          
      '14:00 Sa 08/21/99'        77,858     43.25    -7.96%          
      '17:00 Fr 07/30/99'        84,592     47.00               
                                    

Relating demand to usage.

Now that we are capturing the "demand" and "usage" on the system, the next step is to relate the demand to the usage. When asking the question "How do I relate demand to usage?" you should be asking "What do I design the system for?". The answer is the peak half-hour, of course. A performance management friend of ours says "Designing for the average is like building a bridge for the average height of a sailboat's mast." Guarantees problems. Also from our friend: "And what are your clients paying you for? The performance you deliver during those peak times. You're always measured by your worst performance, not your best." If you design for the anticipated half-hour, based on the observed history and the business partner's projections, you have an excellent chance of getting through your next peak season without getting paged.

First it is important to collect usage data on a 365*24*7 basis. We use Measure to collect usage data for CPU, disk, process, and comm lines. Then on a nightly basis a batch job reduces the usage data to CSV format which can be easily download to our NT server for further analysis. Using this procedure, we can keep 18 months worth of data on our NT server and 30 days worth of Measure data on the NSK system. With this wealth of data, we can analyze the peak half-hour and research any unusual activity. All this is done with negligible impact on the measured system.

Once we have usage data we begin by comparing the NSK system's usage to the demand at the peak half-hour. We create a table that shows the "node activity at 1 TPS". From Table 2, the TPS rate for the peak half-hour that occurred on 8/15/00 at 17:32 was 48.36 TPS. We divide this number into the node usage numbers. For example, the average CPU was 42.85% busy for during the peak half-hour. The CPU % cost at 1 TPS would be (42.85% * 16 CPUs)/48.26 or 14.18% for this 16 CPU system. Table 3 shows a complete table with an 18-month history.

Table 3 provides us with some interesting information about our NSK system. Between June and July of 1999, two things occurred: transaction volume increased by 7% and a new component was added to the application. The new component increased the CPU % cost from 12% to 15% per TPS. Although in later months the CPU % cost settled to a 13-14% cost, with the Measure data we were able to pinpoint the additional cost to the new component.

Also, note on 12/24/99 the TPS rate was 56, but the CPU % cost was at a low 12.39% indicating the system runs more efficiently at high transaction volumes. Lastly, we can use this table to do some capacity planning. If our business partner expects to grow to 120,000 transactions for the peak half-hour during the next Christmas season, with some simple arithmetic we will know if our system has enough CPU capacity to make it through the holiday season. By taking a conservative 15% as the CPU % cost, on a well-balanced system the average CPU will be (15%/TPS * 67 TPS)/16 or approximately 63%. That's comfortable enough to make it through the Christmas season!

TABLE 3

Activity/1 TPS At Peak Half-Hour For the Node

                                COST/1 CPU
           Datetime     TPS     CPU %    INTR %      SWAP  DISK I/O  DISPATCHES
'17:32 Fr 08/15/00'   48.36     14.18      2.47      0.01     10.95      281.37
'17:02 Fr 07/14/00'   47.15     14.36      2.53      0.01     10.75      292.67
'17:02 Fr 06/30/00'   49.20     14.38      2.53      0.01     10.78      293.28
'17:02 Fr 05/12/00'   46.69     14.98      2.53      0.01     10.93      293.43
'18:32 Fr 04/14/00'   46.38     14.98      2.53      0.01     10.21      292.52
'17:02 Fr 03/31/00'   49.19     14.37      2.56      0.01      9.95      294.54
'17:04 Fr 02/18/00'   46.52     13.34      2.36      0.00      9.68      272.09
'17:03 Fr 01/14/00'   45.51     14.89      2.60      0.01      9.43      323.44
'12:03 Fr 12/24/99'   55.56     12.39      2.19      0.00      9.58      253.23
'16:32 We 11/24/99'   45.03     13.01      2.32      0.01     10.65      267.38
'17:02 Fr 10/29/99'   46.16     13.04      2.37      0.01     11.77      268.70
'17:02 Fr 09/01/99'   48.96     13.08      2.41      0.01     13.12      267.61
'14:04 Sa 08/21/99'   43.25     13.24      2.29      0.01     13.48      273.02
'17:02 Fr 07/30/99'   47.00     14.84      2.52      0.00     14.27      304.60
'12:02 Fr 06/11/99'   38.47     11.96      2.07      0.01      9.37      235.13
'12:02 Fr 05/28/99'   41.16     11.98      2.10      0.00      9.47      240.58
'05:02 Fr 04/30/99'   41.91     11.86      2.18      0.00     11.08      242.72
'05:02 Fr 03/19/99'   35.51     11.02      2.09      0.00     12.37      238.82

Another table we create is the processor profiles table for the current peak half-hour. Table 4 shows usage information for each CPU. This table shows us how well balanced our system is. The busiest CPU is CPU 12 at 60.81% and the least busy is CPU 4 at 21.16%. Here's an opportunity to perform some re-balancing. Also, note there are no disk i/os on CPUs 11, 13, 15, and 6. Was this an oversight or is it intentional? It is worth investigating.

TABLE 4

Processor Profiles During Peak Half-Hour on 8/15/00 at 17:32

CPU      CPUBZ      INTBZ      SWAPS      DISKIO      MINMB      QTIME
  0      32.31       6.05       0.05       34.43      96.00       0.65
  1      32.81       5.86       0.00       30.75      65.00       0.63
 10      60.73      11.65       0.00       21.33     135.00       1.65
 11      52.46       8.50       0.00        0.00     138.00       1.13
 12      60.81       9.73       0.00       55.73     142.00       1.42
 13      59.60       8.06       0.08        0.00     142.00       1.28
 14      49.53       7.75       0.03       39.39     149.00       0.96
 15      45.45       6.18       0.02        0.00     152.00       0.78
  2      26.98       5.68       0.13       54.60     104.00       0.45
  3      27.56       4.84       0.00       34.70      91.00       0.45
  4      21.16       4.07       0.00       34.59     158.00       0.26
  5      33.79       7.74       0.00       61.24     138.00       0.58
  6      49.09       8.60       0.01        0.00      90.00       0.89
  7      40.01       8.66       0.00       43.55     101.00       1.00
  8      45.14       7.16       0.00       43.44     118.00       0.76
  9      48.18       8.90       0.00       75.97     107.00       0.92

In Summary

We'll say it again: The first rule is to talk to your business partner. In many companies, this person is the best source of information about the company's plans. If your system is chugging along fat and happy at 50% busy, and you're unaware that the business partner is about to launch a new advertising campaign or a new product which is expected to double sales within three months, you're going to be in trouble.

Just knowing that sales are going to double in the near future is not enough to confidently predict the impact on your NSK system. You need to understand and relate the difference between business and systems transactions. You need historical data. You need to be collecting demand and usage data on 365*24*7 basis. And you need to relate usage to demand.

This is much easier said than done. The process of collecting and reporting usage and demand data can be a daunting task. But from this article hopefully you see taking the time and effort to produce these tables will give you the confidence to plan for your system's future in simple and efficient manner.

This article was reprinted with permission from The Connection , March/April 2001 Volume 22, No. 2.

 
©Copyright 2009
Company | Ban Bottlenecks | Consulting | Software | Papers | Home | Sitemap