"We show you how to process the future".
 
BAN BOTTLENECKS
 


» Overview

 

Ban Bottlenecks

Real Life Examples

Here are recent real life examples to give you an illustration of the various ways that the Ban Bottlenecks service has saved our clients money.

  1. PROBLEM: The client called us because their software vendor was struggling with a problem which was occurring across a complex of nearly a dozen systems, including database, Java, and web servers.

    SOLUTION: The techniques used in our Ban Bottlenecks provide graphic coverage of every system, using multiple perspectives and high granularity.  For this client we were able to demonstrate to the teams (vendor and internal) that their problems included a lack of memory, memory leaks, occasional network dropouts, and poor process scheduling.  We also found some unexpected consequences of their fail-over procedures. 

    Continuing work with the client confirmed the effectiveness of the solutions implemented by the team.

  2. PROBLEM: We noticed that the service levels of the client's ATM system were starting to degrade from their historical levels. While transactions were not yet timing out, we were seeing a problem.  The client hadn't yet had an indication of problems, other than our report.

    SOLUTION: Using our standard reports and analysis we were able to demonstrate that the service level problem occurred weekdays, around 9:00 a.m., and appeared to be related to the service provided by a back-end external authorizer.  With a little more analysis, we were able to determine that the link to that authorizer was saturated at the problem time.  After discussing the issue with the client, they upgraded the comm lines from 19.2 KB to 64 KB, and the problem went away.

    All this occurred before the system started dropping transactions.

  3. PROBLEM: An insurance claims processor recognized that they had a severe problem with their batch processing.  They were not meeting their batch window, and wanted to explore all the alternatives before they invested in an expensive hardware upgrade.

    SOLUTION: Ban Bottlenecks is able to produce "emphasis reports" which focus on a specific time period, in this case the batch window.  We worked with the client to a) identify the processes and programs which seemed to use an unusual amount of CPU so that they could be optimized; b) recommend to the client that they institute a program to minimize the number of transient processes created by the batch streams; c) optimize the CPU usage with intelligent allocation of the heaviest processes across multiple CPUs; and d) increase the amount of disk cache pages allocated to the busiest disks.

  4. PROBLEM: A nationwide pharmacy client began an initiative to process a new format of claims message. This involved a major software change to their on-line system.  They were concerned about the increased resource utilization and how it affect their ability to handle their peak season.

    SOLUTION: As part of each standard Ban Bottlenecks report we do a cost-per-transaction (processing effectiveness) analysis, and track the results over 24-months.  As they converted stores to the new application, we could see that the CPU and disk utilization per transaction was increasing significantly.  We projected what this would mean for their peak season, and advised them when and what to upgrade in advance.  They took our advice and were fully prepared when their peak hit.

  5. PROBLEM: A client performed a revision-level upgrade of its transaction processing application. Unbeknownst to them, the upgrade had introduced a memory leak into their system. In one our key graphs, "Detailed Memory Usage," the amount of memory consumed by the application clearly drifted continuously upward as the week progressed. Left unchecked, this problem would begin to threaten this bank’s sterling track record for processing ATM requests in a timely fashion.

    SOLUTION: Since Ban Bottlenecks is a proactive collaboration, we were able to notify our client of this development as part of our monthly web conference.  First, we noted the existence of the problem. Second, we stated that we had checked our historical database that we kept for them and confirmed that the problem was not in evidence in previous months. In fact, we were able to pinpoint the time the leak started: early in the morning on the first Sunday of the month. The customer responded that an upgrade of the application had been performed that morning.

    As a next step, we needed to "drill-down" into the application to determine which components were gobbling up memory. The TDI Toolkit contains a tool which does just that. We scheduled it to execute and collect process-level stats at successive intervals throughout a chosen day. We assembled a quick report from the data collected. Two processes – each connecting the server to another computer for "host authorization" – were clearly the culprits.

    As part of the service, we provide complete documentation and access to the TDI Toolkit®. We were able to give a quick lesson to our client’s technology staff on how to use the memory collection tool and how to interpret the data collected. The client contacted its main application vendor and notified them of the problem. The vendor was asked to identify the particular instructions within the identified processes that had changed in the upgrade. Using the Toolkit, our client and their vendor have a way to make a series of proposed modifications and test the effect on memory after each change.

  6. PROBLEM: A client operated a system that exhibited excessively high-levels of CPU. When we began working with them, they were ready to sign off on an expensive system upgrade. We initiated Ban Bottlenecks coverage and began to analyze the information we collected. We immediately noticed something peculiar: while the vast majority of our clients have a very discernible business (or "demand") cycle, this new client was experiencing a consistently high level of CPU activity over a 24-hour period.

    SOLUTION: This problem needed to be tackled with urgency. As part of the standard Ban Bottlenecks offering, we produce an initial report set after the first complete week of collecting data. This information became the basis of our first teleconference with our new client. In the conference, we asked first to learn more about their business. What business functions did the server perform? What did a typical business day look like? We had identified that the consistently high CPU levels were due to unrelenting waves of TCP/IP requests. Were customers staying up all night asking for data?

    Working with us, the client was able to target and put a name to the potential source of this odd system behavior. Specifically, there was a single customer who had mastered the system. He had set up four or more home-based PCs to perform lengthy, repeated, automated queries involving the entire database. This "master number-cruncher" had continually ratcheted up his activity to the point where he began to jeopardize the response time for all other users. With this proof in hand, our client contacted this individual and began to work with him on modifying and streamlining his usage habits. As a result of this episode, our client has now tabled its plans for the expensive CPU upgrade.

  7. PROBLEM: By examining the response times associated with each transaction, we were able to alert a client to a debilitating cycle had begun to take shape on its system. Each day, in a predictable pattern, the average response time recorded for all transactions would grow. Further research revealed an alarming increase in the system resources that were required to process an individual transaction. Our reports showed that external response time did not rise commensurately, so the problem must have been due to internal factors. After midnight, the problem disappeared. The same cycle was repeated the following day.

    SOLUTION: The client’s application vendor had – at our client’s request – added a "velocity check" to its online processing transaction path. In other words, a single customer would be prevented from performing more than X number of transactions per day. The check ran from midnight to midnight. The application vendor had instituted a read of the transaction log file to check for the presence of earlier transactions by the same customer. This log file contained two lengthy indices. There were well over 1,000,000 records in the file by the end of the day. Our knowledge of the inner workings of this particular operating system led us to comprehend the complexity of this task. By the end of the calendar day, a seemingly simple query of this nature could in fact need to traverse up to four internal index levels before hitting upon its target record.

    Using our reports as evidence, we demonstrated to our client the insidious nature of this newly added feature. We suggested they advise the application vendor to redesign the velocity check built around the principle of avoiding the transaction log file. Specifically, we recommended they make use of a facility known as an "item index" file. These index-only constructs were perfect for the velocity check because they represent the leanest way possible to perform indexing activities on information that built up quickly over a 24-hour period.

    The Ban Bottlenecks service is also capable of reporting on ad-hoc testing. In this case, we were able to use it to capture the "before and after" effect of our suggested change. The results clearly showed the efficacy of our approach vs. withering impact of the previous attempt. As a result, the customer was able to implement the new approach without fear of negatively impacting system performance.

  8. PROBLEM: A client was undergoing sudden, tremendous growth through acquisition. They ran both pharmacy and POS transactions through a single, fault-tolerant server. They posed the question to us: would two servers – one running POS, the other pharmacy – hold them through their holiday season?

    SOLUTION: We asked our client to factor in all the acquisition activity and estimate for us the total volume increase they expected for the holidays. Simultaneously, we reviewed the behavior of their system over the previous years. [We maintain a complete 24-month system history for all our clients here in our offices.] We then examined the growth information provided by our client in the context of the system’s past behavior. We were able to extrapolate an answer very quickly: a quite emphatic ‘NO.’ We were able to prove that the current system would not be large enough to hold POS traffic alone throughput the holidays.

    This was a throat-clearing moment for our client. The acquisition budget called for one additional server to divide – at some point – POS and pharmacy traffic. Now they were being informed that one additional server might not cut it. The client asked us to dig deeper and determine whether anything could be done within the confines of the current system.

    We had the ability to go deeper because the Ban Bottlenecks service integrates information from the system and application levels of the operating environment. We decided a special analysis was in order, one designed to look at the relative performance of the POS vs. pharmacy applications. The results were quite striking: comparative analysis showed that POS transactions were unnecessarily 3x the CPU cost of a pharmacy transaction! We urged our client to contact their application vendor to determine if this glaring disparity was a warranted or inadvertent.

    Sure enough, the vendor determined that the transaction path taken by the POS transaction set was convoluted and packed with "features" that could probably be dropped. The result was a 1:1 performance relationship between POS and pharmacy transactions which "bought back" enormous slices of CPU. By dividing the two classes up into two separate servers, our client made it through the holiday season that year without incident.

  9. PROBLEM: We monitor a worldwide network of servers for an international financial services client. Our "group diagnostic report" option allows us to provide a comparative assessment regarding the performance of one particular server vs. the rest. In one recent monthly review, a glaring anomaly shouted out at us as we prepared the report for this client: one of the servers had suddenly begun handling over 100 times the Ethernet volume as compared to any of the other systems on the network. In fact, it was 100 times more than this server had handled in any previous month. Of course, it takes CPU cycles to handle these Ethernet requests, so performance on the server as a whole had taken a major hit.

    SOLUTION: If there is no urgent message to deliver to the client, we normally hold a regularly scheduled web conference to review the report and discuss strategy and business. However, if we find alarming results while preparing a client’s report, it is part of our standard procedure to call the client and notify them immediately concerning the incident in question.

    In this case, we called our client straightaway and noted the discrepancies found with this particular server and suggested they investigate the situation immediately. We were able to provide offending server and port information. In short order, the client was able to locate an FTP operation that had spun out of control and had been flooding this server with uninterrupted streams of data for a number of days. The client called the owner of this process and got the job terminated. Each side instituted safeguards designed to prevent this situation from reoccurring.

  10. PROBLEM: A client had a new manager of its financial transaction processing system. A couple of weeks prior to that appointment, the client made a business decision to migrate off of its long-time system to a new, unproven architecture. The newly appointed manager assumed this move had been made to address some shortcoming with the current system in place. Since her internal turnover was less than complete, could we fill her in on the status of her installation?

SOLUTION: Since our service covers a number of clients on similar operating environments solving similar problems, we were able to show this new manager that their installation was far from jeopardy status. To the contrary, they were actually one of our most stable clients: lots of free capacity; no critical incidents in months; and a smooth usage pattern that closely followed consumer demand. In fact, we made the comment that if all of our clients were this stable, our business would be rather boring.

Our insight provided this manager with the information she needed to advocate a "go-slow approach" to her management regarding moving to new, unproven architecture. Besides being a smart move for the company and its consumers, this decision also allowed the new manager to get her feet beneath her in terms of understanding the complexities and scope of her new role.

Please contact us for more information.

 
©Copyright 2009
Company | Ban Bottlenecks | Consulting | Software | Papers | Home | Sitemap