
The Service
Summary
Ban Bottlenecks® is a performance and capacity audit service. For those organizations that don't have the talent, tools, or, most importantly,
time to implement a performance/capacity review, the Ban Bottlenecks service will do it with minimal impact on your systems and staff.
Ban Bottlenecks is a thorough "black-box"
analysis of your critical systems. It is
perfectly secure since it does not require
either physical or network access to your
systems.
- It provides:
- Our free Toolkit for connectionless
high-resolution data collection with low
overhead.
- Coverage of all important objects, including processes
and communications ports.
- Business transaction traffic analysis with service level
reporting by service providers, transaction originators, and transaction authorizers.
- Automatic daily reports covering the business, CPUs,
memory, paging, disks, processes, and communications,
suitable for feeding SAS.
- Automatic weekly peak and problem analysis reports.
- Near-time ad-hoc analysis reports.
- Analyst-scored, easily-navigated, superbly-detailed,
customized Ban Bottlenecks reports with 24 month
historical and forward-looking perspectives.
- “Needle in the haystack” multi-system, multi-tier,
and/or multi-architecture problem determination.
- Web conferences to review the report, discuss our
findings, understand your technical plans, and learn
your business plans to plan for the future.
The ITIL Model: Continuous, Closed-Loop
Process
Ban Bottlenecks follows a process very similar
to the IT Infrastructure Library definitions
for Capacity Management. It includes:
- Continuous measurement of the system
factors.
- Capture of the business traffic and
service levels received and provided by the
system.
- Correlation and projection of the
business and system factors.
- Frequent review of
the information provided with the operations,
applications, capacity/performance, and
systems teams.
- Review of the business traffic
and service levels with the business
representative, and incorporating the
projected business plans.
- Monthly repetition of
these steps.
Architectures Supported
- HP NonStop (Tandem) running the NSK operating system and OSS.
- Stratus Continuum and V-series running the proprietary VOS operating system.
- UNIX variants: AIX, HP-UX, Solaris (SUN), and Red Hat Linux.
- Windows 2000 and later.
Features
The Ban Bottlenecks® Report
The Ban Bottlenecks report is a customized audit of a client's computer systems. It contains a tremendous amount of information on each system
and is designed to satisfy the needs of management, and the Business, Operations, Systems Administration, Capacity Planning, and Application departments within an organization.
The analyst-created scoring and reporting within the report allows the user to quickly find and focus on the issues affecting the systems.
It is HTML-based, delivered via email, FTP, and CD/DVD. It includes the following sections:
Executive Summary
The Executive Summary is a top-level view of all the systems. It highlights problems or issues with the systems, and provides top-level links and explanations. It may contain customized reports and charts documenting system and application data including transaction rates and service levels.
Group Diagnostics
Group diagnostic reports compare and contrast sets of systems. Systems may be grouped by architecture and/or application. The diagnostic reports are excellent at finding the problem areas or devices across large populations of systems, disks, comm.
interfaces and application sets.
Emphasis Reporting
Some clients have clearly-defined processing windows such as the "market open," "trading day," or batch processing periods. In cases such as these additional reports and charts can be created for each window.
Individual System or "Cluster" Audits
Each system or cluster of systems will
receive an individual comprehensive audit
report.
Main Report
The main report begins with a Report Card
where the system is assigned a grade by our
analysts against a dozen or so system and
application factors. Each factor is graded A-F
according to our grading scale:
- A - Adequate
- B - Basically adequate
- C - Cause for concern
- D - Deficient - should be fixed
- F - Failure - caused a problem or outage
The Report Card includes links (when necessary) to an In-Depth or Incident Report.
Checkup Report
The Checkup Report is an inventory and configuration report of the system.
Diagnostic Report
The Diagnostic Report contains a thorough analysis of the system performance and capacity factors. Each factor is usually analyzed in at least three reports:
- Worst (high/low) or heaviest five days for the preceding month
- Worst or heaviest day each month for the last 24 months
- Average day for the last 24 months
The factors which usually appear in the report are:
- Application traffic
- Peak application half-hour
- CPU
- Memory
- Paging rates
- Paging space
- Disk I/O
- Disk space
- TCP/IP traffic
Application traffic is summarized by:
- 24 month history of total for the month, average day, and peak day each month
- Year-to-date growth, month-to-month growth, and 12-month growth
- 24-month history of the peak half-hour traffic and growth
- Worst service level days
Peak Reports
The peak reports in conjunction with the In-Depth Reports provide an additional level of historical detail on the operation of the system. The Peak Reports contain a 24-month history of the peak half-hour by month on the following factors:
- Application traffic
- CPU
- Disk usage
- TCP/IP traffic in and out
- OS transaction statistics when available
- File operations when available
Each report includes:
- 24-month history and month-to-month and 12-month growth analysis
- 24-month history of the computer profile during the peak
- 24-month history of the computer profile per TPS or transaction
- 24-month history of the utilization by processor
- Disk busy/queuing/cache utilization analysis of current peak
- TCP/IP throughput analysis of the current peak
- Legacy/proprietary protocol throughput analysis of the current peak
In-Depth Reports
In-Depth reports show the system in extreme detail during the peaks shown above. The In-Depth Reports are in HTML and CSV format, and show every
process, processor, memory, disk, communications interface, etc. during the peak half-hour selected.
Incident Reports
Incident reports discuss problems discovered during the analyst's review of the system. The report discusses the problem or issue, provides additional data through link to reports or charts, and may include a further analysis of the system and processes similar to the In-Depth report. The Incident Report may also include a recommendation on fixing the problem or avoiding it in the future.
Last Month Charts
The Last Month charts are a suite of charts showing daily totals, averages, and distributions for the various system factors for the preceding month.
Historical Charts
The Historical charts are a suite of charts showing daily totals, averages, and distributions for the various system factors for the preceding 18 months.
Detailed Charts
The Detailed charts are a suite of charts showing totals, averages, and distributions for the various system factors for a selected week each month. The Detailed charts typically show the week by half-hour intervals.
The Ban Bottlenecks® Web
Conference
System management and capacity planning cannot exist in a vacuum. For this reason we schedule a monthly web conference with our clients to discuss the report and the business. A typical agenda may include:
- A review of the report, with focus on business growth and any problems or issues which we may have found.
- A discussion with the client to elicit any problems or issues they may be concerned about.
- A discussion with the client about plans for the system and application.
- A discussion with the client about business plans such as marketing promotions, mergers/acquisitions/divestitures, and any other factor which could affect the business traffic through the system.
The Ban Bottlenecks® Process
One-Time or Ongoing
Ban Bottlenecks is available as a one-time process. In order to get a representative view of the system under investigation we recommend that the process include data collection over at least a one-month period. During this time we will typically produce two to three reports (audits) with web conferences to discuss our findings.
Most of our clients work with us on an continuing basis, with reports on an arranged schedule. This enables us to collect and report a 24-month history of the system, with comprehensive growth analysis.
We don't:
- Modify the OS or application.
- Fix anything. We're strictly an
advisory service.
Basic and Full Report Schedules
The Ban Bottlenecks report is available in Basic and Full versions. Basic reports don't include the Peak Report, the Incident/In-Depth Reports, or Historical or Detailed Chart sets. They do not include review by our analyst. While most of our clients request Full reports each month, others elect to receive as few as four Full reports per year. Clients with multiple systems frequently institute a rolling schedule specifying which systems are reported in Full each month.
Clients receive at least a Basic report on each system each month.
Pricing
Ban Bottlenecks is a fixed-price service. It is extremely affordable, and in many cases it is less than the cost of competing software-only products. Any cost-justification of Ban Bottlenecks must also include the significant staff time reduction that Ban Bottlenecks provides.
The Ban Bottlenecks service includes a license of the TDI Toolkit® data collection/reporting software at no extra charge.
Ban Bottlenecks pricing is dependent on the number and type of systems, the operating systems used, and the size of the systems. There may be an additional charge for custom Executive Summary reports and application-specific reports.
Data Collection
Agent: TDI Toolkit®
The TDI Toolkit is the proprietary agent which is installed on each system under review. It is provided at no additional cost as part of the service. It is a process or service on the box which does its work with minimal impact. It does not require modification of the OS or the application.
"Don't become part of the problem!" is a rule which we strictly observe.
Data Objects Collected
We require the use of the proprietary TDI Toolkit because we collect data on more objects, in greater detail, and at a higher frequency than most other agents. This gives us the ability to quickly discover the cause and impact of transient or intermittent problems or software "features."
Data collection on the objects listed below is architecture-dependent. Collection intervals are designed to provide the most information with the least overhead. The most important objects may be sampled each minute. Others may be sampled at intervals up to once every thirty minutes.
- System CPU
- Individual processors/cores/processor
threads
- Physical memory
- Virtual memory
- Paging activity and paging space
- Physical disk activity and space
- Logical disk activity and space
- Disk cache size and usage
- File system activity and space
- File activity
- Ethernet traffic by individual physical port (packets, bytes, errors)
- Legacy communications by individual physical port (BSC, X.25/LAPB, SNA/SDLC, Token Ring, Expand, ServerNet) (packets, bytes, errors)
- All processes, including transients
- Threads
- Process queues (HP NonStop and Stratus VOS)
- OS statistics such as logins, system data transfer rates, context switches, system/exec calls, processor and disk queuing, forks
- SQL Server
On-System Reports
The TDI Toolkit includes a daily batch job which creates on-box daily summary and detailed reports. These reports are created in text and/or HTML and CSV formats. These reports are not the Ban Bottlenecks report. They are a summary of what occurred on the box during the previous day. The daily job runs at a low priority usually in the early morning when it will have a minimal impact on the system. It takes from 1 to 10 minutes to do its work.
The daily job can be configured to copy the reports to a shared disk for access by support teams.
Application data reporting is provided for custom and industry-standard applications. As a matter of best practice we always attempt to relate the business traffic to the system usage. Whenever possible we also collect and report service levels and success rates. We do not modify the application. We provide report programs which look at the previous day's log.
- Banking and point of sale (Base24, ON/2, Open/2, Connix)
- Securities industry (ticker plant and trading systems)
- Messaging (Omnimessaging, Network Express)
- Data transfer (Data Express)
- Webserver traffic
- Warehouse management
Weekly Extract
Once a week the daily job will run a series of additional reports and then create an extract file (packed or zipped) which must be sent to us via email or FTP. The extract includes:
- The tdi_checkup report which is an inventory and status of the system
- An extract of the daily summary CSV reports
- A set of In-Depth reports showing the system during various peak events. These peak events may include:
- -- Application traffic peak
- -- Ethernet traffic peak
- -- CPU peak
- -- Disk I/O peak
- -- Page fault peak
- -- Page eviction peak
- -- TMF peak
- An extract of the detailed CSV reports for a selected week, typically the week for the application peak
Ad-hoc reports
The TDI Toolkit is available for ad-hoc reporting by client staff. The utilities may be used to investigate issues, including (for most architectures) issues which have happened in the last couple of minutes.
Unattended operation
The TDI Toolkit is designed for unattended operation. This means:
- The batch processing schedules itself
- Files are cleaned up automatically
- The Toolkit restarts itself after a
system reboot
For some architectures we request that the
Toolkit be configured as a "persistent"
process, and/or be included in normal site
monitoring to ensure that it is up. The only
attention required is the transmission of the
weekly extract.
Benefits
Ban Bottlenecks is intended for those organizations needing to implement a stronger system management program for their most critical systems. Most organizations have departments tasked to do similar functions as we provide through Ban Bottlenecks. However, with a population of hundreds or thousands of systems to manage, these departments find themselves short of time or other resources when challenged to implement a complete proactive business-traffic based program.
Typically, customer-facing systems or high dollar-volume systems such as banking ATM, retail POS, customer loyalty, pharmacy, and brokerage require the attention to detail and forward-looking orientation that Ban Bottlenecks provides.
Ban Bottlenecks provides, as a service:
This results in:
Problem identification
Because of what we do and how we do it, we find problems in supposedly mature, stable systems. Our techniques examine every minute of every day, looking for unusual and out-of-pattern use of the various system factors. Since we are capturing and analyzing down to the individual disk, process, and comm interface we will find things like:
- Intermittent, self-correcting application issues which may cause short disruptions and be hard to find.
- Operations problems or errors, such as running backups or batch at the wrong time.
- "Window" issues, such as batch jobs beginning to encroach on the peak online periods.
- Service level problems. These problems may be:
- Internal, caused by conflicts with other applications on the system,
- External, caused by poor service from back-ends,
- Growth or capacity-related which is starting to hit a bottleneck.
- Customer-induced problems, such as
broken or looping scripted customer
interactions.
Business trend
analysis
Whenever possible we capture statistics
directly related to the business traffic being
processed through the system. We do this
without modifying the application. We provide
a program which examines "yesterday's"
application log. When possible, we extract and
analyze:
- Transaction (as defined by the
application) counts and rates
- Service levels and response times
- Success or denial rates
- Transactions and response by transaction
originator and/or authorizer
- Transactions by type or product
Capacity planning
The bottom line is that Ban Bottlenecks
provides the client with a forward-looking
view of the systems. In practice, we give a 3
to 4 months advance warning if we discover
that the system is approaching a bottleneck.
We identify the bottleneck and recommend
action, and after the action is taken we
confirm that the bottleneck is fixed.
Continuous service
improvement
The ongoing Ban Bottlenecks report service
implements a best practice of discovery,
repair, and confirmation of the effect of the
repair. In practice, the collaboration between
our analysts and the client teams produces
systems that run cleanly and predictably,
implementing "no surprises" processing.
High availability
Non-reactive capacity planning combined
with continuous service improvement results in
improved uptime.
Increased staff
effectiveness
The Ban Bottlenecks process relieves the
client staff from many time-consuming chores.
Performance/capacity data collection,
presentation, and analysis is at best tedious
and at worst so time-consuming that it may not
be possible to motivate the team to do the job
as effectively as required. Our experience
with clients is that the performance and
systems teams are happy to have the "dirty
work" performed for them, so that they can
concentrate on their core business of system
management and improvement.
Capital budget
planning
Lastly, Ban Bottlenecks provides all the
information that a client needs to plan its
capital budget. The proactive view that Ban
Bottlenecks provides allows the client to
anticipate growth and to plan for just-in-time
system improvements. The Ban Bottlenecks
report enables the client to relate system,
device, and communications usage to the
business traffic.
Contact us for a
sample web conference and report.
 |