Is your company preparing for a capacity test? Are you unsure and uncertain about how to prepare for this test? This article proposes a few strategies to help companies overcome issues such as how to select business processes for a capacity test, the value of automation for a capacity test, and mitigating risk. A poorly coordinated and planned capacity test could prove costly and have a deleterious impact on the end-user's ability to conduct everyday tasks.
Preparing and planning for a capacity test are activities that can prevent many headaches for a company that is expecting to release or deploy an application. Capacity tests are time-consuming and require a great deal of expertise to successfully execute. Below are a few pointers that will help your organization prepare for a capacity test from the very beginning.
Follow the 80/20 Rule
After WWII, Italian economist Vilfredo Pareto discovered that, in Italy, 80 percent of the country's wealth was held by 20 percent of the country's population, hence the 80/20 rule. Moreover, Pareto's observations have led to the creation of Pareto charts and have played a pivotal role in statistical process control (SPC). The Pareto Analysis in SPC operates in this fashion: Eighty percent of problems usually stem from 20 percent of the causes, which leads to the famous phrase "the vital few and the trivial many."
Pareto's principle is also applicable to all types of capacity testing: stress, performance, volume, load, soak, etc. When one applies the 80/20 rule to capacity testing, one can focus on the 20 percent of the business processes that are likely to cause the majority of problems. Pareto's principle allows test managers to select and identify the business processes that are most likely to cause bottlenecks, performance degradation, and traffic across a software application.
An organization can also construct Pareto charts with stakeholders to identify and select business processes that will be tested during a capacity test. Stakeholders in a capacity test could be database administrators (DBAs), infrastructure engineers, middleware engineers, SMEs, developers, business owners, etc.
Seasonal Surges in Demand
Many organizations predict annual business volumes for certain processes such as purchase orders, invoices, and sales orders. These business volumes are then converted into hourly averages. While this approach may correctly emulate average business volumes for a volume test, it overlooks surges in business volumes due to outlier points or seasonal demands that deviate significantly from average volumes.
Examples of surges in business volumes are:
- Insurance and hardware companies experience higher demand for their services and products after a natural disaster.
- Toy stores, retailers, and airlines face increased sales during the holiday season.
- A shipping company has to ship more packages after one of its competitors has a strike.
- Military facilities need to procure more equipment during times of war.
All of these examples converge on a single question: Do these entities have information systems capable of handling an increase in business volumes without degrading end-user response times or becoming inoperable?
The answer to this question is critical. When planning a capacity test, it is essential to discover the operating limits and breaking points of the application under test. It is not sufficient to test an application for average volume; test for volumes that are 20 percent to 30 percent greater than the average volume. The application under test should be scalable and robust enough to handle unexpected or seasonal surges in demand without drastically affecting end-user response times.
Automate for Repeatability and Consistency
Frequently, it will be necessary to repeat capacity tests with multiple sets of data in order to fine-tune a system. The process is as follows: A capacity test is conducted and problems are reported. The responsible parties will then identify the root cause of the problem and repair it. The test engineer will need to repeat the capacity test to determine if the problem has indeed been repaired. This cycle of repeating tests and fixing problems could be time-consuming, resource intensive, and critical to successfully deploying or releasing an application.
It may not be feasible to manually perform a process that requires repetition. Automation is the optimal approach for repeating capacity tests with multiple sets of data, while emulating multiple end-users on just a few pieces of hardware equipment. Automation also yields valuable graphs and charts for pinpointing an application's bottlenecks and degradation in response times. With automation it is possible to repeat a capacity test consistently.
Some automation tools provide only black box measurements after a capacity test is performed, which increases the difficulty of pinpointing the cause of a particular problem. Organizations that are expected to perform a capacity test should attempt to bring on board additional monitoring tools to collect white box measurements.
Black box measurements do not tell the whole story as to why something timed out or failed to execute within a particular time frame, thus causing the test engineer to interpret many graphs and charts with other parties like DBAs and network engineers. For companies that are conducting time-sensitive tests that do not have the flexibility of waiting for the interpretation of black box measurements, it will be important to combine these measurements with tools that can report white box measurements.
Identifying the Need
The basis for a capacity test is often an enigma for many organizations because they lack test requirements. There are many factors that could affect an application's response time and performance, for instance:
- Inefficient SQL statements
- Hardware equipment
- Software version
- Underlying database system and database design
- Application's customizations
- Adding end-users from a previous release in a production environment
- Running additional batch jobs
A company preparing to deploy a system needs to review these factors as well as other documents such as service level agreements and end-user complaints about the application's response times. Anything that is likely to degrade an application's response or performance deserves attention and testing.
Fail to Plan and Consequently Plan to Fail
What are the consequences of having a crashed server or bringing down an entire application? Would these cause financial harm to your company? What if the capacity tests were in a careless fashion and the application is still deployed?
Capacity tests are an inexact science and can have unpredictable results. A contingency plan with risk mitigation and manual workaround is essential to preventing chaos if the capacity test brings unexpected results. List all the potential negative consequences of conducting a capacity test and how these can be mitigated.
State the Conditions Under Which the Tests Occur
Before starting a capacity test, it's essential to document under what conditions the test will be conducted. Conditions may include:
- Specs for hardware equipment
- Software version and client type thin/fat
- Time of day when the test will be conducted
- Concurrent processes running in the background
- Number of end-users emulated and other end-users performing tasks in parallel to the capacity test
- Network segments
- Size and type of environment, and how closely the environment resembles the production environment
- Database size
The objective is to precisely delineate the conditions under which the test took place and under what conditions the test results were reported. Reporting the test conditions can help troubleshoot and find the root cause of the end-user's complaints.
For instance, an end-user may report problems with an application's response times that are incongruent with the test results reported for the capacity test. But by looking at the test conditions under which the capacity test was conducted, it may be possible to resolve the end-user's complaint by noticing that the end-user is working with a desktop/laptop that is significantly older than those used during the capacity test.
Stakeholders are very protective of the areas that they support. After a capacity test is conducted and results are reported, the stakeholders do not want to hear that the area they support is causing bottlenecks. For instance, developers do not want to believe that their programs have inefficient SQL statements that are causing performance problems. Similarly, the infrastructure team does not want to believe that the LAN/routers/switching is the chokepoint for the application under test, the DBA does not want to hear that the database system is not properly tuned, and so forth.
After a capacity test is conducted, test results are reported, problems are identified, and much finger-pointing occurs as to the root cause of the problems. Because a stakeholder rarely wants to be accountable for resolving the problem, organizations with a bureaucratic business culture stifle the tester and consequently the resolution of problems from a capacity test. The test manager should play a pivotal role in helping the test engineer assign tasks and monitor progress with the resolution of identified problems.
Many companies will have an expert use automated test tools to simulate end-users and business processes before an application is deployed or released into production. One of the problems that arise with having an automated test tool expert emulating traffic is that he may not understand the business processes associated with the application under test or which business processes generate the most traffic. Selection of business processes can become muddled for organizations that do not have service level agreements (SLA) or documentation for expected annual business volumes.
A recommended approach to understanding what business processes should be initially selected for a capacity test is for the test tool engineer to rely on the expertise of the SMEs, business analysts, and middleware engineers. Selection criteria for business processes could be based on these parameters: high CPU utilization, number of concurrent users, business criticality, interfaces to external legacy systems, high swapping and disk activity, paging, table scans, long disk queues, and load averages.
The test engineer should explain the selection criteria to the stakeholders. Once the initial business processes are selected, the test engineer can construct a Pareto chart to focus on the few processes that are the most likely to cause the majority of the traffic.
Integrating the aforementioned pointers as part of the capacity test plan could help companies to mitigate the risk of deploying or releasing an application to end-users with unacceptable response times, bottlenecks, and degradation points. These pointers could help the test manager develop a more comprehensive and robust test strategy for the various types of capacity tests and help organizations confront the uncertainties associated with capacity tests, because they are, indeed, inexact sciences that could lead to unpredictable results.