Python - RRDTool Utilities (module and scripts for RRDs)

April 29, 2008 – 8:37 am

I started a project on Google Code to create a set of Python tools to make dealing with Round Robin Databases (RRD) less painful.  Setting up RRD's can be tough if you don't know what you are doing.

anyone interested can check it out here:  rrdpy

I used it to create a simple HTTP monitoring script (included in source) to graph web response latency like this:

Multiple Dimensions of Performance Testing

April 24, 2008 – 6:18 pm Almost all experts agree that pre-deployment "waterfall" performance testing (which, with the record/playback method, confused by many as the performance testing itself) is not enough - too little, too late. Actually it is just one very specific way of performance testing - with a full spectrum of other approaches, which are used so infrequently (at least as intentional performance testing techniques) that I don't recall finding any good classification. Thinking about it, I see several dimensions of performance testing which, although definitely correlated, probably might be considered somewhat independently - of course, just a raw idea for the moment, just an effort to order thoughts a little.

Solaris Performance Primer: Introduction

April 23, 2008 – 9:38 am

This web log will is the start of a small series of Solaris performance monitoring tools. We are going to present freely available tools from Solaris, the Tools CD and the DTrace toolkit.

This blog will help the upatient ones who want to solve 90% of the day to day performance issues without long studies.

Being able to scale quickly is pivotal in the web age where the load can grow by an order of magnitude overnight. Our little primer will release smaller chapters which are going to apply the following methodology:

  • Chapter 2: Document your performance issue: How to instrument my data center with dimSTAT to get an idea when the performance worries happened
  • Chapter 3:What are the features and limits of my system: How to analyze your Solaris installation, patches etc.. Quantify the abilities of your hardware: How many processors, disks, memory etc.do I have?
  • Chapter 4: What is my system actually doing? Checking out your processes, top users, resource consumption. Understand your system utilization.
  • Chapter 5: What is my process doing? Tools for process introspection. Learn which files are being used, which libraries are being loaded, which call stacks you're currebtly using etc.
  • Chapter 6: Monitoring the processes and threads in projects and zones with prstat
  • Chapter 7:Understand your IO: Who is writing how to where? How the the disk sub system doing?
  • Chapter 8: Understanding network traffic: Which network interface is working how hard? Is my interface overloaded etc.
  • Chapter 9: Tracing at large: How to get information about current system call. How to use DTrace to answer common questions like: Who is writing to which file right now?

This upcoming sequence of blogs will hopefully answer the most common performance questions with standard tools available for Solaris.

Other sites which have Solaris Performance related information are: 

  • The one's who are interested in a more complete (and professional) resource may want to consider Darryl Gove's book "Solaris Application Programming". A great chapter detailing the system tools is available for free online.
  • solarisinternal.com: A comprehensive wiki with available tools and best practise documents.
  • Solaris Performance and Tools: a book by Richard McDougall, Jim Mauro and Brendan Gregg from July 2006.
     

    The content of this document has been created by Thomas Bastian, a performance tuning specialist who gained his knowledge through many performance projects with the key software partners in EMEA from Sun Microsystems 

Adding dates to vmstat logs

April 23, 2008 – 7:07 am

One of the problems I have with vmstat is that it doesn’t have the option to output timestamps (the way iostat does).

As a result, if you have vmstat logs and you don’t know exactly when it was run or at what interval it was collecting data, then it makes the data less meaningful or makes it a hassle to track down the information.

It is easy enough to add the date, using a simple shell script. Create a file, add_date.sh, containing the following:

#!/bin/bash

while read data; do
        echo `date '+%m/%d/%Y %H:%M:%S'` $data
done

Then, execute vmstat as you normally would, but piping the data through our shell scirpt, add_date.sh:

-bash-3.00$ vmstat 5 5 | ./add_date.sh
04/23/2008 10:02:38 kthr memory page disk faults cpu
04/23/2008 10:02:38 r b w swap free re mf pi po fr de sr cd cd f0 s0 in sy cs us sy id
04/23/2008 10:02:38 0 0 0 3760076 2099016 1 2 5 1 1 0 1 1 251 -0 -0 402 150 237 1 0 99
04/23/2008 10:02:43 0 0 0 3518948 2258028 10 126 0 0 0 0 0 0 0 0 0 443 207 285 0 0 100
04/23/2008 10:02:48 0 0 0 3518948 2258048 3 38 0 0 0 0 0 0 0 0 0 433 110 281 0 0 100
04/23/2008 10:02:53 0 0 0 3518948 2258048 3 38 0 0 0 0 0 0 0 0 0 422 111 273 0 1 99
04/23/2008 10:02:58 0 0 0 3518948 2258048 3 38 0 0 0 0 0 0 0 0 0 446 110 282 0 0 100

Now you have vmstat data with timestamps!

Roll Your Own SiteScope, a Simple Alternative

April 23, 2008 – 1:06 am

In working with SiteScope of late, I’ve found that it doesn’t always collect performance metrics the way I want to. More importantly, it can often turn a simple monitoring activity into a complex disaster. Take monitoring via JMX for example. In SiteScope, it has a rather complicated (and sometimes broken) interface when trying to communicate with a busy MBean Server. One can quite easily roll your own JMX monitor using open source tools in about 65 lines of code as I demonstrated here.

But we still all use tools like LoadRunner in these commercial 9-5 contracts right? Wouldn’t it be nice, if you could roll your own custom monitors in Ruby, Perl or whatever language you like, store that data in a simple repository, let’s say a MySQL database, and still be able to hook into those metrics from a LoadRunner Controller during test execution!?

It is possible, with one PHP file and a simple WAMP (or LAMP) installation all wrapped up in a SiteScope-like alternative.
(more…)

StickyMinds.com Weekly Column: Peeling the Performance Onion

April 21, 2008 – 3:06 am Performance tuning is often a frustrating process, especially when you remove one bottleneck after another with little performance improvement. Danny Faught and Rex Black describe the reasons why this happens and how to avoid getting into that situation. They also discuss why you can't work on performance without also dealing with reliability and robustness.

Hands-off Load Testing with JMeter and Ant

April 18, 2008 – 12:00 pm Automation expert Paul Duvall highlights in a recent post the value of earlier and continuous integration of load tests throughout the development cycle and presents simple step-by-step techniques to create a scheduled integration build that runs JMeter tests. By Alexander Olaru

Java Memory Leaks

April 17, 2008 – 7:20 pm

Finding (and fixing) memory leaks in java can sometimes be a tricky process. I recently had to do some analysis on a leaking application and hit a snag.

My game plan was to do the following:

  • Capture heap dumps with jmap
  • Analyze dump with jhat

Along the way, I ran into a problem with jmap and discovered a couple new tools.
Read the rest of this entry »

Regular Expressions in LoadRunner

April 15, 2008 – 7:24 pm

Dmitry Motevich posted a challenge for getting regular expressions working in LoadRunner. Regular expressions in C isn’t pretty, but here it is:
Read the rest of this entry »

Podcast: "Diving into Capacity Planning"

April 11, 2008 – 11:46 am A podcast that I did for TeamQuest Corporation, back in December, is now available. It's a somewhat unconventional take on the motivations for doing CaP, based on taking into account the apparently frustrating but otherwise very realistic perspective of management. During the podcast, I refer to the CMG Keynote given by Jerred Ruble (CEO of TeamQuest Corp.) Here is the abstract of his presentation entitled, "Is Capacity Planning Still Relevant?" (click to enlarge)

Simple registration required to download the 25 MB mp3 file. This podcast also gives you an idea of some the things we will be treating in the Guerrilla Boot Camp class on April 28-29, 2008.

Automation for the people: Hands-off load testing

April 8, 2008 – 9:00 am Load testing is often relegated to late-cycle activities, but it doesn't need to be that way. In this installment of Automation for the people, automation expert Paul Duvall describes how you can discover and fix problems throughout the development cycle by creating a scheduled integration build that runs JMeter tests.

Testing for performance

April 3, 2008 – 3:53 am The third and final article in my SearchSoftwareQuality.com Testing for Performance series posted. The complete series can be found here: Testing for performance, part 1: Assess the problem space Testing for performance, part 2: Build out the test assets Testing for performance, part 3: Provide information

20 New Rules for Faster Web Pages

March 29, 2008 – 9:31 am

Update: Nice explanation in The importance of bandwidth versus latency of how long latencies cause cascading delays in resource loading. Doloto tries to optimize how resources are loaded.

Twenty new rules have been added to the original 14 rules for sizzling web performance. Part of scalability is worrying about performance too. The front-end is where 80-90% of end-user response time is spent and following these best practices improved the performance of Yahoo! properties by 25-50%. The rules are divided into server, content, cookie, JavaScript, CSS, images, and mobile categories. The new rules are:

read more

Pylot Dev Update - Web Performance - Release 1.0

March 19, 2008 – 7:16 am

Finally did the version 1.0 release! visit www.pylot.org to download.

Pylot is still lacking some features I want to add for it to become a serious performance/load testing tool, but the current release delivers very usable functionality.

Current Features:

  • multi-threaded load generator
  • HTTP and HTTPS (SSL) support
  • response verification with regular expressions
  • execution/monitoring console
  • real-time stats
  • results reports with graphs
  • GUI mode
  • shell/console mode
  • cross-platform

Aside from the GUI, there is also a new shell/console interface mode with real-time output for quickly profiling performance your application/service under test from the command line. In this mode, Pylot can run cross-platform. (tested on Windows XP, Vista, Cygwin, Ubuntu, MacOS)

Note: Extra special thanks to Vasil Vangelovski for implementing the original console output and C++ extension


Screenshots of the GUI and new shell/console UI output:





Performance Models: Software versus System

March 8, 2008 – 9:52 am

JXInsight is a comprehensive performance management and problem diagnostics solution that unlike most other competing solutions can be used across all application life cycle phases - from development through to production. Unfortunately this benefit presents its own set of issues to new users in selecting the type and degree of instrumentation and measurement across the different phases (and environments).

The main issue we see with new users especially those who have not taken our software performance engineering training course is the inappropriate usage of fine grain tracing to profile an application in production, incurring more overhead than is required to effectively performance manage an application. Tracing, especially when contextual, is important in understanding the execution flow patterns within an application but the degree of tracing should be trimmed back as an application moves from one phase to another as long as there is a sufficient amount of tracing to link the execution flow patterns captured in one phase to another.

By the time an application gets into production contextual call tracing should be limited to the main entry points (inbound requests) and exit points (outbound requests) within a process. If you have not captured detail execution flows of an software application prior to production you are unlikely to be able to effectively diagnose a problem unless of course the issue is obvious (low hanging fruit) which does beg the question “How did the issue get into production in the first place?”.

Without prior knowledge of the execution patterns (software execution model) devoid of system level concerns (concurrency, contention, co-ordination, and capacity) even the most experienced performance engineer will be overwhelmed by the amount of workload related monitoring data (system execution model). This is why the majority of ad hoc performance consultants focus on basic tuning rather than on software performance engineering though the benefits of tuning a system pale in comparison to what is achievable by engineering and tuning the software itself.

This does not mean one should revert to using high level system metrics as the main source of monitoring data used during diagnostics and problem resolution as this would represent a step (or more) backwards in the evolution of system/application management. Metrics are just one source of performance data and much more relevant and useful when correlated with other sources of performance data including resource usage hotspots (Probes). For performance management it is extremely important to be able to determine and understand what is happening within an application and across its many threads of execution at any moment in time especially at the moment one or more problems are reported. But one must be careful to reduce the overhead of monitoring to a minimum whilst ensuring the level of tracing is sufficient to allow for identification of high level execution patterns which can be related with previous more detailed execution patterns recorded in a snapshot catalog maintained across releases and deployments of an application.

The software execution model derived from detailed tracing and transaction analysis is much more useful during development as it helps developers to understand the runtime behavior of the static software artifacts under construction and not just in terms of performance - performance is just one aspect of the execution. During development additional overhead can be traded for improved insight into the sequence(s) of execution which itself can help avoid many common performance problems such as excessive client->server->database round trips. But as an application moves from development towards production this level of information can overburden and perturb the analysis of the software as the focus shifts to system concerns with the construction of a system execution model.

At this stage you might be questioning what is the difference between a software execution model and system execution model and how does each one relate to levels of application monitoring. One analogy I commonly use during our performance workshops is that of road traffic management in a busy city like London or New York. In terms of traffic analysis the software execution model would consist of the route a driver would take in driving from address X to address Y with timing for each leg of the journey derived from the distance and allowed maximum speed. Most importantly this performance model assumes the driver is the only person on the road in London (film: 28 Days Later) or New York (film: Vanilla Sky). Such a software performance model is generally constructed by recording the execution of each major application use case by a single user with analysis of the resulting model focused specifically on eliminating possible redundant legs (round-tripping) or providing alternative routes (fast call paths).

For the system execution model lets add back into the picture (call frame) all those crazy zombies (component state infections), taxi drivers (runaway worker threads), fellow drivers (concurrent requests), pedestrians (wait monitors), and road works (blocking monitors and resource bottlenecks). Getting from address X to address Y is not straight forward anymore and the time to travel each leg of the journey is subject to random and wild fluctuations (response times outliners) with the possibility of non-arrival (timeouts and failures). Here the analysis of the software performance model is focused on reducing levels of usage (resource consumption) and congestion (contention) on various streets (hotspots), junctions (thread monitors), inbound city motor ways (request queues), and outbound motor ways (external resources). Whilst the level of monitoring and management moves away from end-to-end traffic patterns (execution patterns) and onto the identification (via metering) and resolution of specific trouble points (tuning) in conjunction with overall traffic management strategies (application management and capacity planning) it is still of great importance that decisions (system and software changes) reflect the underlying end-to-end traffic patterns (software execution model) of various commuter groups otherwise a local change (reduced service times) could result in further congestion elsewhere (increased wait times). It is for this reason that many transportation companies in the airline and rail business conduct surveys with traveling passengers collecting the start and end points of the journey.

Finally
Hopefully after getting to this point in the blog entry you will have realized that what I was trying to say was that detailed contextual profiling of call trees (Trace) and resource transaction path analysis (Transact) is much more relevant during the early phases of the application life cycle and that as the application edges closer into production that the emphasis should be placed on much lower overhead approaches such as resource metering (Probes) of call sites and monitoring of component related counters (Metrics) and state (Diagnostics). There should still be tracing at entry and exit points in production in order to relate back to execution patterns previously recorded in much greater detail but the cost benefit analysis favors metering and metrics especially when these are combined, related, and correlated.