March 8, 2008 – 9:52 am
JXInsight is a comprehensive performance management and problem diagnostics solution that unlike most other competing solutions can be used across all application life cycle phases - from development through to production. Unfortunately this benefit presents its own set of issues to new users in selecting the type and degree of instrumentation and measurement across the different phases (and environments).
The main issue we see with new users especially those who have not taken our software performance engineering training course is the inappropriate usage of fine grain tracing to profile an application in production, incurring more overhead than is required to effectively performance manage an application. Tracing, especially when contextual, is important in understanding the execution flow patterns within an application but the degree of tracing should be trimmed back as an application moves from one phase to another as long as there is a sufficient amount of tracing to link the execution flow patterns captured in one phase to another.
By the time an application gets into production contextual call tracing should be limited to the main entry points (inbound requests) and exit points (outbound requests) within a process. If you have not captured detail execution flows of an software application prior to production you are unlikely to be able to effectively diagnose a problem unless of course the issue is obvious (low hanging fruit) which does beg the question “How did the issue get into production in the first place?”.
Without prior knowledge of the execution patterns (software execution model) devoid of system level concerns (concurrency, contention, co-ordination, and capacity) even the most experienced performance engineer will be overwhelmed by the amount of workload related monitoring data (system execution model). This is why the majority of ad hoc performance consultants focus on basic tuning rather than on software performance engineering though the benefits of tuning a system pale in comparison to what is achievable by engineering and tuning the software itself.
This does not mean one should revert to using high level system metrics as the main source of monitoring data used during diagnostics and problem resolution as this would represent a step (or more) backwards in the evolution of system/application management. Metrics are just one source of performance data and much more relevant and useful when correlated with other sources of performance data including resource usage hotspots (Probes). For performance management it is extremely important to be able to determine and understand what is happening within an application and across its many threads of execution at any moment in time especially at the moment one or more problems are reported. But one must be careful to reduce the overhead of monitoring to a minimum whilst ensuring the level of tracing is sufficient to allow for identification of high level execution patterns which can be related with previous more detailed execution patterns recorded in a snapshot catalog maintained across releases and deployments of an application.
The software execution model derived from detailed tracing and transaction analysis is much more useful during development as it helps developers to understand the runtime behavior of the static software artifacts under construction and not just in terms of performance - performance is just one aspect of the execution. During development additional overhead can be traded for improved insight into the sequence(s) of execution which itself can help avoid many common performance problems such as excessive client->server->database round trips. But as an application moves from development towards production this level of information can overburden and perturb the analysis of the software as the focus shifts to system concerns with the construction of a system execution model.
At this stage you might be questioning what is the difference between a software execution model and system execution model and how does each one relate to levels of application monitoring. One analogy I commonly use during our performance workshops is that of road traffic management in a busy city like London or New York. In terms of traffic analysis the software execution model would consist of the route a driver would take in driving from address X to address Y with timing for each leg of the journey derived from the distance and allowed maximum speed. Most importantly this performance model assumes the driver is the only person on the road in London (film: 28 Days Later) or New York (film: Vanilla Sky). Such a software performance model is generally constructed by recording the execution of each major application use case by a single user with analysis of the resulting model focused specifically on eliminating possible redundant legs (round-tripping) or providing alternative routes (fast call paths).
For the system execution model lets add back into the picture (call frame) all those crazy zombies (component state infections), taxi drivers (runaway worker threads), fellow drivers (concurrent requests), pedestrians (wait monitors), and road works (blocking monitors and resource bottlenecks). Getting from address X to address Y is not straight forward anymore and the time to travel each leg of the journey is subject to random and wild fluctuations (response times outliners) with the possibility of non-arrival (timeouts and failures). Here the analysis of the software performance model is focused on reducing levels of usage (resource consumption) and congestion (contention) on various streets (hotspots), junctions (thread monitors), inbound city motor ways (request queues), and outbound motor ways (external resources). Whilst the level of monitoring and management moves away from end-to-end traffic patterns (execution patterns) and onto the identification (via metering) and resolution of specific trouble points (tuning) in conjunction with overall traffic management strategies (application management and capacity planning) it is still of great importance that decisions (system and software changes) reflect the underlying end-to-end traffic patterns (software execution model) of various commuter groups otherwise a local change (reduced service times) could result in further congestion elsewhere (increased wait times). It is for this reason that many transportation companies in the airline and rail business conduct surveys with traveling passengers collecting the start and end points of the journey.
Finally
Hopefully after getting to this point in the blog entry you will have realized that what I was trying to say was that detailed contextual profiling of call trees (Trace) and resource transaction path analysis (Transact) is much more relevant during the early phases of the application life cycle and that as the application edges closer into production that the emphasis should be placed on much lower overhead approaches such as resource metering (Probes) of call sites and monitoring of component related counters (Metrics) and state (Diagnostics). There should still be tracing at entry and exit points in production in order to relate back to execution patterns previously recorded in much greater detail but the cost benefit analysis favors metering and metrics especially when these are combined, related, and correlated.
Posted in From The Web, JXInsight, Metrics, Probes, SPE, Trace | No Comments »