Aggregations
When instrumenting the system to answer performance-related questions, it is useful to consider how data can be aggregated to answer a specific question rather than thinking in terms of data gathered by individual probes. For example, if you wanted to know the number of system calls by user ID, you would not necessarily care about the datum collected at each system call. You simply want to see a table of user IDs and system calls. Historically, you would answer this question by gathering data at each system call, and post-processing the data using a tool like awk(1) or perl(1). However, in DTrace the aggregating of data is a first-class operation. This chapter describes the DTrace facilities for manipulating aggregations.
9.1. Aggregating Functions
An aggregating function is one that has the following property:
f(f(x
0) U f(x
1) U ... U f(x
n)) = f(x
0U x
1U ... U x
n)
where xn is
a set of arbitrary data. That is, applying an aggregating function to subsets
of the whole and then applying it again to the results gives the same result
as applying it to the whole itself. For example, consider a function SUM
that yields the summation of a given data set. If the raw data
consists of {2, 1, 2, 5, 4, 3, 6, 4, 2}, the result of applying SUM
to
the entire set is {29}. Similarly, the result of applying SUM
to
the subset consisting of the first three elements is {5}, the result of applying SUM
to the set consisting of the subsequent three elements is {12},
and the result of of applying SUM
to the remaining three
elements is also {12}. SUM
is an aggregating function because
applying it to the set of these results, {5, 12, 12}, yields the same result,
{29}, as applying SUM
to the original data.
Not all functions are aggregating functions. An example of a non-aggregating
function is the function MEDIAN
that determines the median
element of the set. (The median is defined to be that element of a set for
which as many elements in the set are greater than it as are less than it.)
The MEDIAN
is derived by sorting the set and selecting
the middle element. Returning to the original raw data, if MEDIAN
is
applied to the set consisting of the first three elements, the result is {2}.
(The sorted set is {1, 2, 2}; {2} is the set consisting of the middle element.)
Likewise, applying MEDIAN
to the next three elements yields
{4} and applying MEDIAN
to the final three elements yields
{4}. Applying MEDIAN
to each of the subsets thus yields
the set {2, 4, 4}. Applying MEDIAN
to this set yields the
result {4}. However, sorting the original set yields {1, 2, 2, 2, 3, 4, 4,
5, 6}. Applying MEDIAN
to this set thus yields {3}. Because
these results do not match, MEDIAN
is not an aggregating
function.
Many common functions for understanding a set of data are aggregating functions. These functions include counting the number of elements in the set, computing the minimum value of the set, computing the maximum value of the set, and summing all elements in the set. Determining the arithmetic mean of the set can be constructed from the function to count the number of elements in the set and the function to sum the number the elements in the set.
However, several useful functions are not aggregating functions. These functions include computing the mode (the most common element) of a set, the median value of the set, or the standard deviation of the set.
Applying aggregating functions to data as it is traced has a number of advantages:
-
The entire data set need not be stored. Whenever a new element is to be added to the set, the aggregating function is calculated given the set consisting of the current intermediate result and the new element. After the new result is calculated, the new element may be discarded. This process reduces the amount of storage required by a factor of the number of data points, which is often quite large.
-
Data collection does not induce pathological scalability problems. Aggregating functions enable intermediate results to be kept per-CPU instead of in a shared data structure. DTrace then applies the aggregating function to the set consisting of the per-CPU intermediate results to produce the final system-wide result.
9.2. Aggregations
DTrace stores the results of aggregating functions in objects called aggregations. The aggregation results are indexed using a tuple of expressions similar to those used for associative arrays. In D, the syntax for an aggregation is:
@name[ keys ] = aggfunc ( args );
where name is the name of the aggregation, keys is a comma-separated list of D expressions, aggfunc is one of the DTrace aggregating functions, and args is
a comma-separated list of arguments appropriate for the aggregating function.
The aggregation name is a D identifier that is
prefixed with the special character @
. All aggregations
named in your D programs are global variables; there are no thread- or clause-local
aggregations. The aggregation names are kept in a separate identifier namespace
from other D global variables. Remember that a
and @a
are not the same variable if you reuse names. The special aggregation
name @
can be used to name an anonymous aggregation in
simple D programs. The D compiler treats this name as an alias for the aggregation
name @_
.
The DTrace aggregating functions are shown in the following table. Most aggregating functions take just a single argument that represents the new datum.
Function Name |
Arguments |
Result |
---|---|---|
|
none |
The number of times called. |
|
scalar expression |
The total value of the specified expressions. |
|
scalar expression |
The arithmetic average of the specified expressions. |
|
scalar expression |
The smallest value among the specified expressions. |
|
scalar expression |
The largest value among the specified expressions. |
|
scalar expression, lower bound, upper bound, step value |
A linear frequency distribution, sized by the specified range, of the values of the specified expressions. Increments the value in the highest bucket that is less than the specified expression. |
|
scalar expression |
A power-of-two frequency distribution of the values of the specified expressions. Increments the value in the highest power-of-two bucket that is less than the specified expression. |
For example, to count the number of write(2) system calls in the system,
you could use an informative string as a key and the count
aggregating
function:
syscall::write:entry
{
@counts["write system calls"] = count();
}
The dtrace
command prints aggregation results by default when the process
terminates, either as the result of an explicit END
action
or when the user presses Control-C. The following example output shows the
result of running this command, waiting for a few seconds, and pressing Control-C:
# dtrace -s writes.d dtrace: script './writes.d' matched 1 probe ^C write system calls 179 #
You
can count system calls per process name using the execname
variable
as the key to an aggregation:
syscall::write:entry
{
@counts[execname] = count();
}
The following example output shows the result of running this command, waiting for a few seconds, and pressing Control-C:
# dtrace -s writesbycmd.d dtrace: script './writesbycmd.d' matched 1 probe ^C dtrace 1 cat 4 sed 9 head 9 grep 14 find 15 tail 25 mountd 28 expr 72 sh 291 tee 814 def.dir.flp 1996 make.bin 2010 #
Alternatively, you might want to further examine writes organized by
both executable name and file descriptor. The file descriptor is the first
argument to write(2),
so the following example uses a key consisting of both execname
and arg0
:
syscall::write:entry
{
@counts[execname, arg0] = count();
}
Running this command results in a table with both executable name and file descriptor, as shown in the following example:
# dtrace -s writesbycmdfd.d dtrace: script './writesbycmdfd.d' matched 1 probe ^C cat 1 58 sed 1 60 grep 1 89 tee 1 156 tee 3 156 make.bin 5 164 acomp 1 263 macrogen 4 286 cg 1 397 acomp 3 736 make.bin 1 880 iropt 4 1731 #
The following example displays the average time spent in the write system
call, organized by process name. This example uses the avg
aggregating
function, specifying the expression to average as the argument. The example
averages the wall clock time spent in the system call:
syscall::write:entry
{
self->ts = timestamp;
}
syscall::write:return
/self->ts/
{
@time[execname] = avg(timestamp - self->ts);
self->ts = 0;
}
The following example output shows the result of running this command, waiting for a few seconds, and pressing Control-C:
# dtrace -s writetime.d dtrace: script './writetime.d' matched 2 probes ^C iropt 31315 acomp 37037 make.bin 63736 tee 68702 date 84020 sh 91632 dtrace 159200 ctfmerge 321560 install 343300 mcs 394400 get 413695 ctfconvert 594400 bringover 1332465 tail 1335260 #
The average can be useful, but often does not provide sufficient detail
to understand the distribution of data points. To understand the distribution
in further detail, use the quantize
aggregating function
as shown in the following example:
syscall::write:entry
{
self->ts = timestamp;
}
syscall::write:return
/self->ts/
{
@time[execname] = quantize(timestamp - self->ts);
self->ts = 0;
}
Because each line of output becomes a frequency distribution diagram, the output of this script is substantially longer than previous ones. The following example shows a selection of sample output:
lint value ------------- Distribution ------------- count 8192 | 0 16384 | 2 32768 | 0 65536 |@@@@@@@@@@@@@@@@@@@ 74 131072 |@@@@@@@@@@@@@@@ 59 262144 |@@@ 14 524288 | 0 acomp value ------------- Distribution ------------- count 4096 | 0 8192 |@@@@@@@@@@@@ 840 16384 |@@@@@@@@@@@ 750 32768 |@@ 165 65536 |@@@@@@ 460 131072 |@@@@@@ 446 262144 | 16 524288 | 0 1048576 | 1 2097152 | 0 iropt value ------------- Distribution ------------- count 4096 | 0 8192 |@@@@@@@@@@@@@@@@@@@@@@@ 4149 16384 |@@@@@@@@@@ 1798 32768 |@ 332 65536 |@ 325 131072 |@@ 431 262144 | 3 524288 | 2 1048576 | 1 2097152 | 0
Notice that the rows for the frequency distribution are always power-of-two
values. Each rows indicates the count of the number of elements greater
than or equal to the corresponding value, but less than the
next larger row value. For example, the above output shows that iropt
had
4,149 writes taking between 8,192 nanoseconds and 16,383 nanoseconds, inclusive.
While quantize
is useful for getting quick insight
into the data, you might want to examine a distribution across linear values
instead. To display a linear value distribution, use the lquantize
aggregating
function. The lquantize
function takes three arguments
in addition to a D expression: a lower bound, an upper bound, and a step.
For example, if you wanted to look at the distribution of writes by file descriptor,
a power-of-two quantization would not be effective. Instead, use a linear
quantization with a small range, as shown in the following example:
syscall::write:entry
{
@fds[execname] = lquantize(arg0, 0, 100, 1);
}
Running this script for several seconds yields a large amount of information. The following example shows a selection of typical output:
mountd value ------------- Distribution ------------- count 11 | 0 12 |@ 4 13 | 0 14 |@@@@@@@@@@@@@@@@@@@@@@@@@ 70 15 | 0 16 |@@@@@@@@@@@@ 34 17 | 0 xemacs-20.4 value ------------- Distribution ------------- count 6 | 0 7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 521 8 | 0 9 | 1 10 | 0 make.bin value ------------- Distribution ------------- count 0 | 0 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3596 2 | 0 3 | 0 4 | 42 5 | 50 6 | 0 acomp value ------------- Distribution ------------- count 0 | 0 1 |@@@@@ 1156 2 | 0 3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 6635 4 |@ 297 5 | 0 iropt value ------------- Distribution ------------- count 2 | 0 3 | 299 4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 20144 5 | 0
You can also use the lquantize
aggregating function
to aggregate on time since some point in the past. This technique allows you
to observe a change in behavior over time. The following example displays
the change in system call behavior over the lifetime of a process executing
the date(1) command:
syscall::exec:return,
syscall::exece:return
/execname == "date"/
{
self->start = vtimestamp;
}
syscall:::entry
/self->start/
{
/*
* We linearly quantize on the current virtual time minus our
* process's start time. We divide by 1000 to yield microseconds
* rather than nanoseconds. The range runs from 0 to 10
* milliseconds in steps of 100 microseconds; we expect that no
* date(1) process will take longer than 10 milliseconds to
* complete.
*/
@a["system calls over time"] =
lquantize((vtimestamp - self->start) / 1000, 0, 10000, 100);
}
syscall::rexit:entry
/self->start/
{
self->start = 0;
}
The preceding script provides greater insight into system call behavior
when many date(1) processes
are executed. To see this result, run sh -c 'while true; do date >/dev/null; done'
in one window, while executing the D script in
another. The script produces a profile of the system call behavior of the
date(1) command:
# dtrace -s dateprof.d dtrace: script './dateprof.d' matched 218 probes ^C system calls over time value ------------- Distribution ------------- count < 0 | 0 0 |@@ 20530 100 |@@@@@@ 48814 200 |@@@ 28119 300 |@ 14646 400 |@@@@@ 41237 500 | 1259 600 | 218 700 | 116 800 |@ 12783 900 |@@@ 28133 1000 | 7897 1100 |@ 14065 1200 |@@@ 27549 1300 |@@@ 25715 1400 |@@@@ 35011 1500 |@@ 16734 1600 | 498 1700 | 256 1800 | 369 1900 | 404 2000 | 320 2100 | 555 2200 | 54 2300 | 17 2400 | 5 2500 | 1 2600 | 7 2700 | 0
This output provides a rough idea of the different phases of the
date(1) command with respect to the
services required of the kernel. To better understand these phases, you might
want to understand which system calls are being called when. If so, you could
change the D script to aggregate on the variable probefunc
instead
of a constant string.
For some operations, visibility into microsecond resolution is critical, while
for other (longer) operations, millisecond resolution is sufficient — and some
classes of operations may take hundreds or thousands of milliseconds to
complete. If one makes the wrong choice as the resolution of aggregation, one
either ends up clogging the system with unnecessarily fine-grained data, or
discarding valuable information in overly coarse-grained data. One might think
that one could infer the interesting latency range from the nature of the
operation, but this is unfortunately not the case: I/O operations (a
typically-millisecond resolution operation) can take but microseconds, and CPU
scheduling operations (a typically-microsecond resolution operation) can take
hundreds of milliseconds. Indeed, the problem is that one often does not know
what order of magnitude of resolution is interesting for a particular operation
until one has a feel for the data for the specific operation — one needs to
instrument the system to best know how to instrument the system! To understand
this case use the llquantize
aggregating function.
Here is a simple example of how the aggregation is both logarithmic and linear:
tick-1ms
{
/*
* A log-linear quantization with factor of 10 ranging from
* magnitude 0 to magnitude 6 (inclusive) with twenty steps per
* magnitude.
*/
@ = llquantize(i++, 10, 0, 6, 20);
}
tick-1ms
/i == 1500/
{
exit(0);
}
This example outputs:
value ------------- Distribution ------------- count < 1 | 1 1 | 1 2 | 1 3 | 1 4 | 1 5 | 1 6 | 1 7 | 1 8 | 1 9 | 1 10 | 5 15 | 5 20 | 5 25 | 5 30 | 5 35 | 5 40 | 5 45 | 5 50 | 5 55 | 5 60 | 5 65 | 5 70 | 5 75 | 5 80 | 5 85 | 5 90 | 5 95 | 5 100 |@ 50 150 |@ 50 200 |@ 50 250 |@ 50 300 |@ 50 350 |@ 50 400 |@ 50 450 |@ 50 500 |@ 50 550 |@ 50 600 |@ 50 650 |@ 50 700 |@ 50 750 |@ 50 800 |@ 50 850 |@ 50 900 |@ 50 950 |@ 50 1000 |@@@@@@@@@@@@@ 500 1500 | 0
9.3. Printing Aggregations
By default, multiple aggregations are displayed in the order they
are introduced in the D program. You can override this behavior using the printa
function to print the aggregations. The printa
function
also enables you to precisely format the aggregation data using a format string,
as described in Output Formatting.
If an aggregation is not formatted with a printa
statement
in your D program, the dtrace
command will snapshot the
aggregation data and print the results once after tracing has completed using
the default aggregation format. If a given aggregation is formatted using
a printa
statement, the default behavior is disabled.
You can achieve equivalent results by adding the statement printa(@
aggregation-name)
to a dtrace:::END
probe
clause in your program. The default output format for the avg
, count
, min
, max
, and sum
aggregating functions displays an integer decimal value corresponding
to the aggregated value for each tuple. The default output format for the lquantize
and quantize
aggregating functions
displays an ASCII table of the results. Aggregation tuples are printed as
if trace
had been applied to each tuple element.
9.4. Data Normalization
When aggregating data over some period of time, you might want
to normalize the data with respect to some constant factor.
This technique enables you to compare disjoint data more easily. For example,
when aggregating system calls, you might want to output system calls as a
per-second rate instead of as an absolute value over the course of the run.
The DTrace normalize
action enables you to normalize
data in this way. The parameters to normalize
are an
aggregation and a normalization factor. The output of the aggregation shows
each value divided by the normalization factor.
The following example shows how to aggregate data by system call:
#pragma D option quiet
BEGIN
{
/*
* Get the start time, in nanoseconds.
*/
start = timestamp;
}
syscall:::entry
{
@func[execname] = count();
}
END
{
/*
* Normalize the aggregation based on the number of seconds we
* have been running. (There are 1,000,000,000 nanoseconds in
* one second.)
*/
normalize(@func, (timestamp - start) / 1000000000);
}
Running the above script for a brief period of time results in the following output on a desktop machine:
# dtrace -s ./normalize.d ^C syslogd 0 rpc.rusersd 0 utmpd 0 xbiff 0 in.routed 1 sendmail 2 echo 2 FvwmAuto 2 stty 2 cut 2 init 2 pt_chmod 3 picld 3 utmp_update 3 httpd 4 xclock 5 basename 6 tput 6 sh 7 tr 7 arch 9 expr 10 uname 11 mibiisa 15 dirname 18 dtrace 40 ksh 48 java 58 xterm 100 nscd 120 fvwm2 154 prstat 180 perfbar 188 Xsun 1309 .netscape.bin 3005
normalize
sets the normalization factor for the
specified aggregation, but this action does not modify the underlying data.
denormalize
takes only an aggregation. Adding the denormalize action
to the preceding example returns both raw system call counts and per-second
rates:
#pragma D option quiet
BEGIN
{
start = timestamp;
}
syscall:::entry
{
@func[execname] = count();
}
END
{
this->seconds = (timestamp - start) / 1000000000;
printf("Ran for %d seconds.\n", this->seconds);
printf("Per-second rate:\n");
normalize(@func, this->seconds);
printa(@func);
printf("\nRaw counts:\n");
denormalize(@func);
printa(@func);
}
Running the above script for a brief period of time produces output similar to the following example:
# dtrace -s ./denorm.d ^C Ran for 14 seconds. Per-second rate: syslogd 0 in.routed 0 xbiff 1 sendmail 2 elm 2 picld 3 httpd 4 xclock 6 FvwmAuto 7 mibiisa 22 dtrace 42 java 55 xterm 75 adeptedit 118 nscd 127 prstat 179 perfbar 184 fvwm2 296 Xsun 829 Raw counts: syslogd 1 in.routed 4 xbiff 21 sendmail 30 elm 36 picld 43 httpd 56 xclock 91 FvwmAuto 104 mibiisa 314 dtrace 592 java 774 xterm 1062 adeptedit 1665 nscd 1781 prstat 2506 perfbar 2581 fvwm2 4156 Xsun 11616
Aggregations can also be renormalized. If normalize
is
called more than once for the same aggregation, the normalization factor will
be the factor specified in the most recent call. The following example prints
per-second rates over time:
#pragma D option quiet
BEGIN
{
start = timestamp;
}
syscall:::entry
{
@func[execname] = count();
}
tick-10sec
{
normalize(@func, (timestamp - start) / 1000000000);
printa(@func);
}
9.5. Clearing Aggregations
When using DTrace to build simple monitoring scripts, you can
periodically clear the values in an aggregation using the clear
function.
This function takes an aggregation as its only parameter. The clear
function
clears only the aggregation's values; the aggregation's
keys are retained. Therefore, the presence of a key in an aggregation that
has an associated value of zero indicates that the key had a
non-zero value that was subsequently set to zero as part of a clear
.
To discard both an aggregation's values and its keys, use the trunc
.
See Truncating aggregations for
details.
The following example adds clear
to renormalize.d: Renormalizing an Aggregation:
#pragma D option quiet
BEGIN
{
last = timestamp;
}
syscall:::entry
{
@func[execname] = count();
}
tick-10sec
{
normalize(@func, (timestamp - last) / 1000000000);
printa(@func);
clear(@func);
last = timestamp;
}
While renormalize.d: Renormalizing an Aggregation shows
the system call rate over the lifetime of the dtrace
invocation,
the preceding example shows the system call rate only for the most recent
ten-second period.
9.6. Truncating aggregations
When looking at aggregation results, you often care only about
the top several results. The keys and values associated with anything other
than the highest values are not interesting. You might also wish to discard
an entire aggregation result, removing both keys and values.
The DTrace trunc
function is used for both of these situations.
The parameters to trunc
are an aggregation and
an optional truncation value. Without the truncation value, trunc
discards both aggregation values and aggregation keys
for the entire aggregation. When a truncation value n is
present, trunc
discards aggregation values and keys except for those values and keys associated with the highest n values. That is, trunc(@foo, 10)
truncates
the aggregation named foo
after the top ten values, where trunc(@foo)
discards the entire aggregation. The entire aggregation
is also discarded if 0
is specified as the truncation value.
To see the bottom n values instead of the
top n, specify a negative truncation value to trunc
. For example, trunc(@foo, -10)
truncates
the aggregation named foo
after the bottom ten values.
The following example augments the system call example to only display the per-second system call rates of the top ten system-calling applications in a ten-second period:
#pragma D option quiet
BEGIN
{
last = timestamp;
}
syscall:::entry
{
@func[execname] = count();
}
tick-10sec
{
trunc(@func, 10);
normalize(@func, (timestamp - last) / 1000000000);
printa(@func);
clear(@func);
last = timestamp;
}
The following example shows output from running the above script on a lightly loaded laptop:
FvwmAuto 7 telnet 13 ping 14 dtrace 27 xclock 34 MozillaFirebird- 63 xterm 133 fvwm2 146 acroread 168 Xsun 616 telnet 4 FvwmAuto 5 ping 14 dtrace 27 xclock 35 fvwm2 69 xterm 70 acroread 164 MozillaFirebird- 491 Xsun 1287
9.7. Minimizing Drops
Because DTrace buffers some aggregation data in the kernel, space
might not be available when a new key is added to an aggregation. In this
case, the data will be dropped, a counter will be incremented, and
dtrace
will generate a message indicating an aggregation drop. This situation
rarely occurs because DTrace keeps long-running state (consisting of the aggregation's
key and intermediate result) at user-level where space may grow dynamically.
In the unlikely event that aggregation drops occur, you can increase the aggregation
buffer size with the aggsize
option to reduce the likelihood
of drops. You can also use this option to minimize the memory footprint of
DTrace. As with any size option, aggsize
may be specified
with any size suffix. The resizing policy of this buffer is dictated by the
bufresize
option. For more details on buffering, see Buffers and Buffering. For more
details on options, see Options and Tunables.
An alternative method to eliminate aggregation drops is to increase
the rate at which aggregation data is consumed at user-level. This rate defaults
to once per second, and may be explicitly tuned with the aggrate
option.
As with any rate option, aggrate
may be specified with
any time suffix, but defaults to rate-per-second. For more details on the aggsize
option, see Options and Tunables.