# Aggregations

When instrumenting the system to answer performance-related questions, it is useful to consider how data can be aggregated to answer a specific question rather than thinking in terms of data gathered by individual probes. For example, if you wanted to know the number of system calls by user ID, you would not necessarily care about the datum collected at each system call. You simply want to see a table of user IDs and system calls. Historically, you would answer this question by gathering data at each system call, and post-processing the data using a tool like awk(1) or perl(1). However, in DTrace the aggregating of data is a first-class operation. This chapter describes the DTrace facilities for manipulating aggregations.

## 9.1. Aggregating Functions

An aggregating function is one that has the following property:

``f(f(x`0`) U f(x`1`) U ... U f(x`n`)) = f(x`0 `U x`1 `U ... U x`n`)``

where xn is a set of arbitrary data. That is, applying an aggregating function to subsets of the whole and then applying it again to the results gives the same result as applying it to the whole itself. For example, consider a function `SUM` that yields the summation of a given data set. If the raw data consists of {2, 1, 2, 5, 4, 3, 6, 4, 2}, the result of applying `SUM` to the entire set is {29}. Similarly, the result of applying `SUM` to the subset consisting of the first three elements is {5}, the result of applying `SUM` to the set consisting of the subsequent three elements is {12}, and the result of of applying `SUM` to the remaining three elements is also {12}. `SUM` is an aggregating function because applying it to the set of these results, {5, 12, 12}, yields the same result, {29}, as applying `SUM` to the original data.

Not all functions are aggregating functions. An example of a non-aggregating function is the function `MEDIAN` that determines the median element of the set. (The median is defined to be that element of a set for which as many elements in the set are greater than it as are less than it.) The `MEDIAN` is derived by sorting the set and selecting the middle element. Returning to the original raw data, if `MEDIAN` is applied to the set consisting of the first three elements, the result is {2}. (The sorted set is {1, 2, 2}; {2} is the set consisting of the middle element.) Likewise, applying `MEDIAN` to the next three elements yields {4} and applying `MEDIAN` to the final three elements yields {4}. Applying `MEDIAN` to each of the subsets thus yields the set {2, 4, 4}. Applying `MEDIAN` to this set yields the result {4}. However, sorting the original set yields {1, 2, 2, 2, 3, 4, 4, 5, 6}. Applying `MEDIAN` to this set thus yields {3}. Because these results do not match, `MEDIAN` is not an aggregating function.

Many common functions for understanding a set of data are aggregating functions. These functions include counting the number of elements in the set, computing the minimum value of the set, computing the maximum value of the set, and summing all elements in the set. Determining the arithmetic mean of the set can be constructed from the function to count the number of elements in the set and the function to sum the number the elements in the set.

However, several useful functions are not aggregating functions. These functions include computing the mode (the most common element) of a set, the median value of the set, or the standard deviation of the set.

Applying aggregating functions to data as it is traced has a number of advantages:

• The entire data set need not be stored. Whenever a new element is to be added to the set, the aggregating function is calculated given the set consisting of the current intermediate result and the new element. After the new result is calculated, the new element may be discarded. This process reduces the amount of storage required by a factor of the number of data points, which is often quite large.

• Data collection does not induce pathological scalability problems. Aggregating functions enable intermediate results to be kept per-CPU instead of in a shared data structure. DTrace then applies the aggregating function to the set consisting of the per-CPU intermediate results to produce the final system-wide result.

## 9.2. Aggregations

DTrace stores the results of aggregating functions in objects called aggregations. The aggregation results are indexed using a tuple of expressions similar to those used for associative arrays. In D, the syntax for an aggregation is:

``@name[ keys ] = aggfunc ( args );``

where name is the name of the aggregation, keys is a comma-separated list of D expressions, aggfunc is one of the DTrace aggregating functions, and args is a comma-separated list of arguments appropriate for the aggregating function. The aggregation name is a D identifier that is prefixed with the special character `@`. All aggregations named in your D programs are global variables; there are no thread- or clause-local aggregations. The aggregation names are kept in a separate identifier namespace from other D global variables. Remember that `a` and `@a` are not the same variable if you reuse names. The special aggregation name `@` can be used to name an anonymous aggregation in simple D programs. The D compiler treats this name as an alias for the aggregation name `@_`.

The DTrace aggregating functions are shown in the following table. Most aggregating functions take just a single argument that represents the new datum.

DTrace Aggregating Functions

Function Name

Arguments

Result

`count`

none

The number of times called.

`sum`

scalar expression

The total value of the specified expressions.

`avg`

scalar expression

The arithmetic average of the specified expressions.

`min`

scalar expression

The smallest value among the specified expressions.

`max`

scalar expression

The largest value among the specified expressions.

`lquantize`

scalar expression, lower bound, upper bound, step value

A linear frequency distribution, sized by the specified range, of the values of the specified expressions. Increments the value in the highest bucket that is less than the specified expression.

`quantize`

scalar expression

A power-of-two frequency distribution of the values of the specified expressions. Increments the value in the highest power-of-two bucket that is less than the specified expression.

For example, to count the number of write(2) system calls in the system, you could use an informative string as a key and the `count` aggregating function:

``````syscall::write:entry
{
@counts["write system calls"] = count();
}``````

The `dtrace` command prints aggregation results by default when the process terminates, either as the result of an explicit `END` action or when the user presses Control-C. The following example output shows the result of running this command, waiting for a few seconds, and pressing Control-C:

```# dtrace -s writes.d
dtrace: script './writes.d' matched 1 probe
^C

write system calls                                              179
#```

You can count system calls per process name using the `execname` variable as the key to an aggregation:

``````syscall::write:entry
{
@counts[execname] = count();
}``````

The following example output shows the result of running this command, waiting for a few seconds, and pressing Control-C:

```# dtrace -s writesbycmd.d
dtrace: script './writesbycmd.d' matched 1 probe
^C

dtrace                                                            1
cat                                                               4
sed                                                               9
grep                                                             14
find                                                             15
tail                                                             25
mountd                                                           28
expr                                                             72
sh                                                              291
tee                                                             814
def.dir.flp                                                    1996
make.bin                                                       2010
#```

Alternatively, you might want to further examine writes organized by both executable name and file descriptor. The file descriptor is the first argument to write(2), so the following example uses a key consisting of both `execname` and `arg0`:

``````syscall::write:entry
{
@counts[execname, arg0] = count();
}``````

Running this command results in a table with both executable name and file descriptor, as shown in the following example:

```# dtrace -s writesbycmdfd.d
dtrace: script './writesbycmdfd.d' matched 1 probe
^C

cat                                                       1      58
sed                                                       1      60
grep                                                      1      89
tee                                                       1     156
tee                                                       3     156
make.bin                                                  5     164
acomp                                                     1     263
macrogen                                                  4     286
cg                                                        1     397
acomp                                                     3     736
make.bin                                                  1     880
iropt                                                     4    1731
#```

The following example displays the average time spent in the write system call, organized by process name. This example uses the `avg` aggregating function, specifying the expression to average as the argument. The example averages the wall clock time spent in the system call:

``````syscall::write:entry
{
self->ts = timestamp;
}

syscall::write:return
/self->ts/
{
@time[execname] = avg(timestamp - self->ts);
self->ts = 0;
}``````

The following example output shows the result of running this command, waiting for a few seconds, and pressing Control-C:

```# dtrace -s writetime.d
dtrace: script './writetime.d' matched 2 probes
^C

iropt                                                         31315
acomp                                                         37037
make.bin                                                      63736
tee                                                           68702
date                                                          84020
sh                                                            91632
dtrace                                                       159200
ctfmerge                                                     321560
install                                                      343300
mcs                                                          394400
get                                                          413695
ctfconvert                                                   594400
bringover                                                   1332465
tail                                                        1335260
#```

The average can be useful, but often does not provide sufficient detail to understand the distribution of data points. To understand the distribution in further detail, use the `quantize` aggregating function as shown in the following example:

``````syscall::write:entry
{
self->ts = timestamp;
}

syscall::write:return
/self->ts/
{
@time[execname] = quantize(timestamp - self->ts);
self->ts = 0;
}``````

Because each line of output becomes a frequency distribution diagram, the output of this script is substantially longer than previous ones. The following example shows a selection of sample output:

```  lint
value  ------------- Distribution ------------- count
8192 |                                         0
16384 |                                         2
32768 |                                         0
65536 |@@@@@@@@@@@@@@@@@@@                      74
131072 |@@@@@@@@@@@@@@@                          59
262144 |@@@                                      14
524288 |                                         0

acomp
value  ------------- Distribution ------------- count
4096 |                                         0
8192 |@@@@@@@@@@@@                             840
16384 |@@@@@@@@@@@                              750
32768 |@@                                       165
65536 |@@@@@@                                   460
131072 |@@@@@@                                   446
262144 |                                         16
524288 |                                         0
1048576 |                                         1
2097152 |                                         0

iropt
value  ------------- Distribution ------------- count
4096 |                                         0
8192 |@@@@@@@@@@@@@@@@@@@@@@@                  4149
16384 |@@@@@@@@@@                               1798
32768 |@                                        332
65536 |@                                        325
131072 |@@                                       431
262144 |                                         3
524288 |                                         2
1048576 |                                         1
2097152 |                                         0```

Notice that the rows for the frequency distribution are always power-of-two values. Each rows indicates the count of the number of elements greater than or equal to the corresponding value, but less than the next larger row value. For example, the above output shows that `iropt` had 4,149 writes taking between 8,192 nanoseconds and 16,383 nanoseconds, inclusive.

While `quantize` is useful for getting quick insight into the data, you might want to examine a distribution across linear values instead. To display a linear value distribution, use the `lquantize` aggregating function. The `lquantize` function takes three arguments in addition to a D expression: a lower bound, an upper bound, and a step. For example, if you wanted to look at the distribution of writes by file descriptor, a power-of-two quantization would not be effective. Instead, use a linear quantization with a small range, as shown in the following example:

``````syscall::write:entry
{
@fds[execname] = lquantize(arg0, 0, 100, 1);
}``````

Running this script for several seconds yields a large amount of information. The following example shows a selection of typical output:

```  mountd
value  ------------- Distribution ------------- count
11 |                                         0
12 |@                                        4
13 |                                         0
14 |@@@@@@@@@@@@@@@@@@@@@@@@@                70
15 |                                         0
16 |@@@@@@@@@@@@                             34
17 |                                         0

xemacs-20.4
value  ------------- Distribution ------------- count
6 |                                         0
7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  521
8 |                                         0
9 |                                         1
10 |                                         0

make.bin
value  ------------- Distribution ------------- count
0 |                                         0
1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  3596
2 |                                         0
3 |                                         0
4 |                                         42
5 |                                         50
6 |                                         0

acomp
value  ------------- Distribution ------------- count
0 |                                         0
1 |@@@@@                                    1156
2 |                                         0
3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         6635
4 |@                                        297
5 |                                         0

iropt
value  ------------- Distribution ------------- count
2 |                                         0
3 |                                         299
4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  20144
5 |                                         0```

You can also use the `lquantize` aggregating function to aggregate on time since some point in the past. This technique allows you to observe a change in behavior over time. The following example displays the change in system call behavior over the lifetime of a process executing the date(1) command:

``````syscall::exec:return,
syscall::exece:return
/execname == "date"/
{
self->start = vtimestamp;
}

syscall:::entry
/self->start/
{
/*
* We linearly quantize on the current virtual time minus our
* process's start time. We divide by 1000 to yield microseconds
* rather than nanoseconds. The range runs from 0 to 10
* milliseconds in steps of 100 microseconds; we expect that no
* date(1) process will take longer than 10 milliseconds to
* complete.
*/
@a["system calls over time"] =
lquantize((vtimestamp - self->start) / 1000, 0, 10000, 100);
}

syscall::rexit:entry
/self->start/
{
self->start = 0;
}``````

The preceding script provides greater insight into system call behavior when many date(1) processes are executed. To see this result, run `sh -c 'while true; do date >/dev/null; done'` in one window, while executing the D script in another. The script produces a profile of the system call behavior of the date(1) command:

```# dtrace -s dateprof.d
dtrace: script './dateprof.d' matched 218 probes
^C

system calls over time
value  ------------- Distribution ------------- count
< 0 |                                         0
0 |@@                                       20530
100 |@@@@@@                                   48814
200 |@@@                                      28119
300 |@                                        14646
400 |@@@@@                                    41237
500 |                                         1259
600 |                                         218
700 |                                         116
800 |@                                        12783
900 |@@@                                      28133
1000 |                                         7897
1100 |@                                        14065
1200 |@@@                                      27549
1300 |@@@                                      25715
1400 |@@@@                                     35011
1500 |@@                                       16734
1600 |                                         498
1700 |                                         256
1800 |                                         369
1900 |                                         404
2000 |                                         320
2100 |                                         555
2200 |                                         54
2300 |                                         17
2400 |                                         5
2500 |                                         1
2600 |                                         7
2700 |                                         0```

This output provides a rough idea of the different phases of the date(1) command with respect to the services required of the kernel. To better understand these phases, you might want to understand which system calls are being called when. If so, you could change the D script to aggregate on the variable `probefunc` instead of a constant string.

For some operations, visibility into microsecond resolution is critical, while for other (longer) operations, millisecond resolution is sufficient — and some classes of operations may take hundreds or thousands of milliseconds to complete. If one makes the wrong choice as the resolution of aggregation, one either ends up clogging the system with unnecessarily fine-grained data, or discarding valuable information in overly coarse-grained data. One might think that one could infer the interesting latency range from the nature of the operation, but this is unfortunately not the case: I/O operations (a typically-millisecond resolution operation) can take but microseconds, and CPU scheduling operations (a typically-microsecond resolution operation) can take hundreds of milliseconds. Indeed, the problem is that one often does not know what order of magnitude of resolution is interesting for a particular operation until one has a feel for the data for the specific operation — one needs to instrument the system to best know how to instrument the system! To understand this case use the `llquantize` aggregating function.

Here is a simple example of how the aggregation is both logarithmic and linear:

``````
tick-1ms
{
/*
* A log-linear quantization with factor of 10 ranging from
* magnitude 0 to magnitude 6 (inclusive) with twenty steps per
* magnitude.
*/
@ = llquantize(i++, 10, 0, 6, 20);
}

tick-1ms
/i == 1500/
{
exit(0);
}
``````

This example outputs:

```

value  ------------- Distribution ------------- count
< 1 |                                         1
1 |                                         1
2 |                                         1
3 |                                         1
4 |                                         1
5 |                                         1
6 |                                         1
7 |                                         1
8 |                                         1
9 |                                         1
10 |                                         5
15 |                                         5
20 |                                         5
25 |                                         5
30 |                                         5
35 |                                         5
40 |                                         5
45 |                                         5
50 |                                         5
55 |                                         5
60 |                                         5
65 |                                         5
70 |                                         5
75 |                                         5
80 |                                         5
85 |                                         5
90 |                                         5
95 |                                         5
100 |@                                        50
150 |@                                        50
200 |@                                        50
250 |@                                        50
300 |@                                        50
350 |@                                        50
400 |@                                        50
450 |@                                        50
500 |@                                        50
550 |@                                        50
600 |@                                        50
650 |@                                        50
700 |@                                        50
750 |@                                        50
800 |@                                        50
850 |@                                        50
900 |@                                        50
950 |@                                        50
1000 |@@@@@@@@@@@@@                            500
1500 |                                         0
```

## 9.3. Printing Aggregations

By default, multiple aggregations are displayed in the order they are introduced in the D program. You can override this behavior using the `printa` function to print the aggregations. The `printa` function also enables you to precisely format the aggregation data using a format string, as described in Output Formatting.

If an aggregation is not formatted with a `printa` statement in your D program, the `dtrace` command will snapshot the aggregation data and print the results once after tracing has completed using the default aggregation format. If a given aggregation is formatted using a `printa` statement, the default behavior is disabled. You can achieve equivalent results by adding the statement `printa(@`aggregation-name`)` to a `dtrace:::END` probe clause in your program. The default output format for the `avg`, `count`, `min`, `max`, and `sum` aggregating functions displays an integer decimal value corresponding to the aggregated value for each tuple. The default output format for the `lquantize` and `quantize` aggregating functions displays an ASCII table of the results. Aggregation tuples are printed as if `trace` had been applied to each tuple element.

## 9.4. Data Normalization

When aggregating data over some period of time, you might want to normalize the data with respect to some constant factor. This technique enables you to compare disjoint data more easily. For example, when aggregating system calls, you might want to output system calls as a per-second rate instead of as an absolute value over the course of the run. The DTrace `normalize` action enables you to normalize data in this way. The parameters to `normalize` are an aggregation and a normalization factor. The output of the aggregation shows each value divided by the normalization factor.

The following example shows how to aggregate data by system call:

``````#pragma D option quiet

BEGIN
{
/*
* Get the start time, in nanoseconds.
*/
start = timestamp;
}

syscall:::entry
{
@func[execname] = count();
}

END
{
/*
* Normalize the aggregation based on the number of seconds we
* have been running.  (There are 1,000,000,000 nanoseconds in
* one second.)
*/
normalize(@func, (timestamp - start) / 1000000000);
}``````

Running the above script for a brief period of time results in the following output on a desktop machine:

```# dtrace -s ./normalize.d
^C
syslogd                                                           0
rpc.rusersd                                                       0
utmpd                                                             0
xbiff                                                             0
in.routed                                                         1
sendmail                                                          2
echo                                                              2
FvwmAuto                                                          2
stty                                                              2
cut                                                               2
init                                                              2
pt_chmod                                                          3
picld                                                             3
utmp_update                                                       3
httpd                                                             4
xclock                                                            5
basename                                                          6
tput                                                              6
sh                                                                7
tr                                                                7
arch                                                              9
expr                                                             10
uname                                                            11
mibiisa                                                          15
dirname                                                          18
dtrace                                                           40
ksh                                                              48
java                                                             58
xterm                                                           100
nscd                                                            120
fvwm2                                                           154
prstat                                                          180
perfbar                                                         188
Xsun                                                           1309
.netscape.bin                                                  3005```

`normalize` sets the normalization factor for the specified aggregation, but this action does not modify the underlying data. `denormalize` takes only an aggregation. Adding the denormalize action to the preceding example returns both raw system call counts and per-second rates:

``````#pragma D option quiet

BEGIN
{
start = timestamp;
}

syscall:::entry
{
@func[execname] = count();
}

END
{
this->seconds = (timestamp - start) / 1000000000;
printf("Ran for %d seconds.\n", this->seconds);

printf("Per-second rate:\n");
normalize(@func, this->seconds);
printa(@func);

printf("\nRaw counts:\n");
denormalize(@func);
printa(@func);
}``````

Running the above script for a brief period of time produces output similar to the following example:

```# dtrace -s ./denorm.d
^C
Ran for 14 seconds.
Per-second rate:

syslogd                                                           0
in.routed                                                         0
xbiff                                                             1
sendmail                                                          2
elm                                                               2
picld                                                             3
httpd                                                             4
xclock                                                            6
FvwmAuto                                                          7
mibiisa                                                          22
dtrace                                                           42
java                                                             55
xterm                                                            75
nscd                                                            127
prstat                                                          179
perfbar                                                         184
fvwm2                                                           296
Xsun                                                            829

Raw counts:

syslogd                                                           1
in.routed                                                         4
xbiff                                                            21
sendmail                                                         30
elm                                                              36
picld                                                            43
httpd                                                            56
xclock                                                           91
FvwmAuto                                                        104
mibiisa                                                         314
dtrace                                                          592
java                                                            774
xterm                                                          1062
nscd                                                           1781
prstat                                                         2506
perfbar                                                        2581
fvwm2                                                          4156
Xsun                                                          11616```

Aggregations can also be renormalized. If `normalize` is called more than once for the same aggregation, the normalization factor will be the factor specified in the most recent call. The following example prints per-second rates over time:

## 9.5. Clearing Aggregations

When using DTrace to build simple monitoring scripts, you can periodically clear the values in an aggregation using the `clear` function. This function takes an aggregation as its only parameter. The `clear` function clears only the aggregation's values; the aggregation's keys are retained. Therefore, the presence of a key in an aggregation that has an associated value of zero indicates that the key had a non-zero value that was subsequently set to zero as part of a `clear`. To discard both an aggregation's values and its keys, use the `trunc`. See Truncating aggregations for details.

The following example adds `clear` to renormalize.d: Renormalizing an Aggregation:

``````#pragma D option quiet

BEGIN
{
last = timestamp;
}

syscall:::entry
{
@func[execname] = count();
}

tick-10sec
{
normalize(@func, (timestamp - last) / 1000000000);
printa(@func);
clear(@func);
last = timestamp;
}``````

While renormalize.d: Renormalizing an Aggregation shows the system call rate over the lifetime of the `dtrace` invocation, the preceding example shows the system call rate only for the most recent ten-second period.

## 9.6. Truncating aggregations

When looking at aggregation results, you often care only about the top several results. The keys and values associated with anything other than the highest values are not interesting. You might also wish to discard an entire aggregation result, removing both keys and values. The DTrace `trunc` function is used for both of these situations.

The parameters to `trunc` are an aggregation and an optional truncation value. Without the truncation value, `trunc` discards both aggregation values and aggregation keys for the entire aggregation. When a truncation value n is present, `trunc` discards aggregation values and keys except for those values and keys associated with the highest n values. That is, `trunc(@foo, 10)` truncates the aggregation named `foo` after the top ten values, where `trunc(@foo)` discards the entire aggregation. The entire aggregation is also discarded if `0` is specified as the truncation value.

To see the bottom n values instead of the top n, specify a negative truncation value to `trunc`. For example, `trunc(@foo, -10)` truncates the aggregation named `foo` after the bottom ten values.

The following example augments the system call example to only display the per-second system call rates of the top ten system-calling applications in a ten-second period:

``````#pragma D option quiet

BEGIN
{
last = timestamp;
}

syscall:::entry
{
@func[execname] = count();
}

tick-10sec
{
trunc(@func, 10);
normalize(@func, (timestamp - last) / 1000000000);
printa(@func);
clear(@func);
last = timestamp;
}``````

The following example shows output from running the above script on a lightly loaded laptop:

```  FvwmAuto                                                          7
telnet                                                           13
ping                                                             14
dtrace                                                           27
xclock                                                           34
MozillaFirebird-                                                 63
xterm                                                           133
fvwm2                                                           146
Xsun                                                            616

telnet                                                            4
FvwmAuto                                                          5
ping                                                             14
dtrace                                                           27
xclock                                                           35
fvwm2                                                            69
xterm                                                            70
Because DTrace buffers some aggregation data in the kernel, space might not be available when a new key is added to an aggregation. In this case, the data will be dropped, a counter will be incremented, and `dtrace` will generate a message indicating an aggregation drop. This situation rarely occurs because DTrace keeps long-running state (consisting of the aggregation's key and intermediate result) at user-level where space may grow dynamically. In the unlikely event that aggregation drops occur, you can increase the aggregation buffer size with the `aggsize` option to reduce the likelihood of drops. You can also use this option to minimize the memory footprint of DTrace. As with any size option, `aggsize` may be specified with any size suffix. The resizing policy of this buffer is dictated by the `bufresize` option. For more details on buffering, see Buffers and Buffering. For more details on options, see Options and Tunables.
An alternative method to eliminate aggregation drops is to increase the rate at which aggregation data is consumed at user-level. This rate defaults to once per second, and may be explicitly tuned with the `aggrate` option. As with any rate option, `aggrate` may be specified with any time suffix, but defaults to rate-per-second. For more details on the `aggsize` option, see Options and Tunables.