Buffers and Buffering
Data buffering and management is an essential service provided by the DTrace framework for its clients, such as dtrace(1M). This chapter explores data buffering in detail and describes options you can use to change DTrace's buffer management policies.
11.1. Principal Buffers
The principal buffer is present in every DTrace invocation and is the buffer to which tracing actions record their data by default. These actions include:
|
|
|
|
|
|
|
The principal buffers are always allocated on a
per-CPU basis. This policy is not tunable, but tracing and buffer allocation
can be restricted to a single CPU by using the cpu
option.
11.2. Principal Buffer Policies
DTrace permits tracing in highly constrained contexts in the kernel.
In particular, DTrace permits tracing in contexts in which kernel software
may not reliably allocate memory. The consequence of this flexibility of context
is that there always exists a possibility that DTrace
will attempt to trace data when there isn't space available. DTrace must have
a policy to deal with such situations when they arise, but you might wish
to tune the policy based on the needs of a given experiment. Sometimes the
appropriate policy might be to discard the new data. Other times it might
be desirable to reuse the space containing the oldest recorded data to trace
new data. Most often, the desired policy is to minimize the likelihood of
running out of available space in the first place. To accommodate these varying
demands, DTrace supports several different buffer policies. This support is
implemented with the bufpolicy
option, and can be set on
a per-consumer basis. See Options and Tunables for more details on setting options.
11.2.1. switch Policy
By default, the
principal buffer has a switch
buffer policy. Under this
policy, per-CPU buffers are allocated in pairs: one buffer is active and the
other buffer is inactive. When a DTrace consumer attempts to read a buffer,
the kernel firsts switches the inactive and active buffers.
Buffer switching is done in such a manner that there is no window in which
tracing data may be lost. Once the buffers are switched, the newly inactive
buffer is copied out to the DTrace consumer. This policy assures that the
consumer always sees a self-consistent buffer: a buffer is never simultaneously
traced to and copied out. This technique also avoids introducing a window
in which tracing is paused or otherwise prevented. The rate at which the buffer
is switched and read out is controlled by the consumer with the switchrate
option. As with any rate option, switchrate
may
be specified with any time suffix, but defaults to rate-per-second. For more
details on switchrate
and other options, see Options and Tunables.
To process the principal buffer at user-level at a rate faster
than the default of once per second, tune the value of switchrate
.
The system processes actions that induce user-level activity (such as printa
and system
) when the corresponding record
in the principal buffer is processed. The value of switchrate
dictates
the rate at which the system processes such actions.
Under the switch
policy, if a given
enabled probe would trace more data than there is space available in the active
principal buffer, the data is dropped and a per-CPU drop
count is incremented. In the event of one or more drops, dtrace(1M) displays a message similar
to the following example:
dtrace: 11 drops on CPU 0
If a given record is larger than the total buffer size, the record will
be dropped regardless of buffer policy. You can reduce or eliminate drops
by either increasing the size of the principal buffer with the bufsize
option
or by increasing the switching rate with the switchrate
option.
Under the switch
policy, scratch space for copyin
, copyinstr
, and alloca
is
allocated out of the active buffer.
11.2.2. fill Policy
For some problems,
you might wish to use a single in-kernel buffer. While this approach can be
implemented with the switch
policy and appropriate D constructs
by incrementing a variable in D and predicating an exit
action
appropriately, such an implementation does not eliminate the possibility of
drops. To request a single, large in-kernel buffer, and continue tracing until
one or more of the per-CPU buffers has filled, use the fill
buffer
policy. Under this policy, tracing continues until an enabled probe attempts
to trace more data than can fit in the remaining principal buffer space. When
insufficient space remains, the buffer is marked as filled and the consumer
is notified that at least one of its per-CPU buffers has filled. Once dtrace(1M) detects a single filled buffer,
tracing is stopped, all buffers are processed and dtrace
exits.
No further data will be traced to a filled buffer even if the data would fit
in the buffer.
To use the fill
policy, set the bufpolicy
option
to fill
. For example, the following command traces every
system call entry into a per-CPU 2K buffer with the buffer policy set to fill
:
# dtrace -n syscall:::entry -b 2k -x bufpolicy=fill
fill Policy and END Probes
END
probes
normally do not fire until tracing has been explicitly stopped by the DTrace
consumer. END
probes are guaranteed to only fire on one
CPU, but the CPU on which the probe fires is undefined. With fill
buffers,
tracing is explicitly stopped when at least one of the per-CPU principal buffers
has been marked as filled. If the fill
policy is selected,
the END
probe may fire on a CPU that has a filled buffer.
To accommodate END
tracing in fill
buffers,
DTrace calculates the amount of space potentially consumed by END
probes
and subtracts this space from the size of the principal
buffer. If the net size is negative, DTrace will refuse to start, and dtrace(1M) will output a corresponding
error message:
dtrace: END enablings exceed size of principal buffer
The reservation mechanism ensures that a full buffer always has sufficient
space for any END
probes.
11.2.3. ring Policy
The DTrace ring
buffer
policy helps you trace the events leading up to a failure. If reproducing
the failure takes hours or days, you might wish to keep only the most recent
data. Once a principal buffer has filled, tracing wraps around to the first
entry, thereby overwriting older tracing data. You establish the ring buffer
by setting the bufpolicy
option to the string ring
:
# dtrace -s foo.d -x bufpolicy=ring
When used to create a ring buffer, dtrace(1M) will not display any output
until the process is terminated. At that time, the ring buffer is consumed
and processed. dtrace
processes each ring buffer in CPU
order. Within a CPU's buffer, trace records will be displayed in order from
oldest to youngest. Just as with the switch
buffering policy,
no ordering exists between records from different CPUs are made. If such an
ordering is required, you should trace the timestamp
variable
as part of your tracing request.
The following example demonstrates the use of a #pragma option
directive
to enable ring buffering:
#pragma D option bufpolicy=ring
#pragma D option bufsize=16k
syscall:::entry
/execname == $1/
{
trace(timestamp);
}
syscall::rexit:entry
{
exit(0);
}
11.3. Other Buffers
Principal buffers exist in every DTrace enabling. Beyond principal buffers, some DTrace consumers may have additional in-kernel data buffers: an aggregation buffer, discussed in Aggregations, and one or more speculative buffers, discussed in Speculative Tracing.
11.4. Buffer Sizes
The size of each buffer can be tuned on a per-consumer basis. Separate options are provided to tune each buffer size, as shown in the following table:
Buffer |
Size Option |
---|---|
Principal |
|
Speculative |
|
Aggregation |
|
Each of these options is set with a value that denotes the size. As
with any size option, the value may have an optional size suffix. See Options and Tunables for more
details. For example, to set the buffer size to one megabyte on the command
line to dtrace
, you can use -x
to set the
option:
# dtrace -P syscall -x bufsize=1m
Alternatively, you can use the -b
option to dtrace
:
# dtrace -P syscall -b 1m
Finally, you could set bufsize
using #pragma D option
:
#pragma D option bufsize=1m
The buffer size you select denotes the size of the buffer on each CPU. Moreover, for the switch
buffer policy, bufsize
denotes the size of each buffer on
each CPU. The buffer size defaults to four megabytes.
11.5. Buffer Resizing Policy
Occasionally, the system might not have adequate free kernel memory
to allocate a buffer of desired size either because not enough memory is available
or because the DTrace consumer has exceeded one of the tunable limits described
in Options and Tunables.
You can configure the policy for buffer allocation failure using bufresize
option, which defaults to auto
. Under the auto
buffer resize policy, the size of a buffer is halved until
a successful allocation occurs. dtrace(1M) generates
a message if a buffer as allocated is smaller than the requested size:
# dtrace -P syscall -b 4g dtrace: description 'syscall' matched 430 probes dtrace: buffer size lowered to 128m ...
or:
# dtrace -P syscall'{@a[probefunc] = count()}' -x aggsize=1g dtrace: description 'syscall' matched 430 probes dtrace: aggregation size lowered to 128m ...
Alternatively, you can require manual intervention after buffer allocation
failure by setting bufresize
to manual
.
Under this policy, a failure to allocate will cause DTrace to fail to start:
# dtrace -P syscall -x bufsize=1g -x bufresize=manual dtrace: description 'syscall' matched 430 probes dtrace: could not enable tracing: Not enough space #
The buffer resizing policy of all buffers, principal,
speculative and aggregation, is dictated by the bufresize
option.
11.6. Buffer Ordering Policy
DTrace consumes its principal buffers on a per-CPU basis. This causes output to be ordered first by the order that it retrieved buffers from the CPUs and secondly by the ordering within each principal buffer. Look at the output of the following script:
syscall:::entry
{
trace(timestamp);
}
CPU ID FUNCTION:NAME 23 24 close:entry 3302220933052713 23 24 close:entry 3302220933064286 23 24 close:entry 3302220933066326 23 16 rexit:entry 3302220933111500 1 20 write:entry 3302220705802875 1 20 write:entry 3302220705807694 1 20 write:entry 3302220705812112 1 106 ioctl:entry 3302220705815463
Notice how the timestamps are not in the order that you might expect. All of the events on CPU23 are ordered and all the events on CPU 1 are ordered, however there is no total ordering based on time.
To instead order this based on time, one would use the
temporal
option. This can be controlled on a
per-consumer basis.