INTRO(9F) Kernel Functions for Drivers INTRO(9F)

NAME


Intro - Introduction to kernel and device driver functions

SYNOPSIS


#include <sys/ddi.h>
#include <sys/sunddi.h>

DESCRIPTION


Section 9F of the manual page describes functions that are used for device
drivers, kernel modules, and the implementation of the kernel itself. This
first provides an overview for the use of kernel functions and portions of
the manual that are specific to the kernel. After that, we have grouped
together most functions that are available by use, with some brief
commentary and introduction.

Most manual pages are similar to those in other sections. They have common
fields such as the NAME, a SYNOPSIS to show which header files to include
and prototypes, an extended DESCRIPTION discussing its use, and the common
combination of RETURN VALUES and ERRORS. Some manuals will have examples
and additional manuals to reference in the SEE ALSO section.

RETURN VALUES and ERRORS


One major difference when programming in the kernel versus userland is that
there is no equivalent to errno. Instead, there are a few common patterns
that are used throughout the kernel that we'll discuss. While there are
common patterns, please be aware that due to the natural evolution of the
system, you will need to read the specifics of the section.

+o Many functions will return a specific DDI (Device Driver Interface)
value, which is commonly one of DDI_SUCCESS or DDI_FAILURE, indicating
success and failure respectively. Some functions will return
additional error codes to indicate why something failed. In general,
when checking a response code is always preferred to compare that
something equals or does not equal DDI_SUCCESS as there can be many
different error cases and additional ones can be added over time.

+o Many routines explicitly return 0 on success and will return an
explicit error number. Intro(2) has a list of error numbers.

+o There are classes of functions that return either a pointer or a
boolean type, either the C99 bool or the system's traditional type
boolean_t. In these cases, sometimes a more detailed error is provided
via an additional argument such as a int *. Absent such an argument,
there is generally no more detailed information available.

CONTEXT


The CONTEXT section of a manual page describes the times in which this
function may be called. In generally there are three different contexts
that come up:

User User context implies that the thread of execution is operating
because a user thread has entered the kernel for an operation.
When an application issues a system call such as open(2), read(2),
write(2), or ioctl(2) then we are said to be in user context. When
in user context, one can copy in or out data from a user's address
space. When writing a character or block device driver, the
majority of the time that a character device operation such as the
corresponding open(9E), read(9E), write(9E), and ioctl(9E) entry
point being called, it is executing in user context. It is
possible to call those entry points through the kernel's layered
device interface, so drivers cannot assume those entry points will
always have a user process present, strictly speaking.

Interrupt
Interrupt context refers to when the operating system is handling
an interrupt (See Interrupt Related Functions) and executing a
registered interrupt handler. Interrupt context is split into two
different sets: high-level and low-level interrupts. Most device
drivers are always going to be executing low-level interrupts. To
determine whether an interrupt is considered high level or not, you
should pass the interrupt handle to the ddi_intr_get_pri(9F)
function and compare the resulting priority with
ddi_intr_get_hilevel_pri(9F).

When executing high-level interrupts, the thread may only execute a
limited number of functions. In particular, it may call
ddi_intr_trigger_softint(9F), mutex_enter(9F), and mutex_exit(9F).
It is critical that the mutex being used be properly initialized
with the driver's interrupt priority. The system will
transparently pick the correct implementation of a mutex based on
the interrupt type. Aside from the above, one must not block while
in high-level interrupt context.

On the other hand, when a thread is not in high-level interrupt
context, most of these restrictions are lifted. Kernel memory may
be allocated (if using a non-blocking allocation such as KM_NOSLEEP
or KM_NOSLEEP_LAZY), and many of the other documented functions may
be called.

Regardless of whether a thread is in high-level or low-level
interrupt context, it will never have a user context associated
with it and therefore cannot use routines like ddi_copyin(9F) or
ddi_copyout(9F).

Kernel Kernel context refers to all other times in the kernel. Whenever
the kernel is executing something on a thread that is not
associated with a user process, then one is in kernel context. The
most common situation for writers of kernel modules are things like
timeout callbacks, such as timeout(9F) or ddi_periodic_add(9F),
cases where the kernel is invoking a driver's device operation
routines such as attach(9E) and detach(9E), or many of the device
driver's registered callbacks from frameworks such as the mac(9E),
usba_hcdi(9E), and various portions of SCSI, USB, and block
devices.

Framework-specific Contexts
Some manuals will discuss more specific constraints about when they
can be used. For example, some functions may only be called while
executing a specific entry point like attach(9E). Another example
of this is that the mac_transceiver_info_set_present(9F) function
is only meant to be used while executing a networking driver's
mct_info(9E) entry point.

PARAMETERS


In kernel manual pages (section 9), each function and entry point
description generally has a separate list of parameters which are arguments
to the function. The parameters section describes the basic purpose of
each argument and should explain where such things often come from and any
constraints on their values.

INTERFACES


Functions below are organized into categories that describe their purpose.
Individual functions are documented in their own manual pages. For each of
these areas, we discuss high-level concepts behind each area and provide a
brief discussion of how to get started with it. Note, some deprecated
functions or older frameworks are not listed here.

Every function listed below has its own manual page in section 9F and can
be read with man(1). In addition, some corresponding concepts are
documented in section 9 and some groups of functions are present to support
a specific type of device driver, which is discussed more in section 9E .

Logging Functions


Through the kernel there are often needs to log messages that either make
it into the system log or on the console. These kinds of messages can be
performed with the cmn_err(9F) function or one of its more specific
variants that operate in the context of a device (dev_err(9F)) or a zone
(zcmn_err(9F)).

The console should be used sparingly. While a notice may be found there,
one should assume that it may be missed either due to overflow, not being
connected to say a serial console at the time, or some other reason. While
the system log is better than the console, folks need to take care not to
spam the log. Imagine if someone logged every time a network packet was
generated or received, you'd quickly potentially run out of space and make
it harder to find useful messages for bizarre behavior. It's also
important to remember that only system administrators and privileged users
can actually see this log. Where possible and appropriate use programmatic
errors in routines that allow it.

The system also supports a structured event log called a system event that
is processed by syseventd(8). This is used by the OS to provide
notifications for things like device insertion and removal or the change of
a data link. These are driven by the ddi_log_sysevent(9F) function and
allow arbitrary additional structured metadata in the form of a nvlist_t.

cmn_err(9F) dev_err(9F)
vcmn_err(9F) vzcmn_err(9F)
zcmn_err(9F) ddi_log_sysevent(9F)

Memory Allocation


At the heart of most device drivers is memory allocation. The primary
kernel allocator is called "kmem" (kernel memory) and it is based on the
"vmem" (virtual memory) subsystem. Most of the time, device drivers should
use kmem_alloc(9F) and kmem_zalloc(9F) to allocate memory and free it with
kmem_free(9F). Based on the original kmem and subsequent vmem papers, the
kernel is internally using object caches and magazines to allow high-
throughput allocation in a multi-CPU environment.

When allocating memory, an important choice must be made: whether or not to
block for memory. If one opts to perform a sleeping allocation, then the
caller can be guaranteed that the allocation will succeed, but it may take
some time and the thread will be blocked during that entire duration. This
is the KM_SLEEP flag. On the other hand, there are many circumstances
where this is not appropriate, especially because a thread that is inside a
memory allocation function cannot currently be cancelled. If the thread
corresponds to a user process, then it will not be killable.

Given that there are many situations where this is not appropriate, the
kernel offers an allocation mode where it will not block for memory to be
available: KM_NOSLEEP and KM_NOSLEEP_LAZY. These allocations can fail and
return NULL when they do fail. Even though these are said to be no sleep
operations, that does not mean that the caller may not end up temporarily
blocked due to mutex contention or due to trying a bit more aggressively to
reclaim memory in the case of KM_NOSLEEP. Unless operating in special
circumstances, using KM_NOSLEEP_LAZY should be preferred to KM_NOSLEEP.

If a device driver has its own complex object that has more significant set
up and tear down costs, then the kmem cache function family should be
considered. To use a kmem cache, it must first be created using the
kmem_cache_create(9F) function, which requires specifying the size,
alignment, and constructors and destructors. Individual objects are
allocated from the cache with the kmem_cache_alloc(9F) function. An
important constraint when using the caches is that when an object is freed
with kmem_cache_free(9F), it is the callers responsibility to ensure that
the object is returned to its constructed state prior to freeing it. If
the object is reused, prior to the kernel reclaiming the memory for other
uses, then the constructor will not be called again. Most device drivers
do not need to create a kmem cache for their own allocations.

If you are writing a device driver that is trying to interact with the
networking, STREAMS, or USB subsystems, then they are generally using the
mblk_t data structure which is managed through a different set of APIs,
though they are leveraging kmem under the hood.

The vmem set of interfaces allows for the management of abstract regions of
integers, generally representing memory or some other object, each with an
offset and length. While it is not common that a device driver needs to do
their own such management, vmem_create(9F) and vmem_alloc(9F) are what to
reach for when the need arises. Rather than using vmem, if one needs to
model a set of integers where each is a valid identifier, that is you need
to allocate every integer between 0 and 1000 as a distinct identifier,
instead use id_space_create(9F) which is discussed in Identifier
Management. For more information on vmem, see vmem(9).

kmem_alloc(9F) kmem_cache_alloc(9F)
kmem_cache_create(9F) kmem_cache_destroy(9F)
kmem_cache_free(9F) kmem_cache_set_move(9F)
kmem_free(9F) kmem_zalloc(9F)
vmem_add(9F) vmem_alloc(9F)
vmem_contains(9F) vmem_create(9F)
vmem_destroy(9F) vmem_free(9F)
vmem_size(9F) vmem_walk(9F)
vmem_xalloc(9F) vmem_xcreate(9F)
vmem_xfree(9F) bufcall(9F)
esbbcall(9F) qbufcall(9F)
qunbufcall(9F) unbufcall(9F)

String and libc Analogues


The kernel has many analogues for classic libc functions that deal with
string processing, memory copying, and related. For the most part, these
behave similarly to their userland analogues, but there can be some
differences in return values and for example, in the set of supported
format characters in the case of snprintf(9F) and related.

ASSERT(9F) bcmp(9F)
bzero(9F) bcopy(9F)
ddi_strdup(9F) ddi_strtol(9F)
ddi_strtoll(9F) ddi_strtoul(9F)
ddi_strtoull(9F) ddi_ffs(9F)
ddi_fls(9F) max(9F)
memchr(9F) memcmp(9F)
memcpy(9F) memmove(9F)
memset(9F) min(9F)
numtos(9F) snprintf(9F)
sprintf(9F) stoi(9F)
strcasecmp(9F) strcat(9F)
strchr(9F) strcmp(9F)
strcpy(9F) strdup(9F)
strfree(9F) string(9F)
strlcat(9F) strlcpy(9F)
strlen(9F) strlog(9F)
strncasecmp(9F) strncat(9F)
strncmp(9F) strncpy(9F)
strnlen(9F) strqget(9F)
strqset(9F) strrchr(9F)
strspn(9F) swab(9F)
vsnprintf(9F) va_arg(9F)
va_copy(9F) va_end(9F)
va_start(9F) vsprintf(9F)

Tree Data Structures


These functions provide access to an intrusive self-balancing binary tree
that is generally used throughout illumos. The primary type here is the
avl_tree_t. Structures can be present in multiple trees and there are
built-in walkers for the data structure in mdb(1).

avl_add(9F) avl_create(9F)
avl_destroy_nodes(9F) avl_destroy(9F)
avl_find(9F) avl_first(9F)
avl_insert_here(9F) avl_insert(9F)
avl_is_empty(9F) avl_last(9F)
avl_nearest(9F) AVL_NEXT(9F)
avl_numnodes(9F) AVL_PREV(9F)
avl_remove(9F) avl_swap(9F)

Linked Lists


These functions provide a standard, intrusive doubly-linked list whose type
is the list_t. This list implementation is used extensively throughout
illumos, has debugging support through mdb(1) walkers, and is generally
recommended rather than creating your own list. Due to its intrusive
nature, a given structure can be present on multiple lists.

list_create(9F) list_destroy(9F)
list_head(9F) list_insert_after(9F)
list_insert_before(9F) list_insert_head(9F)
list_insert_tail(9F) list_is_empty(9F)
list_link_active(9F) list_link_init(9F)
list_link_replace(9F) list_move_tail(9F)
list_next(9F) list_prev(9F)
list_remove_head(9F) list_remove_tail(9F)
list_remove(9F) list_tail(9F)

Name-Value Pairs
The kernel often uses the nvlist_t data structure to pass around a list of
typed name-value pairs. This data structure is used in diverse areas,
particularly because of its ability to be serialized in different formats
that are suitable not only for use between userland and the kernel, but
also persistently to a file.

A nvlist_t structure is initialized with the nvlist_alloc(9F) function and
can operate with two different degrees of uniqueness: a mode where only
names are unique or that every name is qualified to a type. The former
means that if I have an integer name "foo" and then add a string, array, or
any other value with the same name, it will be replaced. However, if were
using the name and type as unique, then the value would only be replaced if
both the pair's type and the name "foo" matched a pair that was already
present. Otherwise, the two different entries would co-exist.

When constructing an nvlist, it is normally backed by the normal kmem
allocator and may either use sleeping or non-sleeping allocations. It is
also possible to use a custom allocator, though that generally has not been
necessary in the kernel.

Specific keys and values can be looked up directly with the nvlist_lookup
family of functions, but the entire list can be iterated as well, which is
especially useful when trying to validate that no unknown keys are present
in the list. The iteration API nvlist_next_nvpair(9F) allows one to then
get both the key's name, the type of value of the pair, and then the value
itself.

nv_alloc_fini(9F) nv_alloc_init(9F)
nvlist_add_boolean_array(9F) nvlist_add_boolean_value(9F)
nvlist_add_boolean(9F) nvlist_add_byte_array(9F)
nvlist_add_byte(9F) nvlist_add_int16_array(9F)
nvlist_add_int16(9F) nvlist_add_int32_array(9F)
nvlist_add_int32(9F) nvlist_add_int64_array(9F)
nvlist_add_int64(9F) nvlist_add_int8_array(9F)
nvlist_add_int8(9F) nvlist_add_nvlist_array(9F)
nvlist_add_nvlist(9F) nvlist_add_nvpair(9F)
nvlist_add_string_array(9F) nvlist_add_string(9F)
nvlist_add_uint16_array(9F) nvlist_add_uint16(9F)
nvlist_add_uint32_array(9F) nvlist_add_uint32(9F)
nvlist_add_uint64_array(9F) nvlist_add_uint64(9F)
nvlist_add_uint8_array(9F) nvlist_add_uint8(9F)
nvlist_alloc(9F) nvlist_dup(9F)
nvlist_exists(9F) nvlist_free(9F)
nvlist_lookup_boolean_array(9F) nvlist_lookup_boolean_value(9F)
nvlist_lookup_boolean(9F) nvlist_lookup_byte_array(9F)
nvlist_lookup_byte(9F) nvlist_lookup_int16_array(9F)
nvlist_lookup_int16(9F) nvlist_lookup_int32_array(9F)
nvlist_lookup_int32(9F) nvlist_lookup_int64_array(9F)
nvlist_lookup_int64(9F) nvlist_lookup_int8_array(9F)
nvlist_lookup_int8(9F) nvlist_lookup_nvlist_array(9F)
nvlist_lookup_nvlist(9F) nvlist_lookup_nvpair(9F)
nvlist_lookup_pairs(9F) nvlist_lookup_string_array(9F)
nvlist_lookup_string(9F) nvlist_lookup_uint16_array(9F)
nvlist_lookup_uint16(9F) nvlist_lookup_uint32_array(9F)
nvlist_lookup_uint32(9F) nvlist_lookup_uint64_array(9F)
nvlist_lookup_uint64(9F) nvlist_lookup_uint8_array(9F)
nvlist_lookup_uint8(9F) nvlist_merge(9F)
nvlist_next_nvpair(9F) nvlist_pack(9F)
nvlist_remove_all(9F) nvlist_remove(9F)
nvlist_size(9F) nvlist_t(9F)
nvlist_unpack(9F) nvlist_xalloc(9F)
nvlist_xdup(9F) nvlist_xpack(9F)
nvlist_xunpack(9F) nvpair_name(9F)
nvpair_type(9F) nvpair_value_boolean_array(9F)
nvpair_value_byte_array(9F) nvpair_value_byte(9F)
nvpair_value_int16_array(9F) nvpair_value_int16(9F)
nvpair_value_int32_array(9F) nvpair_value_int32(9F)
nvpair_value_int64_array(9F) nvpair_value_int64(9F)
nvpair_value_int8_array(9F) nvpair_value_int8(9F)
nvpair_value_nvlist_array(9F) nvpair_value_nvlist(9F)
nvpair_value_string_array(9F) nvpair_value_string(9F)
nvpair_value_uint16_array(9F) nvpair_value_uint16(9F)
nvpair_value_uint32_array(9F) nvpair_value_uint32(9F)
nvpair_value_uint64_array(9F) nvpair_value_uint64(9F)
nvpair_value_uint8_array(9F) nvpair_value_uint8(9F)

Identifier Management


A common challenge in the kernel is the management of a series of different
IDs. There are three different families of routines for managing
identifiers presented here, but we recommend the use of the
id_space_create(9F) and id_alloc(9F) family for new use cases. The ID
space can cover all or a subset of the 32-bit integer space and provides
different allocation strategies for this.

Due to the current implementation, callers should generally prefer the non-
sleeping variants because the sleeping ones are not cancellable (currently
this is backed by vmem, but this should not be assumed and may change in
the future).

id_alloc_nosleep(9F) id_alloc_specific_nosleep(9F)
id_alloc(9F) id_allocff_nosleep(9F)
id_allocff(9F) id_free(9F)
id_space_create(9F) id_space_destroy(9F)
id_space_extend(9F) id_space(9F)
id32_alloc(9F) id32_free(9F)
id32_lookup(9F) rmalloc_wait(9F)
rmalloc(9F) rmallocmap_wait(9F)
rmallocmap(9F) rmfree(9F)
rmfreemap(9F)

Bit Manipulation Routines


Many device drivers that are working with registers often need to get a
specific range of bits out of an integer. These functions provide safe
ways to set (bitset) and extract (bitx) bit ranges, as well as modify an
integer to remove a set of bits entirely (bitdel). Using these functions
is preferred to constructing manual masks and shifts particularly when a
programming manual for a device is specified in ranges of bits. On debug
builds, these provide extra checking to try and catch programmer error.

bitdel64(9F) bitset8(9F)
bitset16(9F) bitset32(9F)
bitset64(9F) bitx8(9F)
bitx16(9F) bitx32(9F)
bitx64(9F)

Synchronization Primitives


The kernel provides a set of basic synchronization primitives that can be
used by the system. These include mutexes, condition variables,
reader/writer locks, and semaphores. When creating mutexes and
reader/writer locks, the kernel requires that one pass in the interrupt
priority of a mutex if it will be used in interrupt context. This is
required so the kernel can determine the correct underlying type of lock to
use. This ensures that if for some reason a mutex needs to be used in
high-level interrupt context, the kernel will use a spin lock, but
otherwise can use the standard adaptive mutex that might block. For
developers familiar with other operating systems, this is somewhat
different in that the consumer does not need to generally figure out this
level of detail and this is why this is not present.

In addition, condition variables provide means for waiting and detecting
that a signal has been delivered. These variants are particularly useful
when writing character device operations for device drivers as it allows
users the chance to cancel an operation and not be blocked indefinitely on
something that may not occur. These _sig variants should generally be
preferred where applicable.

The kernel also provides memory barrier primitives. See the Memory
Barriers section for more information. There is no need to use manual
memory barriers when using the synchronization primitives. The
synchronization primitives contain that the appropriate barriers are
present to ensure coherency while the lock is held.

cv_broadcast(9F) cv_destroy(9F)
cv_init(9F) cv_reltimedwait_sig(9F)
cv_reltimedwait(9F) cv_signal(9F)
cv_timedwait_sig(9F) cv_timedwait(9F)
cv_wait_sig(9F) cv_wait(9F)
ddi_enter_critical(9F) ddi_exit_critical(9F)
mutex_destroy(9F) mutex_enter(9F)
mutex_exit(9F) mutex_init(9F)
mutex_owned(9F) mutex_tryenter(9F)
rw_destroy(9F) rw_downgrade(9F)
rw_enter(9F) rw_exit(9F)
rw_init(9F) rw_read_locked(9F)
rw_tryenter(9F) rw_tryupgrade(9F)
sema_destroy(9F) sema_init(9F)
sema_p_sig(9F) sema_p(9F)
sema_tryp(9F) sema_v(9F)
semaphore(9F)

Atomic Operations


This group of functions provides a general way to perform atomic operations
on integers of different sizes and explicit types. The atomic_ops(9F)
manual page describes the different classes of functions in more detail,
but there are functions that take care of using the CPU's instructions for
addition, compare and swap, and more. If data is being protected and only
accessed under a synchronization primitive such as a mutex or reader-writer
lock, then there isn't a reason to use an atomic operation for that data,
generally speaking.

atomic_add_8_nv(9F) atomic_add_8(9F)
atomic_add_16_nv(9F) atomic_add_16(9F)
atomic_add_32_nv(9F) atomic_add_32(9F)
atomic_add_64_nv(9F) atomic_add_64(9F)
atomic_add_char_nv(9F) atomic_add_char(9F)
atomic_add_int_nv(9F) atomic_add_int(9F)
atomic_add_long_nv(9F) atomic_add_long(9F)
atomic_add_ptr_nv(9F) atomic_add_ptr(9F)
atomic_add_short_nv(9F) atomic_add_short(9F)
atomic_and_8_nv(9F) atomic_and_8(9F)
atomic_and_16_nv(9F) atomic_and_16(9F)
atomic_and_32_nv(9F) atomic_and_32(9F)
atomic_and_64_nv(9F) atomic_and_64(9F)
atomic_and_uchar_nv(9F) atomic_and_uchar(9F)
atomic_and_uint_nv(9F) atomic_and_uint(9F)
atomic_and_ulong_nv(9F) atomic_and_ulong(9F)
atomic_and_ushort_nv(9F) atomic_and_ushort(9F)
atomic_cas_16(9F) atomic_cas_32(9F)
atomic_cas_64(9F) atomic_cas_8(9F)
atomic_cas_ptr(9F) atomic_cas_uchar(9F)
atomic_cas_uint(9F) atomic_cas_ulong(9F)
atomic_cas_ushort(9F) atomic_clear_long_excl(9F)
atomic_dec_8_nv(9F) atomic_dec_8(9F)
atomic_dec_16_nv(9F) atomic_dec_16(9F)
atomic_dec_32_nv(9F) atomic_dec_32(9F)
atomic_dec_64_nv(9F) atomic_dec_64(9F)
atomic_dec_ptr_nv(9F) atomic_dec_ptr(9F)
atomic_dec_uchar_nv(9F) atomic_dec_uchar(9F)
atomic_dec_uint_nv(9F) atomic_dec_uint(9F)
atomic_dec_ulong_nv(9F) atomic_dec_ulong(9F)
atomic_dec_ushort_nv(9F) atomic_dec_ushort(9F)
atomic_inc_8_nv(9F) atomic_inc_8(9F)
atomic_inc_16_nv(9F) atomic_inc_16(9F)
atomic_inc_32_nv(9F) atomic_inc_32(9F)
atomic_inc_64_nv(9F) atomic_inc_64(9F)
atomic_inc_ptr_nv(9F) atomic_inc_ptr(9F)
atomic_inc_uchar_nv(9F) atomic_inc_uchar(9F)
atomic_inc_uint_nv(9F) atomic_inc_uint(9F)
atomic_inc_ulong_nv(9F) atomic_inc_ulong(9F)
atomic_inc_ushort_nv(9F) atomic_inc_ushort(9F)
atomic_or_8_nv(9F) atomic_or_8(9F)
atomic_or_16_nv(9F) atomic_or_16(9F)
atomic_or_32_nv(9F) atomic_or_32(9F)
atomic_or_64_nv(9F) atomic_or_64(9F)
atomic_or_uchar_nv(9F) atomic_or_uchar(9F)
atomic_or_uint_nv(9F) atomic_or_uint(9F)
atomic_or_ulong_nv(9F) atomic_or_ulong(9F)
atomic_or_ushort_nv(9F) atomic_or_ushort(9F)
atomic_set_long_excl(9F) atomic_swap_8(9F)
atomic_swap_16(9F) atomic_swap_32(9F)
atomic_swap_64(9F) atomic_swap_ptr(9F)
atomic_swap_uchar(9F) atomic_swap_uint(9F)
atomic_swap_ulong(9F) atomic_swap_ushort(9F)

Memory Barriers


The kernel provides general purpose memory barriers that can be used when
required. In general, when using items described in the Synchronization
Primitives section, these are not required.

membar_consumer(9F) membar_enter(9F)
membar_exit(9F) membar_producer(9F)

Virtual Memory and Pages


All platforms that the operating system supports have some form of virtual
memory which is managed in units of pages. The page size varies between
architectures and platforms. For example, the smallest x86 page size is 4
KiB while SPARC traditionally used 8 KiB pages. These functions can be
used to convert between pages and bytes.

btop(9F) btopr(9F)
ddi_btop(9F) ddi_btopr(9F)
ddi_ptob(9F) ptob(9F)

Module and Device Framework


These functions are used as part of implementing kernel modules and
register device drivers with the various kernel frameworks. There are also
functions here that are suitable for use in the dev_ops(9S), cb_ops(9S),
etc. structures and for interrogating module information.

The mod_install(9F) and mod_remove(9F) functions are used during a driver's
_init(9E) and _fini(9E) functions.

There are two different ways that drivers often manage their instance state
which is created during attach(9E). The first is the use of
ddi_set_driver_private(9F) and ddi_get_driver_private(9F). This stores a
driver-specific value on the dev_info_t structure which allows it to be
used during other operations. Some device driver frameworks may use this
themselves, making this unavailable to the driver.

The other path is to use the soft state suite of functions which
dynamically grows to cover the number of instances of a device that exist.
The soft state is generally initialized in the _init(9E) entry point with
ddi_soft_state_init(9F) and then instances are allocated and freed during
attach(9E) and detach(9E) with ddi_soft_state_zalloc(9F) and
ddi_soft_state_free(9F), and then retrieved with ddi_get_soft_state(9F).

ddi_get_driver_private(9F) ddi_get_soft_state(9F)
ddi_modclose(9F) ddi_modopen(9F)
ddi_modsym(9F) ddi_no_info(9F)
ddi_report_dev(9F) ddi_set_driver_private(9F)
ddi_soft_state_fini(9F) ddi_soft_state_free(9F)
ddi_soft_state_init(9F) ddi_soft_state_zalloc(9F)
mod_info(9F) mod_install(9F)
mod_modname(9F) mod_remove(9F)
nochpoll(9F) nodev(9F)
nulldev(9F)

Device Tree Information


Devices are organized into a tree that is partially seeded by the platform
based on information discovered at boot and augmented with additional
information at runtime. Every instance of a device driver is given a
dev_info_t * (device information) data structure which corresponds to
information about an instance and has a place in the tree. When a driver
requests operations like to allocate memory for DMA, that request is passed
up the tree and modified. The same is true for other things like
interrupts, event notifications, or properties.

There are many different informational properties about a device driver.
For example, ddi_driver_name(9F) returns the name of the device driver,
ddi_get_name(9F) returns the name of the node in the tree,
ddi_get_parent(9F) returns a node's parent, and ddi_get_instance(9F)
returns the instance number of a specific driver.

There are a series of properties that exist on the tree, the exact set of
which depend on the class of the device and are often documented in a
specific device class's manual. For example, the "reg" property is used
for PCI and PCIe devices to describe the various base address registers,
their types, and related, which are documented in pci(5).

When getting a property one can constrain it to the current instance or you
can ask for a parent to try to look up the property. Which mode is
appropriate depends on the specific class of driver, its parent, and the
property.

Using a dev_info_t * pointer has to be done carefully. When a device
driver is in any of its dev_ops(9S), cb_ops(9S), or similar callback
functions that it has registered with the kernel, then it can always safely
use its own dev_info_t and those of any parents it discovers through
ddi_get_parent(9F). However, it cannot assume the validity of any siblings
or children unless there are other circumstances that guarantee that they
will not disappear. In the broader kernel, one should not assume that it
is safe to use a given dev_info_t * structure without the appropriate NDI
(nexus driver interface) hold having been applied.

ddi_binding_name(9F) ddi_dev_is_sid(9F)
ddi_driver_major(9F) ddi_driver_name(9F)
ddi_get_devstate(9F) ddi_get_instance(9F)
ddi_get_name(9F) ddi_get_parent(9F)
ddi_getlongprop_buf(9F) ddi_getlongprop(9F)
ddi_getprop(9F) ddi_getproplen(9F)
ddi_node_name(9F) ddi_prop_create(9F)
ddi_prop_exists(9F) ddi_prop_free(9F)
ddi_prop_get_int(9F) ddi_prop_get_int64(9F)
ddi_prop_lookup_byte_array(9F) ddi_prop_lookup_int_array(9F)
ddi_prop_lookup_int64_array(9F) ddi_prop_lookup_string_array(9F)
ddi_prop_lookup_string(9F) ddi_prop_lookup(9F)
ddi_prop_modify(9F) ddi_prop_op(9F)
ddi_prop_remove_all(9F) ddi_prop_remove(9F)
ddi_prop_undefine(9F) ddi_prop_update_byte_array(9F)
ddi_prop_update_int_array(9F) ddi_prop_update_int(9F)
ddi_prop_update_int64_array(9F) ddi_prop_update_int64(9F)
ddi_prop_update_string_array(9F) ddi_prop_update_string(9F)
ddi_prop_update(9F) ddi_root_node(9F)
ddi_slaveonly(9F)

Copying Data to and from Userland


The kernel operates in a different context from userland. One does not
simply access user memory. This is enforced either by the architecture's
memory model, where user address space isn't even present in the kernel's
virtual address space or by architectural mechanisms such as Supervisor
Mode Access Protect (SMAP) on x86.

To facilitate accessing memory, the kernel provides a few routines that can
be used. In most contexts the main thing to use is ddi_copyin(9F) and
ddi_copyout(9F). These will safely dereference addresses and ensure that
the address is appropriate depending on whether this is coming from the
user or kernel. When operating with the kernel's uio_t structure which is
for mostly used when processing read and write requests, instead
uiomove(9F) is the goto function.

When reading data from userland into the kernel, there is another concern:
the data model. The most common place this comes up is in an ioctl(9E)
handler or other places where the kernel is operating on data that isn't
fixed size. Particularly in C, though this applies to other languages,
structures and unions vary in the size and alignment requirements between
32-bit and 64-bit processes. The same even applies if one uses pointers or
the long, size_t, or similar types in C. In supported 32-bit and 64-bit
environments these types are 4 and 8 bytes respectively. To account for
this, when data is not fixed size between all data models, the driver must
look at the data model of the process it is copying data from.

The simplest way to solve this problem is to try to make the data structure
the same across the different models. It's not sufficient to just use the
same structure definition and fixed size types as the alignment and padding
between the two can vary. For example, the alignment of a 64-bit integer
like a uint64_t can change between a 32-bit and 64-bit data model. One way
to check for the data structures being identical is to leverage the
ctfdiff(1) program, generally with the -I option.

However, there are times when a structure simply can't be the same, such as
when we're encoding a pointer into the structure or a type like the size_t.
When this happens, the most natural way to accomplish this is to use the
ddi_model_convert_from(9F) function which can determine the appropriate
model from the ioctl's arguments. This provides a natural way to copy a
structure in and out in the appropriate data model and convert it at those
points to the kernel's native form.

An alternate way to approach the data model is to use the STRUCT_DECL(9F)
functions, but as this requires wrapping every access to every member,
often times the ddi_model_convert_from(9F) approach and taking care of
converting values and ensuring that limits aren't exceeded at the end is
preferred.

bp_copyin(9F) bp_copyout(9F)
copyin(9F) copyout(9F)
ddi_copyin(9F) ddi_copyout(9F)
ddi_model_convert_from(9F) SIZEOF_PTR(9F)
SIZEOF_STRUCT(9F) STRUCT_BUF(9F)
STRUCT_DECL(9F) STRUCT_FADDR(9F)
STRUCT_FGET(9F) STRUCT_FGETP(9F)
STRUCT_FSET(9F) STRUCT_FSETP(9F)
STRUCT_HANDLE(9F) STRUCT_INIT(9F)
STRUCT_SET_HANDLE(9F) STRUCT_SIZE(9F)
uiomove(9F) ureadc(9F)
uwritec(9F)

Device Register Setup and Access


The kernel abstracts out accessing registers on a device on behalf of
drivers. This allows a similar set of interfaces to be used whether the
registers are found within a PCI BAR, utilizing I/O ports, memory mapped
registers, or some other scheme. Devices with registers all have a "regs"
property that is set up by their parent device, generally a kernel
framework as is the case for PCIe devices, and the meaning is a contract
between the two. Register sets are identified by a numeric ID, which
varies on the device type. For example, the first BAR of a PCI device is
defined as register set 1. On the other hand, the AMD GPIO controller
might have three register sets because of how the hardware design splits
them up. The meaning of the registers and their semantics is still device-
specific. The kernel doesn't know how to interpret the actual registers of
a PCIe device say, just that they exist.

To begin with register setup, one often first looks at the number of
register sets that exist and their size. Most PCI-based device drivers
will skip calling ddi_dev_nregs(9F) and will just move straight to calling
ddi_dev_regsize(9F) to determine the size of a register set that they are
interested in. To actually map the registers, a device driver will call
ddi_regs_map_setup(9F) which requires both a register set and a series of
attributes and returns an access handle that is used to actually read and
write the registers. When setting up registers, one must have a
corresponding ddi_device_acc_attr_t structure which is used to define what
endianness the register set is in, whether any kind of reordering is
allowed (if in doubt specify DDI_STRICTORDER_ACC), and whether any
particular error handling is being used. The structure and all of its
different options are described in ddi_device_acc_attr(9S).

Once a register handle is obtained, then it's easy to read and write the
register space. Functions are organized based on the size of the access.
For the most part, most situations call for the use of the ddi_get8(9F),
ddi_get16(9F), ddi_get32(9F), and ddi_get64(9F) functions to read a
register and the ddi_put8(9F), ddi_put16(9F), ddi_put32(9F), and
ddi_put64(9F) functions to set a register value. While there are the
ddi_io_ and ddi_mem_ families of functions below, these are not generally
needed and are generally present for compatibility. The kernel will
automatically perform the appropriate type of register read for the device
type in question.

Once a register set is no longer being used, the ddi_regs_map_free(9F)
function should be used to release resources. In most cases, this happens
while executing the detach(9E) entry point.

ddi_dev_nregs(9F) ddi_dev_regsize(9F)
ddi_device_copy(9F) ddi_device_zero(9F)
ddi_regs_map_free(9F) ddi_regs_map_setup(9F)
ddi_get8(9F) ddi_get16(9F)
ddi_get32(9F) ddi_get64(9F)
ddi_io_get8(9F) ddi_io_get16(9F)
ddi_io_get32(9F) ddi_io_put8(9F)
ddi_io_put16(9F) ddi_io_put32(9F)
ddi_io_rep_get8(9F) ddi_io_rep_get16(9F)
ddi_io_rep_get32(9F) ddi_io_rep_put8(9F)
ddi_io_rep_put16(9F) ddi_io_rep_put32(9F)
ddi_map_regs(9F) ddi_mem_get8(9F)
ddi_mem_get16(9F) ddi_mem_get32(9F)
ddi_mem_get64(9F) ddi_mem_put8(9F)
ddi_mem_put16(9F) ddi_mem_put32(9F)
ddi_mem_put64(9F) ddi_mem_rep_get8(9F)
ddi_mem_rep_get16(9F) ddi_mem_rep_get32(9F)
ddi_mem_rep_get64(9F) ddi_mem_rep_put8(9F)
ddi_mem_rep_put16(9F) ddi_mem_rep_put32(9F)
ddi_mem_rep_put64(9F) ddi_peek8(9F)
ddi_peek16(9F) ddi_peek32(9F)
ddi_peek64(9F) ddi_poke8(9F)
ddi_poke16(9F) ddi_poke32(9F)
ddi_poke64(9F) ddi_put8(9F)
ddi_put16(9F) ddi_put32(9F)
ddi_put64(9F) ddi_rep_get8(9F)
ddi_rep_get16(9F) ddi_rep_get32(9F)
ddi_rep_get64(9F) ddi_rep_put8(9F)
ddi_rep_put16(9F) ddi_rep_put32(9F)
ddi_rep_put64(9F)

DMA Related Functions


Most high-performance devices provide first-class support for DMA (direct
memory access). DMA allows a transfer between a device and memory to occur
asynchronously and generally without a thread's specific involvement.
Today, most DMA is provided directly by devices and the corresponding
device scheme. Take PCI and PCI Express for example. The idea of DMA is
built into the PCIe standard and therefore basic support for it exists and
therefore there isn't a lot of special programming required. However, this
hasn't always been true and still exists in some cases where there is a 3rd
party DMA engine. If we consider the PCIe example, the PCIe device
directly performs reads and writes to main memory on its own. However, in
the 3rd party case, there is a distinct controller that is neither the
device nor memory that facilitates this, which is called a DMA engine. For
most part, DMA engines are not something that needs to be thought about for
most platforms that illumos is present on; however, they still exist in
some embedded and related contexts.

The first thing that a driver needs to do to set up DMA is to understand
the constraints of the device and bus. These constraints are described in
a series of attributes in the ddi_dma_attr_t structure which is defined in
ddi_dma_attr(9S). The reason that attributes exist is because different
devices, and sometimes different memory uses with a device, have different
requirements for memory. A simple example of this is that not all devices
can accept memory addresses that are 64-bits wide and may have to be
constrained to the lower 32-bits of memory. Another common constraint is
how this memory is chunked up. Some devices may require that all of the
DMA memory be contiguous, while others can allow that to be broken up into
say up to 4 or 8 different regions.

When memory is allocated for DMA it isn't immediately mapped into the
kernel's address space. The addresses that describe a DMA address are
defined in a DMA cookie, several of which may make up a request. However,
those addresses are always physical addresses or addresses that are
virtualized by an IOMMU. There are some cases were the kernel or a driver
needs to be able to access that memory, such as memory that represents a
networking packet. The IP stack will expect to be able to actually read
the data it's given.

To begin with allocating DMA memory, a driver first fills out its attribute
structure. Once that's ready, the DMA allocation process can begin. This
starts off by a driver calling ddi_dma_alloc_handle(9F). This handle is
used through the lifetime of a given DMA memory buffer, but it can be used
across multiple operations that a device or the kernel may perform. The
next step is to actually request that the kernel allocate some amount of
memory in the kernel for this DMA request. This phase actually allocates
addresses in virtual address space for the activity and also requires a
register attribute object that is discussed in Device Register Setup and
Access. Armed with this a driver can now call ddi_dma_mem_alloc(9F) to
specify how much memory they are looking for. If this is successful, a
virtual address, the actual length of the region, and an access handle will
be returned.

At this point, the virtual address region is present. Most drivers will
access this virtual address range directly and will ignore the register
access handle. The side effect of this is that they will handle all
endianness issues with the memory region themselves. If the driver would
prefer to go through the handle, then it can use the register access
functions discussed earlier.

Before the memory can be programmed into the device, it must be bound to a
series of physical addresses or addresses virtualized by an IOMMU. While
the kernel presents the illusion of a single consistent virtual address
range for applications, the physical reality can be quite different. When
the driver is ready it calls ddi_dma_addr_bind_handle(9F) to create the
mapping to well known physical addresses.

These addresses are stored in a series of cookies. A driver can determine
the number of cookies for a given request by utilizing its DMA handle and
calling ddi_dma_ncookies(9F) and then pairing that with
ddi_dma_cookie_get(9F). These DMA cookies will not change and can be used
time and time again until ddi_dma_unbind_handle(9F) is called. With this
information in hand, a physical device can be programmed with these
addresses and let loose to perform I/O.

When performing I/O to and from a device, synchronization is a vitally
important thing which ensures that the actual state in memory is coherent
with the rest of the CPU's internal structures such as caches. In general,
a given DMA request is only going in one direction: for a device or for the
local CPU. In either case, the ddi_dma_sync(9F) function must be called
after the kernel is done writing to a region of DMA memory and before it
triggers the device or the kernel must call it after the device has told it
that some activity has completed that it is going to check.

Some DMA operations utilize what are called DMA windows. The most common
consumer is something like a disk device where DMA operations to a given
series of sectors can be split up into different chunks where as long as
all the transfers are performed, the intermediate states are acceptable.
Put another way, because of how SCSI and SAS commands are designed, block
devices can basically take a given I/O request and break it into multiple
independent I/Os that will equate to the same final item.

When a device supports this mode of operation and it is opted into, then a
DMA allocation may result in the use of DMA windows. This allows for cases
where the kernel can't perform a DMA allocation for the entire request, but
instead can allocate a partial region and then walk through each part one
at a time. This is uncommon outside of block devices and usually also is
related to calling ddi_dma_buf_bind_handle(9F).

ddi_dma_addr_bind_handle(9F) ddi_dma_alloc_handle(9F)
ddi_dma_buf_bind_handle(9F) ddi_dma_burstsizes(9F)
ddi_dma_cookie_get(9F) ddi_dma_cookie_iter(9F)
ddi_dma_cookie_one(9F) ddi_dma_free_handle(9F)
ddi_dma_getwin(9F) ddi_dma_mem_alloc(9F)
ddi_dma_mem_free(9F) ddi_dma_ncookies(9F)
ddi_dma_nextcookie(9F) ddi_dma_numwin(9F)
ddi_dma_set_sbus64(9F) ddi_dma_sync(9F)
ddi_dma_unbind_handle(9F) ddi_dmae_1stparty(9F)
ddi_dmae_alloc(9F) ddi_dmae_disable(9F)
ddi_dmae_enable(9F) ddi_dmae_getattr(9F)
ddi_dmae_getcnt(9F) ddi_dmae_prog(9F)
ddi_dmae_release(9F) ddi_dmae_stop(9F)
ddi_dmae(9F)

Interrupt Handler Related Functions


Interrupts are a central part of the role of device drivers and one of the
things that's important to get right. Interrupts come in different types:
fixed, MSI, and MSI-X. The kinds that are available depend on the device
and the rest of the system. For example, MSI and MSI-X interrupts are
generally specific to PCI and PCI Express devices. To begin the interrupt
allocation process, the first thing a driver needs to do is to discover
what type of interrupts it supports with ddi_intr_get_supported_types(9F).
Then, the driver should work through the supported types, preferring MSI-X,
then MSI, and finally fixed interrupts, and try to allocate interrupts.

Drivers first need to know how many interrupts that they require. For
example, a networking driver may want to have an interrupt made available
for each ring that it has. To discover the number of interrupts available,
the driver should call ddi_intr_get_navail(9F). If there are sufficient
interrupts, it can proceed to actually allocate the interrupts with
ddi_intr_alloc(9F). When allocating interrupts, callers need to check to
see how many interrupts the system actually gave them. Just because an
interrupt is allocated does not mean that it will fire or be ready to use,
there are a series of additional steps that the driver must take.

To go through and enable the interrupt, the driver should go through and
get the interrupt capabilities with ddi_intr_get_cap(9F) and the priority
of the interrupt with ddi_intr_get_pri(9F). The priority must be used
while creating mutexes and related synchronization primitives that will be
used during the interrupt handler. At this point, the driver can go ahead
and register the functions that will be called with each allocated
interrupt with the ddi_intr_add_handler(9F) function. The arguments can
vary for each allocated interrupt. It is common to have an interrupt-
specific data structure passed in one of the arguments or an interrupt
number, while the other argument is generally the driver's instance-
specific data structure.

At this point, the last step for the interrupt to be made active from the
kernel's perspective is to enable it. This will use either the
ddi_intr_block_enable(9F) or ddi_intr_enable(9F) functions depending on the
interrupt's capabilities. The reason that these are different is because
some interrupt types (MSI) require that all interrupts in a group be
enabled and disabled at the same time. This is indicated with the
DDI_INTR_FLAG_BLOCK flag found in the interrupt's capabilities. Once that
is called, interrupts that are generated by a device will be delivered to
the registered function.

It's important to note that there is often device-specific interrupt setup
that is required. While the kernel takes care of updating any pieces of
the processor's interrupt controller, I/O crossbar, or the PCI MSI and MSI-
X capabilities, many devices have device-specific registers that are used
to manage, set up, and acknowledge interrupts. These registers or other
controls are often capable of separately masking interrupts and are
generally what should be used if there are times that you need to
separately enable or disable interrupts such as to poll an I/O ring.

When unwinding interrupts, one needs to work in the reverse order here.
Until ddi_intr_block_disable(9F) or ddi_intr_disable(9F) is called, one
should assume that their interrupt handler will be called. Due to cases
where an interrupt is shared between multiple devices, this can happen even
if the device is quiesced! Only after that is done is it safe to then free
the interrupts with a call to ddi_intr_free(9F).

ddi_add_intr(9F) ddi_add_softintr(9F)
ddi_get_iblock_cookie(9F) ddi_get_soft_iblock_cookie(9F)
ddi_intr_add_handler(9F) ddi_intr_add_softint(9F)
ddi_intr_alloc(9F) ddi_intr_block_disable(9F)
ddi_intr_block_enable(9F) ddi_intr_clr_mask(9F)
ddi_intr_disable(9F) ddi_intr_dup_handler(9F)
ddi_intr_enable(9F) ddi_intr_free(9F)
ddi_intr_get_cap(9F) ddi_intr_get_hilevel_pri(9F)
ddi_intr_get_navail(9F) ddi_intr_get_nintrs(9F)
ddi_intr_get_pending(9F) ddi_intr_get_pri(9F)
ddi_intr_get_softint_pri(9F) ddi_intr_get_supported_types(9F)
ddi_intr_hilevel(9F) ddi_intr_remove_handler(9F)
ddi_intr_remove_softint(9F) ddi_intr_set_cap(9F)
ddi_intr_set_mask(9F) ddi_intr_set_nreq(9F)
ddi_intr_set_pri(9F) ddi_intr_set_softint_pri(9F)
ddi_intr_trigger_softint(9F) ddi_remove_intr(9F)
ddi_remove_softintr(9F) ddi_trigger_softintr(9F)

Minor Nodes


For a device driver to be accessed by a program in user space (or with the
kernel layered device interface) then it must create a minor node. Minor
nodes are created under /devices (devfs(4FS)) and are tied to the instance
of a device driver via its dev_info_t. The devfsadm(8) daemon and the /dev
file system (sdev, dev(4FS)) are responsible for creating a coherent set of
names that user programs access. Drivers create these minor nodes using
the ddi_create_minor_node(9F) function listed below.

In UNIX tradition, character, block, and STREAMS device special files are
identified by a major and minor number. All instances of a given driver
share the same major number, which means that a device driver must
coordinate the minor number space across all instances. While a minor node
is created with a fixed minor number, it is possible to change the minor
number while processing an open(9E) call, allowing subsequent character
device operations to uniquely identify a particular caller. This is
usually referred to as a driver that "clones".

When drivers aren't performing cloning, then usually the minor number used
when creating the minor node is some fixed offset or multiple of the
driver's instance number. When cloning and a driver needs to allocate and
manage a minor number space, usually an ID space is leveraged whose IDs are
usually in the range from 0 through MAXMIN32. There are several different
strategies for tracking data structures as they relate to minor numbers.
Sometimes, the soft state functionality is used. Others might keep an AVL
tree around or tie the data to some other data structure. The method
chosen often varies on the specifics of the implementation and its broader
context.

The dev_t structure represents the combined major and minor number. It can
be taken apart with the getmajor(9F) and getminor(9F) functions and then
reconstructed with the makedevice(9F) function.

ddi_create_minor_node(9F) ddi_remove_minor_node(9F)
getmajor(9F) getminor(9F)
devfs_clean(9F) makedevice(9F)

Accessing Time, Delays, and Periodic Events
The kernel provides a number of ways to understand time in the system. In
particular it provides a few different clocks and time measurements:

High-resolution monotonic time
The kernel provides access to a high-resolution monotonic clock
that is tracked in nanoseconds. This clock is perfect for
measuring durations and is accessed via gethrtime(9F). Unlike the
real-time clock, this clock is not subject to adjustments by a time
synchronization daemon and is the preferred clock that drivers
should be using for tracking events. The high-resolution clock is
consistent across CPUs, meaning that you may call gethrtime(9F) on
one CPU and the value will be consistent with what is returned,
even if a thread is migrated to another CPU.

The high-resolution clock is implemented using an architecture and
platform-specific means. For example, on x86 it is generally
backed by the TSC (time stamp counter).

Real-time
The real-time clock tracks time as humans perceive it. This clock
is accessed using ddi_get_time(9F). If the system is running a
time synchronization daemon that leverages the network time
protocol, then this time may be in sync with other systems (subject
to some amount of variance); however, it is critical that this is
not assumed.

In general, this time should not be used by drivers for any
purpose. It can jump around, drift, and most aspects in the kernel
are not based on the real-time clock. For any device timing
activities, the high-resolution clock should be used.

Tick-based monotonic time
The kernel has a running periodic function that fires based on the
rate dictated by the hz variable, generally operating at 100 or
1000 kHz. The current number of ticks since boot is accessible
through the ddi_get_lbolt(9F) function. When functions operate in
units of ticks, this is what they are tracking. This value can be
converted to and from microseconds using the drv_usectohz(9F) and
drv_hztousec(9F) functions.

In general, drivers should prefer the high-resolution monotonic
clock for tracking events internally.

With these different timing mechanisms, the kernel provides a few different
ways to delay execution or to get a callback after some amount of time
passes.

The delay(9F) and drv_usecwait(9F) functions are used to block the
execution of the current thread. delay(9F) can be used in conditions where
sleeping and blocking is allowed where as drv_usecwait(9F) is a busy-wait,
which is appropriate for some device drivers, particularly when in high-
level interrupt context.

The kernel also allows a function to be called after some time has elapsed.
This callback occurs on a different thread and will be executed in kernel
context. A timeout can be scheduled in the future with the timeout(9F)
function and cancelled with the untimeout(9F) function. There is also a
STREAMs-specific version that can be used if the circumstances are required
with the qtimeout(9F) function.

These are all considered one-shot events. That is, they will only happen
once after being scheduled. If instead, a driver requires periodic
behavior, such as needing something to occur every second, then it should
use the ddi_periodic_add(9F) function to establish that.

delay(9F) ddi_get_lbolt(9F)
ddi_get_lbolt64(9F) ddi_get_time(9F)
ddi_periodic_add(9F) ddi_periodic_delete(9F)
drv_hztousec(9F) drv_usectohz(9F)
drv_usecwait(9F) gethrtime(9F)
qtimeout(9F) quntimeout(9F)
timeout(9F) untimeout(9F)

Task Queues


A task queue provides an asynchronous processing mechanism that can be used
by drivers and the broader system. A task queue can be created with
ddi_taskq_create(9F) and sized with a given number of threads and a
relative priority of those threads. Once created, tasks can be dispatched
to the queue with ddi_taskq_dispatch(9F). The different functions and
arguments dispatched do not need to be the same and can vary from
invocation to invocation. However, it is the caller's responsibility to
ensure that any reference memory is valid until the task queue is done
processing. It is possible to create a barrier for a task queue by using
the ddi_taskq_wait(9F) function.

While task queues are a flexible mechanism for handling and processing
events that occur in a well defined context, they do not have an inherent
backpressure mechanism built in. This means it is possible to add events
to a task queue faster than they can be processed. For high-volume events,
this must be considered before just dispatching an event. Do not rely on a
non-sleeping allocation in the task queue dispatch context.

ddi_taskq_create(9F) ddi_taskq_destroy(9F)
ddi_taskq_dispatch(9F) ddi_taskq_resume(9F)
ddi_taskq_suspend(9F) ddi_taskq_suspended(9F)
ddi_taskq_wait

Credential Management and Privileges


Not everything in the system has the same power to impact it. To determine
the permissions and context of a caller, the cred_t data structure
encapsulates a number of different things including the traditional user
and group IDs, but also the zone that one is operating in the context of
and the associated privileges that the caller has. While this concept is
more often thought of due to userland processes being associated with
specific users, these same principles apply to different threads in the
kernel. Not all kernel threads are allowed to indiscriminately do what
they want, they can be constrained by the same privilege model that
processes are, which is discussed in privileges(7).

Most operations that device drivers implement are given a credential.
However, from within the kernel, a credential can be obtained that refers
to a specific zone, the current process, or a generic kernel credential.

It is up to drivers and the kernel writ-large to check whether a given
credential is authorized to perform a given operation. This is
encapsulated by the various privilege checks that exist. The most common
check used is drv_priv(9F) which checks for PRIV_SYS_DEVICES.

CRED(9F) crdup(9F)
crfree(9F) crget(9F)
crgetgid(9F) crgetgroups(9F)
crgetngroups(9F) crgetrgid(9F)
crgetruid(9F) crgetsgid(9F)
crgetsuid(9F) crgetuid(9F)
crgetzoneid(9F) crhold(9F)
ddi_get_cred(9F) drv_priv(9F)
kcred(9F) priv_getbyname(9F)
priv_policy_choice(9F) priv_policy_only(9F)
priv_policy(9F) zone_kcred(9F)

Device ID Management


Device IDs are a means of establishing a unique ID for a device in the
kernel. These unique IDs are generally tied to something from the device's
hardware such as a serial number or related, but can also be fabricated and
stored on the device. These device IDs are used by other subsystems like
ZFS to record information about a device as the actual /devices path that a
device resides at may change because it is moved around in the system.

For device drivers, particularly those that represent block devices, they
should first call ddi_devid_init(9F) to initialize the device ID data
structure. After that is done, it is then safe to call
ddi_devid_register(9F) to notify the kernel about the ID.

ddi_devid_compare(9F) ddi_devid_free(9F)
ddi_devid_get(9F) ddi_devid_init(9F)
ddi_devid_register(9F) ddi_devid_sizeof(9F)
ddi_devid_str_decode(9F) ddi_devid_str_encode(9F)
ddi_devid_str_free(9F) ddi_devid_unregister(9F)
ddi_devid_valid(9F)

Message Block Functions


The mblk_t data structure is used to chain together messages which are used
through the kernel for different subsystems including all of networking,
terminals, STREAMS, USB, and more.

Message blocks are chained together by a series of two different pointers:
b_cont and b_next. When a message is split across multiple data buffers,
they are linked by the b_cont pointer. However, multiple distinct messages
can be chained together and linked by the b_next pointer. Let's look at
this in the context of a series of networking packets. If we had a chain
of say 10 UDP packets that we were given, each UDP packet is considered an
independent message and would be linked from one to the next based on the
order they should be transmitted with the b_next pointer. However, an
individual message may be entirely in one message block, in which case its
b_cont pointer would be NULL, but if say the packet were split into a 100
byte data buffer that contained the headers and then a 1000 byte data
buffer that contained the actual packet data, those two would be linked
together by b_cont. A continued message would never have its next pointer
used to link it to a wholly different message. Visually you might see this
as:

+---------------+
| UDP Message 0 |
| Bytes 0-1100 |
| b_cont ---+--> NULL
| b_next + |
+---------|-----+
|
v
+---------------+ +----------------+
| UDP Message 1 | | UDP Message 1+ |
| Bytes 0-100 | | Bytes 100-1100 |
| b_cont ---+--> | b_cont ----+->NULL
| b_next + | | b_next ----+->NULL
+---------|-----+ +----------------+
|
...
|
v
+---------------+
| UDP Message 9 |
| Bytes 0-1100 |
| b_cont ---+--> NULL
| b_next ---+--> NULL
+---------------+

Message blocks all have an associated data block which contains the actual
data that is present. Multiple message blocks can share the same data
block as well. The data block has a notion of a type, which is generally
M_DATA which signifies that they operate on data.

To allocate message blocks, one generally uses the allocb(9F) function to
create one; however, you can also create message blocks using your own
source of data through functions like desballoc(9F). This is generally
used when one wants to use memory that was originally used for DMA to pass
data back into the kernel, such as in a networking device driver. When
this happens, a callback function will be called once the last user of the
data block is done with it.

The functions listed below often end in either "msg" or "b" to indicate
that they will operate on an entire message and follow the b_cont pointer
or they will not respectively.

adjmsg(9F) allocb(9F)
copyb(9F) copymsg(9F)
datamsg(9F) desballoc(9F)
desballoca(9F) dupb(9F)
dupmsg(9F) esballoc(9F)
esballoca(9F) freeb(9F)
freemsg(9F) linkb(9F)
mcopymsg(9F) msgdsize(9F)
msgpullup(9F) msgsize(9F)
pullupmsg(9F) rmvb(9F)
testb(9F) unlinkb(9F)

Upgradable Firmware Modules


The UFM (Upgradable Firmware Module) subsystem is used to grant the system
observability into firmware that exists persistently on a device. These
functions are intended for use by drivers that are participating in the
kernel's UFM framework, which is discussed in ddi_ufm(9E).

The ddi_ufm_init(9E) and ddi_ufm_fini(9E) functions are used to indicate
support of the subsystem to the kernel. The driver is required to use the
ddi_ufm_update(9F) function to indicate both that it is ready to receive
UFM requests and to indicate that any data that the kernel may have
previously received has changed. Once that's completed, then the other
functions listed here are generally used as part of implementing specific
callback functions that are registered.

ddi_ufm_fini(9F) ddi_ufm_image_set_desc(9F)
ddi_ufm_image_set_misc(9F) ddi_ufm_image_set_nslots(9F)
ddi_ufm_init(9F) ddi_ufm_slot_set_attrs(9F)
ddi_ufm_slot_set_imgsize(9F) ddi_ufm_slot_set_misc(9F)
ddi_ufm_slot_set_version(9F) ddi_ufm_update(9F)

Firmware Loading


Some hardware devices have firmware that is not stored as part of the
device itself and must instead be sent to the device each time it is
powered on. These routines help drivers that need to perform this read
such data from the file system from well-known locations in the operating
system. To begin with, a driver should call firmware_open(9F) to open a
handle to the firmware file. At that point, one can determine the size of
the file with the firmware_get_size(9F) function and allocate the
appropriate sized memory buffer to read it in. Callers should always check
what the size of the returned file is and should not just blindly pass that
size off to the kernel memory allocator. For example, if a file was over
100 MiB in size, then one should not assume that they're going to just
blindly allocate 100 MiB of kernel memory and should instead perform
incremental reads and sends to a device that are smaller in size.

A driver can then go through and perform arbitrary reads of the firmware
file through the firmware_read(9F) interface until they have read
everything that they need. Once complete, the corresponding handle needs
to be released through the firmware_close(9F) function.

firmware_close(9F) firmware_get_size(9F)
firmware_open(9F) firmware_read(9F)

Fault Management Handling


These functions allow device drivers to harden themselves against errors
that might occur while interfacing with devices and tie into the broader
fault management architecture.

To begin, a driver must declare which capabilities it implements during its
attach(9E) function by calling ddi_fm_init(9F). The set of capabilities it
receives back may be less than what was requested because the capabilities
are dependent on the overall chain of drivers present.

If DDI_FM_EREPORT_CAPABLE was negotiated, then the driver is expected to
generate error events when certain conditions occur using the
ddi_fm_ereport_post(9F) function or the more specific pci_ereport_post(9F)
function. If a caller has negotiated DDI_FM_ACCCHK_CAPABLE, then it is
allowed to set up its register attributes to indicate that it will check
for errors on the register handle after using functions like ddi_get8(9F)
and ddi_set8(9F) by calling ddi_fm_acc_err_get(9F) and reacting
accordingly. Similarly, if a driver has negotiated DDI_FM_DMACHK_CAPABLE,
then it will use ddi_check_dma_handle(9F) to check the results of DMA
activity and handle the results appropriately. Similar to register
accesses, the DMA attributes must be updated to set that error handling is
anticipated on this handle. The ddi_fm_init(9F) manual page has an
overview of the other types of flags that can be negotiated and how they
are used.

ddi_check_acc_handle(9F) ddi_check_dma_handle(9F)
ddi_dev_report_fault(9F) ddi_fm_acc_err_clear(9F)
ddi_fm_acc_err_get(9F) ddi_fm_capable(9F)
ddi_fm_dma_err_clear(9F) ddi_fm_dma_err_get(9F)
ddi_fm_ereport_post(9F) ddi_fm_fini(9F)
ddi_fm_handler_register(9F) ddi_fm_handler_unregister(9F)
ddi_fm_init(9F) ddi_fm_service_impact(9F)
pci_ereport_post(9F) pci_ereport_setup(9F)
pci_ereport_teardown(9F)

SCSI and SAS Device Driver Functions


These functions are for use by SCSI and SAS device drivers that leverage
the kernel's frameworks. Other device drivers should not use these. For
more background on these, some of the general concepts are discussed in
iport(9), phymap(9), and tgtmap(9).

Device drivers register initially with the kernel by using the
scsi_ha_init(9F) function and then, in their attach routine, register
specific instances, using functions like scsi_hba_iport_register(9F) or
instead scsi_hba_tran_alloc(9F) and scsi_hba_attach_setup(9F). New drivers
are encouraged to use the target map and iports framework to simplify the
device driver writing process.

makecom_g0_s(9F) makecom_g0(9F)
makecom_g1(9F) makecom_g5(9F)
makecom(9F) sas_phymap_create(9F)
sas_phymap_destroy(9F) sas_phymap_lookup_ua(9F)
sas_phymap_lookup_uapriv(9F) sas_phymap_phy_add(9F)
sas_phymap_phy_rem(9F) sas_phymap_phy2ua(9F)
sas_phymap_phys_free(9F) sas_phymap_phys_next(9F)
sas_phymap_ua_free(9F) sas_phymap_ua2phys(9F)
sas_phymap_uahasphys(9F) scsi_abort(9F)
scsi_address_device(9F) scsi_alloc_consistent_buf(9F)
scsi_cname(9F) scsi_destroy_pkt(9F)
scsi_device_hba_private_get(9F) scsi_device_hba_private_set(9F)
scsi_device_unit_address(9F) scsi_dmafree(9F)
scsi_dmaget(9F) scsi_dname(9F)
scsi_errmsg(9F) scsi_ext_sense_fields(9F)
scsi_find_sense_descr(9F) scsi_free_consistent_buf(9F)
scsi_free_wwnstr(9F) scsi_get_device_type_scsi_options(9F)
scsi_get_device_type_string(9F) scsi_hba_attach_setup(9F)
scsi_hba_detach(9F) scsi_hba_fini(9F)
scsi_hba_init(9F) scsi_hba_iport_exist(9F)
scsi_hba_iport_find(9F) scsi_hba_iport_register(9F)
scsi_hba_iport_unit_address(9F) scsi_hba_iportmap_create(9F)
scsi_hba_iportmap_destroy(9F) scsi_hba_iportmap_iport_add(9F)
scsi_hba_iportmap_iport_remove(9F) scsi_hba_lookup_capstr(9F)
scsi_hba_pkt_alloc(9F) scsi_hba_pkt_comp(9F)
scsi_hba_pkt_free(9F) scsi_hba_probe(9F)
scsi_hba_tgtmap_create(9F) scsi_hba_tgtmap_destroy(9F)
scsi_hba_tgtmap_scan_luns(9F) scsi_hba_tgtmap_set_add(9F)
scsi_hba_tgtmap_set_begin(9F) scsi_hba_tgtmap_set_end(9F)
scsi_hba_tgtmap_set_flush(9F) scsi_hba_tgtmap_tgt_add(9F)
scsi_hba_tgtmap_tgt_remove(9F) scsi_hba_tran_alloc(9F)
scsi_hba_tran_free(9F) scsi_ifgetcap(9F)
scsi_ifsetcap(9F) scsi_init_pkt(9F)
scsi_log(9F) scsi_mname(9F)
scsi_pktalloc(9F) scsi_pktfree(9F)
scsi_poll(9F) scsi_probe(9F)
scsi_resalloc(9F) scsi_reset_notify(9F)
scsi_reset(9F) scsi_resfree(9F)
scsi_rname(9F) scsi_sense_asc(9F)
scsi_sense_ascq(9F) scsi_sense_cmdspecific_uint64(9F)
scsi_sense_info_uint64(9F) scsi_sense_key(9F)
scsi_setup_cdb(9F) scsi_slave(9F)
scsi_sname(9F) scsi_sync_pkt(9F)
scsi_transport(9F) scsi_unprobe(9F)
scsi_unslave(9F) scsi_validate_sense(9F)
scsi_vu_errmsg(9F) scsi_wwn_to_wwnstr(9F)
scsi_wwnstr_to_wwn

Block Device Buffer Handling


Block devices operate with a data structure called the struct buf which is
described in buf(9S). This structure is used to represent a given block
request and is used heavily in block devices, the SCSI/SAS framework, and
the blkdev framework. The functions described here are used to manipulate
these structures in various ways such as copying them around, indicating
error conditions, or indicating when the I/O operation is done. By
default, this memory is not mapped into the kernel's address space so
several functions such as bp_mapin(9F) are present to allow for that to
happen when required.

To initially obtain a struct buf, drivers should begin by calling
getrbuf(9F) at which point, the caller can fill in the structure. Once
that's done, the physio(9F) function can be used to actually perform the
I/O and wait until it's complete.

bioclone(9F) biodone(9F)
bioerror(9F) biofini(9F)
bioinit(9F) biomodified(9F)
bioreset(9F) biosize(9F)
biowait(9F) bp_mapin(9F)
bp_mapout(9F) clrbuf(9F)
disksort(9F) freerbuf(9F)
geterror(9F) getrbuf(9F)
minphys(9F) physio(9F)

Networking Device Driver Functions


These functions are for networking device drivers that implant the MAC,
GLDv3 interfaces. The full framework and how to use it is described in
mac(9E).

mac_alloc(9F) mac_fini_ops(9F)
mac_free(9F) mac_hcksum_get(9F)
mac_hcksum_set(9F) mac_init_ops(9F)
mac_link_update(9F) mac_lso_get(9F)
mac_maxsdu_update(9F) mac_prop_info_set_default_fec(9F)
mac_prop_info_set_default_link_flowctrl(9F)
mac_prop_info_set_default_str(9F)
mac_prop_info_set_default_uint32(9F)
mac_prop_info_set_default_uint64(9F)
mac_prop_info_set_default_uint8(9F) mac_prop_info_set_perm(9F)
mac_prop_info_set_range_uint32(9F) mac_prop_info(9F)
mac_register(9F) mac_rx_ring(9F)
mac_rx(9F) mac_transceiver_info_set_present(9F)
mac_transceiver_info_set_usable(9F) mac_transceiver_info(9F)
mac_tx_ring_update(9F) mac_tx_update(9F)
mac_unregister(9F)

USB Device Driver Functions


These functions are designed for USB device drivers. To first initialize
with the kernel, a device driver must call usb_client_attach(9F) and then
usb_get_dev_data(9F). The latter call is required to get access to the
USB-level descriptors about the device which describe what kinds of USB
endpoints (control, bulk, interrupt, or isochronous) exist on the device as
well as how many different interfaces and configurations are present.

Once a given configuration, sometimes the default, is selected, then the
driver can proceed to opening up what the USB architecture calls a pipe,
which provides a way to send requests to a specific USB endpoint. First,
specific endpoints can be looked up using the usb_lookup_ep_data(9F)
function which gets information from the parsed descriptors and then that
gets filled into an extended descriptor with usb_ep_xdescr_fill(9F). With
that in hand, a pipe can be opened with usb_pipe_xopen(9F).

Once a pipe has been opened, which most often happens in a driver's
attach(9E) entry point, then requests can be allocated and submitted.
There is a different allocation for each type of request (e.g.
usb_alloc_bulk_req(9F)) and a different submission function for each type
as well. Each request structure has a corresponding page in section 9S
that describes the structure, its members, and how to work with it.

One other major concern for USB devices, which isn't as common with other
types of devices, is that they can be yanked out and reinserted at any
time. To help determine when this happens, the kernel offers the
usb_register_event_cbs(9F) function which allows a driver to register for
callbacks when a device is disconnected, reconnected, or around checkpoint
suspend/resume behavior.

usb_alloc_bulk_req(9F) usb_alloc_ctrl_req(9F)
usb_alloc_intr_req(9F) usb_alloc_isoc_req(9F)
usb_alloc_request(9F) usb_client_attach(9F)
usb_client_detach(9F) usb_clr_feature(9F)
usb_create_pm_components(9F) usb_ep_xdescr_fill(9F)
usb_free_bulk_req(9F) usb_free_ctrl_req(9F)
usb_free_descr_tree(9F) usb_free_dev_data(9F)
usb_free_intr_req(9F) usb_free_isoc_req(9F)
usb_get_addr(9F) usb_get_alt_if(9F)
usb_get_cfg(9F) usb_get_current_frame_number(9F)
usb_get_dev_data(9F) usb_get_if_number(9F)
usb_get_max_pkts_per_isoc_request(9F)
usb_get_status(9F)
usb_get_string_descr(9F) usb_handle_remote_wakeup(9F)
usb_lookup_ep_data(9F) usb_owns_device(9F)
usb_parse_data(9F) usb_pipe_bulk_xfer(9F)
usb_pipe_close(9F) usb_pipe_ctrl_xfer_wait(9F)
usb_pipe_ctrl_xfer(9F) usb_pipe_drain_reqs(9F)
usb_pipe_get_max_bulk_transfer_size(9F)
usb_pipe_get_private(9F)
usb_pipe_get_state(9F) usb_pipe_intr_xfer(9F)
usb_pipe_isoc_xfer(9F) usb_pipe_open(9F)
usb_pipe_reset(9F) usb_pipe_set_private(9F)
usb_pipe_stop_intr_polling(9F) usb_pipe_stop_isoc_polling(9F)
usb_pipe_xopen(9F) usb_print_descr_tree(9F)
usb_register_hotplug_cbs(9F) usb_reset_device(9F)
usb_set_alt_if(9F) usb_set_cfg(9F)
usb_unregister_hotplug_cbs(9F)

PCI Device Driver Functions


These functions are specific for PCI and PCI Express based device drivers
and are intended to be used to get access to PCI configuration space. For
normal PCI base address registers (BARs) instead see Register Setup and
Access.

To access PCI configuration space, a device driver should first call
pci_config_setup(9F). Generally, drivers will call this in their
attach(9E) entry point and then tear down the configuration space access
with the pci_config_teardown(9F) entry point in detach(9E). After setting
up access to configuration space, the returned handle can be used in all of
the various configuration space routines to get and set specific sized
values in configuration space.

pci_config_get8(9F) pci_config_get16(9F)
pci_config_get32(9F) pci_config_get64(9F)
pci_config_put8(9F) pci_config_put16(9F)
pci_config_put32(9F) pci_config_put64(9F)
pci_config_setup(9F) pci_config_teardown(9F)
pci_report_pmcap(9F) pci_restore_config_regs(9F)
pci_save_config_regs(9F)

USB Host Controller Interface Functions


These routines are used for device drivers which implement the USB host
controller interfaces described in usba_hcdi(9E). Other types of devices
drivers and modules should not call these functions. In particular, if one
is writing a device driver for a USB device, these are not the routines
you're looking for and you want to see USB Device Driver Functions. These
are what the ehci(4D) or xhci(4D) drivers use to provide services that USB
drivers use via the kernel USB architecture.

usba_alloc_hcdi_ops(9F) usba_free_hcdi_ops(9F)
usba_hcdi_cb(9F) usba_hcdi_dup_intr_req(9F)
usba_hcdi_dup_isoc_req(9F) usba_hcdi_get_device_private(9F)
usba_hcdi_register(9F) usba_hcdi_unregister(9F)
usba_hubdi_bind_root_hub(9F) usba_hubdi_cb_ops(9F)
usba_hubdi_close(9F) usba_hubdi_dev_ops(9F)
usba_hubdi_ioctl(9F) usba_hubdi_open(9F)
usba_hubdi_root_hub_power(9F) usba_hubdi_unbind_root_hub(9F)

Functions for PCMCIA Drivers


These functions exist for older PCMCIA device drivers. These should not
otherwise be used by the system.

csx_AccessConfigurationRegister(9F) csx_ConvertSize(9F)
csx_ConvertSpeed(9F) csx_CS_DDI_Info(9F)
csx_DeregisterClient(9F) csx_DupHandle(9F)
csx_Error2Text(9F) csx_Event2Text(9F)
csx_FreeHandle(9F) csx_Get16(9F)
csx_Get32(9F) csx_Get64(9F)
csx_Get8(9F) csx_GetEventMask(9F)
csx_GetFirstClient(9F) csx_GetFirstTuple(9F)
csx_GetHandleOffset(9F) csx_GetMappedAddr(9F)
csx_GetNextClient(9F) csx_GetNextTuple(9F)
csx_GetStatus(9F) csx_GetTupleData(9F)
csx_MakeDeviceNode(9F) csx_MapLogSocket(9F)
csx_MapMemPage(9F) csx_ModifyConfiguration(9F)
csx_ModifyWindow(9F) csx_Parse_CISTPL_BATTERY(9F)
csx_Parse_CISTPL_BYTEORDER(9F) csx_Parse_CISTPL_CFTABLE_ENTRY(9F)
csx_Parse_CISTPL_CONFIG(9F) csx_Parse_CISTPL_DATE(9F)
csx_Parse_CISTPL_DEVICE_A(9F) csx_Parse_CISTPL_DEVICE_OA(9F)
csx_Parse_CISTPL_DEVICE_OC(9F) csx_Parse_CISTPL_DEVICE(9F)
csx_Parse_CISTPL_DEVICEGEO_A(9F) csx_Parse_CISTPL_DEVICEGEO(9F)
csx_Parse_CISTPL_FORMAT(9F) csx_Parse_CISTPL_FUNCE(9F)
csx_Parse_CISTPL_FUNCID(9F) csx_Parse_CISTPL_GEOMETRY(9F)
csx_Parse_CISTPL_JEDEC_A(9F) csx_Parse_CISTPL_JEDEC_C(9F)
csx_Parse_CISTPL_LINKTARGET(9F) csx_Parse_CISTPL_LONGLINK_A(9F)
csx_Parse_CISTPL_LONGLINK_C(9F) csx_Parse_CISTPL_LONGLINK_MFC(9F)
csx_Parse_CISTPL_MANFID(9F) csx_Parse_CISTPL_ORG(9F)
csx_Parse_CISTPL_SPCL(9F) csx_Parse_CISTPL_SWIL(9F)
csx_Parse_CISTPL_VERS_1(9F) csx_Parse_CISTPL_VERS_2(9F)
csx_ParseTuple(9F) csx_Put16(9F)
csx_Put32(9F) csx_Put64(9F)
csx_Put8(9F) csx_RegisterClient(9F)
csx_ReleaseConfiguration(9F) csx_ReleaseIO(9F)
csx_ReleaseIRQ(9F) csx_ReleaseSocketMask(9F)
csx_ReleaseWindow(9F) csx_RemoveDeviceNode(9F)
csx_RepGet16(9F) csx_RepGet32(9F)
csx_RepGet64(9F) csx_RepGet8(9F)
csx_RepPut16(9F) csx_RepPut32(9F)
csx_RepPut64(9F) csx_RepPut8(9F)
csx_RequestConfiguration(9F) csx_RequestIO(9F)
csx_RequestIRQ(9F) csx_RequestSocketMask(9F)
csx_RequestWindow(9F) csx_ResetFunction(9F)
csx_SetEventMask(9F) csx_SetHandleOffset(9F)
csx_ValidateCIS(9F)

STREAMS related functions


These functions are meant to be used when interacting with STREAMS devices
or when implementing one. When a STREAMS driver is opened, it receives
messages on a queue which are then processed and can be sent back. As
different queues are often linked together, the most common thing is to
process a message and then pass the message onto the next queue using the
putnext(9F) function.

STREAMS messages are passed around using message blocks, which use the
mblk_t type. See Message Block Functions for more about how the data
structure and functions that manipulate message blocks.

These functions should generally not be used when implementing a networking
device driver today. See mac(9E) instead.

backq(9F) bcanput(9F)
bcanputnext(9F) canput(9F)
canputnext(9F) enableok(9F)
flushband(9F) flushq(9F)
freezestr(9F) getq(9F)
insq(9F) merror(9F)
mexchange(9F) noenable(9F)
put(9F) putbq(9F)
putctl(9F) putctl1(9F)
putnext(9F) putnextctl(9F)
putnextctl1(9F) putq(9F)
mt-streams(9F) qassociate(9F)
qenable(9F) qprocsoff(9F)
qprocson(9F) qreply(9F)
qsize(9F) qwait_sig(9F)
qwait(9F) qwriter(9F)
OTHERQ(9F) RD(9F)
rmvq(9F) SAMESTR(9F)
unfreezestr(9F) WR(9F)

STREAMS ioctls


The following functions are used when a STREAMS-based device driver is
processing its ioctl(9E) entry point. Unlike character and block devices,
STREAMS ioctls are passed around in message blocks and copying data in and
out of userland as STREAMS ioctls are generally always processed in kernel
context. This means that the normal functions like ddi_copyin(9F) and
ddi_copyout(9F) cannot be used. Instead, when a message block has a type
of M_IOCTL, then these routines can often be used to convert the structure
into one that asks for data to be copied in, copied out, or to finally
acknowledge the ioctl as successful or to terminate the processing in
error.

mcopyin(9F) mcopyout(9F)
mioc2ack(9F) miocack(9F)
miocnak(9F) miocpullup(9F)
mkiocb(9F)

chpoll(9E) Related Functions
These functions are present in service of the chpoll(9E) interface which is
used to support the traditional poll(2), and select(3C) interfaces as well
as event ports through the port_get(3C) interface. See chpoll(9E) for the
specific cases this should be called. If a device driver does not
implement the chpoll(9E) character device entry point, then these functions
should not be used.

pollhead_clean(9F) pollwakeup(9F)

Kernel Statistics


The kernel statistics or kstat framework provides an easy way of exporting
statistic information to be consumed outside of the kernel. Users can
interface with this data via kstat(8) and the corresponding kstat library
discussed in kstat(3KSTAT).

Kernel statistics are grouped using a tuple of four identifiers, separated
by colons when using kstat(8). These are, in order, the statistic module
name, instance, a name which covers a group of statistics, and an
individual name for a statistic. In addition, kernel statistics have a
class which is used to group similar named groups of statistics together
across devices. When using kstat_create(9F), drivers specify the first
three parts of the tuple and the class. The naming of individual
statistics, the last part of the tuple, varies based upon the type of the
statistic. For the most part, drivers will use the kstat type
KSTAT_TYPE_NAMED, which allows multiple name-value pairs to exist within
the statistic. For example, the kernel's layer 2 networking framework,
mac(9E), creates a kstat with the driver's name and instance and names it
"mac". Within this named group, there are statistics for all of the
different individual stats that the kernel and devices track such as bytes
transmitted and received, the state and speed of the link, and advertised
and enabled capabilities.

A device driver can initialize a kstat with the kstat_create(9F) function.
It will not be made accessible to users until the kstat_install(9F)
function is called. The device driver must perform additional
initialization of the kstat before proceeding and calling
kstat_install(9F). The kstat structure that drivers see is discussed in
kstat(9S).

kstat_create(9F) kstat_delete(9F)
kstat_install(9F) kstat_named_init(9F)
kstat_named_setstr(9F) kstat_queue(9F)
kstat_runq_back_to_waitq(9F) kstat_runq_enter(9F)
kstat_runq_exit(9F) kstat_waitq_enter(9F)
kstat_waitq_exit(9F) kstat_waitq_to_runq(9F)

NDI Events


These functions are used to allow a device driver to register for certain
events that might occur to its device or a parent in the tree and receive a
callback function when they occur. A good example of this is when a device
has been removed from the system such as someone just pulling out a USB
device or NVMe U.2 device. The event handlers work by first getting a
cookie that names the type of event with ddi_get_eventcookie(9F) and then
registering the callback with ddi_add_event_handler(9F).

The ddi_cb_register(9F) function is used to collect over classes of events
such as when participating in dynamic interrupt sharing.

ddi_add_event_handler(9F) ddi_cb_register(9F)
ddi_cb_unregister(9F) ddi_get_eventcookie(9F)
ddi_remove_event_handler(9F)

Layered Device Interfaces


The LDI (Layered Device Interface) provides a mechanism for a driver to
open up another device in the kernel and begin calling basic operations on
the device as though the calling driver were a normal user process.
Through the LDI, drivers can perform equivalents to the basic file read(2)
and write(2) calls, look up properties on the device, perform networking
style calls ala getmsg(2) and putmsg(2), and register callbacks to be
called when something happens to the underlying device. For example, the
ZFS file system uses the LDI to open and operate on block devices.

Before opening a device itself, callers must obtain a notion of their
identity which is used when making subsequent calls. The simplest form is
often to use the device's dev_info_t and call ldi_ident_from_dip(9F);
however, there are also methods available based upon having a dev_t or a
STREAMS struct queue.

Once that identity is established, there are several ways to open a device
such as ldi_open_by_dev(9F), ldi_open_by_devid(9F), or
ldi_open_by_name(9F). Once an LDI device has been opened, then all of the
other functions may be used to operate on the device; however, consumers of
the LDI must think carefully about what kind of device they are opening.
While a kernel pseudo-device driver cannot disappear while it is open, when
the device represents an actual piece of hardware, it is possible for it to
be physically removed and no longer be accessible. Consumers should not
assume that a layered device will always be present.

ldi_add_event_handler(9F) ldi_aread(9F)
ldi_awrite(9F) ldi_close(9F)
ldi_devmap(9F) ldi_dump(9F)
ldi_ev_finalize(9F) ldi_ev_get_cookie(9F)
ldi_ev_get_type(9F) ldi_ev_notify(9F)
ldi_ev_register_callbacks(9F) ldi_ev_remove_callbacks(9F)
ldi_get_dev(9F) ldi_get_devid(9F)
ldi_get_eventcookie(9F) ldi_get_minor_name(9F)
ldi_get_otyp(9F) ldi_get_size(9F)
ldi_getmsg(9F) ldi_ident_from_dev(9F)
ldi_ident_from_dip(9F) ldi_ident_from_stream(9F)
ldi_ident_release(9F) ldi_ioctl(9F)
ldi_open_by_dev(9F) ldi_open_by_devid(9F)
ldi_open_by_name(9F) ldi_poll(9F)
ldi_prop_exists(9F) ldi_prop_get_int(9F)
ldi_prop_get_int64(9F) ldi_prop_lookup_byte_array(9F)
ldi_prop_lookup_int_array(9F) ldi_prop_lookup_int64_array(9F)
ldi_prop_lookup_string_array(9F) ldi_prop_lookup_string(9F)
ldi_putmsg(9F) ldi_read(9F)
ldi_remove_event_handler(9F) ldi_strategy(9F)
ldi_write(9F)

Signal Manipulation


These utility functions all relate to understanding whether or not a
process can receive a signal an actually delivering one to a process from a
driver. This interface is specific to device drivers and should not be
used by the broader kernel. These interfaces are not recommended and
should only be used after consultation.

ddi_can_receive_sig(9F) proc_ref(9F)
proc_signal(9F) proc_unref(9F)

Getting at Surrounding Context


These functions allow a driver to better understand its current context.
For example, some drivers have to deal with providing polled I/O or take
special care as part of creating a kernel crash dump. These cases may need
to call the ddi_in_panic(9F) function. The other functions generally
provide a way to get at information such as the process ID or other
information from the system; however, this generally should not be needed
or used. Almost all values exposed by say drv_getparm(9F) have more usable
first-class methods of getting at the data.

ddi_get_kt_did(9F) ddi_get_pid(9F)
ddi_in_panic(9F) drv_getparm(9F)

Driver Memory Mapping


These functions are present for device drivers that implement the
devmap(9E) or segmap(9E) entry points. The ddi_umem_alloc(9F) routines are
used to allocate and lock memory that can later be used as part of passing
this memory to userland through the mapping entry points.

ddi_devmap_segmap(9F) ddi_mmap_get_model(9F)
ddi_segmap_setup(9F) ddi_segmap(9F)
ddi_umem_alloc(9F) ddi_umem_free(9F)
ddi_umem_iosetup(9F) ddi_umem_lock(9F)
ddi_umem_unlock(9F) ddi_unmap_regs(9F)
devmap_default_access(9F) devmap_devmem_setup(9F)
devmap_do_ctxmgt(9F) devmap_load(9F)
devmap_set_ctx_timeout(9F) devmap_setup(9F)
devmap_umem_setup(9F) devmap_unload(9F)

UTF-8, UTF-16, UTF-32, and Code Set Utilities
These routines provide the ability to work with and deal with text in
different encodings and code sets. Generally the kernel does not assume
that much about the type of the text that it is operating in, though some
subsystems will require that the names of things be ASCII only.

The primary other locales that the system supports are generally UTF-8
based and so the kernel provides a set of routines to deal with UTF-8 and
Unicode normalization. However, there are still cases where different
character encodings are required or conversation between UTF-8 and some
other type is required. This is provided by the kernel iconv framework,
which provides a subset of the traditional userland iconv conversions.

kiconv_close(9F) kiconv_open(9F)
kiconv(9F) kiconvstr(9F)
u8_strcmp(9F) u8_textprep_str(9F)
u8_validate(9F) uconv_u16tou32(9F)
uconv_u16tou8(9F) uconv_u32tou16(9F)
uconv_u32tou8(9F) uconv_u8tou16(9F)
uconv_u8tou32(9F)

Raw I/O Port Access
This group of functions provides raw access to I/O ports on architecture
that support them. These functions do not allow any coordination with
other callers nor is the validity of the port assured in any way. In
general, device drivers should use the normal register access routines to
access I/O ports. See Device Register Setup and Access for more
information on the preferred way to setup and access registers.

inb(9F) inw(9F)
inl(9F) outb(9F)
outw(9F) outl(9F)

Power Management


These functions are used to raise and lower the internal power levels of a
device driver or to indicate to the kernel that the device is busy and
therefore cannot have its power changed. See power(9E) for additional
information.

ddi_removing_power(9F) pm_busy_component(9F)
pm_idle_component(9F) pm_lower_power(9F)
pm_power_has_changed(9F) pm_raise_power(9F)
pm_trans_check(9F)

Network Packet Hooks


These functions are intended to be used by device drivers that wish to
inspect and potentially modify packets along their path through the
networking stack. The most common use case is for implementing something
like a network firewall. Otherwise, if looking to add support for a new
protocol or other network processing feature, one is better off more
directly integrating with the networking stack.

To get started, drivers generally will need to first use
net_protocol_lookup(9F) to get a handle to say that they're interested in
looking at IPv4 or IPv6 traffic and then can allocate an actual hook object
with hook_alloc(9F). After filling out the hook, the hook can be inserted
into the actual system with net_hook_register(9F).

Hooks operate in the context of a networking stack. Every networking stack
in the system is independent and therefore has its own set of interfaces,
routing tables, settings, and related. Most zones have their own
networking stack. This is the exclusive-IP option that is described in
zoneadm(8).

Drivers can register to get a callback for every netstack in the system and
be notified when they are created and destroyed. This is done by calling
the net_instance_alloc(9F) function, filling out its data structure, and
then finally calling net_instance_register(9F). Like other callback
interfaces, the moment the callback functions are registered, drivers need
to expect that they're going to be called.

hook_alloc(9F) hook_free(9F)
net_event_notify_register(9F) net_event_notify_unregister(9F)
net_getifname(9F) net_getlifaddr(9F)
net_getmtu(9F) net_getnetid(9F)
net_getpmtuenabled(9F) net_hook_register(9F)
net_hook_unregister(9F) net_inject_alloc(9F)
net_inject_free(9F) net_inject(9F)
net_instance_alloc(9F) net_instance_free(9F)
net_instance_notify_register(9F) net_instance_notify_unregister(9F)
net_instance_protocol_unregister(9F)
net_instance_register(9F)
net_instance_unregister(9F) net_ispartialchecksum(9F)
net_isvalidchecksum(9F) net_kstat_create(9F)
net_kstat_delete(9F) net_lifgetnext(9F)
net_netidtozonid(9F) net_phygetnext(9F)
net_phylookup(9F) net_protocol_lookup(9F)
net_protocol_notify_register(9F) net_protocol_release(9F)
net_protocol_walk(9F) net_routeto(9F)
net_zoneidtonetid(9F) netinfo(9F)

SEE ALSO


Intro(2), Intro(9), Intro(9E), Intro(9S)

illumos Developer's Guide, https://www.illumos.org/books/dev/.

Writing Device Drivers, https://www.illumos.org/books/wdd/.

illumos July 17, 2023 illumos