Hardening illumos Drivers
Fault Management Architecture (FMA) I/O Fault Services enable driver developers to integrate fault management capabilities into I/O device drivers. The illumos I/O fault services framework defines a set of interfaces that enable all drivers to coordinate and perform basic error handling tasks and activities. The illumos FMA as a whole provides for error handling and fault diagnosis, in addition to response and recovery. FMA is a component of illumos's Predictive Self-Healing strategy.
A driver is considered hardened when it uses the defensive programming practices described in this document in addition to the I/O fault services framework for error handling and diagnosis. The driver hardening test harness tests that the I/O fault services and defensive programming requirements have been correctly fulfilled.
-
illumos Fault Management Architecture I/O Fault Services provides a reference for driver developers who want to integrate fault management capabilities into I/O device drivers.
-
Defensive Programming Techniques for illumos Device Drivers provides general information about how to defensively write an illumos device driver.
-
Driver Hardening Test Harness is a driver development tool that injects simulated hardware faults when the driver under development accesses its hardware.
13.1. illumos Fault Management Architecture I/O Fault Services
This section explains how to integrate fault management error reporting, error handling, and diagnosis for I/O device drivers. This section provides an in-depth examination of the I/O fault services framework and how to utilize the I/O fault service APIs within a device driver.
-
What Is Predictive Self-Healing? provides background and an overview of the illumos Fault Management Architecture.
-
illumos Fault Manager describes additional background with a focus on a high-level overview of the illumos Fault Manager, fmd(1M).
-
Error Handling is the primary section for driver developers. This section highlights the best practice coding techniques for high-availability and the use of I/O fault services in driver code to interact with the FMA.
-
Diagnosing Faults describes how faults are diagnosed from the errors detected by drivers.
-
Event Registry provides information on illumos's Event Registry.
13.1.1. What Is Predictive Self-Healing?
Traditionally, systems have exported hardware and software error information directly to human administrators and to management software in the form of syslog messages. Often, error detection, diagnosis, reporting, and handling was embedded in the code of each driver.
A system like the illumos OS predictive self-healing system is first and foremost self-diagnosing. Self-diagnosing means the system provides technology to automatically diagnose problems from observed symptoms, and the results of the diagnosis can then be used to trigger automated response and recovery. A fault in hardware or a defect in software can be associated with a set of possible observed symptoms called errors. The data generated by the system as the result of observing an error is called an error report or ereport.
In a system capable of self-healing, ereports are captured by the system and are encoded as a set of name-value pairs described by an extensible event protocol to form an ereport event. Ereport events and other data are gathered to facilitate self-healing, and are dispatched to software components called diagnosis engines designed to diagnose the underlying problems corresponding to the error symptoms observed by the system. A diagnosis engine runs in the background and silently consumes error telemetry until it can produce a diagnosis or predict a fault.
After processing sufficient telemetry to reach a conclusion, a diagnosis engine produces another event called a fault event. The fault event is then broadcast to all agents that are interested in the specific fault event. An agent is a software component that initiates recovery and responds to specific fault events. A software component known as the illumos Fault Manager, fmd(1M), manages the multiplexing of events between ereport generators, diagnosis engines, and agent software.
13.1.2. illumos Fault Manager
The illumos Fault Manager, fmd(1M), is responsible for dispatching in-bound error telemetry
events to the appropriate diagnosis engines. The diagnosis engine is responsible
for identifying the underlying hardware faults or software defects that are
producing the error symptoms. The fmd
(1M) daemon is the
illumos implementation of a fault manager. It starts at boot time and loads
all of the diagnosis engines and agents available on the system. The illumos
Fault Manager also provides interfaces for system administrators and service
personnel to observe fault management activity.
Diagnosis, Suspect Lists, and Fault Events
Once a diagnosis has been made, the diagnosis is output in the form of a list.suspect event. A list.suspect event is an event comprised of one or more possible fault or defect events. Sometimes the diagnosis cannot narrow the cause of errors to a single fault or defect. For example, the underlying problem might be a broken wire connecting controllers to the main system bus. The problem might be with a component on the bus or with the bus itself. In this specific case, the list.suspect event will contain multiple fault events: one for each controller attached to the bus, and one for the bus itself.
-
The resource is the component that was diagnosed as faulty. The fmdump(1M) command shows this payload member as “Problem in.”
-
The Automated System Recovery Unit (ASRU) is the hardware or software component that must be disabled to prevent further error symptoms from occurring. The
fmdump
(1M) command shows this payload member as “Affects.” -
The Field Replaceable Unit (FRU) is the component that must be replaced or repaired to fix the underlying problem.
-
The Label payload is a string that gives the location of the FRU in the same form as it is printed on the chassis or motherboard, for example next to a DIMM slot or PCI card slot. The
fmdump
command shows this payload member as “Location.”
For example, after receiving a certain number of ECC correctable errors in a given amount of time for a particular memory location, the CPU and memory diagnosis engine issues a diagnosis (list.suspect event) for a faulty DIMM.
# fmdump -v -u 38bd6f1b-a4de-4c21-db4e-ccd26fa8573c TIME UUID SUNW-MSG-ID Oct 31 13:40:18.1864 38bd6f1b-a4de-4c21-db4e-ccd26fa8573c AMD-8000-8L 100% fault.cpu.amd.icachetag Problem in: hc:///motherboard=0/chip=0/cpu=0 Affects: cpu:///cpuid=0 FRU: hc:///motherboard=0/chip=0 Location: SLOT 2
In this example, fmd(1M) has
identified a problem in a resource, specifically a CPU (hc:///motherboard=0/chip=0/cpu=0
). To suppress further error symptoms and to prevent an uncorrectable
error from occurring, an ASRU, (cpu:///cpuid=0
), is identified
for retirement. The component that needs to be replaced is the FRU (hc:///motherboard=0/chip=0
).
Response Agents
An agent is a software component that takes action in response to a
diagnosis or repair. For example, the CPU and memory retire agent is designed
to act on list.suspects that contain a fault.cpu.* event. The cpumem-retire
agent will attempt to off-line a CPU or retire a physical memory
page from service. If the agent is successful, an entry in the fault manager's
ASRU cache is added for the page or CPU that was successfully retired. The
fmadm(1M) utility, as shown in the
example below, shows an entry for a memory rank that has been diagnosed as
having a fault. ASRUs that the system does not have the ability to off-line,
retire, or disable, will also have an entry in the ASRU cache, but they will
be seen as degraded. Degraded means the resource associated with the ASRU
is faulty, but the ASRU is unable to be removed from service. Currently illumos
agent software cannot act upon I/O ASRUs (device instances). All faulty I/O
resource entries in the cache are in the degraded state.
# fmadm faulty STATE RESOURCE / UUID -------- ---------------------------------------------------------------------- degraded mem:///motherboard=0/chip=1/memory-controller=0/dimm=3/rank=0 ccae89df-2217-4f5c-add4-d920f78b4faf -------- ----------------------------------------------------------------------
The primary purpose of a retire agent is to isolate (safely remove from service) the piece of hardware or software that has been diagnosed as faulty.
-
Send alerts via SNMP traps. This can translate a diagnosis into an alert for SNMP that plugs into existing software mechanisms.
-
Post a syslog message. Message specific diagnoses (for example, syslog message agent) can take the result of a diagnosis and translate it into a syslog message that administrators can use to take a specific action.
-
Other agent actions such as update the FRUID. Response agents can be platform-specific.
Message IDs and Dictionary Files
The syslog message agent takes the output of the diagnosis (the list.suspect event) and writes specific messages to the console or /var/adm/messages. Often console messages can be difficult to understand. FMA remedies this problem by providing a defined fault message structure that is generated every time a list.suspect event is delivered to a syslog message.
The syslog agent generates a message identifier (MSG ID). The event registry generates dictionary files (.dict files) that map a list.suspect event to a structured message identifier that should be used to identify and view the associated knowledge article. Message files, (.po files) map the message ID to localized messages for every possible list of suspected faults that the diagnosis engine can generate. The following is an example of a fault message emitted on a test system.
SUNW-MSG-ID: AMD-8000-7U, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Fri Jul 28 04:26:51 PDT 2006 PLATFORM: Sun Fire V40z, CSN: XG051535088, HOSTNAME: parity SOURCE: eft, REV: 1.16 EVENT-ID: add96f65-5473-69e6-dbe1-8b3d00d5c47b DESC: The number of errors associated with this CPU has exceeded acceptable levels. Refer to http://sun.com/msg/AMD-8000-7U for more information. AUTO-RESPONSE: An attempt will be made to remove this CPU from service. IMPACT: Performance of this system may be affected. REC-ACTION: Schedule a repair procedure to replace the affected CPU. Use fmdump -v -u <EVENT_ID> to identify the module.
System Topology
To identify where a fault might have occurred, diagnosis engines need to have the topology for a given software or hardware system represented. The fmd(1M) daemon provides diagnosis engines with a handle to a topology snapshot that can be used during diagnosis. Topology information is used to represent the resource, ASRU, and FRU found in each fault event. The topology can also be used to store the platform label, FRUID, and serial number identification.
The resource payload member in the fault event is always represented
by the physical path location from the platform chassis outward. For example,
a PCI controller function that is bridged from the main system bus to a PCI
local bus is represented by its hc
scheme path name:
hc:///motherboard=0/hostbridge=1/pcibus=0/pcidev=13/pcifn=0
The ASRU payload member in the fault event is typically represented
by the illumos device tree instance name that is bound to a hardware controller,
device, or function. FMA uses the dev
scheme to represent
the ASRU in its native format for actions that might be taken by a future
implementation of a retire agent specifically designed for I/O devices:
dev:////pci@1e,600000/ide@d
The FRU payload representation in the fault event varies depending on the closest replaceable component to the I/O resource that has been diagnosed as faulty. For example, a fault event for a broken embedded PCI controller might name the motherboard of the system as the FRU that needs to be replaced:
hc:///motherboard=0
The label payload is a string that gives the location of the FRU in the same form as it is printed on the chassis or motherboard, for example next to a DIMM slot or PCI card slot:
Label: SLOT 2
13.1.3. Error Handling
This section describes how to use I/O fault services APIs to handle errors within a driver. This section discusses how drivers should indicate and initialize their fault management capabilities, generate error reports, and register the driver's error handler routine.
Excerpts are provided from source code examples that demonstrate
the use of the I/O fault services API from the Broadcom 1Gb NIC driver, bge
. Follow these examples as a model for how to integrate fault
management capability into your own drivers. Take the following steps to study
the complete bge
driver code:
-
Go to the illumos source browser.
-
Enter
bge
in the File Path field. -
Select illumos-gate in the project(s) listing.
-
Click the Search button.
Drivers that have been instrumented to provide FMA error report telemetry detect errors and determine the impact of those errors on the services provided by the driver. Following the detection of an error, the driver should determine when its services have been impacted and to what degree.
-
Attempt recovery
-
Retry an I/O transaction
-
Attempt fail-over techniques
-
Report the error to the calling application/stack
-
If the error cannot be constrained any other way, then panic
Errors detected by the driver are communicated to the fault management daemon as an ereport. An ereport is a structured event defined by the FMA event protocol. The event protocol is a specification for a set of common data fields that must be used to describe all possible error and fault events, in addition to the list of suspected faults. Ereports are gathered into a flow of error telemetry and dispatched to the diagnosis engine.
Declaring Fault Management Capabilities
A hardened device driver must declare its fault management capabilities
to the I/O Fault Management framework. Use the ddi_fm_init
(9F)
function to declare the fault management capabilities of your driver.
void ddi_fm_init(dev_info_t *dip, int *fmcap, ddi_iblock_cookie_t *ibcp)
The ddi_fm_init
function
can be called from kernel context in a driver attach(9E) or detach(9E) entry point. The ddi_fm_init
function usually is called from the attach
entry
point. The ddi_fm_init
function allocates and initializes
resources according to fmcap. The fmcap parameter
must be set to the bitwise-inclusive-OR of the following fault management
capabilities:
-
DDI_FM_EREPORT_CAPABLE
- Driver is responsible for and capable of generating FMA protocol error events (ereports) upon detection of an error condition. -
DDI_FM_ACCCHK_CAPABLE
- Driver is responsible for and capable of checking for errors upon completion of one or more access I/O transactions. -
DDI_FM_DMACHK_CAPABLE
- Driver is responsible for and capable of checking for errors upon completion of one or more DMA I/O transactions. -
DDI_FM_ERRCB_CAPABLE
- Driver has an error callback function.
A hardened leaf driver generally
sets all these capabilities. However, if its parent nexus is not capable of
supporting any one of the requested capabilities, the associated bit is cleared
and returned as such to the driver. Before returning from ddi_fm_init
(9F), the I/O fault services framework creates a set of fault
management capability properties: fm-ereport-capable
, fm-accchk-capable
, fm-dmachk-capable
and fm-errcb-capable
. The currently supported fault management capability
level is observable by using the prtconf(1M) command.
To make your driver support administrative selection of fault management
capabilities, export and set the fault management capability level properties
to the values described above in the driver.conf(4) file. The fm-capable
properties
must be set and read prior to calling ddi_fm_init
with
the desired capability list.
The following example from the bge
driver shows the bge_fm_init
function,
which calls the ddi_fm_init
(9F) function. The bge_fm_init
function is called in the bge_attach
function.
static void
bge_fm_init(bge_t *bgep)
{
ddi_iblock_cookie_t iblk;
/* Only register with IO Fault Services if we have some capability */
if (bgep->fm_capabilities) {
bge_reg_accattr.devacc_attr_access = DDI_FLAGERR_ACC;
bge_desc_accattr.devacc_attr_access = DDI_FLAGERR_ACC;
dma_attr.dma_attr_flags = DDI_DMA_FLAGERR;
/*
* Register capabilities with IO Fault Services
*/
ddi_fm_init(bgep->devinfo, &bgep->fm_capabilities, &iblk);
/*
* Initialize pci ereport capabilities if ereport capable
*/
if (DDI_FM_EREPORT_CAP(bgep->fm_capabilities) ||
DDI_FM_ERRCB_CAP(bgep->fm_capabilities))
pci_ereport_setup(bgep->devinfo);
/*
* Register error callback if error callback capable
*/
if (DDI_FM_ERRCB_CAP(bgep->fm_capabilities))
ddi_fm_handler_register(bgep->devinfo,
bge_fm_error_cb, (void*) bgep);
} else {
/*
* These fields have to be cleared of FMA if there are no
* FMA capabilities at runtime.
*/
bge_reg_accattr.devacc_attr_access = DDI_DEFAULT_ACC;
bge_desc_accattr.devacc_attr_access = DDI_DEFAULT_ACC;
dma_attr.dma_attr_flags = 0;
}
}
Cleaning Up Fault Management Resources
The ddi_fm_fini
(9F) function cleans up
resources allocated to support fault management for dip.
void ddi_fm_fini(dev_info_t *dip)
The ddi_fm_fini
function can be called from kernel
context in a driver attach(9E) or detach(9E) entry point.
The following example from the bge
driver shows the bge_fm_fini
function,
which calls the ddi_fm_fini
(9F) function. The bge_fm_fini
function is called in the bge_unattach
function,
which is called in both the bge_attach
and bge_detach
functions.
static void
bge_fm_fini(bge_t *bgep)
{
/* Only unregister FMA capabilities if we registered some */
if (bgep->fm_capabilities) {
/*
* Release any resources allocated by pci_ereport_setup()
*/
if (DDI_FM_EREPORT_CAP(bgep->fm_capabilities) ||
DDI_FM_ERRCB_CAP(bgep->fm_capabilities))
pci_ereport_teardown(bgep->devinfo);
/*
* Un-register error callback if error callback capable
*/
if (DDI_FM_ERRCB_CAP(bgep->fm_capabilities))
ddi_fm_handler_unregister(bgep->devinfo);
/*
* Unregister from IO Fault Services
*/
ddi_fm_fini(bgep->devinfo);
}
}
Getting the Fault Management Capability Bit Mask
The ddi_fm_capable
(9F) function returns
the capability bit mask currently set for dip.
void ddi_fm_capable(dev_info_t *dip)
Reporting Errors
-
Queueing an Error Event discusses how to queue error events.
-
Detecting and Reporting PCI-Related Errors describes how to report PCI-related errors.
-
Reporting Standard I/O Controller Errors describes how to report standard I/O controller errors.
-
Service Impact Function discusses how to report whether an error has impacted the services provided by a device.
Queueing an Error Event
The ddi_fm_ereport_post
(9F) function
causes an ereport event to be queued for delivery to the fault manager daemon, fmd(1M).
void ddi_fm_ereport_post(dev_info_t *dip,
const char *error_class,
uint64_t ena,
int sflag, ...)
The sflag parameter indicates whether the caller is willing to wait for system memory and event channel resources to become available.
The ENA indicates the Error Numeric Association (ENA)
for this error report. The ENA might have been initialized and obtained from
another error detecting software module such as a bus nexus driver. If the
ENA is set to 0, it will be initialized by ddi_fm_ereport_post
.
The name-value pair (nvpair) variable argument
list contains one or more name, type, value pointer nvpair tuples
for non-array data_type_t
types or one or more name, type,
number of element, value pointer tuples for data_type_t
array
types. The nvpair tuples make up the ereport event
payload required for diagnosis. The end of the argument list is specified
by NULL
.
The ereport class names and payloads described in Reporting Standard I/O Controller Errors for I/O controllers are used as appropriate for error_class. Other ereport class names and payloads can be defined, but they must be registered in the illumos event registry and accompanied by driver specific diagnosis engine software, or the Eversholt fault tree (eft) rules. For more information about the illumos event registry and about Eversholt fault tree rules, see the Fault Management community on OpenSolaris.
void
bge_fm_ereport(bge_t *bgep, char *detail)
{
uint64_t ena;
char buf[FM_MAX_CLASS];
(void) snprintf(buf, FM_MAX_CLASS, "%s.%s", DDI_FM_DEVICE, detail);
ena = fm_ena_generate(0, FM_ENA_FMT1);
if (DDI_FM_EREPORT_CAP(bgep->fm_capabilities)) {
ddi_fm_ereport_post(bgep->devinfo, buf, ena, DDI_NOSLEEP,
FM_VERSION, DATA_TYPE_UINT8, FM_EREPORT_VERS0, NULL);
}
}
Detecting and Reporting PCI-Related Errors
PCI-related errors, including PCI, PCI-X, and PCI-E, are automatically
detected and reported when you use pci_ereport_post
(9F).
void pci_ereport_post(dev_info_t *dip, ddi_fm_error_t *derr, uint16_t *xx_status)
Drivers do not need to generate driver-specific ereports for errors
that occur in the PCI Local Bus configuration status registers. The pci_ereport_post
function can report data parity errors, master aborts, target
aborts, signaled system errors, and much more.
If pci_ereport_post
is to be used by a driver,
then pci_ereport_setup
(9F) must have been previously
called during the driver's attach
(9E) routine,
and pci_ereport_teardown
(9F) must subsequently
be called during the driver's detach
(9E) routine.
The bge
code samples below show the bge
driver
invoking the pci_ereport_post
function from the driver's
error handler. See also Registering an Error Handler.
/*
* The I/O fault service error handling callback function
*/
/*ARGSUSED*/
static int
bge_fm_error_cb(dev_info_t *dip, ddi_fm_error_t *err, const void *impl_data)
{
/*
* as the driver can always deal with an error
* in any dma or access handle, we can just return
* the fme_status value.
*/
pci_ereport_post(dip, err, NULL);
return (err->fme_status);
}
Reporting Standard I/O Controller Errors
A standard set of device ereports is defined for commonly seen errors for I/O controllers. These ereports should be generated whenever one of the error symptoms described in this section is detected.
The ereports described in this section are dispatched for diagnosis to the eft diagnosis engine, which uses a common set of standard rules to diagnose them. Any other errors detected by device drivers must be defined as ereport events in the illumos event registry and must be accompanied by device specific diagnosis software or eft rules.
- DDI_FM_DEVICE_INVAL_STATE
-
The driver has detected that the device is in an invalid state.
A driver should post an error when it detects that the data it transmits or receives appear to be invalid. For example, in the
bge
code, thebge_chip_reset
andbge_receive_ring
routines generate theereport.io.device.inval_state
error when these routines detect invalid data./* * The SEND INDEX registers should be reset to zero by the * global chip reset; if they're not, there'll be trouble * later on. */ sx0 = bge_reg_get32(bgep, NIC_DIAG_SEND_INDEX_REG(0)); if (sx0 != 0) { BGE_REPORT((bgep, "SEND INDEX - device didn't RESET")); bge_fm_ereport(bgep, DDI_FM_DEVICE_INVAL_STATE); return (DDI_FAILURE); } /* ... */ /* * Sync (all) the receive ring descriptors * before accepting the packets they describe */ DMA_SYNC(rrp->desc, DDI_DMA_SYNC_FORKERNEL); if (*rrp->prod_index_p >= rrp->desc.nslots) { bgep->bge_chip_state = BGE_CHIP_ERROR; bge_fm_ereport(bgep, DDI_FM_DEVICE_INVAL_STATE); return (NULL); }
- DDI_FM_DEVICE_INTERN_CORR
-
The device has reported a self-corrected internal error. For example, a correctable ECC error has been detected by the hardware in an internal buffer within the device.
This error flag is not used in the
bge
driver. See the nxge_fm.c file on the illumos source browser for examples that use this error. Take the following steps to study thenxge
driver code:-
Go to illumos source browser.
-
Enter
nxge
in the File Path field. -
Select illumos-gate in the project(s) listing.
-
Click the Search button.
-
- DDI_FM_DEVICE_INTERN_UNCORR
-
The device has reported an uncorrectable internal error. For example, an uncorrectable ECC error has been detected by the hardware in an internal buffer within the device.
This error flag is not used in the
bge
driver. See the nxge_fm.c file on the illumos source browser for examples that use this error. - DDI_FM_DEVICE_STALL
-
The driver has detected that data transfer has stalled unexpectedly.
The
bge_factotum_stall_check
routine provides an example of stall detection.dogval = bge_atomic_shl32(&bgep->watchdog, 1); if (dogval < bge_watchdog_count) return (B_FALSE); BGE_REPORT((bgep, "Tx stall detected, watchdog code 0x%x", dogval)); bge_fm_ereport(bgep, DDI_FM_DEVICE_STALL); return (B_TRUE);
- DDI_FM_DEVICE_NO_RESPONSE
-
The device is not responding to a driver command.
bge_chip_poll_engine(bge_t *bgep, bge_regno_t regno, uint32_t mask, uint32_t val) { uint32_t regval; uint32_t n; for (n = 200; n; --n) { regval = bge_reg_get32(bgep, regno); if ((regval & mask) == val) return (B_TRUE); drv_usecwait(100); } bge_fm_ereport(bgep, DDI_FM_DEVICE_NO_RESPONSE); return (B_FALSE); }
- DDI_FM_DEVICE_BADINT_LIMIT
-
The device has raised too many consecutive invalid interrupts.
The
bge_intr
routine within thebge
driver provides an example of stuck interrupt detection. Thebge_fm_ereport
function is a wrapper for theddi_fm_ereport_post
(9F) function. See thebge_fm_ereport
example in Queueing an Error Eventif (bgep->missed_dmas >= bge_dma_miss_limit) { /* * If this happens multiple times in a row, * it means DMA is just not working. Maybe * the chip has failed, or maybe there's a * problem on the PCI bus or in the host-PCI * bridge (Tomatillo). * * At all events, we want to stop further * interrupts and let the recovery code take * over to see whether anything can be done * about it ... */ bge_fm_ereport(bgep, DDI_FM_DEVICE_BADINT_LIMIT); goto chip_stop; }
Service Impact Function
A fault management capable driver must indicate whether or not an error
has impacted the services provided by a device. Following detection of an
error and, if necessary, a shutdown of services, the driver should invoke
the ddi_fm_service_impact
(9F) routine to reflect
the current service state of the device instance. The service state can be
used by diagnosis and recovery software to help identify or react to the problem.
The ddi_fm_service_impact
routine should be called
both when an error has been detected by the driver itself, and when the framework
has detected an error and marked an access or DMA handle as faulty.
void ddi_fm_service_impact(dev_info_t *dip, int svc_impact)
The following service impact values (svc_impact)
are accepted by ddi_fm_service_impact
:
- DDI_SERVICE_LOST
-
The service provided by the device is unavailable due to a device fault or software defect.
- DDI_SERVICE_DEGRADED
-
The driver is unable to provide normal service, but the driver can provide a partial or degraded level of service. For example, the driver might have to make repeated attempts to perform an operation before it succeeds, or it might be running at less that its configured speed.
- DDI_SERVICE_UNAFFECTED
-
The driver has detected an error, but the services provided by the device instance are unaffected.
- DDI_SERVICE_RESTORED
-
All of the device's services have been restored.
The call to ddi_fm_service_impact
generates the
following ereports on behalf of the driver, based on the service impact argument
to the service impact routine:
-
ereport.io.service.lost
-
ereport.io.service.degraded
-
ereport.io.service.unaffected
-
ereport.io.service.restored
In the following bge
code, the driver determines
that it is unable to successfully restart transmitting or receiving packets
as the result of an error. The service state of the device transitions to
DDI_SERVICE_LOST.
/*
* All OK, reinitialize hardware and kick off GLD scheduling
*/
mutex_enter(bgep->genlock);
if (bge_restart(bgep, B_TRUE) != DDI_SUCCESS) {
(void) bge_check_acc_handle(bgep, bgep->cfg_handle);
(void) bge_check_acc_handle(bgep, bgep->io_handle);
ddi_fm_service_impact(bgep->devinfo, DDI_SERVICE_LOST);
mutex_exit(bgep->genlock);
return (DDI_FAILURE);
}
The ddi_fm_service_impact
function should
not be called from the registered callback routine.
Access Attributes Structure
A DDI_FM_ACCCHK_CAPABLE
device driver must set its access
attributes to indicate that it is capable of handling programmed I/O (PIO)
access errors that occur during a register read or write. The devacc_attr_access
field in the ddi_device_acc_attr(9S) structure
should be set as an indicator to the system that the driver is capable of
checking for and handling data path errors. The ddi_device_acc_attr
structure
contains the following members:
ushort_t devacc_attr_version;
uchar_t devacc_attr_endian_flags;
uchar_t devacc_attr_dataorder;
uchar_t devacc_attr_access; /* access error protection */
Errors detected in the data path to or from a device can be processed by one or more of the device driver's nexus parents.
The devacc_attr_access
field can be set to
the following values:
- DDI_DEFAULT_ACC
-
This flag indicates the system will take the default action (panic if appropriate) when an error occurs. This attribute cannot be used by DDI_FM_ACCCHK_CAPABLE drivers.
- DDI_FLAGERR_ACC
-
This flag indicates that the system will attempt to handle and recover from an error associated with the access handle. The driver should use the techniques described in Defensive Programming Techniques for illumos Device Drivers and should use
ddi_fm_acc_err_get
(9F) to regularly check for errors before the driver allows data to be passed back to the calling application.-
Error notification via the driver callback
-
An error condition observable via
ddi_fm_acc_err_get
(9F)
-
- DDI_CAUTIOUS_ACC
-
The DDI_CAUTIOUS_ACC flag provides a high level of protection for each Programmed I/O access made by the driver.
Use of this flag will cause a significant impact on the performance of the driver.
The DDI_CAUTIOUS_ACC flag signifies that an error is anticipated by the accessing driver. The system attempts to handle and recover from an error associated with this handle as gracefully as possible. No error reports are generated as a result, but the handle's
fme_status
flag is set to DDI_FM_NONFATAL. This flag is functionally equivalent to ddi_peek(9F) and ddi_poke(9F).The use of the DDI_CAUTIOUS_ACC provides:
-
Exclusive access to the bus
-
On trap protection - (
ddi_peek
andddi_poke
) -
Error notification through the driver callback registered with
ddi_fm_handler_register
(9F) -
An error condition observable through
ddi_fm_acc_err_get
(9F)
-
Generally, drivers should check for data path errors at appropriate junctures in the code path to guarantee consistent data and to ensure that proper error status is presented in the I/O software stack.
DDI_FM_ACCCHK_CAPABLE device drivers must set their devacc_attr_access
field to DDI_FLAGERR_ACC or DDI_CAUTIOUS_ACC.
DMA Attributes Structure
As with access handle setup, a DDI_FM_DMACHK_CAPABLE device driver must
set the dma_attr_flag
field of its ddi_dma_attr(9S) structure
to the DDI_DMA_FLAGERR flag. The system attempts to recover from an error
associated with a handle that has DDI_DMA_FLAGERR set. The ddi_dma_attr
structure contains the following members:
uint_t dma_attr_version; /* version number */
uint64_t dma_attr_addr_lo; /* low DMA address range */
uint64_t dma_attr_addr_hi; /* high DMA address range */
uint64_t dma_attr_count_max; /* DMA counter register */
uint64_t dma_attr_align; /* DMA address alignment */
uint_t dma_attr_burstsizes; /* DMA burstsizes */
uint32_t dma_attr_minxfer; /* min effective DMA size */
uint64_t dma_attr_maxxfer; /* max DMA xfer size */
uint64_t dma_attr_seg; /* segment boundary */
int dma_attr_sgllen; /* s/g length */
uint32_t dma_attr_granular; /* granularity of device */
uint_t dma_attr_flags; /* Bus specific DMA flags */
Drivers that set the DDI_DMA_FLAGERR flag should use the techniques described in Defensive Programming Techniques for illumos Device Drivers
and should use ddi_fm_dma_err_get
(9F)
to check for data path errors whenever DMA transactions are completed or at
significant points within the code path. This ensures consistent data and
proper error status presented to the I/O software stack.
-
Error notification via the driver callback registered with
ddi_fm_handler_register
-
An error condition observable by calling
ddi_fm_dma_err_get
Getting Error Status
If a fault has occurred that affects the resource mapped by the handle, the error status structure is updated to reflect error information captured during error handling by a bus or other device driver in the I/O data path.
void ddi_fm_dma_err_get(ddi_dma_handle_t handle, ddi_fm_error_t *de, int version)
void ddi_fm_acc_err_get(ddi_acc_handle_t handle, ddi_fm_error_t *de, int version)
The ddi_fm_acc_err_get
(9F) and ddi_fm_dma_err_get
(9F) functions return the error status for
a DMA or access handle respectively. The version field should be set to DDI_FME_VERSION.
An error for an access handle
means that an error has been detected that has affected PIO transactions to
or from the device using that access handle. Any data received by the driver,
for example via a recent ddi_get8(9F) call,
should be considered potentially corrupt. Any data sent to the device, for
example via a recent ddi_put32(9F) call
might also have been corrupted or might not have been received at all. The
underlying fault might, however, be transient, and the driver can therefore
attempt to recover by calling ddi_fm_acc_err_clear
(9F),
resetting the device to get it back into a known state, and retrying any potentially
failed transactions.
If an error is indicated for a DMA handle, it implies that an error has been detected that has (or will) affect DMA transactions between the device and the memory currently bound to the handle (or most recently bound, if the handle is currently unbound). Possible causes include the failure of a component in the DMA data path, or an attempt by the device to make an invalid DMA access. The driver might be able to continue by retrying and reallocating memory. The contents of the memory currently (or previously) bound to the handle should be regarded as indeterminate and should be released back to the system. The fault indication associated with the current transaction is lost once the handle is bound or re-bound, but because the fault might persist, future DMA operations might not succeed.
Clearing Errors
These routines should be called when the driver wants to retry a request after an error was detected by the handle without needing to free and reallocate the handle first.
void ddi_fm_acc_err_clear(ddi_acc_handle_t handle, int version)
void ddi_fm_dma_err_clear(ddi_dma_handle_t handle, int version)
Registering an Error Handler
Error handling activity might begin at the time that the error is detected by the operating system via a trap or error interrupt. If the software responsible for handling the error (the error handler) cannot immediately isolate the device that was involved in the failed I/O operation, it must attempt to find a software module within the device tree that can perform the error isolation. The illumos device tree provides a structural means to propagate nexus driver error handling activities to children who might have a more detailed understanding of the error and can capture error state and isolate the problem device.
A driver can register an error handler callback with the I/O Fault
Services Framework. The error handler should be specific to the type of error
and subsystem where error detection has occurred. When the driver's error
handler routine is invoked, the driver must check for any outstanding errors
associated with device transactions and generate ereport events. The driver
must also return error handler status in its ddi_fm_error
structure.
For example, if it has been determined that the system's integrity has been
compromised, the most appropriate action might be for the error handler to
panic the system.
The callback is invoked by a parent nexus driver when an error might be associated with a particular device instance. Device drivers that register error handlers must be DDI_FM_ERRCB_CAPABLE.
void ddi_fm_handler_register(dev_info_t *dip, ddi_err_func_t handler, void *impl_data)
The ddi_fm_handler_register
(9F) routine registers an error handler callback with the
I/O fault services framework. The ddi_fm_handler_register
function
should be called in the driver's attach(9E) entry point for callback
registration following driver fault management initialization (ddi_fm_init
).
-
Check for any outstanding hardware errors associated with device transactions, and generate ereport events for diagnosis. For a PCI, PCI-x, or PCI express device this can generally be done using
pci_ereport_post
as described in Detecting and Reporting PCI-Related Errors. -
Return error handler status in its
ddi_fm_error
structure:-
DDI_FM_OK
-
DDI_FM_FATAL
-
DDI_FM_NONFATAL
-
DDI_FM_UNKNOWN
-
-
A pointer to a device instance (dip) under the driver's control
-
A data structure (
ddi_fm_error
) that contains common fault management data and status for error handling -
A pointer to any implementation specific data (impl_data) specified at the time of the handler's registration
-
Must not hold locks
-
Must not sleep waiting for resources
-
Isolating the device instance that might have caused errors
-
Recovering transactions associated with errors
-
Reporting the service impact of errors
-
Scheduling device shutdown for errors considered fatal
These actions can be carried out within the error handler function.
However, because of the restrictions on locking and because the error handler
function does not always know the context of what the driver was doing at
the point where the fault occurred, it is more usual for these actions to
be carried out following inline calls to ddi_fm_acc_err_get
(9F)
and ddi_fm_dma_err_get
(9F) within the normal
paths of the driver as described previously.
/*
* The I/O fault service error handling callback function
*/
/*ARGSUSED*/
static int
bge_fm_error_cb(dev_info_t *dip, ddi_fm_error_t *err, const void *impl_data)
{
/*
* as the driver can always deal with an error
* in any dma or access handle, we can just return
* the fme_status value.
*/
pci_ereport_post(dip, err, NULL);
return (err->fme_status);
}
Fault Management Data and Status Structure
Driver error handling callbacks are passed a pointer to a data structure that contains common fault management data and status for error handling.
The data structure ddi_fm_error
contains an
FMA protocol ENA for the current error, the status of the error handler callback,
an error expectation flag, and any potential access or DMA handles associated
with an error detected by the parent nexus.
fme_ena
-
This field is initialized by the calling parent nexus and might have been incremented along the error handling propagation chain before reaching the driver's registered callback routine. If the driver detects a related error of its own, it should increment this ENA prior to calling
ddi_fm_ereport_post
. fme_acc_handle
,fme_dma_handle
-
These fields contain a valid access or DMA handle if the parent was able to associate an error detected at its level to a handle mapped or bound by the device driver.
fme_flag
-
The
fme_flag
is set to DDI_FM_ERR_EXPECTED if the calling parent determines the error was the result of a DDI_CAUTIOUS_ACC protected operation. In this case, thefme_acc_handle
is valid and the driver should check for and report only errors not associated with the DDI_CAUTIOUS_ACC protected operation. Otherwise,fme_flag
is set to DDI_FM_ERR_UNEXPECTED and the driver must perform the full range of error handling tasks. fme_status
-
Upon return from its error handler callback, the driver must set
fme_status
to one of the following values:-
DDI_FM_OK – No errors were detected and the operational state of this device instance remains the same.
-
DDI_FM_FATAL – An error has occurred and the driver considers it to be fatal to the system. For example, a call to
pci_ereport_post
(9F) might have detected a system fatal error. In this case, the driver should report any additional error information it might have in the context of the driver. -
DDI_FM_NONFATAL – An error has been detected by the driver but is not considered fatal to the system. The driver has identified the error and has either isolated the error or is committing that it will isolate the error.
-
DDI_FM_UNKNOWN – An error has been detected, but the driver is unable to isolate the device or determine the impact of the error on the operational state of the system.
-
13.1.4. Diagnosing Faults
The fault management daemon, fmd(1M), provides a programming interface for the development of diagnosis engine (DE) plug-in modules. A DE can be written to consume and diagnose any error telemetry or specific error telemetries. The eft DE was designed to diagnose any number of ereport classes based on diagnosis rules specified in the Eversholt language.
Standard Leaf Device Diagnosis
Most I/O subsystems use the eft DE and rules sets to diagnose device and device driver related problems. A standard set of ereports, listed in Reporting Standard I/O Controller Errors, has been specified for PCI leaf devices. Accompanying these ereports are eft diagnosis rules that take the telemetry and identify the associated device fault. Drivers that generate these ereports do not need to deliver any additional diagnosis software or eft rules.
The detection and generation of these ereports produces the following fault events:
fault.io.pci.bus-linkerr
-
A hardware fault on the PCI bus
fault.io.pci.device-interr
-
A hardware fault within the device
fault.io.pci.device-invreq
-
A hardware fault in the device or a defect in the driver that causes the device to send an invalid request
fault.io.pci.device-noresp
-
A hardware fault in the device that causes the driver not to respond to a valid request
fault.io.pciex.bus-linkerr
-
A hardware fault on the link
fault.io.pciex.bus-noresp
-
The link going down so that a device cannot respond to a valid request
fault.io.pciex.device-interr
-
A hardware fault within the device
fault.io.pciex.device-invreq
-
A hardware fault in the device or a defect in the driver that causes the device to send an invalid request
fault.io.pciex.device-noresp
-
A hardware fault in the device causing it not to respond to a valid request
Specialized Device Diagnosis
Driver developers who want to generate additional ereports or provide more specialized diagnosis software or eft rules can do so by writing a C-based DE or an eft diagnosis rules set. See the Fault Management community on OpenSolaris for information.
13.1.5. Event Registry
The illumos event registry is the central repository of all class names,
ereports, faults, defects, upsets and suspect lists (list.suspect) events.
The event registry also contains the current definitions of all event member
payloads, as well as important non-payload information like internal documentation,
suspect lists, dictionaries, and knowledge articles. For example, ereport.io
and fault.io
are two of the base class names
that are of particular importance to I/O driver developers.
The FMA event protocol defines a base set of payload members that is supplied with each of the registered events. Developers can also define additional events that help diagnosis engines (or eft rules) to narrow a suspect list down to a specific fault.
13.1.6. Glossary
This section uses the following terms:
- Agent
-
A generic term used to describe fault manager modules that subscribe to fault.* or list.* events. Agents are used to retire faulty resources, communicate diagnosis results to Administrators, and bridge to higher-level management frameworks.
- ASRU (Automated System Reconfiguration Unit)
-
The ASRU is a resource that can be disabled by software or hardware in order to isolate a problem in the system and suppress further error reports.
- DE (Diagnosis Engine)
-
A fault management module whose purpose is to diagnose problems by subscribing to one or more classes of incoming error events and using these events to solve cases associated with each problem on the system.
- ENA (Error Numeric Association)
-
An Error Numeric Association (ENA) is an encoded integer that uniquely identifies an error report within a given fault region and time period. The ENA also indicates the relationship of the error to previous errors as a secondary effect.
- Error
-
An unexpected condition, result, signal, or datum. An error is the symptom of a problem on the system. Each problem typically produces many different kinds of errors.
- ereport (Error Report)
-
The data captured with a particular error. Error report formats are defined in advance by creating a class naming the error report and defining a schema using the illumos event registry.
- ereport event (Error Event)
-
The data structure that represents an instance of an error report. Error events are represented as name-value pair lists.
- Fault
-
Malfunctioning behavior of a hardware component.
- Fault Boundary
-
Logical partition of hardware or software elements for which a specific set of faults can be enumerated.
- Fault Event
-
An instance of a fault diagnosis encoded in the protocol.
- Fault Manager
-
Software component responsible for fault diagnosis via one or more diagnosis engines and state management.
- FMRI (Fault Managed Resource Identifier)
-
An FMRI is a URL-like identifier that acts as the canonical name for a particular resource in the fault management system. Each FMRI includes a scheme that identifies the type of resource, and one or more values that are specific to the scheme. An FMRI can be represented as URL-like string or as a name-value pair list data structure.
- FRU (Field Replaceable Unit)
-
The FRU is a resource that can be replaced in the field by a customer or service provider. FRUs can be defined for hardware (for example system boards) or for software (for example software packages or patches).
13.1.7. Resources
13.2. Defensive Programming Techniques for illumos Device Drivers
This section offers techniques for device drivers to avoid system panics and hangs, wasting system resources, and spreading data corruption. A driver is considered hardened when it uses these defensive programming practices in addition to the I/O fault services framework for error handling and diagnosis.
-
Each piece of hardware should be controlled by a separate instance of the device driver. See Device Configuration Concepts.
-
Programmed I/O (PIO) must be performed only through the DDI access functions, using the appropriate data access handle. See Device Access: Programmed I/O.
-
The device driver must assume that data that is received from the device might be corrupted. The driver must check the integrity of the data before the data is used.
-
The driver must avoid releasing bad data to the rest of the system.
-
Use only documented DDI functions and interfaces in your driver.
-
The driver must ensure that the device writes only into pages of memory in the DMA buffers (
DDI_DMA_READ
) that are controlled entirely by the driver. This technique prevents a DMA fault from corrupting an arbitrary part of the system's main memory. -
The device driver must not be an unlimited drain on system resources if the device locks up. The driver should time out if a device claims to be continuously busy. The driver should also detect a pathological (stuck) interrupt request and take appropriate action.
-
The device driver must support hotplugging in illumos.
-
The device driver must use callbacks instead of waiting on resources.
-
The driver must free up resources after a fault. For example, the system must be able to close all minor devices and detach driver instances even after the hardware fails.
13.2.1. Using Separate Device Driver Instances
The illumos kernel allows multiple instances of a driver. Each instance has its own data space but shares the text and some global data with other instances. The device is managed on a per-instance basis. Drivers should use a separate instance for each piece of hardware unless the driver is designed to handle any failover internally. Multiple instances of a driver per slot can occur, for example, with multifunction cards.
13.2.2. Exclusive Use of DDI Access Handles
All PIO access by a driver must use illumos DDI access functions from the following families of routines:
-
ddi_get
X -
ddi_put
X -
ddi_rep_get
X -
ddi_rep_put
X
The driver should not directly access the mapped registers by the address that is returned from ddi_regs_map_setup(9F). Avoid the ddi_peek(9F) and ddi_poke(9F) routines because these routines do not use access handles.
The DDI access mechanism is important because DDI access provides an opportunity to control how data is read into the kernel.
13.2.3. Detecting Corrupted Data
The following sections describe where data corruption can occur and how to detect corruption.
Corruption of Device Management and Control Data
The driver should assume that any data obtained from the device, whether by PIO or DMA, could have been corrupted. In particular, extreme care should be taken with pointers, memory offsets, and array indexes that are based on data from the device. Such values can be malignant, in that these values can cause a kernel panic if dereferenced. All such values should be checked for range and alignment (if required) before use.
Even a pointer that is not malignant can still be misleading. For example, a pointer can point to a valid but not correct instance of an object. Where possible, the driver should cross-check the pointer with the object to which it is pointing, or otherwise validate the data obtained through that pointer.
Other types of data can also be misleading, such as packet lengths, status words, or channel IDs. These data types should be checked to the extent possible. A packet length can be range-checked to ensure that the length is neither negative nor larger than the containing buffer. A status word can be checked for “impossible” bits. A channel ID can be matched against a list of valid IDs.
Where a value is used to identify a stream, the driver must ensure that the stream still exists. The asynchronous nature of processing STREAMS means that a stream can be dismantled while device interrupts are still outstanding.
The driver should not reread data from the device. The data should be read once, validated, and stored in the driver's local state. This technique avoids the hazard of data that is correct when initially read, but is incorrect when reread later.
The driver should also ensure that all loops are bounded. For example,
a device that returns a continuous BUSY
status should not
be able to lock up the entire system.
Corruption of Received Data
Device errors can result in corrupted data being placed in receive buffers. Such corruption is indistinguishable from corruption that occurs beyond the domain of the device, for example, within a network. Typically, existing software is already in place to handle such corruption. One example is the integrity checks at the transport layer of a protocol stack. Another example is integrity checks within the application that uses the device.
If the received data is not to be checked for integrity at a higher layer, the data can be integrity-checked within the driver itself. Methods of detecting corruption in received data are typically device-specific. Checksums and CRC are examples of the kinds of checks that can be done.
13.2.4. DMA Isolation
A defective device might initiate an improper DMA transfer over the bus. This data transfer could corrupt good data that was previously delivered. A device that fails might generate a corrupt address that can contaminate memory that does not even belong to its own driver.
In systems with an IOMMU, a device can write only to pages mapped as writable for DMA. Therefore, such pages should be owned solely by one driver instance. These pages should not be shared with any other kernel structure. While the page in question is mapped as writable for DMA, the driver should be suspicious of data in that page. The page must be unmapped from the IOMMU before the page is passed beyond the driver, and before any validation of the data.
You can use ddi_umem_alloc(9F) to guarantee that a whole aligned page is allocated, or allocate multiple pages and ignore the memory below the first page boundary. You can find the size of an IOMMU page by using ddi_ptob(9F).
Alternatively, the driver can choose to copy the data into a safe part of memory before processing it. If this is done, the data must first be synchronized using ddi_dma_sync(9F).
Calls to ddi_dma_sync
should specify SYNC_FOR_DEV
before using DMA to transfer data to a device, and SYNC_FOR_CPU
after using DMA to transfer data from the device to memory.
On some PCI-based systems with an IOMMU, devices can use PCI dual address cycles (64-bit addresses) to bypass the IOMMU. This capability gives the device the potential to corrupt any region of main memory. Device drivers must not attempt to use such a mode and should disable it.
13.2.5. Handling Stuck Interrupts
The driver must identify stuck interrupts because a persistently asserted interrupt severely affects system performance, almost certainly stalling a single-processor machine.
Sometimes the driver might have difficulty identifying a particular interrupt as invalid. For network drivers, if a receive interrupt is indicated but no new buffers have been made available, no work was needed. When this situation is an isolated occurrence, it is not a problem, since the actual work might already have been completed by another routine such as a read service.
On the other hand, continuous interrupts with no work for the driver to process can indicate a stuck interrupt line. For this reason, platforms allow a number of apparently invalid interrupts to occur before taking defensive action.
While appearing to have work to do, a hung device might be failing to update its buffer descriptors. The driver should defend against such repetitive requests.
In some cases, platform-specific bus drivers might be capable of identifying
a persistently unclaimed interrupt and can disable the offending device. However,
this relies on the driver's ability to identify the valid interrupts and return
the appropriate value. The driver should return a DDI_INTR_UNCLAIMED
result
unless the driver detects that the device legitimately asserted an interrupt.
The interrupt is legitimate only if the device actually requires the driver
to do some useful work.
The legitimacy of other, more incidental, interrupts is much harder to certify. An interrupt-expected flag is a useful tool for evaluating whether an interrupt is valid. Consider an interrupt such as descriptor free, which can be generated if all the device's descriptors had been previously allocated. If the driver detects that it has taken the last descriptor from the card, it can set an interrupt-expected flag. If this flag is not set when the associated interrupt is delivered, the interrupt is suspicious.
Some informative interrupts might not be predictable, such as one that indicates that a medium has become disconnected or frame sync has been lost. The easiest method of detecting whether such an interrupt is stuck is to mask this particular source on first occurrence until the next polling cycle.
If the interrupt occurs again while disabled, the interrupt should be considered false. Some devices have interrupt status bits that can be read even if the mask register has disabled the associated source and might not be causing the interrupt. You can devise a more appropriate algorithm specific to your devices.
Avoid looping on interrupt status bits indefinitely. Break such loops if none of the status bits set at the start of a pass requires any real work.
13.2.6. Additional Programming Considerations
-
Thread interaction
-
Threats from top-down requests
-
Adaptive strategies
Thread Interaction
Kernel panics in a device driver are often caused by unexpected interaction of kernel threads after a device failure. When a device fails, threads can interact in ways that you did not anticipate.
If processing routines terminate early, the condition variable waiters are blocked because an expected signal is never given. Attempting to inform other modules of the failure or handling unanticipated callbacks can result in undesirable thread interactions. Consider the sequence of mutex acquisition and relinquishing that can occur during device failures.
Threads that originate in an upstream STREAMS module
can become involved in unfortunate paradoxes if those threads are used to
return to that module unexpectedly. Consider using alternative threads to
handle exception messages. For instance, a procedure might use a read-side
service routine to communicate an M_ERROR
, rather than
handling the error directly with a read-side putnext(9F).
A failing STREAMS device that cannot be quiesced during close because of a fault can generate an interrupt after the stream has been dismantled. The interrupt handler must not attempt to use a stale stream pointer to try to process the message.
Threats From Top-Down Requests
While protecting the system from defective hardware, you also need to protect against driver misuse. Although the driver can assume that the kernel infrastructure is always correct (a trusted core), user requests passed to it can be potentially destructive.
For example, a user can request an action to be performed upon a user-supplied
data block (M_IOCTL
) that is smaller than the block size
that is indicated in the control part of the message. The driver should never
trust a user application.
Consider the construction of each type of ioctl
that
your driver can receive and the potential harm that the ioctl
could
cause. The driver should perform checks to ensure that it does not process
a malformed ioctl
.
Adaptive Strategies
A driver can continue to provide service using faulty hardware. The driver can attempt to work around the identified problem by using an alternative strategy for accessing the device. Given that broken hardware is unpredictable and given the risk associated with additional design complexity, adaptive strategies are not always wise. At most, these strategies should be limited to periodic interrupt polling and retry attempts. Periodically retrying the device tells the driver when a device has recovered. Periodic polling can control the interrupt mechanism after a driver has been forced to disable interrupts.
Ideally, a system always has an alternative device to provide a vital system service. Service multiplexors in kernel or user space offer the best method of maintaining system services when a device fails. Such practices are beyond the scope of this section.
13.3. Driver Hardening Test Harness
The driver hardening test harness tests that the I/O fault services and defensive programming requirements have been correctly fulfilled. Hardened device drivers are resilient to potential hardware faults. You must test the resilience of device drivers as part of the driver development process. This type of testing requires that the driver handle a wide range of typical hardware faults in a controlled and repeatable way. The driver hardening test harness enables you to simulate such hardware faults in software.
The driver hardening test harness is an illumos device driver development tool. The test harness injects a wide range of simulated hardware faults when the driver under development accesses its hardware. This section describes how to configure the test harness, create error-injection specifications (referred to as errdefs), and execute the tests on your device driver.
The test harness intercepts calls from the driver to various DDI routines, then corrupts the result of the calls as if the hardware had caused the corruption. In addition, the harness allows for corruption of accesses to specific registers as well as definition of more random types of corruption.
The test harness can generate test scripts automatically by tracing all register accesses as well as direct memory access (DMA) and interrupt usage during the running of a specified workload. A script is generated that reruns that workload while injecting a set of faults into each access.
The driver tester should remove duplicate test cases from the generated scripts.
The test harness is implemented as a device driver called bofi
, which stands for bus_ops fault injection, and two user-level utilities, th_define(1M) and th_manage(1M).
-
Validates compliant use of illumos DDI services
-
Facilitates controlled corruption of programmed I/O (PIO) and DMA requests and interference with interrupts, thus simulating faults that occur in the hardware managed by the driver
-
Facilitates simulation of failures in the data path between the CPU and the device, which are reported from parent nexus drivers
-
Monitors a driver's access during a specified workload and generates fault-injection scripts
13.3.1. Fault Injection
The driver hardening test harness intercepts and, when requested, corrupts each access a driver makes to its hardware. This section provides information you should understand to create faults to test the resilience of your driver.
illumos devices are managed inside a tree-like structure called the device tree (devinfo tree). Each node of the devinfo tree stores information that relates to a particular instance of a device in the system. Each leaf node corresponds to a device driver, while all other nodes are called nexus nodes. Typically, a nexus represents a bus. A bus node isolates leaf drivers from bus dependencies, which enables architecturally independent drivers to be produced.
Many of the DDI functions, particularly the data access functions, result in upcalls to the bus nexus drivers. When a leaf driver accesses its hardware, it passes a handle to an access routine. The bus nexus understands how to manipulate the handle and fulfill the request. A DDI-compliant driver only accesses hardware through use of these DDI access routines. The test harness intercepts these upcalls before they reach the specified bus nexus. If the data access matches the criteria specified by the driver tester, the access is corrupted. If the data access does not match the criteria, it is given to the bus nexus to handle in the usual way.
A driver obtains an access handle by using the ddi_regs_map_setup(9F) function:
ddi_regs_map_setup(dip, rset, ma, offset, size, handle)
The arguments specify which “offboard” memory is to be mapped. The driver must use the returned handle when it references the mapped I/O addresses, since handles are meant to isolate drivers from the details of bus hierarchies. Therefore, do not directly use the returned mapped address, ma. Direct use of the mapped address destroys the current and future uses of the data access function mechanism.
-
I/O to Host:
ddi_getX(handle, ma) ddi_rep_getX(handle, buf, ma, repcnt, flag)
-
Host to I/O:
ddi_putX(handle, ma, value) ddi_rep_putX()
X and repcnt are the number of bytes to be transferred. X is the bus transfer size of 8, 16, 32, or 64 bytes.
DMA has a similar, yet richer, set of data access functions.
13.3.2. Setting Up the Test Harness
The driver hardening test harness is part of the Solaris Developer Cluster. If you have not installed this Solaris cluster, you must manually install the test harness packages appropriate for your platform.
Installing the Test Harness
To install the test harness packages (SUNWftduu and SUNWftdur), use the pkgadd(1M) command.
As superuser, go to the directory in which the packages are located and type:
# pkgadd -d . SUNWftduu SUNWftdur
Configuring the Test Harness
After the test harness is installed, set the properties in the /kernel/drv/bofi.conf file to configure the harness to interact with your driver. When the harness configuration is complete, reboot the system to load the harness driver.
The test harness behavior is controlled by boot-time properties that are set in the /kernel/drv/bofi.conf configuration file.
When the harness is first installed, enable the harness to intercept the DDI accesses to your driver by setting these properties:
bofi-nexus
-
Bus nexus type, such as the PCI bus
bofi-to-test
-
Name of the driver under test
For example, to test a PCI bus network driver called xyznetdrv
,
set the following property values:
bofi-nexus="pci"
bofi-to-test="xyznetdrv"
Other properties relate to the use and harness checking of the illumos DDI data access mechanisms for reading and writing from peripherals that use PIO and transferring data to and from peripherals that use DMA.
bofi-range-check
-
When this property is set, the test harness checks the consistency of the arguments that are passed to PIO data access functions.
bofi-ddi-check
-
When this property is set, the test harness verifies that the mapped address that is returned by
ddi_map_regs_setup
(9F) is not used outside of the context of the data access functions. bofi-sync-check
-
When this property is set, the test harness verifies correct usage of DMA functions and ensures that the driver makes compliant use of ddi_dma_sync(9F).
13.3.3. Testing the Driver
This section describes how to create and inject faults by using the th_define(1M) and th_manage(1M) commands.
Creating Faults
The th_define
utility provides an interface to the bofi
device driver for defining errdefs. An errdef corresponds
to a specification for how to corrupt a device driver's accesses to its hardware.
The th_define
command-line arguments determine the precise
nature of the fault to be injected. If the supplied arguments define a consistent
errdef, the th_define
process stores the errdef with the bofi
driver. The process suspends itself until the criteria given
by the errdef becomes satisfied. In practice, the suspension ends when the
access counts go to zero (0).
Injecting Faults
-
Type of hardware being accessed (driver name)
-
Instance of the hardware being accessed (driver instance)
-
Register set being tested
-
Subset of the register set that is targeted
-
Direction of the transfer (read or write)
-
Type of access (PIO or DMA)
-
The driver instance and register set being tested (
-n
name,-i
instance, and-r
reg_number). -
The subset of the register set eligible for corruption. This subset is indicated by providing an offset into the register set and a length from that offset (
-l
offset[
len]
). -
The kind of access to be intercepted:
log
,pio
,dma
,pio_r
,pio_w
,dma_r
,dma_w
,intr
(-a
acc_types). -
How many accesses should be faulted (
-c
count[
failcount]
). -
The kind of corruption that should be applied to a qualifying access (
-o
operator[
operand]
).-
Replace datum with a fixed value (EQUAL)
-
Perform a bitwise operation on the datum (AND, OR, XOR)
-
Ignore the transfer (for host to I/O accesses NO_TRANSFER)
-
Lose, delay, or inject spurious interrupts (LOSE, DELAY, EXTRA)
-
Use the -a
acc_chk option
to simulate framework faults in an errdef.
Fault-Injection Process
-
Use the th_define(1M) command to create errdefs.
Create errdefs by passing test definitions to the
bofi
driver, which stores the definitions so they can be accessed by using the th_manage(1M) command. -
Create a workload, then use the
th_manage
command to activate and manage the errdef.The
th_manage
command is a user interface to the various ioctls that are recognized by thebofi
harness driver. Theth_manage
command operates at the level of driver names and instances and includes these commands:get_handles
to list access handles,start
to activate errdefs, andstop
to deactivate errdefs.The activation of an errdef results in qualifying data accesses to be faulted. The
th_manage
utility supports these commands:broadcast
to provide the current state of the errdef andclear_errors
to clear the errdef.See the
th_define
(1M) andth_manage
(1M) man pages for more information.
Test Harness Warnings
-
Write warning messages to the console
-
Write warning messages to the console and then panic the system
Use the second method to help pinpoint the root cause of a problem.
When the bofi-range-check
property value is set
to warn
, the harness prints the following messages (or
panics if set to panic) when it detects a range violation of a DDI function
by your driver:
ddi_getX() out of range addr %x not in %x
ddi_putX() out of range addr %x not in %x
ddi_rep_getX() out of range addr %x not in %x
ddi_rep_putX() out of range addr %x not in %x
X is 8, 16, 32, or 64.
When the harness has been requested to insert over 1000 extra interrupts, the following message is printed if the driver does not detect interrupt jabber:
undetected interrupt jabber - %s %d
13.3.4. Using Scripts to Automate the Test Process
You can create fault-injection test scripts by using the logging access type of the th_define(1M) utility:
# th_define -n name -i instance -a log [-e fixup_script]
The th_define
command takes the instance offline
and brings it back online. Then th_define
runs the workload
that is described by the fixup_script and logs
I/O accesses that are made by the driver instance.
The fixup_script is called twice with the set of optional arguments. The script is called once just before the instance is taken offline, and it is called again after the instance has been brought online.
The following variables are passed into the environment of the called executable:
- DRIVER_PATH
-
Device path of the instance
- DRIVER_INSTANCE
-
Instance number of the driver
- DRIVER_UNCONFIGURE
-
Set to 1 when the instance is about to be taken offline
- DRIVER_CONFIGURE
-
Set to 1 when the instance has just been brought online
Typically, the fixup_script ensures that the device under test is in a suitable state to be taken offline (unconfigured) or in a suitable state for error injection (for example, configured, error free, and servicing a workload). The following script is a minimal script for a network driver:
#!/bin/ksh
driver=xyznetdrv
ifnum=$driver$DRIVER_INSTANCE
if [[ $DRIVER_CONFIGURE = 1 ]]; then
ifconfig $ifnum plumb
ifconfig $ifnum ...
ifworkload start $ifnum
elif [[ $DRIVER_UNCONFIGURE = 1 ]]; then
ifworkload stop $ifnum
ifconfig $ifnum down
ifconfig $ifnum unplumb
fi
exit $?
The ifworkload
command should initiate the
workload as a background task. The fault injection occurs after the fixup_script configures the driver under test and brings it
online (DRIVER_CONFIGURE is set to 1).
If the -e
fixup_script option
is present, it must be the last option on the command line. If the -e
option
is not present, a default script is used. The default script repeatedly attempts
to bring the device under test offline and online. Thus the workload consists
of the driver's attach
and detach
paths.
The resulting log is converted into a set of executable scripts that are suitable for running unassisted fault-injection tests. These scripts are created in a subdirectory of the current directory with the name driver.test.id. The scripts inject faults, one at a time, into the driver while running the workload that is described by the fixup_script.
The driver tester has substantial control over the errdefs that are produced by the test automation process. See the th_define(1M) man page.
If the tester chooses a suitable range of workloads for the test scripts, the harness gives good coverage of the hardening aspects of the driver. However, to achieve full coverage, the tester might need to create additional test cases manually. Add these cases to the test scripts. To ensure that testing completes in a timely manner, you might need to manually delete duplicate test cases.
Automated Test Process
-
Identify the aspects of the driver to be tested.
-
Attach and detach
-
Plumb and unplumb under a stack
-
Normal data transfer
-
Documented debug modes
A separate workload script (fixup_script) must be generated for each mode of use.
-
-
For each mode of use, prepare an executable program(fixup_script) that configures and unconfigures the device, and creates and terminates a workload.
-
Run the
th_define
(1M) command with the errdefs, together with an access type of-a
log. -
Wait for the logs to fill.
The logs contain a dump of the
bofi
driver's internal buffers. This data is included at the front of the script.Because it can take from a few seconds to several minutes to create the logs, use the
th_manage broadcast
command to check the progress. -
Change to the created test directory and run the master test script.
The master script runs each generated test script in sequence. Separate test scripts are generated per register set.
-
Store the results for analysis.
Successful test results, such as
success (corruption reported)
andsuccess (corruption undetected)
, show that the driver under test is behaving properly. The results are reported asfailure (no service impact reported)
if the harness detects that the driver has failed to report the service impact after reporting a fault, or if the driver fails to detect that an access or DMA handle has been marked as faulted.It is fine for a few
test not triggered
failures to appear in the output. However, several such failures indicate that the test is not working properly. These failures can appear when the driver does not access the same registers as when the test scripts were generated. -
Run the test on multiple instances of the driver concurrently to test the multithreading of error paths.
For example, each
th_define
command creates a separate directory that contains test scripts and a master script:# th_define -n xyznetdrv -i 0 -a log -e script # th_define -n xyznetdrv -i 1 -a log -e script
Once created, run the master scripts in parallel.
The generated scripts produce only simulated fault injections that are based on what was logged during the time the logging errdef was active. When you define a workload, ensure that the required results are logged. Also analyze the resulting logs and fault-injection specifications. Verify that the hardware access coverage that the resulting test scripts created is what is required.