Drivers for Block Devices
This chapter describes the structure of block device drivers. The kernel views a block device as a set of randomly accessible logical blocks. The file system uses a list of buf(9S) structures to buffer the data blocks between a block device and the user space. Only block devices can support a file system.
This chapter provides information on the following subjects:
16.1. Block Driver Structure Overview
Block Driver Roadmap shows data structures and routines that define the structure of a block device driver. Device drivers typically include the following elements:
-
Device-loadable driver section
-
Device configuration section
-
Device access section
The shaded device access section in the following figure illustrates entry points for block drivers.

Associated with each device driver is a dev_ops(9S) structure, which in turn refers to a cb_ops(9S) structure. See Driver Autoconfiguration for details on driver data structures.
Block device drivers provide these entry points:
Some of the entry points can be replaced by nodev(9F) or nulldev(9F) as appropriate.
16.2. File I/O
A file system is a tree-structured hierarchy of directories and files. Some file systems, such as the UNIX File System (UFS), reside on block-oriented devices. File systems are created by format(1M) and newfs(1M).
When an application issues a read(2) or write(2) system call to an ordinary file on the UFS file system, the file system can call the device driver strategy(9E) entry point for the block device on which the file system resides. The file system code can call strategy(9E) several times for a single read(2) or write(2) system call.
The file system code determines the logical device address, or logical block number, for each ordinary file block. A block I/O request is then built in the form of a buf(9S) structure directed at the block device. The driver strategy(9E) entry point then interprets the buf(9S) structure and completes the request.
16.3. Block Device Autoconfiguration
attach(9E) should perform the common initialization tasks for each instance of a device:
-
Allocating per-instance state structures
-
Mapping the device's registers
-
Registering device interrupts
-
Initializing mutex and condition variables
-
Creating power manageable components
-
Creating minor nodes
Block device drivers create minor nodes of type S_IFBLK
.
As a result, a block special file that represents the node appears in the /devices hierarchy.
Logical device
names for block devices appear in the /dev/dsk directory,
and consist of a controller number, bus-address number, disk number, and slice
number. These names are created by the devfsadm(1M) program if the node type
is set to DDI_NT_BLOCK
or DDI_NT_BLOCK_CHAN
. DDI_NT_BLOCK_CHAN
should be specified if the device communicates
on a channel, that is, a bus with an additional level of addressability.
SCSI disks are a good example. DDI_NT_BLOCK_CHAN
causes
a bus-address field (tN) to appear in the logical name. DDI_NT_BLOCK
should be used for most other devices.
A minor device refers to a partition on the disk. For each minor
device, the driver must create an nblocks
or Nblocks
property.
This integer property gives the number of blocks supported by the minor device
expressed in units of DEV_BSIZE
, that is, 512 bytes. The
file system uses the nblocks
and Nblocks
properties
to determine device limits. Nblocks
is the 64-bit version
of nblocks
. Nblocks
should be used with
storage devices that can hold over 1 Tbyte of storage per disk. See Device Properties for more information.
Block Driver attach Routine shows
a typical attach(9E) entry
point with emphasis on creating the device's minor node and the Nblocks
property.
Note that because this example uses Nblocks
and not nblocks
, ddi_prop_update_int64(9F) is called instead of ddi_prop_update_int(9F).
As a side note, this example shows the use of makedevice(9F) to create
a device number for ddi_prop_update_int64
. The makedevice
function makes use of ddi_driver_major(9F), which generates a major
number from a pointer to a dev_info_t
structure.
Using ddi_driver_major
is similar to using getmajor(9F), which gets
a dev_t
structure pointer.
static int
xxattach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
int instance = ddi_get_instance(dip);
switch (cmd) {
case DDI_ATTACH:
/*
* allocate a state structure and initialize it
* map the devices registers
* add the device driver's interrupt handler(s)
* initialize any mutexes and condition variables
* read label information if the device is a disk
* create power manageable components
*
* Create the device minor node. Note that the node_type
* argument is set to DDI_NT_BLOCK.
*/
if (ddi_create_minor_node(dip, "minor_name", S_IFBLK,
instance, DDI_NT_BLOCK, 0) == DDI_FAILURE) {
/* free resources allocated so far */
/* Remove any previously allocated minor nodes */
ddi_remove_minor_node(dip, NULL);
return (DDI_FAILURE);
}
/*
* Create driver properties like "Nblocks". If the device
* is a disk, the Nblocks property is usually calculated from
* information in the disk label. Use "Nblocks" instead of
* "nblocks" to ensure the property works for large disks.
*/
xsp->Nblocks = size;
/* size is the size of the device in 512 byte blocks */
maj_number = ddi_driver_major(dip);
if (ddi_prop_update_int64(makedevice(maj_number, instance), dip,
"Nblocks", xsp->Nblocks) != DDI_PROP_SUCCESS) {
cmn_err(CE_CONT, "%s: cannot create Nblocks property\n",
ddi_get_name(dip));
/* free resources allocated so far */
return (DDI_FAILURE);
}
xsp->open = 0;
xsp->nlayered = 0;
/* ... */
return (DDI_SUCCESS);
case DDI_RESUME:
/* For information, see Chapter 12, "Power Management," in this book. */
default:
return (DDI_FAILURE);
}
}
16.4. Controlling Device Access
This section describes the entry points for open
and close
functions in block device drivers. See Drivers for Character Devices for
more information on open(9E) and close(9E).
16.4.1. open Entry Point (Block Drivers)
The open(9E) entry point is used to gain access to a given device. The open(9E) routine of a block driver is called when a user thread issues an open(2) or mount(2) system call on a block special file associated with the minor device, or when a layered driver calls open(9E). See File I/O for more information.
The open
entry point should check for the following conditions:
-
The device can be opened, that is, the device is online and ready.
-
The device can be opened as requested. The device supports the operation. The device's current state does not conflict with the request.
-
The caller has permission to open the device.
The following example demonstrates a block driver open(9E) entry point.
static int
xxopen(dev_t *devp, int flags, int otyp, cred_t *credp)
{
minor_t instance;
struct xxstate *xsp;
instance = getminor(*devp);
xsp = ddi_get_soft_state(statep, instance);
if (xsp == NULL)
return (ENXIO);
mutex_enter(&xsp->mu);
/*
* only honor FEXCL. If a regular open or a layered open
* is still outstanding on the device, the exclusive open
* must fail.
*/
if ((flags & FEXCL) && (xsp->open || xsp->nlayered)) {
mutex_exit(&xsp->mu);
return (EAGAIN);
}
switch (otyp) {
case OTYP_LYR:
xsp->nlayered++;
break;
case OTYP_BLK:
xsp->open = 1;
break;
default:
mutex_exit(&xsp->mu);
return (EINVAL);
}
mutex_exit(&xsp->mu);
return (0);
}
The otyp
argument is used to specify the type of
open on the device. OTYP_BLK
is the typical open type for
a block device. A device can be opened several times with otyp
set
to OTYP_BLK
. close(9E) is called only once when
the final close of type OTYP_BLK
has occurred for the device. otyp
is set to OTYP_LYR
if the device is being
used as a layered device. For every open of type OTYP_LYR
,
the layering driver issues a corresponding close of type OTYP_LYR
.
The example keeps track of each type of open so the driver can determine when
the device is not being used in close(9E).
16.4.2. close Entry Point (Block Drivers)
The close(9E) entry point uses the same
arguments as open(9E) with
one exception. dev
is the device number rather than a pointer
to the device number.
The close
routine should verify otyp
in
the same way as was described for the open(9E) entry point. In the following example, close
must determine when the device can really be closed. Closing
is affected by the number of block opens and layered opens.
static int
xxclose(dev_t dev, int flag, int otyp, cred_t *credp)
{
minor_t instance;
struct xxstate *xsp;
instance = getminor(dev);
xsp = ddi_get_soft_state(statep, instance);
if (xsp == NULL)
return (ENXIO);
mutex_enter(&xsp->mu);
switch (otyp) {
case OTYP_LYR:
xsp->nlayered--;
break;
case OTYP_BLK:
xsp->open = 0;
break;
default:
mutex_exit(&xsp->mu);
return (EINVAL);
}
if (xsp->open || xsp->nlayered) {
/* not done yet */
mutex_exit(&xsp->mu);
return (0);
}
/* cleanup (rewind tape, free memory, etc.) */
/* wait for I/O to drain */
mutex_exit(&xsp->mu);
return (0);
}
16.4.3. strategy Entry Point
The strategy(9E) entry point is used to read and write data buffers to and from a block device. The name strategy refers to the fact that this entry point might implement some optimal strategy for ordering requests to the device.
strategy(9E) can be written
to process one request at a time, that is, a synchronous transfer. strategy
can also be written to queue multiple requests to the device,
as in an asynchronous transfer. When choosing a method, the abilities and
limitations of the device should be taken into account.
The strategy(9E) routine is passed a pointer to a buf(9S) structure. This structure describes the transfer request, and contains status information on return. buf(9S) and strategy(9E) are the focus of block device operations.
16.4.4. buf Structure
The following buf
structure members are
important to block drivers:
int b_flags; /* Buffer Status */
struct buf *av_forw; /* Driver work list link */
struct buf *av_back; /* Driver work list link */
size_t b_bcount; /* # of bytes to transfer */
union {
caddr_t b_addr; /* Buffer's virtual address */
} b_un;
daddr_t b_blkno; /* Block number on device */
diskaddr_t b_lblkno; /* Expanded block number on device */
size_t b_resid; /* # of bytes not transferred after error */
int b_error; /* Expanded error field */
void *b_private; /* “opaque” driver private area */
dev_t b_edev; /* expanded dev field */
where:
av_forw
andav_back
-
Pointers that the driver can use to manage a list of buffers by the driver. See Asynchronous Data Transfers (Block Drivers) for a discussion of the
av_forw
andav_back
pointers. b_bcount
-
Specifies the number of bytes to be transferred by the device.
b_un.b_addr
-
The kernel virtual address of the data buffer. Only valid after bp_mapin(9F) call.
b_blkno
-
The starting 32-bit logical block number on the device for the data transfer, which is expressed in 512-byte
DEV_BSIZE
units. The driver should use eitherb_blkno
orb_lblkno
but not both. b_lblkno
-
The starting 64-bit logical block number on the device for the data transfer, which is expressed in 512-byte
DEV_BSIZE
units. The driver should use eitherb_blkno
orb_lblkno
but not both. b_resid
-
Set by the driver to indicate the number of bytes that were not transferred because of an error. See Block Driver Routine for Asynchronous Interrupts for an example of setting
b_resid
. Theb_resid
member is overloaded.b_resid
is also used by disksort(9F). b_error
-
Set to an error number by the driver when a transfer error occurs.
b_error
is set in conjunction with theb_flags
B_ERROR
bit. See the Intro(9E) man page for details about error values. Drivers should use bioerror(9F) rather than settingb_error
directly. b_flags
-
Flags with status and transfer attributes of the
buf
structure. IfB_READ
is set, thebuf
structure indicates a transfer from the device to memory. Otherwise, this structure indicates a transfer from memory to the device. If the driver encounters an error during data transfer, the driver should set theB_ERROR
field in theb_flags
member. In addition, the driver should provide a more specific error value inb_error
. Drivers should use bioerror(9F) rather than settingB_ERROR
.Drivers should never clear
b_flags
. b_private
-
For exclusive use by the driver to store driver-private data.
b_edev
-
Contains the device number of the device that was used in the transfer.
bp_mapin Structure
A buf
structure pointer can be passed into
the device driver's strategy(9E) routine.
However, the data buffer referred to by b_un.b_addr
is
not necessarily mapped in the kernel's address space. Therefore, the driver
cannot directly access the data. Most block-oriented devices have DMA capability
and therefore do not need to access the data buffer directly. Instead, these
devices use the DMA mapping routines to enable the device's DMA engine to
do the data transfer. For details about using DMA, see Direct Memory Access (DMA).
If a driver needs to access the data buffer directly, that driver must first map the buffer into the kernel's address space by using bp_mapin(9F). bp_mapout(9F) should be used when the driver no longer needs to access the data directly.
bp_mapout(9F) should
only be called on buffers that have been allocated and are owned by the
device driver. bp_mapout
must not be called on buffers
that are passed to the driver through the strategy(9E) entry point, such as a file
system. bp_mapin(9F) does
not keep a reference count. bp_mapout(9F) removes any kernel mapping
on which a layer over the device driver might rely.
16.5. Synchronous Data Transfers (Block Drivers)
This section presents a simple method for performing synchronous I/O transfers. This method assumes that the hardware is a simple disk device that can transfer only one data buffer at a time by using DMA. Another assumption is that the disk can be spun up and spun down by software command. The device driver's strategy(9E) routine waits for the current request to be completed before accepting a new request. The device interrupts when the transfer is complete. The device also interrupts if an error occurs.
The steps for performing a synchronous data transfer for a block driver are as follows:
-
Check for invalid buf(9S) requests.
Check the buf(9S) structure that is passed to strategy(9E) for validity. All drivers should check the following conditions:
-
The request begins at a valid block. The driver converts the
b_blkno
field to the correct device offset and then determines whether the offset is valid for the device. -
The request does not go beyond the last block on the device.
-
Device-specific requirements are met.
If an error is encountered, the driver should indicate the appropriate error with bioerror(9F). The driver should then complete the request by calling biodone(9F).
biodone
notifies the caller of strategy(9E) that the transfer is complete. In this case, the transfer has stopped because of an error. -
-
Check whether the device is busy.
Synchronous data transfers allow single-threaded access to the device. The device driver enforces this access in two ways:
-
The driver maintains a busy flag that is guarded by a mutex.
-
The driver waits on a condition variable with cv_wait(9F), when the device is busy.
If the device is busy, the thread waits until the interrupt handler indicates that the device is not longer busy. The available status can be indicated by either the cv_broadcast(9F) or the cv_signal(9F) function. See Multithreading for details on condition variables.
When the device is no longer busy, the strategy(9E) routine marks the device as available.
strategy
then prepares the buffer and the device for the transfer. -
-
Set up the buffer for DMA.
Prepare the data buffer for a DMA transfer by using ddi_dma_alloc_handle(9F) to allocate a DMA handle. Use ddi_dma_buf_bind_handle(9F) to bind the data buffer to the handle. For information on setting up DMA resources and related data structures, see Direct Memory Access (DMA).
-
Begin the transfer.
At this point, a pointer to the buf(9S) structure is saved in the state structure of the device. The interrupt routine can then complete the transfer by calling biodone(9F).
The device driver then accesses device registers to initiate a data transfer. In most cases, the driver should protect the device registers from other threads by using mutexes. In this case, because strategy(9E) is single-threaded, guarding the device registers is not necessary. See Multithreading for details about data locks.
When the executing thread has started the device's DMA engine, the driver can return execution control to the calling routine, as follows:
static int xxstrategy(struct buf *bp) { struct xxstate *xsp; struct device_reg *regp; minor_t instance; ddi_dma_cookie_t cookie; instance = getminor(bp->b_edev); xsp = ddi_get_soft_state(statep, instance); if (xsp == NULL) { bioerror(bp, ENXIO); biodone(bp); return (0); } /* validate the transfer request */ if ((bp->b_blkno >= xsp->Nblocks) || (bp->b_blkno < 0)) { bioerror(bp, EINVAL); biodone(bp); return (0); } /* * Hold off all threads until the device is not busy. */ mutex_enter(&xsp->mu); while (xsp->busy) { cv_wait(&xsp->cv, &xsp->mu); } xsp->busy = 1; mutex_exit(&xsp->mu); /* * If the device has power manageable components, * mark the device busy with pm_busy_components(9F), * and then ensure that the device * is powered up by calling pm_raise_power(9F). * * Set up DMA resources with ddi_dma_alloc_handle(9F) and * ddi_dma_buf_bind_handle(9F). */ xsp->bp = bp; regp = xsp->regp; ddi_put32(xsp->data_access_handle, ®p->dma_addr, cookie.dmac_address); ddi_put32(xsp->data_access_handle, ®p->dma_size, (uint32_t)cookie.dmac_size); ddi_put8(xsp->data_access_handle, ®p->csr, ENABLE_INTERRUPTS | START_TRANSFER); return (0); }
-
Handle the interrupting device.
When the device finishes the data transfer, the driver generates an interrupt, which eventually results in the driver's interrupt routine being called. Most drivers specify the state structure of the device as the argument to the interrupt routine when registering interrupts. See the ddi_add_intr(9F) man page and Registering Interrupts. The interrupt routine can then access the buf(9S) structure being transferred, plus any other information that is available from the state structure.
The interrupt handler should check the device's status register to determine whether the transfer completed without error. If an error occurred, the handler should indicate the appropriate error with bioerror(9F). The handler should also clear the pending interrupt for the device and then complete the transfer by calling biodone(9F).
As the final task, the handler clears the busy flag. The handler then calls cv_signal(9F) or cv_broadcast(9F) on the condition variable, signaling that the device is no longer busy. This notification enables other threads waiting for the device in strategy(9E) to proceed with the next data transfer.
The following example shows a synchronous interrupt routine.
static u_int
xxintr(caddr_t arg)
{
struct xxstate *xsp = (struct xxstate *)arg;
struct buf *bp;
uint8_t status;
mutex_enter(&xsp->mu);
status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
if (!(status & INTERRUPTING)) {
mutex_exit(&xsp->mu);
return (DDI_INTR_UNCLAIMED);
}
/* Get the buf responsible for this interrupt */
bp = xsp->bp;
xsp->bp = NULL;
/*
* This example is for a simple device which either
* succeeds or fails the data transfer, indicated in the
* command/status register.
*/
if (status & DEVICE_ERROR) {
/* failure */
bp->b_resid = bp->b_bcount;
bioerror(bp, EIO);
} else {
/* success */
bp->b_resid = 0;
}
ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
CLEAR_INTERRUPT);
/* The transfer has finished, successfully or not */
biodone(bp);
/*
* If the device has power manageable components that were
* marked busy in strategy(9F), mark them idle now with
* pm_idle_component(9F)
* Release any resources used in the transfer, such as DMA
* resources ddi_dma_unbind_handle(9F) and
* ddi_dma_free_handle(9F).
*
* Let the next I/O thread have access to the device.
*/
xsp->busy = 0;
cv_signal(&xsp->cv);
mutex_exit(&xsp->mu);
return (DDI_INTR_CLAIMED);
}
16.6. Asynchronous Data Transfers (Block Drivers)
This section presents a method for performing asynchronous I/O transfers. The driver queues the I/O requests and then returns control to the caller. Again, the assumption is that the hardware is a simple disk device that allows one transfer at a time. The device interrupts when a data transfer has completed. An interrupt also takes place if an error occurs. The basic steps for performing asynchronous data transfers are:
-
Check for invalid buf(9S) requests.
-
Enqueue the request.
-
Start the first transfer.
-
Handle the interrupting device.
16.6.1. Checking for Invalid buf Requests
As in the synchronous case, the device driver should check the buf(9S) structure passed to strategy(9E) for validity. See Synchronous Data Transfers (Block Drivers) for more details.
16.6.2. Enqueuing the Request
Unlike synchronous data transfers, a driver does not wait for an asynchronous request to complete. Instead, the driver adds the request to a queue. The head of the queue can be the current transfer. The head of the queue can also be a separate field in the state structure for holding the active request, as in Enqueuing Data Transfer Requests for Block Drivers.
If the queue is initially empty, then the hardware is not busy and strategy(9E) starts the transfer before returning. Otherwise, if a transfer completes with a non-empty queue, the interrupt routine begins a new transfer. Enqueuing Data Transfer Requests for Block Drivers places the decision of whether to start a new transfer into a separate routine for convenience.
The driver can use the av_forw
and the av_back
members of the buf(9S) structure to manage a list of transfer requests. A
single pointer can be used to manage a singly linked list, or both pointers
can be used together to build a doubly linked list. The device hardware specification
specifies which type of list management, such as insertion policies, is used
to optimize the performance of the device. The transfer list is a per-device
list, so the head and tail of the list are stored in the state structure.
The following example provides multiple threads with access to the driver shared data, such as the transfer list. You must identify the shared data and must protect the data with a mutex. See Multithreading for more details about mutex locks.
static int
xxstrategy(struct buf *bp)
{
struct xxstate *xsp;
minor_t instance;
instance = getminor(bp->b_edev);
xsp = ddi_get_soft_state(statep, instance);
/* ... */
/* validate transfer request */
/* ... */
/*
* Add the request to the end of the queue. Depending on the device, a sorting
* algorithm, such as disksort(9F) can be used if it improves the
* performance of the device.
*/
mutex_enter(&xsp->mu);
bp->av_forw = NULL;
if (xsp->list_head) {
/* Non-empty transfer list */
xsp->list_tail->av_forw = bp;
xsp->list_tail = bp;
} else {
/* Empty Transfer list */
xsp->list_head = bp;
xsp->list_tail = bp;
}
mutex_exit(&xsp->mu);
/* Start the transfer if possible */
(void) xxstart((caddr_t)xsp);
return (0);
}
16.6.3. Starting the First Transfer
Device drivers that implement queuing usually have a start
routine. start
dequeues the next request and starts the data transfer to
or from the device. In this example, start
processes
all requests regardless of the state of the device, whether busy or free.
start
must be written to be called from any
context. start
can be called by both the strategy routine
in kernel context and the interrupt routine in interrupt context.
start
is called by strategy(9E) every time strategy
queues
a request so that an idle device can be started. If the device is busy, start
returns immediately.
start
is also called by the interrupt handler before
the handler returns from a claimed interrupt so that a nonempty queue can
be serviced. If the queue is empty, start
returns immediately.
Because start
is a private driver routine, start
can take any arguments and can return any type. The following
code sample is written to be used as a DMA callback, although that portion
is not shown. Accordingly, the example must take a caddr_t
as
an argument and return an int
. See Handling Resource Allocation Failures for more information about DMA
callback routines.
static int
xxstart(caddr_t arg)
{
struct xxstate *xsp = (struct xxstate *)arg;
struct buf *bp;
mutex_enter(&xsp->mu);
/*
* If there is nothing more to do, or the device is
* busy, return.
*/
if (xsp->list_head == NULL || xsp->busy) {
mutex_exit(&xsp->mu);
return (0);
}
xsp->busy = 1;
/* Get the first buffer off the transfer list */
bp = xsp->list_head;
/* Update the head and tail pointer */
xsp->list_head = xsp->list_head->av_forw;
if (xsp->list_head == NULL)
xsp->list_tail = NULL;
bp->av_forw = NULL;
mutex_exit(&xsp->mu);
/*
* If the device has power manageable components,
* mark the device busy with pm_busy_components(9F),
* and then ensure that the device
* is powered up by calling pm_raise_power(9F).
*
* Set up DMA resources with ddi_dma_alloc_handle(9F) and
* ddi_dma_buf_bind_handle(9F).
*/
xsp->bp = bp;
ddi_put32(xsp->data_access_handle, &xsp->regp->dma_addr,
cookie.dmac_address);
ddi_put32(xsp->data_access_handle, &xsp->regp->dma_size,
(uint32_t)cookie.dmac_size);
ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
ENABLE_INTERRUPTS | START_TRANSFER);
return (0);
}
16.6.4. Handling the Interrupting Device
The interrupt routine is similar to the asynchronous version, with the
addition of the call to start
and the removal of the
call to cv_signal(9F).
static u_int
xxintr(caddr_t arg)
{
struct xxstate *xsp = (struct xxstate *)arg;
struct buf *bp;
uint8_t status;
mutex_enter(&xsp->mu);
status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
if (!(status & INTERRUPTING)) {
mutex_exit(&xsp->mu);
return (DDI_INTR_UNCLAIMED);
}
/* Get the buf responsible for this interrupt */
bp = xsp->bp;
xsp->bp = NULL;
/*
* This example is for a simple device which either
* succeeds or fails the data transfer, indicated in the
* command/status register.
*/
if (status & DEVICE_ERROR) {
/* failure */
bp->b_resid = bp->b_bcount;
bioerror(bp, EIO);
} else {
/* success */
bp->b_resid = 0;
}
ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
CLEAR_INTERRUPT);
/* The transfer has finished, successfully or not */
biodone(bp);
/*
* If the device has power manageable components that were
* marked busy in strategy(9F), mark them idle now with
* pm_idle_component(9F)
* Release any resources used in the transfer, such as DMA
* resources (ddi_dma_unbind_handle(9F) and
* ddi_dma_free_handle(9F)).
*
* Let the next I/O thread have access to the device.
*/
xsp->busy = 0;
mutex_exit(&xsp->mu);
(void) xxstart((caddr_t)xsp);
return (DDI_INTR_CLAIMED);
}
16.7. dump and print Entry Points
This section discusses the dump(9E) and print(9E) entry points.
16.7.1. dump Entry Point (Block Drivers)
The dump(9E) entry point is used to copy
a portion of virtual address space directly to the specified device in the
case of a system failure. dump
is also used to copy the
state of the kernel out to disk during a checkpoint operation. See the cpr(7) and dump(9E) man pages for more information.
The entry point must be capable of performing this operation without the use
of interrupts, because interrupts are disabled during the checkpoint operation.
int dump(dev_t dev, caddr_t addr, daddr_t blkno, int nblk)
where:
- dev
-
Device number of the device to receive the dump.
- addr
-
Base kernel virtual address at which to start the dump.
- blkno
-
Block at which the dump is to start.
- nblk
-
Number of blocks to dump.
The dump depends upon the existing driver working properly.
16.7.2. print Entry Point (Block Drivers)
int print(dev_t dev, char *str)
The print(9E) entry point is called by
the system to display a message about an exception that has been detected. print(9E) should call cmn_err(9F) to post the
message to the console on behalf of the system. The following example demonstrates
a typical print
entry point.
static int
xxprint(dev_t dev, char *str)
{
cmn_err(CE_CONT, “xx: %s\n”, str);
return (0);
}
16.8. Disk Device Drivers
Disk devices represent an important class of block device drivers.
16.8.1. Disk ioctls
illumos disk drivers need to support a minimum set of ioctl
commands
specific to illumos disk drivers. These I/O controls are specified in the dkio(7I) manual page. Disk I/O controls
transfer disk information to or from the device driver. An illumos disk device
is supported by disk utility commands such as format(1M) and newfs(1M). The mandatory illumos disk I/O
controls are as follows:
DKIOCINFO
-
Returns information that describes the disk controller
DKIOCGAPART
-
Returns a disk's partition map
DKIOCSAPART
-
Sets a disk's partition map
DKIOCGGEOM
-
Returns a disk's geometry
DKIOCSGEOM
-
Sets a disk's geometry
DKIOCGVTOC
-
Returns a disk's Volume Table of Contents
DKIOCSVTOC
-
Sets a disk's Volume Table of Contents
16.8.2. Disk Performance
The illumos DDI/DKI provides facilities to optimize I/O transfers for improved file system performance. A mechanism manages the list of I/O requests so as to optimize disk access for a file system. See Asynchronous Data Transfers (Block Drivers) for a description of enqueuing an I/O request.
The diskhd
structure is used to manage a linked list of I/O requests.
struct diskhd {
long b_flags; /* not used, needed for consistency*/
struct buf *b_forw, *b_back; /* queue of unit queues */
struct buf *av_forw, *av_back; /* queue of bufs for this unit */
long b_bcount; /* active flag */
};
The diskhd
data structure has two buf
pointers that the driver can manipulate. The av_forw
pointer
points to the first active I/O request. The second pointer, av_back
,
points to the last active request on the list.
A pointer to this structure is passed as an argument to disksort(9F), along with
a pointer to the current buf
structure being processed.
The disksort
routine sorts the buf
requests
to optimize disk seek. The routine then inserts the buf
pointer
into the diskhd
list. The disksort
program
uses the value that is in b_resid
of the buf
structure as a sort key. The driver is responsible for setting
this value. Most illumos disk drivers use the cylinder group as the sort key.
This approach optimizes the file system read-ahead accesses.
When data has been added to the diskhd
list, the
device needs to transfer the data. If the device is not busy processing a
request, the xxstart
routine pulls the first buf
structure off the diskhd
list and starts
a transfer.
If the device is busy, the driver should return from the xxstrategy
entry point. When the hardware is done with the data transfer,
an interrupt is generated. The driver's interrupt routine is then called to
service the device. After servicing the interrupt, the driver can then call
the start
routine to process the next buf
structure
in the diskhd
list.