1. Introduction 1.1. Project/Component Working Name: ZFS (Zettabyte Filesystem) 1.2. Name of Document Author/Supplier: Jeff Bonwick 1.3. Date of This Document: 04/26/2002 1.4. Name of Major Document Customer(s)/Consumer(s): 1.4.1. The Steering Committees you expect to review your project SOESC, ONSC 1.4.2. The ARC(s) you expect to review your project PSARC 1.4.3. The Director/VP who is "Sponsoring" this project Glenn W 1.4.4. The name of your business unit SSG 2. Project Summary 2.1. Project Description ZFS is a new filesystem technology that provides immense capacity (128-bit), provable data integrity (no silent data corruption), always-consistent on-disk format (so data is always available), self-optimizing performance, and real-time remote replication to allow instant recovery from disasters of geographic scope. ZFS departs from traditional filesystems by eliminating the concept of volumes. Instead, ZFS filesystems share a common storage pool. This allows customers to organize their data logically rather than along device/volume boundaries. The storage pool allocator (SPA) provides space allocation, replication, compression, encryption, checksums, and resource controls (quotas, reservations) for ZFS. ZFS is sufficiently automated that it can be used in embedded applications, e.g. as "Solaris firmware" for storage appliances. 2.2. Risks and Assumptions We assume that storage capacity will continue its exponential growth. The largest existing filesystems (e.g. SGI's XFS) are 64-bit capable. That will soon be inadequate: 1PB datasets are plausible today, which means that customers will hit the 64-bit limit (16EB) after just 14 more doublings in capacity. We therefore accept the cost of slightly larger metadata to make ZFS 128-bit capable from day one. We assume that open, file-level data sharing (as practiced today with NFS and HTTP, and as envisioned in the future by N1) will become increasingly important. ZFS is designed to serve files both locally and over file-level protocols like NFS and DAFS. We do not assume anything about the I/O transport (Fibre Channel, InfiniBand, iSCSI, etc); anything that can move bits will suffice. Migration is not mandatory. ZFS can peacefully coexist with UFS, QFS, VxFS, and all other filesystems. We do not intend to EOL UFS until ZFS has had several years to become established and trusted in customers' minds. The initial delivery of ZFS will include basic administrative tools to create and mount ZFS filesystems, configure devices in a storage pool, and so on; this is discussed in more detail in the ZFS white paper [1]. More sophisticated tools will be the subject of separate one-pagers. The most obvious risk is that serious bugs could cause highly visible, catastrophic data loss at key customer sites. This risk is inherent in any new storage technology, or even in bug fixes to existing technology. We intend to mitigate this risk by extensive testing, fault injection, internal exposure, and external beta. The risk of *not* undertaking this effort is that Sun will fall further behind in storage technology, which will either cost us sales or make those sales contingent on third-party software. 3. Business Summary 3.1. Problem Area ZFS provides a POSIX-compliant general-purpose filesystem with integrated storage management. The ZFS project has four key goals: (1) Provide a fast, scalable, extremely reliable filesystem to power Sun servers and storage appliances. (2) Eliminate Sun's dependence on Veritas. (3) Make open, file-level protocols -- NFS and N1 in particular -- more competitive by dramatically improving performance. (4) Make administration so automated that ZFS can be used in embedded applications, e.g. as "Solaris firmware" for storage appliances. 3.6. How will you know when you are done? A detailed requirements document is being drafted. At a minimum, the project team makes the following requirements of itself: (1) ZFS must prove that its data integrity model is correctly implemented by surviving a battery of abusive stress tests. It must survive thousands of crashes, disk failures, and injected faults without ever losing on-disk consistency, returning bad data, or leaking a single block. (2) ZFS must outperform UFS on the benchmarks Sun considers important at the time, e.g. TPC and SPEC-SFS. (3) ZFS must be fully POSIX-compliant. (4) ZFS must have extensive internal exposure and external beta before general availability. (5) There must be no known bugs that affect data integrity. 4. Technical Description "Extraordinary claims require extraordinary proof." -- Carl Sagan This section goes into greater detail than a typical one-pager because the strong claims made for ZFS do indeed require strong proof. Readers uninterested in technical details should just skim the subsection headers (4.x.y) to get a feel for ZFS's capabilities. The ZFS architecture is based on three organizing principles: (1) all storage is pooled (2) ZFS is object-based, modular, and extensible (3) all operations are copy-on-write, transactional, and checksummed ZFS's reliability, performance, and ease of administration all flow from these three principles, so we'll describe them in some detail. We begin with a brief overview of the overall modular structure. 4.1. Overview There are four major components to the ZFS filesystem: (1) ZFS POSIX Layer (ZPL): provides standard POSIX semantics, e.g. permission checking, file modes, timestamps, and so on. The ZPL translates incoming vnode and vfs ops into operations that read or write a storage object, which is a linearly addressable set of bytes (a.k.a. sparse array or flat file). (2) ZFS Directory Service (ZDS): provides concurrent, constant-time directory services (create, delete, lookup) for the ZPL. (3) Data Management Unit (DMU): provides transactions, data caching and object translation (described in Section 4.3). (4) Storage Pool Allocator (SPA): provides space allocation, replication, checksums, compression, encryption, resource controls and fault management. The ZPL and ZDS implement the filesystem's "upper half", which translates all filesystem operations into a set of transactional reads and writes on storage objects. The DMU and SPA implement the filesystem's "lower half", which is a reliable object store. The ZFS architecture is highly modular and extensible. The ZPL can be replaced by non-POSIX interfaces such as native database APIs; the ZDS can be replaced by completely different directory algorithms; the SPA employs pluggable, layerable virtual devices to implement mirroring, striping, RAID-5, and so on; and the SPA employs pluggable modules for checksums, compression, and encryption. 4.2. Pooled Storage Pooled storage simply means that disk space is shared among many filesystems. In addition to reducing fragmentation, this also simplifies administration because the data's logical structure isn't constrained by physical device boundaries. In the same way that directories can be freely created and deleted within a traditional filesystem, ZFS filesystems can be freely created and deleted within a storage pool. Consider jurassic as an example. Jurassic currently has about 20 filesystems providing home directories for 1,000 users. Why 20? A single UFS filesystem would be impossible because jurassic has more than 1TB of data, and 1,000 filesystems (one for each user) would be impractical because the administrative model doesn't scale and none of the free disk space could be shared between users. The 128-bit pooled storage model eliminates both constraints. Obviously all users could easily fit into a single filesystem (128-bit capacity = 3 x 10^26 TB). Alternatively, jurassic could provide a separate filesystem for each user. This is attractive because it would allow the sysadmin to provide per-user backups, reservations (to guarantee each user a given amount of space), and quotas (to limit each user's space consumption). An existence proof of the viability of pooled storage can be found in a surprising place: tmpfs. All tmpfs mounts consume space from the operating system's global swap pool. There is no "raw device" for a tmpfs mount; rather, tmpfs filesystems are named only by their mount points, i.e. their logical names. ZFS operates on the same principle. ZFS filesystems use a common storage pool and are named by their default mount points. 4.3. ZFS is object-based Traditional filesystems are either block-based or extent-based. In either case the filesystem performs two basic operations: it translates incoming vnode ops into operations that read or write an inode, and then it translates inode read/write into block I/O. These block I/O requests then go to a volume manager for a third translation from logical blocks to physical blocks. Finally, the volume manager issues block I/O requests to actual physical devices. By contrast, ZFS is an object-based filesystem. The ZPL translates incoming vnode ops into operations that read or write a storage object, which is a linearly addressable set of bytes. The ZPL then sends these object read/write requests to the DMU (Data Management Unit), which translates directly to physical data location. There is no need for a volume manager, so all the overhead of the logical block layer is eliminated. ZFS stores *everything* in DMU objects: user data, znodes, directories, even DMU and SPA metadata. Consequently, ZFS metadata can be managed just like user data; it requires no special out-of-band treatment. We will see in the next section that this principle is not merely a design simplification, but a cornerstone of the reliability model. 4.4. All operations are copy-on-write, transactional, and checksummed Principles (1) and (2) reduce most of the complexity of storage management to a single problem: fast, reliable reading and writing of DMU objects. This is the heart of the filesystem, so we'll describe it in some detail. 4.4.1. Copy-on-write transaction model Copy-on-write and transactional semantics are intimately related. Many operations that modify filesystem state must modify more than one block to do so. This poses a problem: there is no way to write multiple disk blocks atomically, yet these writes *must* be atomic to keep the filesystem consistent on disk. The solution is twofold: first, the SPA never modifies blocks that are currently in use; instead, it allocates new blocks and writes the new contents to them. It can write an arbitrary number of blocks this way without affecting on-disk consistency because no committed DMU translation refers to any new block. Second, the DMU groups related work into transactions so that related blocks are always committed together. For efficiency the SPA collects many DMU transactions into larger transaction groups and commits each group as a whole. The entire storage pool can be viewed as a giant tree with leaf nodes containing data and interior nodes containing metadata. The root of the tree is a single disk block called the uberblock. To commit a transaction group the DMU first writes out all the modified leaf nodes, then the interior nodes that point to them, and so on up the tree; finally the SPA issues a single write that atomically changes the uberblock to refer to the new tree. All data and all metadata are stored in DMU objects, and all DMU objects are subordinate to the uberblock, so the entire storage pool is always self-consistent. One key assumption is that the write that updates the uberblock is atomic. Modern disks make strong guarantees about single-write atomicity: if the disk loses power, some drives can actually tap the spindle's rotational energy to generate enough current to complete any in-flight writes; others take advantage of NVRAM write caching [6, 9]. Just to be safe, the uberblock is checksummed and replicated anyway. In the event of a partial disk write the SPA will detect a checksum error and reread the uberblock from a different replica. Finally, ZFS implements intent logging to ensure that returning from a system call (e.g. write(2) or unlink(2)) always implies that the operation is committed. The intent log is not required for on-disk consistency; it exists only to ensure that the synchronous semantics required by O_DSYNC operations and the NFS protocol are always met. Please see the ZFS white paper [1] for further details. The current ZFS prototype has survived over 20,000 simulated crashes without ever losing on-disk consistency or leaking a single block. 4.4.2. Snapshots The copy-on-write transaction model enables constant-time snapshots almost as a side effect. ZFS can snapshot the entire storage pool by simply making a copy of the uberblock. All subsequent writes cause new blocks to be allocated (because everything is COW), so the blocks that comprise the snapshot are never modified. Similarly, ZFS can snapshot any subtree of the storage pool (e.g. a single filesystem) by making a copy of its root block. The only additional support required is that each block pointer must record the block's time of birth (i.e. the transaction group in which it was allocated) so the DMU can determine when a block can be freed; a block remains active until all snapshots that reference it have been deleted. Constant-time snapshots provide a convenient, scalable data recovery service and provide a stable image for file-based backup utilities. Unlike UFS and WAFL, ZFS supports an unlimited number of snapshots. The more disk space you have, the more history you can keep. 4.4.3. Checksums and Data Integrity The SPA provides a powerful checksum system to ensure data integrity. Each interior node in the tree described above contains an array of block pointers that describe its children. Each block pointer contains the child's DVA (Data Virtual Address), birthday (the transaction group in which it was born), and 64-bit checksum. There are several advantages to keeping the checksum in the block pointer: (1) it eliminates the need for a separate I/O to fetch the checksum; (2) it ensures that the checksum and the data are physically separated, which improves fault isolation; and (3) it provides checksums on the checksums because each indirect block must pass its parent's checksum test. [The careful reader will note that this scheme breaks down at the uberblock, which has no parent. The uberblock therefore has a field to store its own checksum.] The SPA provides a variety of checksum functions. By default the SPA applies 64-bit second-order Fletcher checksums [10] to all user data and 64-bit fourth-order Fletcher checksums to all metadata. Both weaker and stronger checksum vectors are available, including the null checksum for those who seek maximum speed at any price. The checksums are versioned at block granularity to allow the SPA to upgrade to faster or stronger checksums as they become available. Current research in high-speed software cryptography is generating a wealth of fast, strong hashes that look promising as checksums. In fact, the final reduction of Fletcher's four 64-bit accumulators to a single 64-bit result relies on the NH universal hash developed for UMAC message authentication [11]. The SPA's standard 64-bit checksums provide a 99.99999999999999999% ("nineteen nines") probability of detecting data corruption. Finally, we note that there are many levels in the I/O stack at which checksums can be computed and verified. The SPA level is ideal because it is both necessary and sufficient to detect *and correct* any of the five major classes of error: (1) Bit rot (media errors): in this case the checksum fails on read, so the SPA reads from another replica and repairs the damaged one. (2) Misdirected read (disk firmware reads the wrong block): the SPA behaves as in (1), but the first replica isn't actually damaged, so the "repair" step is unnecessary (but harmless). (3) Phantom write (disk claims it wrote the data, but really didn't): unless all devices silently fail in unison, at least one replica will have a good copy of the data. When the SPA reads from any replica that was not updated due to a previous phantom write, the checksum fails, case (1) applies, and the damage is repaired. (4) Misdirected write (disk wrote the data, but to the wrong block): this is equivalent to a phantom write (the intended block wasn't written) plus bit rot (another block was written erroneously). Both cases are handled as described in cases (1) and (3) above. (5) User error. If the system administrator accidentally uses an active ZFS device for some other purpose, the SPA will detect this as a storm of checksum errors. As long as the device is mirrored, ZFS can actually survive this without any loss of data. There is one class of errors that no amount of magic can fix: silent in-core corruption, e.g. undetected parity errors in main memory. If the hardware corrupts memory *after* it's been checksummed, ZFS can't tell that it happened. It is not clear that there is any way around this other than to build highly reliable memory systems. 4.5. Storage Pools vs. Volumes In some ways the SPA resembles a volume manager: it organizes a collection of physical devices (e.g. disks) into virtual devices (e.g. mirrors) to provide replication, striping, and so on. But the actual services provided by volume managers and storage pools are very different. A volume manager exports a fixed-size disk abstraction with no semantic knowledge of the filesystem above it. This puts a substantial burden on the volume manager because, for example, it must constantly work to keep both halves of a mirror in sync. That is, the volume manager must maintain its own data integrity model *in addition* to the filesystem's data integrity model. This leads not only to duplication of effort, but also to higher costs for each individual integrity model because neither one can leverage semantic information about the other. This lack of semantic information shows up in many forms: for example, filesystems cannot share space across volumes because the volume manager has no idea which blocks the filesystem considers active. These and many other flaws are inherent in the FS/volume interface. The DMU and SPA work together to provide a shared storage pool with a unified data integrity model (as described in Section 4.4 above). Each ZFS filesystem acts as a client of the storage pool, so space is shared among many filesystems. The administrative task of "growing a filesystem" does not exist in ZFS; ZFS filesystems grow and shrink dynamically as they create and delete DMU objects. The storage pool itself grows and shrinks dynamically as devices are added or removed. The SPA supports multiple storage pools for qualitatively different storage needs. For example, databases and home directories contain precious data that is typically replicated for maximum availability, whereas news spools and Web caches contain read-only data that can easily be downloaded again if the local copy is lost to disk failure. Thus it may make sense for a general-purpose server to have two or more storage pools, e.g. one with replication for databases, mail, and other valuable data, and another without replication for caches, temporary files, and other transient data. 4.6. Always-Available Data 4.6.1. Elimination of fsck(1M) and all other offline administration We saw in Section 4.4 that the storage pool is always consistent. Therefore, the ZFS filesystems it contains are always mountable. There is no fsck(1M) for ZFS, no "clean bit", and no offline administration of any kind. Barring a true catastrophe, e.g. physical failure of all replicas, ZFS data is always available. The elimination of fsck(1M) is not just an optimization anymore; it's a necessity. Running fsck(1M) on a terabyte of data requires many hours of downtime; running fsck(1M) on a petabyte would take almost a year! The fsck-less, always-mountable property of ZFS is an absolute requirement for high availability. 4.6.2. Self-Healing Data In replicated (mirrored) configurations the SPA does more than just detect corruption: it repairs the damage automatically. If the SPA detects a checksum error when reading a block it tries all other replicas to find a valid copy of the data. The SPA uses the valid copy both to satisfy the original request and to repair the damaged replica. With the current ZFS prototype we can actually 'dd' random garbage over an entire mirror of an active filesystem without experiencing any errors or interruption in service. 4.6.3. Failure Prediction and Disk Scrubbing The SPA takes proactive steps to keep data healthy and available. If a storage pool contains 20 devices grouped into 10 mirrors, and one of the devices starts to report errors or return bad data (which the 64-bit checksums will detect), the SPA will automatically migrate data from the compromised mirror to the nine good mirrors. The SPA also performs "disk scrubbing" (systematic reads of all blocks during idle time) to detect latent errors ("bit rot") while they are still recoverable. In replicated configurations the SPA automatically repairs these errors using the self-healing approach described above. 4.6.4. Hot Space The SPA does not require dedicated hot spares. Instead, it spreads "hot space" across all devices to serve the same purpose. This approach increases the available I/O bandwidth because all devices can be used. It also improves reliability: with hot spares, a disk that hasn't been used for months is suddenly called into service, so its own health is unknown. The hot space approach keeps all devices busy, improving our ability to monitor the devices' health and predict the onset of failure. The hot-space approach also increases drive utilization, which will become crucial as we approach the "access density wall" over the next few years [12]. 4.6.5. Real-Time Remote Replication The DMU supports real-time remote replication. Unlike traditional daily backups or batched remote replication, RTRR ensures that the remote copy of the data is always consistent and never more than a few seconds out of date. The ZFS architecture naturally supports remote replication. All data and metadata are stored in objects, and the only operations on objects are read, write, and free; this makes the protocol very simple. Reads don't need to be sent over the wire, of course, and writes and frees can be sent asynchronously within transaction group boundaries. The only explicit synchronization required is a once-per-transaction-group (i.e. every few seconds) commit. Thus, remote replication of DMU objects is not latency-sensitive; all it requires is sufficient bandwidth, which is getting cheaper by the day. 4.6.6. User Undo Finally, while all this technology is helpful, the most common cause of data loss today is user error. Any serious attempt at improving data availability *must* address this. Therefore, in addition to the snapshot facility described earlier, ZFS provides "User Undo". This allows end users to quickly and easily recover recently deleted or overwritten files without sysadmin intervention. 4.7. Capacity ZFS filesystems and storage pools are 128-bit capable, which means that they can support up to 340 undecillion bytes (about 3 x 10^26 TB). Individual files are also 128-bit capable, but are currently limited to 64-bit access (16EB) because neither the operating system nor file-level protocols like NFS support larger files yet. When these limitations are removed, ZFS will be ready: all on-disk structures are already 128-bit. 4.8. Performance 4.8.1. All Writes Are Sequential Writes In a traditional filesystem, random file writes cause random disk writes because existing data blocks must be overwritten. In ZFS, however, the copy-on-write model gives the SPA complete control over block placement. The SPA preferentially allocates free blocks sequentially so that most disk writes are sequential. Thus ZFS provides sequential write performance even for random write workloads (Oracle being one important example). In the event that an object becomes fragmented, the DMU can make it contiguous again (defragment it) by simply declaring all of its blocks to be dirty. This will cause the SPA to allocate new blocks (because everything is copy-on-write) and, as described above, these new blocks will be as contiguous as possible. 4.8.2. Dynamic Striping Traditional striping improves bandwidth by spreading data across multiple disks at some fixed "stripe width". For example, if a file were striped across 4 disks with a stripe width of one block, the file's blocks would be laid out as follows: disk0 disk1 disk2 disk3 ----- ----- ----- ----- 0 1 2 3 4 5 6 7 8 9 10 11 ... In general, the nth file block maps to disk (n % 4), block (n / 4). The down side of this approach is that it's very static: the file's block offsets are a function of the number of disks in the stripe, so changing the configuration of an active stripe is impossible. The SPA takes a different approach: dynamic striping. If there are 20 disks paired into 10 mirrors, the SPA can spread writes across all 10 mirrors. If more disks are added, the SPA can immediately use them. The location of all previously written data is specified by its DMU translations, not by arithmetic based on the number of disks, so there is no "stripe geometry" to change. It just works. The SPA also provides a traditional striping module for customers who wish to maintain explicit control of stripe parameters. There will always be certain carefully-coded applications that can take advantage of such explicit control. 4.8.3. Sustained Bandwidth: Parallel Three-Phase Transaction Groups Each transaction group goes through four states: open (accepting new work), quiescing (waiting for accepted transactions to complete), syncing (pushing all changes to disk), and closed (uberblock was successfully written). The "closed" state is really just a formalism because nothing actually happens in that state. If transaction groups were strictly serialized, performance would be bursty: the load on the system would alternate between CPU-bound, idle, and disk-bound as each transaction group went through its open, quiescing, and syncing phases, respectively. Therefore the SPA allows up to three transaction groups to be active at the same time -- one each in the open, quiescing, and syncing state. This ensures that the SPA can always accept new work, and can always keep both CPUs and disks busy. 4.8.4. Intelligent Prefetch The SPA is designed so that all I/O goes though a common code path and retains semantic information about its source (e.g. filesystem, object, offset, thread). This allows the SPA to monitor data access patterns so it can provide profile-directed feedback to its prefetch engine. This enables the SPA to perform more intelligent prefetch than traditional read-ahead. 4.8.5. Multiple Block Sizes The SPA uses a variant of the slab allocator [13] to divide large regions of disk space into blocks of various sizes (512B - 1MB). Small blocks allow efficient storage of small files and metadata. Large blocks allow more efficient handling of large files by reducing both the number of DMU translations and the depth of the indirect block tree (larger blocks have greater fanout). 4.8.6. Synchronous Semantics with Async Performance ZFS is always consistent on disk. If the system crashes, however, the filesystem won't contain any transactions since the last transaction group was committed; that's what the intent log is for. Logically, there are two common ways to use an intent log: (1) log metadata only: loses recent writes (like UFS, VxFS, and QFS) (2) log all data: recovers everything (required for NFS or O_DSYNC) Physically, there are three common ways to implement such a log: (A) log to disk: adds latency of one disk write (e.g. UFS logging) (B) log to solid-state disk: faster, but still goes through I/O path (C) log to NVRAM: very fast (e.g. NetApp filers) The ideal configuration is 2C (log everything to NVRAM), so that ZFS provides synchronous semantics with async performance. The ZFS team will work with Sun's hardware teams to develop an NVRAM strategy. In the interim, options 1A, 2A, 1B, and 2B are all viable. 4.8.7. Concurrent, Constant-Time Directory Ops ("The /var/mail Problem") Large directories like /var/mail pose major performance problems for most existing filesystems (including UFS, QFS, and VxFS) because the time required to create or delete a file is proportional to the number of files. Moreover, existing directory locking schemes only allow one such operation at a time. The ZDS uses a variant of Extendible Hashing [14] to perform all directory operations -- create, lookup, and delete -- in constant time. Moreover, the ZDS uses fine-grained locking so that many operations can proceed in parallel. This is a huge win for mail servers, news spools, and other create/delete-intensive workloads. 4.8.8. Byte-Range Locking for Concurrent Writes ("The Oracle Problem") When files are used to store database records, it is essential to scalability that different records can be written concurrently. POSIX, however, requires that overlapping writes must not interleave; it must appear as though the writes occurred in some definite order. This problem is difficult to solve well, and existing filesystems solve it poorly: they put a single write lock around the entire file and then offer a mount option to bypass the lock when running Oracle (marketed by Veritas as "QuickIO"). This means that the end user can have Oracle performance or POSIX compliance, but not both. ZFS supports concurrent writes to different parts of a file without breaking POSIX overlapping-write semantics. The ZPL provides byte-range locking so that overlapping writes are always serialized, but non-competing writes (the common case in general, and always the case for Oracle) proceed in parallel. Thus Oracle performance can peacefully coexist with POSIX semantics. 4.9. Compression The SPA's support for multiple block sizes makes block-level data compression almost trivial. The idea is simple: when given 8K of data to write, the SPA compresses it to see how big the result is. If the compression ratio is 2:1 or better it can use a 4K block; if it's 4:1 or better it can use a 2K block, and so on. Block-level compression is appealing because, unlike whole-file compression, it's completely transparent to higher-level software. The SPA supports pluggable compression modules so that new compression algorithms can be introduced in the future. Compression can be turned on or off at any granularity (storage pool, filesystem, or individual object) at any time. 4.10. Encryption and Data Security ZFS supports pluggable encryption modules. Like compression, encryption is a block-level operation. ZFS encryption works with any symmetric block cipher; examples include DES, AES, IDEA, RC6, Blowfish, SEAL, and OCB. The encryption keys can be per-object (to protect a single file), per-filesystem (to protect one person's or project's data), or per-storage pool (to protect all corporate data). An effective data security model requires more than just encryption. We will update the ZFS white paper [1] as this component of the project gets fleshed out. 4.11. New Technologies Enabled by ZFS 4.11.1. New Storage APIs POSIX is no longer the only game in town. Emerging storage paradigms such as integrated database support will require radically different programming models. However, all storage APIs do have one thing in common: they need a persistent, transactional object store. The DMU and SPA provide a transactional object storage service that can be programmed directly to create higher-level storage APIs. The ZPL and ZDS are a case in point: these relatively simple modules program the DMU to implement a complete POSIX filesystem (ZFS). Another example is Oracle, whose consistency model does not map well to POSIX semantics. Oracle's current use of AIO on O_DSYNC files generates a great deal of I/O to maintain consistency. By contrast, the DMU's interface to create, write, and commit transactions provides an efficient programming model that Oracle could use directly to obtain better database performance. In fact, because the DMU and SPA convert random object writes to sequential disk writes, Oracle on the DMU should be even faster than Oracle on raw devices. 4.11.2. Object-Based Appliances The DMU/SPA transactional object model provides a natural protocol for a storage appliance. The protocol is very simple and can support both filesystems and databases efficiently. 4.11.3. High-Availability File Services ZFS provides a strong transactional foundation for HA file services such as clusters, partner-pairs, and N1-type configurations. This topic will be covered in detail in subsequent one-pagers. 4.11.4. Embedded Applications The interfaces between ZFS and the supporting operating system are relatively small and clearly defined, so ZFS can be made to work on other operating systems such as VxWorks. This allows ZFS to be used in embedded applications such as array products where it can provide strong data integrity, block remapping, and so on. 4.11.5. Foundation Classes for Storage Programming The DMU and SPA provide long-overdue foundation classes for storage programming. Building new storage APIs on the DMU/SPA foundation will simplify development by providing a library of common services; improve performance by providing more powerful interfaces than the block-based volume API; simplify storage management by providing a single administrative point (the storage pool) that supports many higher-level protocols; and provide standard tools and technologies to support faster deployment of new technology. 5. Reference Documents [1] See __internal web site__ for the latest information on ZFS. [2] Hitz et al. "File System Design for an NFS File Server Appliance." Proceedings of the 1994 Winter USENIX Conference. [3] Gingell, R. CTO Filesystem Report, March 2000. [4] Sweeney et al. "Scalability in the XFS File System." Proceedings of the 1996 USENIX Technical Conference. [5] Harriet Coverston and Brian Wong, personal communications. [6] Tweedie, Stephen. "The EXT3 Journaling Filesystem." http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html [7] See http://www.pcguide.com/ref/hdd/file/ntfs. [8] See http://www-106.ibm.com/developerworks/library/jfs.html. [9] See http://www.pcguide.com/ref/hdd/op. [10] Fletcher, J. "An arithmetic checksum for serial transmissions." IEEE Transactions on Communications 30, 1 (Jan 1982), pp. 247-253. [11] Rogaway et al. "UMAC: Fast and Provably Secure Message Authentication." Advances in Cryptology, CRYPTO '99. [12] See, for example, http://www.storagesearch.com/3dram.html. [13] Bonwick, J. "The Slab Allocator: An Object-Caching Kernel Memory Allocator." Proceedings of the 1994 Summer USENIX Conference. [14] Fagin et al. "Extendible Hashing -- a fast access method for dynamic files." ACM Transactions on Database Systems, 4(3):315-344, September 1979. 6. Resources and Schedule 6.1. Projected Availability Q1CY2002: user-level prototype (available now) Q3CY2002: kernel prototype Q1CY2003: internal alpha (build machines, small servers) Q3CY2003: internal beta (jurassic) Q1CY2004: external beta/EA Q3CY2004: general availability 6.2. Cost of Effort ~15 staff-years 6.3. Cost of Capital Resources ~$xxx (most testing will use existing resources) 6.5. ARC review type: Standard 7. Prototype Availability 7.1. Prototype Availability April 2002 7.2. Prototype Cost 20 staff-months