This is a discussion on 7000 Series Takeover and Failback - Solaris Rss ; The two fundamental operations most people associate with clusters are takeover (assuming control of resources from a failed peer) and failback (relinquishing control of those resources to that peer after it has resumed operation). It's very important to understand what ...
The two fundamental operations most people associate with clusters are takeover (assuming control of resources from a failed peer) and failback (relinquishing control of those resources to that peer after it has resumed operation). It's very important to understand what these operations mean in the context of the Sun Storage 7310 and 7410 NAS appliances. A number of important changes have been made in the recently-released 2009.Q3 software release that affect how these operations work - all of them, we obviously believe, for the better - but the administrative interfaces are unchanged. The product documentation (PDF not yet updated for Q3 as of this writing) has also been greatly enhanced to better describe the clustering model and administrative operations that apply to these products, and I would strongly recommend that you avail yourself of that resource; you'll want to be familiar with it to understand some of the terminology I'll use here. But many of our customers and partners have asked us questions around takeover performance, and in order to address those questions I will need to go into greater detail about the implementation of these operations. So let's take a look under the hood. It's important to understand that nearly everything discussed here is an implementation detail that may change without notice in future software releases, and applies very specifically to the new 2009.Q3 software.
A First-Order Look
Takeover and failback consist of operations performed on each resource eligible for the operation. The selection of eligible resources is described in the product documentation and depends on the resource type and, if applicable, which cluster peer has been assigned ownership of it. For the sake of simplicity, we will define a fairly typical cluster configuration in which there is a single pool consisting of the disks and log devices in 4 J4400 JBODs, a pair of network interfaces by which clients will access that storage via NFS, and a network interface private to each head for administration only. We will refer to the heads as simply A and B. The pool and service interfaces are assigned to head A.
In this configuration, the following resources of interest will exist on both heads:
- Eight resources of the form ak:/diskset/(uuid)
In addition, head A will have ak:/net/nge2 and head B will have ak:/net/nge3, representing the private administrative network interfaces. There will also be a very large number of other resources representing components with upgradeable firmware, service configuration, users and roles, and other configuration state replicated between heads. Because these resources are replicas and do not normally have any activity associated with them at takeover or failback time, we will not consider them further. The resources of interest must be acted upon in dependency order; for example, we must open the ZFS pool and mount and share all of its shares before we can safely bring any network interfaces up that clients expect to use to access those shares. Otherwise clients could attempt to access the shares before they are available, receiving a response that would result in stale filehandle errors. The above listing of resources reflects the dependency order for import; as you would expect, the opposite order must be observed when exporting.
It is worth noting that most of these resources do not appear in the management UIs. This is deliberate: the nas, shadow, and smb resource classes are symbiotes; that is, they always have the same resource identifier and ownership as their respective masters but have distinct locations in the dependency chain and distinct actions required for import or export. This allows us finer-grained dependency control and makes the implementation simpler and more modular by separating subsystems from one another. This becomes very important when discussing takeover and failback performance: the symbiote resources must be exported and/or imported, and these operations take time.
When head A fails (let us assume it has been powered off accidentally by our data center personnel), head B will initiate takeover once it detects that no heartbeats have arrived from head A. This timeout period is currently 500ms, and takeover is initiated as soon as the timeout elapses. At the beginning of takeover, an arbitration process is performed that protects user data in the case in which all cluster I/O has failed but both heads are still functioning. This arbitration process consists of attempting to take the zone lock (defined by SAS-2) on each SAS expander present in the storage fabric. These locks are acquired and dropped in a defined order by any head attempting to enter the OWNER state. The locks are held with a fixed timeout period, so that if the holder of the locks does not continually reacquire them they will eventually be dropped. A thread on the OWNER head does this at an interval significantly less than the timeout period. Therefore, if a head attempting to enter the OWNER state fails to acquire the locks, it will wait until the timeout interval expires and try again. If it again fails, it will reboot, allowing its peer (which must still be functioning) to take control of shared resources. This process prevents both heads from attempting to perform simultaneous write access to shared storage, which would destroy or corrupt data. The timeout interval is set to 5 seconds in current products, meaning that this step can take up to 5 seconds to complete plus the time to contact all expanders (typically around 1-2s even in the largest configurations). Since the zone locks are not held when in the CLUSTERED state, this will normally take less than 2 seconds overall. Only in the cases where arbitration is actually necessary - such as when all three cluster I/O links are disconnected between two functioning heads - or when taking over directly from the STRIPPED state can the additional 5 second penalty apply.
After acquiring the zone locks, the surviving head will evaluate each resource in the resource map in dependency order. If the resource is not already imported, its class's import function will be invoked. Since head B does not own any of the singleton resources listed above, none of them nor any of their symbiotes will be imported. We will therefore invoke the diskset class's import function for each of the disksets, then the zfs class's import function for pool-0, then the nas class's import function for pool-0, then the net class's import function for nge0, and so on until we have attempted to import all of these resources. Note that if a failure occurs we will simply mark the resource faulted and proceed: our peer is down so we must make a best effort. When head A resumes functioning, it will rejoin the cluster, meaning that the current list and state of all resources will be transferred from head B over the intracluster I/O subsystem. Head A will not, however, take control of any of the singleton resources or their symbiotes; it will import only its own private resource ak:/net/nge2 as it transitions into the STRIPPED state following rejoin. This behaviour prevents ping-ponging and allows the administrator to verify that the restored head has had any hardware issues addressed before returning it to service.
Now that head A has rejoined, a failback can be initiated from head B. During failback, head B will walk the list of resources in reverse dependency order, invoking the resource's class's export function for each resource that is not owned by head B. If any of these functions fails, head B will generate an alert and reboot itself. This is done to ensure that the cluster is in a consistent, well-defined state: it would not be safe for head A to import a resource that is still under the control of head B, nor would it be possible for head A to enter a defined cluster state without importing all of the resources assigned to it. Likewise, if head B attempted to re-import the resource that could not be exported, that operation or some other re-import required by it could (and likely would) fail as well, making matters worse. Therefore B's reboot will trigger a takeover by A and consistency is maintained. Assuming a successful export, head B will now perform an intracluster RPC to head A instructing it to begin importing its resources. In response, head A will walk the list of resources in dependency order, invoking each resource's class's import function for each resource assigned to it (but not any assigned to head B). If any of these functions fails, head A will generate an alert and reboot itself, again triggering takeover from head B and maintaining consistency.
A Closer Look
Since failure detection and zone lock acquisition together take at most a few seconds, it is clear that we will need to understand the performance characteristics of each import and export function in order to understand overall takeover and failback performance. What exactly does each of them do?
A diskset is exactly what its name implies: a collection of disks managed together. A few simple rules govern disksets: disks in a diskset are always part of the same storage pool, or no storage pool; disks in a diskset are always located in the same physical enclosure such as a J4400; and disks in a diskset are always imported or exported together. The mapping of disksets onto the slots or bays in a storage enclosure is defined by metadata delivered as part of the appliance software and may vary by product or enclosure type. Administrators do not manage disksets directly; they are handled automatically by the software when storage pools are created and destroyed.
In the abstract, disksets would be merely an engineering convenience, containers used to track the allocation of disks. Unfortunately, the need to support ATA disks necessitates a far more complex implementation. Because the ATA protocol does not support communication between multiple initiators and a single target, the SAS standard defines the notion of an affiliation, a mapping between a single initiator and an ATA target. Only the initiator that owns the affiliation can communicate with the target; any attempt by another initiator to do so will fail. Ownership of the affiliation is tracked by the SAS expander in which the STP bridge port associated with the ATA target is located, and an affiliation is claimed automatically the first time an initiator performs an I/O operation on a given target. Note that I/O operations in this context are not limited to those that affect the media: it is not possible to obtain even basic information about the device without claiming the affiliation. The process of obtaining that information and using it to create a device node used by ZFS and other software to interface with the disk is known as enumeration, a process that is normally performed by default by Solaris and other operating systems on every disk visible to the system's HBAs. However, as we can see, attaching two initiators to the same expander and performing automatic enumeration will result in chaos if there are ATA disks behind that expander: each system will claim some subset of the affiliations during enumeration but hang for an extended time attempting to enumerate those disks whose affiliations were claimed by its peer. The net result would be extremely long boot times and some random subset of disks visible to each system. Clearly this is untenable.
Disksets are a solution to this problem. By disabling automatic enumeration of ATA disks, we can control when the enumeration process is performed, limiting it to circumstances in which it is known to be safe: storage configuration and those times when we know our peer is not attempting to access the disks; e.g., during takeover and failback. Therefore the diskset import function must, for each disk, cause the operating system to enumerate that disk via each possible path. The export routine, likewise, must cause the operating system to "forget about" the disk and relinquish its affiliations for each initiator.
In previous software releases, diskset import time usually dominated the takeover and failback processes. While the 12 disks in each diskset were enumerated in parallel, fundamental problems in the kernel and an inability to process disks from multiple disksets in parallel limited the parallelism that could be exploited. Each diskset typically took 15 to 30 seconds to import, and, worse, could take much longer in certain error paths, especially if, during takeover, the expander had not yet torn down the defunct initiator's affiliations. The current software release improves the situation considerably: all disksets can be imported in a single invocation of the diskset import function (known as "vector import"), and up to 96 disks can be enumerated in parallel, up from 12. In addition, improvements in error handling and timeouts have greatly reduced the worst-case import time when disk or affiliation errors occur. Overall, configurations such as our example above will typically see reductions in diskset import time on the order of 4-6x, with an accompanying large decrease in variance. That is, we might reasonably expect all 8 disksets in our example configuration to be imported in 30s. Because most of the overall benefits come from increased parallelism, smaller configurations will see somewhat smaller improvements. Diskset export is not, and has never been, a significant contributor to failback times; undoing the enumeration process is typically measured in milliseconds for each disk. This means that the relationship between takeover and failback times depends mainly on which contributing factors dominate each activity; i.e., the configuration and uses of the system.
Importing a ZFS pool resource simply means opening the pool, reading the labels from each disk, and creating the attachment points for any zvols (used to provide block storage) contained in the pool. Reading of labels takes constant time as it is performed in parallel, but the second portion of this activity requires walking all datasets in the pool, which is done sequentially. The time taken here is therefore proportional to the sum of the number of projects, filesystem shares, and LU shares the pool contains. It is, however, usually much less than the mounting and sharing activities, which we will investigate next.
The NAS symbiote of the ZFS pool is responsible for mounting and sharing all of the ZFS datasets, including zvols used as backing stores for block devices. This activity therefore takes time proportional to the number of shared filesystems (NFS, CIFS, FTP, HTTP, SFTP, FTPS) and block devices (iSCSI). Tests have shown that NFS shares contribute between 5ms and 15ms each to this process but, because the meaning of "sharing" depends on the protocol, it is difficult to provide an overall estimate of the constants associated with this activity. Likewise, export requires unsharing and then unmounting the filesystems and LUs, which is also linear and requires variable time that is protocol-dependent. Tests have shown that each NFS share contributes a similar time increment to the export process as it does to the import process.
The net class's resources represent the state of a network interface, which will already have been plumbed and configured on both heads. When this resource is imported, the subsystem informs the kernel that the addresses on the interface should be brought up. This activity is performed sequentially for each address and the time taken is therefore linear in the number of addresses configured for the interface. However, the time taken for each is miniscule and it is unusual to assign more than a few addresses to an interface. The entire operation normally completes in a second or two. Exporting is directly analogous and takes a comparable length of time.
This resource class, new in the 2009.Q3 software, manages shadow migration destinations associated with the pool. Because we cannot necessarily mount the shadow migration sources until the network interfaces are imported, this symbiote of the pool resource is imported after all net resources. It is responsible for activating each shadow mount, which will cause the source filesystems to be mounted. This occurs sequentially, and is therefore linear in the number of shadow sources. Of course, shadow sources that are local will take very little time to mount while NFS client mounts can take a significant amount of time. Exporting is the complement, and is normally very fast in all cases.
Each net resource has an smb symbiote, responsible for notifying the CIFS subsystem that an additional network interface should be used to provide service to CIFS clients. This operation is effectively irrelevant as it usually takes less than a second.
Putting It Together
As we've seen, there are many moving pieces involved in takeover and failback. Each resource class has its own set of operations for import and export; some take effectively constant time while others depend on the number of shares and projects or the number of disks. Even where a clear dependency in a particular variable can be characterised, the actual time taken to perform each individual suboperation may not be known or even constant; for example, sharing a filesystem can take a different amount of time depending on the protocol used and even the parameters associated with that particular share. For all these reasons, I strongly encourage anyone who is especially sensitive to takeover or failback time to perform some tests based on their own real-world configurations. This will become even more important as overall performance improves: for example, the recent improvements to diskset import time make the number and type of shares much more relevant to total takeover time. Many configurations may achieve 4x or better overall takeover time improvement as a result of that work, but a configuration with, for example, 3000 shares on a pool consisting of a single diskset, may see little or no change. As with any benchmarking activity, there is no substitute for testing your own configuration, but I hope the above description of the process and rough guidelines will be helpful in establishing expectations going into that testing process so that anomalous behaviour can be identified and tracked down.
In a future post I'll talk about a few of the remaining opportunities for improvement. Until then, ALL HAIL CLUSTRON!