Open source storage - Storage

This is a discussion on Open source storage - Storage ; So in the past few months there have been some interesting moves towards Open Source Storage: ZFS on Solaris, and Nexenta's software appliance. Has anyone out there deployed it to where it actually does anything useful? The cost savings are ...

+ Reply to Thread
Results 1 to 20 of 20

Thread: Open source storage

  1. Open source storage

    So in the past few months there have been some interesting moves
    towards Open Source Storage: ZFS on Solaris, and Nexenta's software
    appliance.

    Has anyone out there deployed it to where it actually does anything
    useful? The cost savings are phenomenal, but nothing is truly free,
    you pay for it one way or another. On the flip side, its the LAST part
    of the stack which is still proprietary, and a part of me thinks its
    inevitable.

    SC

  2. Re: Open source storage

    S writes:

    > So in the past few months there have been some interesting moves
    > towards Open Source Storage: ZFS on Solaris, and Nexenta's software
    > appliance.


    Of course, Linux has had more sophisticated file systems, including
    several clustered file systems, available as open source for some
    time....

    > Has anyone out there deployed it to where it actually does anything
    > useful? The cost savings are phenomenal, but nothing is truly free,
    > you pay for it one way or another.


    Fundamentally, you pay by doing the support and maintenance yourself,
    and by not having as much focused tuning expertise, formal testing,
    and relationships with database, operating system, backup vendors.

    Of course, you also take on the responsibility of making sure that
    whatever disks/tapes you buy work reliably with the controllers,
    motherboard, and operating system. (Does that SYNCHRONIZE CACHE
    command realy work?)

    If you're saving very much, you've probably also lost the hardware
    redundancy that's built into a hardware RAID system -- dual-ported
    access to disks, independent buses (not sharing a controller chip),
    etc.

    > On the flip side, it's the LAST part of the stack which is still
    > proprietary, and a part of me thinks its inevitable.


    Actually it's not; the firmware on the controllers is proprietary
    in nearly all cases, and the firmware in the drives is as well.

    -- Anton

  3. Re: Open source storage

    In article , Anton Rang wrote:
    >
    >Of course, Linux has had more sophisticated file systems, including
    >several clustered file systems, available as open source for some
    >time....
    >


    More sophisticated than ZFS?



  4. Re: Open source storage

    the wharf rat wrote:
    > In article , Anton Rang wrote:
    >> Of course, Linux has had more sophisticated file systems, including
    >> several clustered file systems, available as open source for some
    >> time....
    >>

    >
    > More sophisticated than ZFS?


    ReiserFS (especially Reiser4) is beyond question more sophisticated than
    ZFS - not only in concept (generic data-clustering ability, for example)
    but in execution (e.g., it incorporates batch-update mechanisms somewhat
    similar to ZFS's without losing sight of the importance of on-disk file
    contiguity).

    Extent-based XFS also does a significantly better job of promoting
    on-disk contiguity than ZFS does (even leaving aside the additional
    depredations caused by ZFS's brain-damaged 'RAID-Z' design) - and
    contributed the concept of allocate-on-write to ZFS (and Reiser) IIRC.

    GFS (and perhaps GPFS) support concurrent device sharing among the
    clustered systems that Anton mentioned (last I knew ZFS had no similar
    capability).

    ZFS is something of a one-trick pony. Its small-write performance is
    very good (at least when RAID-Z is not involved), but with access
    patterns that create fragmented files its medium-to-large read
    performance is just not competitive - and last I knew it didn't even
    have a defragmenter to alleviate that situation (defragmenting becomes
    awkward when you perform snapshots at the block level).

    And despite its hype about eliminating the LVM layer as soon as you need
    to incorporate redundancy in your storage up it pops again in the form
    of device groups - so there's relatively little net gain in that respect
    over a well-designed LVM interface (not that a ZFS-like approach
    *couldn't* have done a better job of eliminating LVM-level management,
    mind you).

    I wouldn't be so critical of ZFS if its marketeers and accompanying
    zealots hadn't hyped it to the moon and back: it's a refreshing change
    from the apparent complete lack of corporate interest in file-system
    development over the last decade or so, even if its design leaves a bit
    to be desired and its implementation is less than full-fledged - and it
    should be very satisfactory for environments that don't expose its
    weaknesses.

    (And yes, I do like its integral integrity checksums, but their
    importance has been over-hyped as well - given the number of
    significantly higher-probability hazards that data is subject to.)

    - bill

  5. Re: Open source storage

    One might argue that Reiser has done a pretty slick job of marketing
    his FS as well. I have heard that he hasn't really run his FS on any
    enterprise class storage. Understandable considering he's a small
    shop. Maybe this has changed.

    XFS has had its own issues. Yes you have on-disk continuity, but if
    you lose power while XFS is building its extant, you've got data
    corruption.

    I don't have any direct experience with ZFS...I'm trying to talk one
    of my buddies into letting me play with it on a system he has on his
    site though.

    So I really think the issues stopping people from deploying open
    source storage are:
    1. Lack of snapshots, which may not be an issue if ZFS gains traction.
    2. No coherent DR strategy. I don't consider rsync a mirroring
    solution if it needs to walk the tree each time.
    3. It seems like storage admins still need to have that support
    hotline printed out and pinned next to their workstation :-)

    Anyone think any different?


    On Feb 17, 4:59 pm, Bill Todd wrote:
    > the wharf rat wrote:
    > > In article , Anton Rang wrote:
    > >> Of course, Linux has had more sophisticated file systems, including
    > >> several clustered file systems, available as open source for some
    > >> time....

    >
    > > More sophisticated than ZFS?

    >
    > ReiserFS (especially Reiser4) is beyond question more sophisticated than
    > ZFS - not only in concept (generic data-clustering ability, for example)
    > but in execution (e.g., it incorporates batch-update mechanisms somewhat
    > similar to ZFS's without losing sight of the importance of on-disk file
    > contiguity).
    >
    > Extent-based XFS also does a significantly better job of promoting
    > on-disk contiguity than ZFS does (even leaving aside the additional
    > depredations caused by ZFS's brain-damaged 'RAID-Z' design) - and
    > contributed the concept of allocate-on-write to ZFS (and Reiser) IIRC.
    >
    > GFS (and perhaps GPFS) support concurrent device sharing among the
    > clustered systems that Anton mentioned (last I knew ZFS had no similar
    > capability).
    >
    > ZFS is something of a one-trick pony. Its small-write performance is
    > very good (at least when RAID-Z is not involved), but with access
    > patterns that create fragmented files its medium-to-large read
    > performance is just not competitive - and last I knew it didn't even
    > have a defragmenter to alleviate that situation (defragmenting becomes
    > awkward when you perform snapshots at the block level).
    >
    > And despite its hype about eliminating the LVM layer as soon as you need
    > to incorporate redundancy in your storage up it pops again in the form
    > of device groups - so there's relatively little net gain in that respect
    > over a well-designed LVM interface (not that a ZFS-like approach
    > *couldn't* have done a better job of eliminating LVM-level management,
    > mind you).
    >
    > I wouldn't be so critical of ZFS if its marketeers and accompanying
    > zealots hadn't hyped it to the moon and back: it's a refreshing change
    > from the apparent complete lack of corporate interest in file-system
    > development over the last decade or so, even if its design leaves a bit
    > to be desired and its implementation is less than full-fledged - and it
    > should be very satisfactory for environments that don't expose its
    > weaknesses.
    >
    > (And yes, I do like its integral integrity checksums, but their
    > importance has been over-hyped as well - given the number of
    > significantly higher-probability hazards that data is subject to.)
    >
    > - bill



  6. Re: Open source storage

    S wrote:
    > One might argue that Reiser has done a pretty slick job of marketing
    > his FS as well.


    Yes, he has - but there's more relative substance behind that marketing
    than there is behind ZFS's (after all, when you promote yourself as "The
    Last Word In File Systems" it's easy to fall quite embarrassingly short).

    I have heard that he hasn't really run his FS on any
    > enterprise class storage.


    The subject was not breadth of existing deployment but sophistication.

    ....

    > XFS has had its own issues. Yes you have on-disk continuity, but if
    > you lose power while XFS is building its extant, you've got data
    > corruption.


    I'd like to see a credible reference for that allegation (unless you're
    simply referring to the potential inconsistency that virtually all
    update-in-place file systems have when *updating* - rather than writing
    for the first time - multiple sectors at once).

    ....

    > So I really think the issues stopping people from deploying open
    > source storage are:
    > 1. Lack of snapshots, which may not be an issue if ZFS gains traction.


    My impression is that snapshots have been available in Linux, BSD, and
    for that matter Solaris itself for many years in various forms
    associated with LVMs and/or file systems.

    > 2. No coherent DR strategy. I don't consider rsync a mirroring
    > solution if it needs to walk the tree each time.


    Synchronous mirroring at the driver level has been available for ages,
    and is entirely feasible across distances of at least 100 miles - enough
    to survive any disaster which your business is likely to survive as long
    as your remote site is reasonably robust. If write performance
    requirements can be relaxed a bit distances can be significantly
    greater. I haven't looked recently, so I don't know how well those
    facilities deal with temporary link interruptions and subsequent
    catch-up (if you've got dedicated fiber to a robust back-up site that
    may not be too likely to occur, but in other circumstances it would be
    very desirable).

    - bill

  7. Re: Open source storage

    On Feb 18, 4:52 pm, Bill Todd wrote:
    > S wrote:
    > > One might argue that Reiser has done a pretty slick job of marketing
    > > his FS as well.

    >
    > Yes, he has - but there's more relative substance behind that marketing
    > than there is behind ZFS's (after all, when you promote yourself as "The
    > Last Word In File Systems" it's easy to fall quite embarrassingly short).


    Thats pretty funny and I would have to agree :-)

    > I have heard that he hasn't really run his FS on any
    >
    > > enterprise class storage.

    >
    > The subject was not breadth of existing deployment but sophistication.


    Right, but if Reiser hasn't run his FS on any enterprise-class storage
    how can we assume its ready for prime-time, enterprise-class
    deployment?

    > > XFS has had its own issues. Yes you have on-disk continuity, but if
    > > you lose power while XFS is building its extant, you've got data
    > > corruption.

    >
    > I'd like to see a credible reference for that allegation (unless you're
    > simply referring to the potential inconsistency that virtually all
    > update-in-place file systems have when *updating* - rather than writing
    > for the first time - multiple sectors at once).


    See section 6.1: Delaying allocation

    http://oss.sgi.com/projects/xfs/pape...nix/index.html

    I remember reading another paper with detailed descriptions of causing
    data corruption on XFS through power manipulation but of course I
    can't find it anymore.

    > > So I really think the issues stopping people from deploying open
    > > source storage are:
    > > 1. Lack of snapshots, which may not be an issue if ZFS gains traction.

    >
    > My impression is that snapshots have been available in Linux, BSD, and
    > for that matter Solaris itself for many years in various forms
    > associated with LVMs and/or file systems.


    I believe you can only have 1 snapshot at a time in LVM. Nowhere near
    the sophistication of WAFL snapshots.

    > > 2. No coherent DR strategy. I don't consider rsync a mirroring
    > > solution if it needs to walk the tree each time.

    >
    > Synchronous mirroring at the driver level has been available for ages,
    > and is entirely feasible across distances of at least 100 miles - enough
    > to survive any disaster which your business is likely to survive as long
    > as your remote site is reasonably robust. If write performance
    > requirements can be relaxed a bit distances can be significantly
    > greater. I haven't looked recently, so I don't know how well those
    > facilities deal with temporary link interruptions and subsequent
    > catch-up (if you've got dedicated fiber to a robust back-up site that
    > may not be too likely to occur, but in other circumstances it would be
    > very desirable)


    Can you name some examples of synchronous mirroring at the driver
    level? Is it open source? Easy to deploy?

    Bottom line: I'd like to see people deploy Open Source Storage in
    their data centers. I'm just wondering why it hasn't happened yet and
    offering possible reasons.

    S


    >
    > - bill



  8. Re: Open source storage

    S wrote:

    ....

    if Reiser hasn't run his FS on any enterprise-class storage
    > how can we assume its ready for prime-time, enterprise-class
    > deployment?


    Because any failure of enterprise-class storage to faithfully mimic
    (e.g.) SCSI behavior should be considered to be an enterprise-storage
    bug rather than any problem with the file system?

    >
    >>> XFS has had its own issues. Yes you have on-disk continuity, but if
    >>> you lose power while XFS is building its extant, you've got data
    >>> corruption.

    >> I'd like to see a credible reference for that allegation (unless you're
    >> simply referring to the potential inconsistency that virtually all
    >> update-in-place file systems have when *updating* - rather than writing
    >> for the first time - multiple sectors at once).

    >
    > See section 6.1: Delaying allocation
    >
    > http://oss.sgi.com/projects/xfs/pape...nix/index.html


    There's nothing there that even remotely hints at data corruption on
    power loss: the defined semantics of any normal Unix-style file system
    (including ZFS) specifies that any user data that hasn't been explicitly
    flushed to disk may or may not be on the disk, in whole or in part,
    should power fail (that's what write-back caching is all about: if you
    want atomic on-disk persistence, you use fsync or per-request
    write-through - though even those won't necessarily guarantee
    full-request, let alone multi-request, atomicity beyond the individual
    file block level should power fail before the request completes, even on
    ZFS; about the only difference with ZFS is that individual file block
    disk writes are guaranteed to be atomic rather than just the
    near-guarantee that disks provide that individual sector writes will be
    atomic).

    It's been many years since I read that paper, though, and it provided a
    pleasant trip down memory lane. XFS did a lot of interesting things for
    the early '90s, even if not all of them were necessarily optimal.

    ....

    >>> So I really think the issues stopping people from deploying open
    >>> source storage are:
    >>> 1. Lack of snapshots, which may not be an issue if ZFS gains traction.

    >> My impression is that snapshots have been available in Linux, BSD, and
    >> for that matter Solaris itself for many years in various forms
    >> associated with LVMs and/or file systems.

    >
    > I believe you can only have 1 snapshot at a time in LVM. Nowhere near
    > the sophistication of WAFL snapshots.


    But all that you need to do an on-line backup, one of the most important
    consumers of snapshot technology. Other uses of snapshots tend to be
    more like inferior substitutes for 'continuous data protection'
    facilities, though the advent of writable snapshots (clones) has opened
    up new uses (at least new imaginable uses: how much actual utility they
    have I'm not sure).

    The old Solaris fssnap mechanism may have been limited to a single
    snapshot. Peter Braam et al. produced alpha and beta releases of a more
    general snapshot facility called snapfs in 2001 which I thought either
    got further developed or replaced with another product of the same name,
    but I didn't find further information on it. The Linux LVM and LVM2
    support snapshots (the latter including writable snapshots) - and a
    quick glance at the documentation didn't seem to indicate that they
    supported only one at a time.

    >
    >>> 2. No coherent DR strategy. I don't consider rsync a mirroring
    >>> solution if it needs to walk the tree each time.

    >> Synchronous mirroring at the driver level has been available for ages,
    >> and is entirely feasible across distances of at least 100 miles - enough
    >> to survive any disaster which your business is likely to survive as long
    >> as your remote site is reasonably robust. If write performance
    >> requirements can be relaxed a bit distances can be significantly
    >> greater. I haven't looked recently, so I don't know how well those
    >> facilities deal with temporary link interruptions and subsequent
    >> catch-up (if you've got dedicated fiber to a robust back-up site that
    >> may not be too likely to occur, but in other circumstances it would be
    >> very desirable)

    >
    > Can you name some examples of synchronous mirroring at the driver
    > level? Is it open source? Easy to deploy?


    I'm not all that familiar with the offerings, but my impression is that
    DRDB may be the current Linux standard in this area; a 2003 description
    can be found at http://www.linux-mag.com/id/1502 , and it's still being
    developed (just Google it). You may have been able to roll your own
    remote replication before DRDB by using a remote disk paired
    (RAID-1-style) with a local disk under local LVM facilities.

    - bill

  9. Re: Open source storage



    Bill Todd wrote:

    > There's nothing there that even remotely hints at data corruption on
    > power loss: the defined semantics of any normal Unix-style file
    > system (including ZFS) specifies that any user data that hasn't been
    > explicitly flushed to disk may or may not be on the disk, in whole or
    > in part, should power fail (that's what write-back caching is all
    > about: ...



    Sorry to go off on a tangent but I think it is somewhat relevant since S
    was talking about Enterprise storage: How common is it for enterprise
    storage vendors to have disks with firmware that makes it impossible to
    the enable write-back cache? We have an SGI NAS (IS4500) where this is
    the case and it took me a little by surprise, although it does make a
    lot of sense when you have 100 TB storage. Does most or all enterprise
    storage permanently disable write-back cache?

    Thanks,

    Steve


  10. Re: Open source storage

    Steve Cousins wrote:
    >
    >
    > Bill Todd wrote:
    >
    >> There's nothing there that even remotely hints at data corruption on
    >> power loss: the defined semantics of any normal Unix-style file
    >> system (including ZFS) specifies that any user data that hasn't been
    >> explicitly flushed to disk may or may not be on the disk, in whole or
    >> in part, should power fail (that's what write-back caching is all
    >> about: ...

    >
    >
    > Sorry to go off on a tangent but I think it is somewhat relevant since S
    > was talking about Enterprise storage: How common is it for enterprise
    > storage vendors to have disks with firmware that makes it impossible to
    > the enable write-back cache? We have an SGI NAS (IS4500) where this is
    > the case and it took me a little by surprise, although it does make a
    > lot of sense when you have 100 TB storage. Does most or all enterprise
    > storage permanently disable write-back cache?
    >
    > Thanks,
    >
    > Steve
    >


    Ha, I'm more shocked that anything from SGI is still in use.

  11. Re: Open source storage

    On Feb 19, 9:50 pm, Bill Todd wrote:
    > S wrote:
    >
    > ...
    >
    > if Reiser hasn't run his FS on any enterprise-class storage
    >
    > > how can we assume its ready for prime-time, enterprise-class
    > > deployment?

    >
    > Because any failure of enterprise-class storage to faithfully mimic
    > (e.g.) SCSI behavior should be considered to be an enterprise-storage
    > bug rather than any problem with the file system?


    :-)

    > >>> XFS has had its own issues. Yes you have on-disk continuity, but if
    > >>> you lose power while XFS is building its extant, you've got data
    > >>> corruption.
    > >> I'd like to see a credible reference for that allegation (unless you're
    > >> simply referring to the potential inconsistency that virtually all
    > >> update-in-place file systems have when *updating* - rather than writing
    > >> for the first time - multiple sectors at once).

    >
    > > See section 6.1: Delaying allocation

    >
    > >http://oss.sgi.com/projects/xfs/pape...nix/index.html

    >
    > There's nothing there that even remotely hints at data corruption on
    > power loss: the defined semantics of any normal Unix-style file system
    > (including ZFS) specifies that any user data that hasn't been explicitly
    > flushed to disk may or may not be on the disk, in whole or in part,
    > should power fail (that's what write-back caching is all about: if you
    > want atomic on-disk persistence, you use fsync or per-request
    > write-through - though even those won't necessarily guarantee
    > full-request, let alone multi-request, atomicity beyond the individual
    > file block level should power fail before the request completes, even on
    > ZFS; about the only difference with ZFS is that individual file block
    > disk writes are guaranteed to be atomic rather than just the
    > near-guarantee that disks provide that individual sector writes will be
    > atomic).


    You're right semantically. I understand the difference between sync
    and async, but it seems like the XFS designers almost went out of
    their way to ensure your data got corrupted when you lost power. The
    early SGI systems were designed with special hardware to shutdown
    gracefully in case of power loss, so maybe XFS was designed on the
    assumption that this would always be the case.

    I'll take boring old ext3 anytime over XFS or ReiserFS, I don't like
    to live life on the edge when it comes to my data or worse, other
    people's data.

    > It's been many years since I read that paper, though, and it provided a
    > pleasant trip down memory lane. XFS did a lot of interesting things for
    > the early '90s, even if not all of them were necessarily optimal.


    I also like it because its a very clearly written paper. The fact that
    you and I remember it after all these years is a testament to that.

    > ...
    >
    > >>> So I really think the issues stopping people from deploying open
    > >>> source storage are:
    > >>> 1. Lack of snapshots, which may not be an issue if ZFS gains traction.
    > >> My impression is that snapshots have been available in Linux, BSD, and
    > >> for that matter Solaris itself for many years in various forms
    > >> associated with LVMs and/or file systems.

    >
    > > I believe you can only have 1 snapshot at a time in LVM. Nowhere near
    > > the sophistication of WAFL snapshots.

    >
    > But all that you need to do an on-line backup, one of the most important
    > consumers of snapshot technology. Other uses of snapshots tend to be
    > more like inferior substitutes for 'continuous data protection'
    > facilities, though the advent of writable snapshots (clones) has opened
    > up new uses (at least new imaginable uses: how much actual utility they
    > have I'm not sure).
    >
    > The old Solaris fssnap mechanism may have been limited to a single
    > snapshot. Peter Braam et al. produced alpha and beta releases of a more
    > general snapshot facility called snapfs in 2001 which I thought either
    > got further developed or replaced with another product of the same name,
    > but I didn't find further information on it. The Linux LVM and LVM2
    > support snapshots (the latter including writable snapshots) - and a
    > quick glance at the documentation didn't seem to indicate that they
    > supported only one at a time.


    I looked at the LVM docs again, and realized why I thought you were
    limited to only one. The LVM guys advise you against running too many
    because it can slow down your FS..at some point in time I must have
    made a mental note to myself never to run >1 SS.

    You can have hundreds of snapshots in WAFL, without significant
    degradation of performance. I agree this is a substitute for CDP, but
    people love this feature, and want it. Again, maybe not having enough
    snapshots is a barrier to admins from adopting open source storage.

    > >>> 2. No coherent DR strategy. I don't consider rsync a mirroring
    > >>> solution if it needs to walk the tree each time.
    > >> Synchronous mirroring at the driver level has been available for ages,
    > >> and is entirely feasible across distances of at least 100 miles - enough
    > >> to survive any disaster which your business is likely to survive as long
    > >> as your remote site is reasonably robust. If write performance
    > >> requirements can be relaxed a bit distances can be significantly
    > >> greater. I haven't looked recently, so I don't know how well those
    > >> facilities deal with temporary link interruptions and subsequent
    > >> catch-up (if you've got dedicated fiber to a robust back-up site that
    > >> may not be too likely to occur, but in other circumstances it would be
    > >> very desirable)

    >
    > > Can you name some examples of synchronous mirroring at the driver
    > > level? Is it open source? Easy to deploy?

    >
    > I'm not all that familiar with the offerings, but my impression is that
    > DRDB may be the current Linux standard in this area; a 2003 description
    > can be found athttp://www.linux-mag.com/id/1502, and it's still being
    > developed (just Google it). You may have been able to roll your own
    > remote replication before DRDB by using a remote disk paired
    > (RAID-1-style) with a local disk under local LVM facilities.


    OK I'll be sure to check it out. Another solution of sorts is NBD (is
    that what you meant by remote disk-pair?). Pretty neat stuff..but its
    a start.


    >
    > - bill



  12. Re: Open source storage

    Steve Cousins wrote:
    >
    >
    > Bill Todd wrote:
    >
    >> There's nothing there that even remotely hints at data corruption on
    >> power loss: the defined semantics of any normal Unix-style file
    >> system (including ZFS) specifies that any user data that hasn't been
    >> explicitly flushed to disk may or may not be on the disk, in whole or
    >> in part, should power fail (that's what write-back caching is all
    >> about: ...

    >
    >
    > Sorry to go off on a tangent but I think it is somewhat relevant since S
    > was talking about Enterprise storage: How common is it for enterprise
    > storage vendors to have disks with firmware that makes it impossible to
    > the enable write-back cache?


    I'd guess pretty common (whether in the disk firmware or via array
    firmware that explicitly tells the disks to disable it): array
    software/firmware really wants to know when data has made it to the
    platter, and while if sufficiently intelligent it can juggle enabled
    disk-level write-back cache by using explicit disk-level 'force unit
    access' request options and/or cache-flush commands (some ATA drives
    have been known to ignore the latter without notice, but presumably the
    array vendor wouldn't qualify such disks for use in the array) there's
    little reason to do so, since typically the (enterprise) controller has
    considerably more non-volatile write-back cache than the disks have
    local volatile write-back cache.

    However, that has nothing to do with what I was talking about - which
    was file-system-level write-back caching and the resulting
    file-system-defined persistence semantics.

    - bill

  13. Re: Open source storage

    S wrote:

    ....

    >>>>> XFS has had its own issues. Yes you have on-disk continuity, but if
    >>>>> you lose power while XFS is building its extant, you've got data
    >>>>> corruption.
    >>>> I'd like to see a credible reference for that allegation (unless you're
    >>>> simply referring to the potential inconsistency that virtually all
    >>>> update-in-place file systems have when *updating* - rather than writing
    >>>> for the first time - multiple sectors at once).
    >>> See section 6.1: Delaying allocation
    >>> http://oss.sgi.com/projects/xfs/pape...nix/index.html

    >> There's nothing there that even remotely hints at data corruption on
    >> power loss: the defined semantics of any normal Unix-style file system
    >> (including ZFS) specifies that any user data that hasn't been explicitly
    >> flushed to disk may or may not be on the disk, in whole or in part,
    >> should power fail (that's what write-back caching is all about: if you
    >> want atomic on-disk persistence, you use fsync or per-request
    >> write-through - though even those won't necessarily guarantee
    >> full-request, let alone multi-request, atomicity beyond the individual
    >> file block level should power fail before the request completes, even on
    >> ZFS; about the only difference with ZFS is that individual file block
    >> disk writes are guaranteed to be atomic rather than just the
    >> near-guarantee that disks provide that individual sector writes will be
    >> atomic).

    >
    > You're right semantically. I understand the difference between sync
    > and async, but it seems like the XFS designers almost went out of
    > their way to ensure your data got corrupted when you lost power.


    I see your point, but it strikes me as one of the 'a little bit
    pregnant' variety: absent explicit write-through or cache-flush
    control, *any* Unix file system will tend to produce data
    inconsistencies after interruption, the only question being just how
    many (not whether there will be any at all).

    What the XFS designers went out of their way to do was to avoid writing
    data that never needed to be written (files that got deleted before ever
    making it to disk) and avoid fragmenting data that did get written (by
    deferring allocation and writing as long as feasible). As a by-product,
    dirty data in the cache didn't get flushed out as often as in more
    primitive file system environments where flushing data older than (e.g.)
    30 seconds (ZFS uses 5 seconds as its default IIRC) didn't have any real
    down-side.

    To put it another way, arbitrarily making data persistent frequently for
    an application or user that isn't sufficiently interested to have taken
    the appropriate steps to do so penalizes those applications and users
    that *have* taken such steps (by consuming system resources
    unnecessarily). And since you can never completely protect such
    negligent applications/users (unless you make every write synchronous),
    going to the opposite extreme (and thereby encouraging them actually to
    address the issue rather than merely hope that it won't bite them too
    frequently) has merit.

    That said, a different design might have achieved more up-to-date
    persistence with minimal impact on system resource consumption (e.g., by
    dumping small user data updates lazily into the log temporarily).

    The
    > early SGI systems were designed with special hardware to shutdown
    > gracefully in case of power loss, so maybe XFS was designed on the
    > assumption that this would always be the case.


    Could be, but I kind of doubt it: with potentially gigabytes of
    discontiguous dirty data in system cache, you'd need a full-blown UPS to
    guarantee persistence in such a case (and since even UPSs have been
    known to fail, a pair of them suitably wired for redundancy).

    > I'll take boring old ext3 anytime over XFS or ReiserFS, I don't like
    > to live life on the edge when it comes to my data or worse, other
    > people's data.


    Then you really should consider a system like VMS, where at least many
    writes are synchronous by default: Unix file systems *always* 'live on
    the edge' in the sense that you describe - the only question being just
    how sharp the edge is.

    - bill

  14. Re: Open source storage

    Steve Cousins wrote:
    >
    >
    > Bill Todd wrote:
    >
    >> There's nothing there that even remotely hints at data corruption on
    >> power loss: the defined semantics of any normal Unix-style file
    >> system (including ZFS) specifies that any user data that hasn't been
    >> explicitly flushed to disk may or may not be on the disk, in whole or
    >> in part, should power fail (that's what write-back caching is all
    >> about: ...

    >
    >
    > Sorry to go off on a tangent but I think it is somewhat relevant since S
    > was talking about Enterprise storage: How common is it for enterprise
    > storage vendors to have disks with firmware that makes it impossible to
    > the enable write-back cache? We have an SGI NAS (IS4500) where this is
    > the case and it took me a little by surprise, although it does make a
    > lot of sense when you have 100 TB storage. Does most or all enterprise
    > storage permanently disable write-back cache?


    In the drives themselves? It is easier to handle your own write/read
    caching strategy with enterprise grade hardware than it is to trust what
    a given disk vendor might offer. Also makes it easier to multi-source
    drives.

  15. Re: Open source storage

    On Feb 20, 5:08 pm, Bill Todd wrote:
    > S wrote:
    > >>>>> XFS has had its own issues. Yes you have on-disk continuity, but if
    > >>>>> you lose power while XFS is building its extant, you've got data
    > >>>>> corruption.
    > >>>> I'd like to see a credible reference for that allegation (unless you're
    > >>>> simply referring to the potential inconsistency that virtually all
    > >>>> update-in-place file systems have when *updating* - rather than writing
    > >>>> for the first time - multiple sectors at once).
    > >>> See section 6.1: Delaying allocation
    > >>>http://oss.sgi.com/projects/xfs/pape...nix/index.html
    > >> There's nothing there that even remotely hints at data corruption on
    > >> power loss: the defined semantics of any normal Unix-style file system
    > >> (including ZFS) specifies that any user data that hasn't been explicitly
    > >> flushed to disk may or may not be on the disk, in whole or in part,
    > >> should power fail (that's what write-back caching is all about: if you
    > >> want atomic on-disk persistence, you use fsync or per-request
    > >> write-through - though even those won't necessarily guarantee
    > >> full-request, let alone multi-request, atomicity beyond the individual
    > >> file block level should power fail before the request completes, even on
    > >> ZFS; about the only difference with ZFS is that individual file block
    > >> disk writes are guaranteed to be atomic rather than just the
    > >> near-guarantee that disks provide that individual sector writes will be
    > >> atomic).

    >
    > > You're right semantically. I understand the difference between sync
    > > and async, but it seems like the XFS designers almost went out of
    > > their way to ensure your data got corrupted when you lost power.

    >
    > I see your point, but it strikes me as one of the 'a little bit
    > pregnant' variety: absent explicit write-through or cache-flush
    > control, *any* Unix file system will tend to produce data
    > inconsistencies after interruption, the only question being just how
    > many (not whether there will be any at all).


    For a change, I agree somewhat with your initial analogy. Using XFS is
    like having sex without a condom, its great, but very unsafe if you
    don't know what you're doing. Not something I'd use in an enterprise
    scenario.

    > What the XFS designers went out of their way to do was to avoid writing
    > data that never needed to be written (files that got deleted before ever
    > making it to disk) and avoid fragmenting data that did get written (by
    > deferring allocation and writing as long as feasible). As a by-product,
    > dirty data in the cache didn't get flushed out as often as in more
    > primitive file system environments where flushing data older than (e.g.)
    > 30 seconds (ZFS uses 5 seconds as its default IIRC) didn't have any real
    > down-side.
    >
    > To put it another way, arbitrarily making data persistent frequently for
    > an application or user that isn't sufficiently interested to have taken
    > the appropriate steps to do so penalizes those applications and users
    > that *have* taken such steps (by consuming system resources
    > unnecessarily). And since you can never completely protect such
    > negligent applications/users (unless you make every write synchronous),
    > going to the opposite extreme (and thereby encouraging them actually to
    > address the issue rather than merely hope that it won't bite them too
    > frequently) has merit.


    Again, your points make great sense in theory, but wouldn't fly in the
    real world. Obviously you wouldn't want to make every write
    synchronous, but there is a gray area between that and the XFS/
    ReiserFS approach which just says hey..you're on your own.

    > That said, a different design might have achieved more up-to-date
    > persistence with minimal impact on system resource consumption (e.g., by
    > dumping small user data updates lazily into the log temporarily).


    This I agree with 100%.

    >
    > The
    >
    > > early SGI systems were designed with special hardware to shutdown
    > > gracefully in case of power loss, so maybe XFS was designed on the
    > > assumption that this would always be the case.

    >
    > Could be, but I kind of doubt it: with potentially gigabytes of
    > discontiguous dirty data in system cache, you'd need a full-blown UPS to
    > guarantee persistence in such a case (and since even UPSs have been
    > known to fail, a pair of them suitably wired for redundancy).


    Check these out:
    http://www.ibm.com/developerworks/li...ry/l-fs11.html
    http://linuxmafia.com/faq/Filesystems/reiserfs.html

    > > I'll take boring old ext3 anytime over XFS or ReiserFS, I don't like
    > > to live life on the edge when it comes to my data or worse, other
    > > people's data.

    >
    > Then you really should consider a system like VMS, where at least many
    > writes are synchronous by default: Unix file systems *always* 'live on
    > the edge' in the sense that you describe - the only question being just
    > how sharp the edge is.


    Come on. Boring is one thing, dead is another.

    S

    >
    > - bill



  16. Re: Open source storage



    Lon wrote:

    > Steve Cousins wrote:
    >
    >> Sorry to go off on a tangent but I think it is somewhat relevant
    >> since S was talking about Enterprise storage: How common is it for
    >> enterprise storage vendors to have disks with firmware that makes it
    >> impossible to the enable write-back cache? We have an SGI NAS
    >> (IS4500) where this is the case and it took me a little by surprise,
    >> although it does make a lot of sense when you have 100 TB storage.
    >> Does most or all enterprise storage permanently disable write-back
    >> cache?

    >
    >
    > In the drives themselves? It is easier to handle your own write/read
    > caching strategy with enterprise grade hardware than it is to trust
    > what a given disk vendor might offer. Also makes it easier to
    > multi-source drives.



    Yes. They have drives from Seagate with custom firmware. I had thought
    that performance would go down the tubes with this but doing tests with
    bonnie++ I got 50 MB/sec writes instead of 70 MB/sec for an individual
    drive (comparing it with a non-SGI version of the same drive). While
    this is noticeable it isn't as bad as I had expected.


  17. Re: Open source storage

    S wrote:
    [SNIP]
    >
    > For a change, I agree somewhat with your initial analogy. Using XFS is
    > like having sex without a condom, its great, but very unsafe if you
    > don't know what you're doing. Not something I'd use in an enterprise
    > scenario.

    Catch up with the real world - XFS has been in use in production sites
    for quite a few years now, not just the various super-computer sites
    that run SGI's Altix systems, but several others, lots of geo and oil
    stuff, not to mention the various media houses using CXFS, and, no
    doubt, the various mil and gov sites that SGI can't talk about.

    [SNIP]
    > Again, your points make great sense in theory, but wouldn't fly in the
    > real world. Obviously you wouldn't want to make every write
    > synchronous, but there is a gray area between that and the XFS/
    > ReiserFS approach which just says hey..you're on your own.

    If you want it safe, use the "wsync" mount option on XFS, and all
    modifications are synchronous just like on VMS. Safe, but slow.

    Cheers,
    Gary B-)

  18. Re: Open source storage

    S wrote:
    > On Feb 20, 5:08 pm, Bill Todd wrote:
    >> S wrote:
    >>>>>>> XFS has had its own issues. Yes you have on-disk continuity, but if
    >>>>>>> you lose power while XFS is building its extant, you've got data
    >>>>>>> corruption.
    >>>>>> I'd like to see a credible reference for that allegation (unless you're
    >>>>>> simply referring to the potential inconsistency that virtually all
    >>>>>> update-in-place file systems have when *updating* - rather than writing
    >>>>>> for the first time - multiple sectors at once).
    >>>>> See section 6.1: Delaying allocation
    >>>>> http://oss.sgi.com/projects/xfs/pape...nix/index.html
    >>>> There's nothing there that even remotely hints at data corruption on
    >>>> power loss: the defined semantics of any normal Unix-style file system
    >>>> (including ZFS) specifies that any user data that hasn't been explicitly
    >>>> flushed to disk may or may not be on the disk, in whole or in part,
    >>>> should power fail (that's what write-back caching is all about: if you
    >>>> want atomic on-disk persistence, you use fsync or per-request
    >>>> write-through - though even those won't necessarily guarantee
    >>>> full-request, let alone multi-request, atomicity beyond the individual
    >>>> file block level should power fail before the request completes, even on
    >>>> ZFS; about the only difference with ZFS is that individual file block
    >>>> disk writes are guaranteed to be atomic rather than just the
    >>>> near-guarantee that disks provide that individual sector writes will be
    >>>> atomic).
    >>> You're right semantically. I understand the difference between sync
    >>> and async, but it seems like the XFS designers almost went out of
    >>> their way to ensure your data got corrupted when you lost power.

    >> I see your point, but it strikes me as one of the 'a little bit
    >> pregnant' variety: absent explicit write-through or cache-flush
    >> control, *any* Unix file system will tend to produce data
    >> inconsistencies after interruption, the only question being just how
    >> many (not whether there will be any at all).

    >
    > For a change, I agree somewhat with your initial analogy. Using XFS is
    > like having sex without a condom, its great, but very unsafe if you
    > don't know what you're doing. Not something I'd use in an enterprise
    > scenario.


    Whereas using a more conventional Unix file system is like having sex
    while using a condom with a hole in it: somewhat lower risk if you
    don't know what you're doing, but nowhere nearly as low as using an
    uncompromised condom (or knowing what you're doing) would be.

    >
    >> What the XFS designers went out of their way to do was to avoid writing
    >> data that never needed to be written (files that got deleted before ever
    >> making it to disk) and avoid fragmenting data that did get written (by
    >> deferring allocation and writing as long as feasible). As a by-product,
    >> dirty data in the cache didn't get flushed out as often as in more
    >> primitive file system environments where flushing data older than (e.g.)
    >> 30 seconds (ZFS uses 5 seconds as its default IIRC) didn't have any real
    >> down-side.
    >>
    >> To put it another way, arbitrarily making data persistent frequently for
    >> an application or user that isn't sufficiently interested to have taken
    >> the appropriate steps to do so penalizes those applications and users
    >> that *have* taken such steps (by consuming system resources
    >> unnecessarily). And since you can never completely protect such
    >> negligent applications/users (unless you make every write synchronous),
    >> going to the opposite extreme (and thereby encouraging them actually to
    >> address the issue rather than merely hope that it won't bite them too
    >> frequently) has merit.

    >
    > Again, your points make great sense in theory, but wouldn't fly in the
    > real world.


    They fly just fine in the real world: they just don't fly as well with
    people who don't know what they're doing (who get burned regardless, but
    get burned a bit more when using a system that doesn't cater to them at
    the expense of the more competent).


    Obviously you wouldn't want to make every write
    > synchronous, but there is a gray area between that and the XFS/
    > ReiserFS approach which just says hey..you're on your own.


    You still just don't get it: anything less than fully synchronous
    writes says hey, you're on your own - you're just not likely to get
    burned *as often* if you don't bother worrying about it if there are
    more frequent cache flushes.

    >
    >> That said, a different design might have achieved more up-to-date
    >> persistence with minimal impact on system resource consumption (e.g., by
    >> dumping small user data updates lazily into the log temporarily).

    >
    > This I agree with 100%.
    >
    >> The
    >>
    >>> early SGI systems were designed with special hardware to shutdown
    >>> gracefully in case of power loss, so maybe XFS was designed on the
    >>> assumption that this would always be the case.

    >> Could be, but I kind of doubt it: with potentially gigabytes of
    >> discontiguous dirty data in system cache, you'd need a full-blown UPS to
    >> guarantee persistence in such a case (and since even UPSs have been
    >> known to fail, a pair of them suitably wired for redundancy).

    >
    > Check these out:
    > http://www.ibm.com/developerworks/li...ry/l-fs11.html


    This reference (to one of a series of Web pages which included a lot of
    interesting material - thanks, even though it proved to be a major
    time-sink as well as leading me serendipitously through a tortuous
    series of intermediate links too numerous to remember to
    http://artlung.com/smorgasborg/C_R_Y..._I_C_O_N.shtml , an
    interesting piece of historical computing philosophy which I got sucked
    into because I like the author's novels) cites a bug in XFS V1.0 which
    was largely fixed in V1.1, in which case it's primarily of historical
    interest for that specific interval (prior to April, 2002).

    However, SGI's XFS FAQ verifies that corner cases remained until
    2.6.22-rc1 (May 8, 2007) - when they may have addressed the issue at its
    root in the manner that Ted describes for ext3 in your second reference
    (that's certainly the way I've designed that mechanism and it's
    something of a mystery why they didn't do so in XFS V1.0, since their
    recovery procedure already was able to identify and partially deal with
    the problem there).

    In any event, note that any corruption due to such a bug has relatively
    little to do with *how long* modified data is kept around in the cache
    before being updated in place (your original apparent concern with XFS,
    as reflected in the citation which you provided back then): it would
    happen nearly as frequently (though likely affect fewer files when it
    did) with ext3 if the same bug were present there.

    > http://linuxmafia.com/faq/Filesystems/reiserfs.html


    Ted's observation that more loosely-structured (in terms of predictable
    placement of critical metadata) file systems are harder to fsck is
    entirely valid. Such systems should compensate for this by ensuring
    that critical metadata is written redundantly even if user data is not
    (in order to make the need to fsck a once-in-a-lifetime event) - and in
    fact this is something that ZFS does right. In file systems that
    protect metadata updates with write-ahead logging the additional
    overhead can be minimal, since the eventual in-place (mirrored) updates
    can by definition be applied lazily; the same can be true for designs
    that perform flexible batch updates as ZFS and Reiser4 do.

    As for XFS and ReiserFS journaling, either Ted doesn't understand how
    XFS and/or ReiserFS implement write-ahead logging, or XFS and/or
    ReiserFS don't understand how to do write-ahead logging robustly (a look
    at the code which I don't have time for could determine which).

    There's nothing wrong with doing logical logging as long as you don't
    overlook the fact that when it finally *does* come time to update the
    on-disk copy of the metadata (rather than just continue to accumulate
    updates to it in memory while making them persistent in the log) you
    first have to dump the entire metadata block image into the log before
    updating it in place to avoid the problem that Ted describes. So either
    he's wrong (e.g., the conversation which he had with the XFS engineer
    might have referred to SGI's EFS implementation: Ted's reference to the
    XFS 1.0 problems described in your other link 32 months after XFS 1.1
    had been released suggests that his acquaintance with the file system
    may be somewhat casual), or XFS and/or ReiserFS have a bug (not a design
    flaw: it's readily rectifiable) in their logging implementation.

    Once again, note that any corruption due to such a bug has nothing
    whatsoever to do with *how long* modified data is kept around in the
    cache before being updated in place: it's simply a matter of whether
    the eventual metadata in-place update happens to be interrupted.

    >
    >>> I'll take boring old ext3 anytime over XFS or ReiserFS, I don't like
    >>> to live life on the edge when it comes to my data or worse, other
    >>> people's data.

    >> Then you really should consider a system like VMS, where at least many
    >> writes are synchronous by default: Unix file systems *always* 'live on
    >> the edge' in the sense that you describe - the only question being just
    >> how sharp the edge is.

    >
    > Come on. Boring is one thing, dead is another.


    There's still a great deal of critical data being safely maintained in
    VMS environments, even though VMS itself sees minimal development
    activity these days. Perhaps you're just too solidly wedded to your
    illusion that (other issues being equal, of course) ext3 provides some
    qualitative difference in safety over less-frequently-flushed approaches
    to find *real* safety interesting. (Of course, if you enable user-data
    journaling in ext3 - or ReiserFS - that changes the situation
    significantly, and some of the little that I've read about JFS suggests
    that it may do so by default, and synchronously.)

    - bill

  19. Re: Open source storage

    On Feb 22, 11:21 am, Bill Todd wrote:
    > S wrote:
    > > On Feb 20, 5:08 pm, Bill Todd wrote:
    > >> S wrote:
    > >>>>>>> XFS has had its own issues. Yes you have on-disk continuity, but if
    > >>>>>>> you lose power while XFS is building its extant, you've got data
    > >>>>>>> corruption.
    > >>>>>> I'd like to see a credible reference for that allegation (unless you're
    > >>>>>> simply referring to the potential inconsistency that virtually all
    > >>>>>> update-in-place file systems have when *updating* - rather than writing
    > >>>>>> for the first time - multiple sectors at once).
    > >>>>> See section 6.1: Delaying allocation
    > >>>>>http://oss.sgi.com/projects/xfs/pape...nix/index.html
    > >>>> There's nothing there that even remotely hints at data corruption on
    > >>>> power loss: the defined semantics of any normal Unix-style file system
    > >>>> (including ZFS) specifies that any user data that hasn't been explicitly
    > >>>> flushed to disk may or may not be on the disk, in whole or in part,
    > >>>> should power fail (that's what write-back caching is all about: if you
    > >>>> want atomic on-disk persistence, you use fsync or per-request
    > >>>> write-through - though even those won't necessarily guarantee
    > >>>> full-request, let alone multi-request, atomicity beyond the individual
    > >>>> file block level should power fail before the request completes, even on
    > >>>> ZFS; about the only difference with ZFS is that individual file block
    > >>>> disk writes are guaranteed to be atomic rather than just the
    > >>>> near-guarantee that disks provide that individual sector writes will be
    > >>>> atomic).
    > >>> You're right semantically. I understand the difference between sync
    > >>> and async, but it seems like the XFS designers almost went out of
    > >>> their way to ensure your data got corrupted when you lost power.
    > >> I see your point, but it strikes me as one of the 'a little bit
    > >> pregnant' variety: absent explicit write-through or cache-flush
    > >> control, *any* Unix file system will tend to produce data
    > >> inconsistencies after interruption, the only question being just how
    > >> many (not whether there will be any at all).

    >
    > > For a change, I agree somewhat with your initial analogy. Using XFS is
    > > like having sex without a condom, its great, but very unsafe if you
    > > don't know what you're doing. Not something I'd use in an enterprise
    > > scenario.

    >
    > Whereas using a more conventional Unix file system is like having sex
    > while using a condom with a hole in it: somewhat lower risk if you
    > don't know what you're doing, but nowhere nearly as low as using an
    > uncompromised condom (or knowing what you're doing) would be.


    :-) I knew you would have a witty comeback to that one.

    > >> What the XFS designers went out of their way to do was to avoid writing
    > >> data that never needed to be written (files that got deleted before ever
    > >> making it to disk) and avoid fragmenting data that did get written (by
    > >> deferring allocation and writing as long as feasible). As a by-product,
    > >> dirty data in the cache didn't get flushed out as often as in more
    > >> primitive file system environments where flushing data older than (e.g.)
    > >> 30 seconds (ZFS uses 5 seconds as its default IIRC) didn't have any real
    > >> down-side.

    >
    > >> To put it another way, arbitrarily making data persistent frequently for
    > >> an application or user that isn't sufficiently interested to have taken
    > >> the appropriate steps to do so penalizes those applications and users
    > >> that *have* taken such steps (by consuming system resources
    > >> unnecessarily). And since you can never completely protect such
    > >> negligent applications/users (unless you make every write synchronous),
    > >> going to the opposite extreme (and thereby encouraging them actually to
    > >> address the issue rather than merely hope that it won't bite them too
    > >> frequently) has merit.

    >
    > > Again, your points make great sense in theory, but wouldn't fly in the
    > > real world.

    >
    > They fly just fine in the real world: they just don't fly as well with
    > people who don't know what they're doing (who get burned regardless, but
    > get burned a bit more when using a system that doesn't cater to them at
    > the expense of the more competent).
    >
    > Obviously you wouldn't want to make every write
    >
    > > synchronous, but there is a gray area between that and the XFS/
    > > ReiserFS approach which just says hey..you're on your own.

    >
    > You still just don't get it: anything less than fully synchronous
    > writes says hey, you're on your own - you're just not likely to get
    > burned *as often* if you don't bother worrying about it if there are
    > more frequent cache flushes.


    I do get it, but IMHO the *as often* makes ALL of the difference in
    perception.

    >
    >
    > >> That said, a different design might have achieved more up-to-date
    > >> persistence with minimal impact on system resource consumption (e.g., by
    > >> dumping small user data updates lazily into the log temporarily).

    >
    > > This I agree with 100%.

    >
    > >> The

    >
    > >>> early SGI systems were designed with special hardware to shutdown
    > >>> gracefully in case of power loss, so maybe XFS was designed on the
    > >>> assumption that this would always be the case.
    > >> Could be, but I kind of doubt it: with potentially gigabytes of
    > >> discontiguous dirty data in system cache, you'd need a full-blown UPS to
    > >> guarantee persistence in such a case (and since even UPSs have been
    > >> known to fail, a pair of them suitably wired for redundancy).

    >
    > > Check these out:
    > >http://www.ibm.com/developerworks/li...ry/l-fs11.html

    >
    > This reference (to one of a series of Web pages which included a lot of
    > interesting material - thanks, even though it proved to be a major
    > time-sink as well as leading me serendipitously through a tortuous
    > series of intermediate links too numerous to remember tohttp://artlung.com/smorgasborg/C_R_Y_P_T_O_N_O_M_I_C_O_N.shtml, an
    > interesting piece of historical computing philosophy which I got sucked
    > into because I like the author's novels) cites a bug in XFS V1.0 which
    > was largely fixed in V1.1, in which case it's primarily of historical
    > interest for that specific interval (prior to April, 2002).
    >
    > However, SGI's XFS FAQ verifies that corner cases remained until
    > 2.6.22-rc1 (May 8, 2007) - when they may have addressed the issue at its
    > root in the manner that Ted describes for ext3 in your second reference
    > (that's certainly the way I've designed that mechanism and it's
    > something of a mystery why they didn't do so in XFS V1.0, since their
    > recovery procedure already was able to identify and partially deal with
    > the problem there).
    >
    > In any event, note that any corruption due to such a bug has relatively
    > little to do with *how long* modified data is kept around in the cache
    > before being updated in place (your original apparent concern with XFS,
    > as reflected in the citation which you provided back then): it would
    > happen nearly as frequently (though likely affect fewer files when it
    > did) with ext3 if the same bug were present there.


    Correct, I am aware that these are 2 different issues.

    > >http://linuxmafia.com/faq/Filesystems/reiserfs.html

    >
    > Ted's observation that more loosely-structured (in terms of predictable
    > placement of critical metadata) file systems are harder to fsck is
    > entirely valid. Such systems should compensate for this by ensuring
    > that critical metadata is written redundantly even if user data is not
    > (in order to make the need to fsck a once-in-a-lifetime event) - and in
    > fact this is something that ZFS does right. In file systems that
    > protect metadata updates with write-ahead logging the additional
    > overhead can be minimal, since the eventual in-place (mirrored) updates
    > can by definition be applied lazily; the same can be true for designs
    > that perform flexible batch updates as ZFS and Reiser4 do.
    >
    > As for XFS and ReiserFS journaling, either Ted doesn't understand how
    > XFS and/or ReiserFS implement write-ahead logging, or XFS and/or
    > ReiserFS don't understand how to do write-ahead logging robustly (a look
    > at the code which I don't have time for could determine which).
    >
    > There's nothing wrong with doing logical logging as long as you don't
    > overlook the fact that when it finally *does* come time to update the
    > on-disk copy of the metadata (rather than just continue to accumulate
    > updates to it in memory while making them persistent in the log) you
    > first have to dump the entire metadata block image into the log before
    > updating it in place to avoid the problem that Ted describes. So either
    > he's wrong (e.g., the conversation which he had with the XFS engineer
    > might have referred to SGI's EFS implementation: Ted's reference to the
    > XFS 1.0 problems described in your other link 32 months after XFS 1.1
    > had been released suggests that his acquaintance with the file system
    > may be somewhat casual), or XFS and/or ReiserFS have a bug (not a design
    > flaw: it's readily rectifiable) in their logging implementation.
    >
    > Once again, note that any corruption due to such a bug has nothing
    > whatsoever to do with *how long* modified data is kept around in the
    > cache before being updated in place: it's simply a matter of whether
    > the eventual metadata in-place update happens to be interrupted.
    >
    >
    >
    > >>> I'll take boring old ext3 anytime over XFS or ReiserFS, I don't like
    > >>> to live life on the edge when it comes to my data or worse, other
    > >>> people's data.
    > >> Then you really should consider a system like VMS, where at least many
    > >> writes are synchronous by default: Unix file systems *always* 'live on
    > >> the edge' in the sense that you describe - the only question being just
    > >> how sharp the edge is.

    >
    > > Come on. Boring is one thing, dead is another.

    >
    > There's still a great deal of critical data being safely maintained in
    > VMS environments, even though VMS itself sees minimal development
    > activity these days. Perhaps you're just too solidly wedded to your
    > illusion that (other issues being equal, of course) ext3 provides some
    > qualitative difference in safety over less-frequently-flushed approaches
    > to find *real* safety interesting. (Of course, if you enable user-data
    > journaling in ext3 - or ReiserFS - that changes the situation
    > significantly, and some of the little that I've read about JFS suggests
    > that it may do so by default, and ...



    Very cool. I've enjoyed this discussion.

    So I did notice that a lot of folks have been wondering about the same
    thing re: Open Source Storage:

    http://storagemojo.com/2007/10/26/wo...e-oss-storage/

    S






  20. Re: Open source storage

    On Feb 21, 7:55 pm, "Gary R. Schmidt" wrote:
    > S wrote:
    >
    > [SNIP]
    >
    > > For a change, I agree somewhat with your initial analogy. Using XFS is
    > > like having sex without a condom, its great, but very unsafe if you
    > > don't know what you're doing. Not something I'd use in an enterprise
    > > scenario.

    >
    > Catch up with the real world - XFS has been in use in production sites
    > for quite a few years now, not just the various super-computer sites
    > that run SGI's Altix systems, but several others, lots of geo and oil
    > stuff, not to mention the various media houses using CXFS, and, no
    > doubt, the various mil and gov sites that SGI can't talk about.
    >


    Not sure if mil/gov sites qualify as real world. Oil and Gas/Media
    were SGI's mainstay back in the day, so it wouldn't surprise me that
    they went with XFS on Linux now. To me real world is a bank, an
    Internet company, or a CAD engineering firm.


    > [SNIP]> Again, your points make great sense in theory, but wouldn't fly in the
    > > real world. Obviously you wouldn't want to make every write
    > > synchronous, but there is a gray area between that and the XFS/
    > > ReiserFS approach which just says hey..you're on your own.

    >
    > If you want it safe, use the "wsync" mount option on XFS, and all
    > modifications are synchronous just like on VMS. Safe, but slow.
    >


    Continuing with the above analogy, running VMS would be like having
    sex with your close relative. Boring, kinda familiar and possibly
    illegal in most states except Alabama.

    Later,
    S

    > Cheers,
    > Gary B-)


+ Reply to Thread