Big volumes, small files - Storage

This is a discussion on Big volumes, small files - Storage ; Hello, I'd like to plan a new storage solution for a system currently in production. The system's storage is based on code which writes many files to the file system, with overall storage needs currently around 40TB and expected to ...

+ Reply to Thread
Results 1 to 13 of 13

Thread: Big volumes, small files

  1. Big volumes, small files

    Hello,

    I'd like to plan a new storage solution for a system currently in
    production.

    The system's storage is based on code which writes many files to the
    file system, with overall storage needs currently around 40TB and
    expected to reach hundreds of TBs. The average file size of the system
    is ~100K, which translates to ~500 million files today, and billions
    of files in the future. This storage is accessed over NFS by a rack of
    40 Linux blades, and is mostly read-only (99% of the activity is
    reads). While I realize calling this sub-optimal system design is
    probably an understatement, the design of the system is beyond my
    control and isn't likely to change in the near future.

    The system's current storage is based on 4 VxFS filesystems, created
    on SVM meta-devices each ~10TB in size. A 2-node Sun Cluster serves
    the filesystems, 2 filesystems per node. Each of the filesystems
    undergoes growfs as more storage is made available. We're looking for
    an alternative solution, in an attempt to improve performance and
    ability to recover from disasters (fsck on 2^42 files isn't practical,
    and I'm getting pretty worried due to this fact - even the smallest
    filesystem inconsistency will leave me lots of useless bits).

    Question is - can someone with experience with large filesystems and
    many small-files share his stories? Is it practical to base such a
    solution on a few (8) large volumes, each with single large filesystem
    in it?

    Many thanks in advance for any advice,
    - Yaniv


  2. Re: Big volumes, small files

    On Apr 18, 9:48 am, the.ak...@gmail.com wrote:
    > Hello,
    >
    > I'd like to plan a new storage solution for a system currently in
    > production.
    >
    > The system's storage is based on code which writes many files to the
    > file system, with overall storage needs currently around 40TB and
    > expected to reach hundreds of TBs. The average file size of the system
    > is ~100K, which translates to ~500 million files today, and billions
    > of files in the future. This storage is accessed over NFS by a rack of
    > 40 Linux blades, and is mostly read-only (99% of the activity is
    > reads). While I realize calling this sub-optimal system design is
    > probably an understatement, the design of the system is beyond my
    > control and isn't likely to change in the near future.
    >
    > The system's current storage is based on 4 VxFS filesystems, created
    > on SVM meta-devices each ~10TB in size. A 2-node Sun Cluster serves
    > the filesystems, 2 filesystems per node. Each of the filesystems
    > undergoes growfs as more storage is made available. We're looking for
    > an alternative solution, in an attempt to improve performance and
    > ability to recover from disasters (fsck on 2^42 files isn't practical,
    > and I'm getting pretty worried due to this fact - even the smallest
    > filesystem inconsistency will leave me lots of useless bits).
    >
    > Question is - can someone with experience with large filesystems and
    > many small-files share his stories? Is it practical to base such a
    > solution on a few (8) large volumes, each with single large filesystem
    > in it?
    >
    > Many thanks in advance for any advice,
    > - Yaniv


    The best bet, would e to go to a NAS appliance, IE EMC or NetApp.
    There NAS devices can handle this load better than any veritas
    solution. The NSX model from EMC will let you go to 32 TB per file
    system per data mover. It also allows for backups via snap shots. Do
    not use ATA storage, try to use low cost fiber channel drives. They
    have a higher run rate then standard ATA. If you used a DMX3 and a
    NSX you would be able to handle 2 to 3 years worth of growth within a
    two unit environment.


  3. Re: Big volumes, small files

    On 18 Apr 2007 06:48:30 -0700, the.aknin@gmail.com wrote:

    >Hello,
    >
    >I'd like to plan a new storage solution for a system currently in
    >production.
    >
    >The system's storage is based on code which writes many files to the
    >file system, with overall storage needs currently around 40TB and
    >expected to reach hundreds of TBs. The average file size of the system
    >is ~100K, which translates to ~500 million files today, and billions
    >of files in the future. This storage is accessed over NFS by a rack of
    >40 Linux blades, and is mostly read-only (99% of the activity is
    >reads). While I realize calling this sub-optimal system design is
    >probably an understatement, the design of the system is beyond my
    >control and isn't likely to change in the near future.
    >
    >The system's current storage is based on 4 VxFS filesystems, created
    >on SVM meta-devices each ~10TB in size. A 2-node Sun Cluster serves
    >the filesystems, 2 filesystems per node. Each of the filesystems
    >undergoes growfs as more storage is made available. We're looking for
    >an alternative solution, in an attempt to improve performance and
    >ability to recover from disasters (fsck on 2^42 files isn't practical,
    >and I'm getting pretty worried due to this fact - even the smallest
    >filesystem inconsistency will leave me lots of useless bits).
    >
    >Question is - can someone with experience with large filesystems and
    >many small-files share his stories? Is it practical to base such a
    >solution on a few (8) large volumes, each with single large filesystem
    >in it?
    >
    >Many thanks in advance for any advice,
    > - Yaniv


    For your particular situation there is a universal statement that
    applies; you're screwed.

    The simple answer is there is nothing out there yet that willl handle
    lots of small files well, except maybe a RamSAN.

    I agree with carmelomcc that a proprietary NAS may be a good fit, but
    I disagree that EMC makes a NAS. It's a pile of crap.
    Depending on budget NetApp may do fine. There's also BlueArc and
    agami. I don;t think clustered storage is going to help you in any
    way.

    However, if you are mostly reads what is keeping you from using a
    boatload of cache?

    Backups are going to be painful, no way around it. The best thing I
    can think of given your current environment is using cache devices for
    client performance enhancement and FlashBackup for backups and
    recovery.
    FlashBackup will take the entire image as a backup and not spend eons
    mapping blocks to files. I'm not exactly sure how it works but people
    swear by it.

    ~F

  4. Re: Big volumes, small files

    Faeandar wrote:
    > On 18 Apr 2007 06:48:30 -0700, the.aknin@gmail.com wrote:
    >
    >> Hello,
    >>
    >> I'd like to plan a new storage solution for a system currently in
    >> production.
    >>
    >> The system's storage is based on code which writes many files to the
    >> file system, with overall storage needs currently around 40TB and
    >> expected to reach hundreds of TBs. The average file size of the system
    >> is ~100K, which translates to ~500 million files today, and billions
    >> of files in the future. This storage is accessed over NFS by a rack of
    >> 40 Linux blades, and is mostly read-only (99% of the activity is
    >> reads). While I realize calling this sub-optimal system design is
    >> probably an understatement, the design of the system is beyond my
    >> control and isn't likely to change in the near future.
    >>
    >> The system's current storage is based on 4 VxFS filesystems, created
    >> on SVM meta-devices each ~10TB in size. A 2-node Sun Cluster serves
    >> the filesystems, 2 filesystems per node. Each of the filesystems
    >> undergoes growfs as more storage is made available. We're looking for
    >> an alternative solution, in an attempt to improve performance and
    >> ability to recover from disasters (fsck on 2^42 files isn't practical,
    >> and I'm getting pretty worried due to this fact - even the smallest
    >> filesystem inconsistency will leave me lots of useless bits).
    >>
    >> Question is - can someone with experience with large filesystems and
    >> many small-files share his stories? Is it practical to base such a
    >> solution on a few (8) large volumes, each with single large filesystem
    >> in it?
    >>
    >> Many thanks in advance for any advice,
    >> - Yaniv

    >
    > For your particular situation there is a universal statement that
    > applies; you're screwed.
    >
    > The simple answer is there is nothing out there yet that willl handle
    > lots of small files well


    Though I have no direct experience with it, my impression is that this
    may be a workload which ZFS could handle well (I don't know what level
    of maturity ZFS has attained by now, but Apple's recent embrace of it
    suggests that it may be pretty solid). Its maximum block size goes to
    128KB, so many/most files could fit in a single block. It grows
    dynamically as required, without the 16TB (?) limit of ext[2|3]fs on
    Linux (though other mature Linux file systems like JFS and XFS might be
    worth considering - possibly even ReiserFS if V3 is sufficiently
    stable). Sun may support ZFS as a cluster file system by now (IIRC
    plans were in place to).

    Any mature journaling file system with snapshots should address the fsck
    and backup issues (one of the nice things about ZFS is that its
    background integrity-checking and increased metadata-replication
    mechanisms reduce even further the chances that the system will ever get
    sufficiently hosed that fsck would be required). If directories must be
    large (are all the files in just one?), you'd want a file system with
    b-tree or hash-indexed directories (I think everything I mentioned above
    qualifies).

    Good luck,

    - bill

  5. Re: Big volumes, small files

    On Fri, 20 Apr 2007 00:40:13 -0400, Bill Todd
    wrote:

    >Faeandar wrote:
    >> On 18 Apr 2007 06:48:30 -0700, the.aknin@gmail.com wrote:
    >>
    >>> Hello,
    >>>
    >>> I'd like to plan a new storage solution for a system currently in
    >>> production.
    >>>
    >>> The system's storage is based on code which writes many files to the
    >>> file system, with overall storage needs currently around 40TB and
    >>> expected to reach hundreds of TBs. The average file size of the system
    >>> is ~100K, which translates to ~500 million files today, and billions
    >>> of files in the future. This storage is accessed over NFS by a rack of
    >>> 40 Linux blades, and is mostly read-only (99% of the activity is
    >>> reads). While I realize calling this sub-optimal system design is
    >>> probably an understatement, the design of the system is beyond my
    >>> control and isn't likely to change in the near future.
    >>>
    >>> The system's current storage is based on 4 VxFS filesystems, created
    >>> on SVM meta-devices each ~10TB in size. A 2-node Sun Cluster serves
    >>> the filesystems, 2 filesystems per node. Each of the filesystems
    >>> undergoes growfs as more storage is made available. We're looking for
    >>> an alternative solution, in an attempt to improve performance and
    >>> ability to recover from disasters (fsck on 2^42 files isn't practical,
    >>> and I'm getting pretty worried due to this fact - even the smallest
    >>> filesystem inconsistency will leave me lots of useless bits).
    >>>
    >>> Question is - can someone with experience with large filesystems and
    >>> many small-files share his stories? Is it practical to base such a
    >>> solution on a few (8) large volumes, each with single large filesystem
    >>> in it?
    >>>
    >>> Many thanks in advance for any advice,
    >>> - Yaniv

    >>
    >> For your particular situation there is a universal statement that
    >> applies; you're screwed.
    >>
    >> The simple answer is there is nothing out there yet that willl handle
    >> lots of small files well

    >
    >Though I have no direct experience with it, my impression is that this
    >may be a workload which ZFS could handle well (I don't know what level
    >of maturity ZFS has attained by now, but Apple's recent embrace of it
    >suggests that it may be pretty solid). Its maximum block size goes to
    >128KB, so many/most files could fit in a single block. It grows
    >dynamically as required, without the 16TB (?) limit of ext[2|3]fs on
    >Linux (though other mature Linux file systems like JFS and XFS might be
    >worth considering - possibly even ReiserFS if V3 is sufficiently
    >stable). Sun may support ZFS as a cluster file system by now (IIRC
    >plans were in place to).
    >
    >Any mature journaling file system with snapshots should address the fsck
    >and backup issues (one of the nice things about ZFS is that its
    >background integrity-checking and increased metadata-replication
    >mechanisms reduce even further the chances that the system will ever get
    >sufficiently hosed that fsck would be required). If directories must be
    >large (are all the files in just one?), you'd want a file system with
    >b-tree or hash-indexed directories (I think everything I mentioned above
    >qualifies).
    >
    >Good luck,
    >
    >- bill



    ZFS is great in concept and I think they are on the right path,
    however it's not yet ready for primetime imo.

    The integrated integrity checking is extremely cpu intesive. It does
    not cluster yet, at least not as of 2 weeks ago. Many file systems
    grow dynamically so I would make that a check in ZFS's column. No
    practical TB limit is a win if you need to go beyond 16TB in a single
    FS.
    I'm not sure I see how snapshots or journaling helps with backups. It
    still has to map blocks to files, which is the long part of a backup.
    I know when NetApp backups occur it takes the snapshot and then tries
    to do a dump. If you have millions of files it can be hours before
    data is actually transferred, I believe ZFS is no different.

    Since the OP's IO pattern is mostly reads the cpu load may not be an
    issue but writes suffer a serious penalty if you are not cpu-rich.
    I've spoken with people who ran an Oracle db on ZFS and said they had
    to move back until they had a T2000 or so.

    ~F

  6. Re: Big volumes, small files

    Faeandar wrote:

    ....

    > ZFS is great in concept and I think they are on the right path,
    > however it's not yet ready for primetime imo.


    Though (as I already noted) I don't have any direct experience with it,
    my impression is that people are using it in production systems
    successfully - so a description of your specific reservations would be
    useful.

    >
    > The integrated integrity checking is extremely cpu intesive.


    I suspect that you're mistaken: IIRC it occurs as part of an
    already-existing data copy operation at a very low level in the disk
    read/write routines, and at close to memory-streaming speeds (i.e.,
    mostly using CPU cycles that are being used anyway just to copy the data).

    It does
    > not cluster yet, at least not as of 2 weeks ago.


    It was not clear that this was a requirement in this case - but since
    the OP mentioned clustering, I mentioned the soon-to-arrive capability.

    Many file systems
    > grow dynamically so I would make that a check in ZFS's column.


    I'm not sure they grow dynamically quite as painlessly as ZFS does:
    usually, you first have to arrange to expand the underlying disk storage
    at the volume-manager level, and then have to incorporate the increase
    in volume size into the file system.

    No
    > practical TB limit is a win if you need to go beyond 16TB in a single
    > FS.
    > I'm not sure I see how snapshots or journaling helps with backups.


    I should have added the word 'respectively', I guess: journaling helps
    avoid the need for fsck, and snapshots help expedite backups (by
    avoiding any need for down-time while making them).

    It
    > still has to map blocks to files, which is the long part of a backup.
    > I know when NetApp backups occur it takes the snapshot and then tries
    > to do a dump. If you have millions of files it can be hours before
    > data is actually transferred, I believe ZFS is no different.


    Actually, it is, since it allows block sizes up to 128 KB (vs. 4 KB for
    WAFL IIRC, though if WAFL does a good job of defragmenting files the
    difference may not be too substantial). With the OP's 100 KB file
    sizes, this means that each file can be accessed (backed up) with a
    single disk access, yielding a fairly respectable backup bandwidth of
    about 6 MB/sec (assuming that such an access takes about 16 ms. for a
    7200 rpm drive, including transfer time, and that the associated
    directory accesses can be batched during the scan).

    >
    > Since the OP's IO pattern is mostly reads the cpu load may not be an
    > issue but writes suffer a serious penalty if you are not cpu-rich.


    I'm not sure why that would be the case even if the integrity-checking
    *were* CPU-intensive, since the overhead to check the integrity on a
    read should be just about the same as the overhead to generate the
    checksum on a write. True, one must generate it all the way back up to
    the system superblock for a write (one reason why I prefer a
    log-oriented implementation that can defer and consolidate such
    activity), but below the root unless you've got many of the
    intermediate-level blocks cached you have to access and validate them on
    each read (and with on the order of a billion files, my guess is that
    needed directory data will quite frequently not be cached).

    > I've spoken with people who ran an Oracle db on ZFS and said they had
    > to move back until they had a T2000 or so.


    Now in *that* application I suspect that a lot of the intermediate
    blocks *are* often cached on reads, which does drive up relative write
    overhead substantially (not so much due to integrity-checking per se -
    since as I already noted I think that it piggybacks on a copy operation
    - as due to the need to write back the entire block-tree path on each
    update).

    - bill

  7. Re: Big volumes, small files

    On 2007-04-18, the.aknin@gmail.com wrote:

    [saw your question on the GPFS forum, but prefer news..]

    > The system's storage is based on code which writes many files to the
    > file system, with overall storage needs currently around 40TB and
    > expected to reach hundreds of TBs. The average file size of the system
    > is ~100K, which translates to ~500 million files today, and billions
    > of files in the future. This storage is accessed over NFS by a rack of
    > 40 Linux blades, and is mostly read-only (99% of the activity is
    > reads). While I realize calling this sub-optimal system design is
    > probably an understatement, the design of the system is beyond my
    > control and isn't likely to change in the near future.


    > Question is - can someone with experience with large filesystems and
    > many small-files share his stories? Is it practical to base such a
    > solution on a few (8) large volumes, each with single large filesystem
    > in it?


    My largest GPFS system is a 10 TB usable, 700 GB currently used, average
    file size of 70-80KB, 9M inodes used. Not quite as large as your current,
    but it might have some of the same properties.. It's a Maildir-based
    mailserver-cluster, and is working very well. Only issue we'd had is
    that writing to directories with 10's of thousands files can be too
    expensive. We will probably need to move metadata to separate volumes
    to give them more cache than the data-volumes to fix this.

    If I was to do your huge system with GPFS, I would try to first spread
    the files over as many directories as possible, and also across as many
    separate file systems as possible. Spreading over many directories,
    because I think GPFS is doing directory-level locking for some
    operations (adding new files?), and spread over as many file systems
    as possible to reduce the fsck time (GPFS does do online fsck, but
    not everything can be fixed while online) and make sure that a
    catastrophic file system error doesn't take down everything.

    I would also try to avoid NFS if possible. Having the clients mount the
    fs's as GPFS clients (tcpip or SAN) will probably be much better, and
    will avoid the bottlenecks and SPOFs of NFS.



    -jf

  8. Re: Big volumes, small files

    Jan-Frode Myklebust wrote:
    > On 2007-04-18, the.aknin@gmail.com wrote:
    >
    > [saw your question on the GPFS forum, but prefer news..]
    >
    >> The system's storage is based on code which writes many files to the
    >> file system, with overall storage needs currently around 40TB and
    >> expected to reach hundreds of TBs. The average file size of the system
    >> is ~100K, which translates to ~500 million files today, and billions
    >> of files in the future. This storage is accessed over NFS by a rack of
    >> 40 Linux blades, and is mostly read-only (99% of the activity is
    >> reads). While I realize calling this sub-optimal system design is
    >> probably an understatement, the design of the system is beyond my
    >> control and isn't likely to change in the near future.

    >
    >> Question is - can someone with experience with large filesystems and
    >> many small-files share his stories? Is it practical to base such a
    >> solution on a few (8) large volumes, each with single large filesystem
    >> in it?

    >
    > My largest GPFS system is a 10 TB usable, 700 GB currently used, average
    > file size of 70-80KB, 9M inodes used. Not quite as large as your current,
    > but it might have some of the same properties.. It's a Maildir-based
    > mailserver-cluster, and is working very well. Only issue we'd had is
    > that writing to directories with 10's of thousands files can be too
    > expensive. We will probably need to move metadata to separate volumes
    > to give them more cache than the data-volumes to fix this.
    >
    > If I was to do your huge system with GPFS, I would try to first spread
    > the files over as many directories as possible, and also across as many
    > separate file systems as possible. Spreading over many directories,
    > because I think GPFS is doing directory-level locking for some
    > operations (adding new files?), and spread over as many file systems
    > as possible to reduce the fsck time (GPFS does do online fsck, but
    > not everything can be fixed while online) and make sure that a
    > catastrophic file system error doesn't take down everything.
    >
    > I would also try to avoid NFS if possible. Having the clients mount the
    > fs's as GPFS clients (tcpip or SAN) will probably be much better, and
    > will avoid the bottlenecks and SPOFs of NFS.
    >


    What features does GPFS have that will make in better than NFS? GPFS is
    a good filesystem with cluster support, but I don't see anything special
    that will help with the large numbers of files the OP is trying to deal
    with.

    Pete

  9. Re: Big volumes, small files

    In article <1176904109.982831.76900@b75g2000hsg.googlegroups.c om>,
    wrote:
    >Hello,
    >
    >Question is - can someone with experience with large filesystems and
    >many small-files share his stories? Is it practical to base such a
    >solution on a few (8) large volumes, each with single large filesystem
    >in it?


    You've already got Sun. Why not just migrate to ZFS? ZFS
    operations are constant speed regardless of size of file system or
    size of directory. Lots of other neat stuff too.


  10. Re: Big volumes, small files

    On Apr 22, 7:59 am, w...@panix.com (the wharf rat) wrote:
    > In article <1176904109.982831.76...@b75g2000hsg.googlegroups.c om>,
    >
    > wrote:
    > >Hello,

    >
    > >Question is - can someone with experience with large filesystems and
    > >many small-files share his stories? Is it practical to base such a
    > >solution on a few (8) large volumes, each with single large filesystem
    > >in it?

    >
    > You've already got Sun. Why not just migrate to ZFS? ZFS
    > operations are constant speed regardless of size of file system or
    > size of directory. Lots of other neat stuff too.


    We have seen unexplained performance issues with NFS/ZFS. We've tried
    several configurations in which we ran the SPEC SFS test against
    identical systems, one exporting NFS over ZFS, the other over UFS.
    UFS' performance was an order of magnitude better.
    While there may be several explanations, we're still pretty worried
    about switching such a large installation to a relatively new
    filesystem, esp. with the performance questions hovering above its
    head. We've tried several performance and tuning advice, including
    setting zil_disable, to no avail. UFS beat ZFS every time (in other
    tests, not SFS and NFS based, ZFS was superior).

    Anyone else with information about ZFS/NFS is very much invited to
    share his or her experience.


  11. Re: Big volumes, small files

    On 2007-04-21, Pete wrote:
    >>
    >> I would also try to avoid NFS if possible. Having the clients mount the
    >> fs's as GPFS clients (tcpip or SAN) will probably be much better, and
    >> will avoid the bottlenecks and SPOFs of NFS.

    >
    > What features does GPFS have that will make in better than NFS? GPFS is
    > a good filesystem with cluster support, but I don't see anything special
    > that will help with the large numbers of files the OP is trying to deal
    > with.


    Comparing GPFS to NFS is a bit apples to oranges, in that one is an access
    method to an underlying fs, while the other is a real fs.. But, besides
    own experience in that NFS is slow at serving small files I would point
    at a few features making it better than NFS for OP's problem:

    With GPFS you have two modes of giving access to disk. Either you give
    all nodes direct access trough SAN, or you let only a subset of the nodes
    access trough SAN and have them serve the disks as (in gpfs speak) Network
    Shared Disk (NSD). These NSD's can be accessed directly on SAN for the
    nodes that see them there, or they can be accessed trough tcp/ip to a node
    that again can access it on the SAN. A single NSD will typically have
    a primary and a seconary node serving it, to avoid SPOF. So already here
    we have higher availability than NFS, in that there's not a single node
    the client is depending on.

    Further, once you have more than one NSD in the same file system, the nodes
    will typically load balance the I/O over several NSD-serving nodes.

    Further, the NSD-serving nodes woun't be busy with file system operations,
    as I believe the NSD's are more like network block devices, so GPFS has
    distributed the filesystem operations away from the single NFS-server
    and out to the clients. I would assume this will work especially well for
    the OP's problem, in that he's mostly doing reads, and then woun't have
    to worry about overloading the lock manager.

    Random result from google Comparing GPFS/NSD to NFS:
    http://www.nus.edu.sg/comcen/svu/pub...erformance.pdf

    But, finding benchmark results for small file I/O is not easy.. FS's seems
    to be too much focused on high troughput, streaming I/O...


    -jf

  12. Re: Big volumes, small files

    On Fri, 20 Apr 2007 22:24:31 -0400, Bill Todd
    wrote:

    >Faeandar wrote:
    >
    >...
    >
    >> ZFS is great in concept and I think they are on the right path,
    >> however it's not yet ready for primetime imo.

    >
    >Though (as I already noted) I don't have any direct experience with it,
    >my impression is that people are using it in production systems
    >successfully - so a description of your specific reservations would be
    >useful.
    >
    >>
    >> The integrated integrity checking is extremely cpu intesive.

    >
    >I suspect that you're mistaken: IIRC it occurs as part of an
    >already-existing data copy operation at a very low level in the disk
    >read/write routines, and at close to memory-streaming speeds (i.e.,
    >mostly using CPU cycles that are being used anyway just to copy the data).


    According to Sun, the integrity check and file system self-healing
    process is a permanent background process as well as the foreground
    checks you mention. In the case of a system that is completely idle
    of actual IO the system hung at around 40% performing these
    consistency checks. When IO is going on it backs off to some extent
    but it's still a hog.
    There is no disk IO that is close to memory speeds. The consistency
    checks and verifications involve checking data on platter.

    >
    > It does
    >> not cluster yet, at least not as of 2 weeks ago.

    >
    >It was not clear that this was a requirement in this case - but since
    >the OP mentioned clustering, I mentioned the soon-to-arrive capability.


    Soon-to-arrive means 1.0. It's worth noting points like that. While
    ZFS is great in design it is still new.

    >
    > Many file systems
    >> grow dynamically so I would make that a check in ZFS's column.

    >
    >I'm not sure they grow dynamically quite as painlessly as ZFS does:
    >usually, you first have to arrange to expand the underlying disk storage
    >at the volume-manager level, and then have to incorporate the increase
    >in volume size into the file system.


    It depends on the system, but these days those tasks are fairly
    simple. ZFS gets this extreme ease of use by not having a RAID
    controller between itself and the disks, which means a jbod (not
    everyone is keen on that yet). If you put a raid controller between
    them then Sun recommends turning off the consistency checking. Alot
    of what ZFS is depends on direct control of blocks.

    >
    > No
    >> practical TB limit is a win if you need to go beyond 16TB in a single
    >> FS.
    >> I'm not sure I see how snapshots or journaling helps with backups.

    >
    >I should have added the word 'respectively', I guess: journaling helps
    >avoid the need for fsck, and snapshots help expedite backups (by
    >avoiding any need for down-time while making them).


    True, but my example of the NetApp filer demonstrates that just
    because you don't need downtime to do the backup it is still extremely
    painful in an environment like what the OP describes.

    >
    > It
    >> still has to map blocks to files, which is the long part of a backup.
    >> I know when NetApp backups occur it takes the snapshot and then tries
    >> to do a dump. If you have millions of files it can be hours before
    >> data is actually transferred, I believe ZFS is no different.

    >
    >Actually, it is, since it allows block sizes up to 128 KB (vs. 4 KB for
    >WAFL IIRC, though if WAFL does a good job of defragmenting files the
    >difference may not be too substantial). With the OP's 100 KB file
    >sizes, this means that each file can be accessed (backed up) with a
    >single disk access, yielding a fairly respectable backup bandwidth of
    >about 6 MB/sec (assuming that such an access takes about 16 ms. for a
    >7200 rpm drive, including transfer time, and that the associated
    >directory accesses can be batched during the scan).


    It's not the transfer I was referring to but rather the mapping (phase
    I and II of a dump). I believe ZFS still has to map the files to
    blocks even if it's a one to one ratio. At millions of files this can
    be painful. Once those phases are done the transfer rates are
    probably full pipe.
    Also, in the 100KB file to 128KB block ratio you lose what, 20% of
    your capacity? Big trade off in some environments.

    >
    >>
    >> Since the OP's IO pattern is mostly reads the cpu load may not be an
    >> issue but writes suffer a serious penalty if you are not cpu-rich.

    >
    >I'm not sure why that would be the case even if the integrity-checking
    >*were* CPU-intensive, since the overhead to check the integrity on a
    >read should be just about the same as the overhead to generate the
    >checksum on a write. True, one must generate it all the way back up to
    >the system superblock for a write (one reason why I prefer a
    >log-oriented implementation that can defer and consolidate such
    >activity), but below the root unless you've got many of the
    >intermediate-level blocks cached you have to access and validate them on
    >each read (and with on the order of a billion files, my guess is that
    >needed directory data will quite frequently not be cached).


    In this case ZFS would also be doing the raid. If you're using a raid
    controller the rules change, as do the features.

    ~F

  13. Re: Big volumes, small files

    Faeandar wrote:
    > On Fri, 20 Apr 2007 22:24:31 -0400, Bill Todd
    > wrote:
    >
    >> Faeandar wrote:
    >>
    >> ...
    >>
    >>> ZFS is great in concept and I think they are on the right path,
    >>> however it's not yet ready for primetime imo.

    >> Though (as I already noted) I don't have any direct experience with it,
    >> my impression is that people are using it in production systems
    >> successfully - so a description of your specific reservations would be
    >> useful.
    >>
    >>> The integrated integrity checking is extremely cpu intesive.

    >> I suspect that you're mistaken: IIRC it occurs as part of an
    >> already-existing data copy operation at a very low level in the disk
    >> read/write routines, and at close to memory-streaming speeds (i.e.,
    >> mostly using CPU cycles that are being used anyway just to copy the data).

    >
    > According to Sun, the integrity check and file system self-healing
    > process is a permanent background process as well as the foreground
    > checks you mention.


    Yes, but there's no reason for that to take up very much in the way of
    resources (e.g., the last study I saw in this area indicated that a full
    integrity sweep once every couple of months was more than adequate to
    cut the incidence of latent errors - unnoticed corruption that jumps up
    to bite you after the *good* copy dies - down by at least an order of
    magnitude.

    In the case of a system that is completely idle
    > of actual IO the system hung at around 40% performing these
    > consistency checks.


    That's a ridiculous amount to use as the default (well, at least for
    production software - if they're still using pure idle time heavily to
    reassure customers due to ZFS's newness that might explain it), and I
    would be very surprised if it weren't at least tunable to a much lesser
    amount.

    When IO is going on it backs off to some extent
    > but it's still a hog.
    > There is no disk IO that is close to memory speeds. The consistency
    > checks and verifications involve checking data on platter.


    Of course they do, and I never suggested otherwise. What can move at
    close to memory speeds is the *CPU* overhead involved in the checks, and
    it can piggyback on a memory-to-memory data move that is happening
    anyway (such that few *extra* CPU cycles beyond what would already be
    consumed in the move are required).

    >
    >> It does
    >>> not cluster yet, at least not as of 2 weeks ago.

    >> It was not clear that this was a requirement in this case - but since
    >> the OP mentioned clustering, I mentioned the soon-to-arrive capability.

    >
    > Soon-to-arrive means 1.0. It's worth noting points like that. While
    > ZFS is great in design it is still new.


    Everything starts off new. The question is when a product becomes
    usable in production, and that's something that's measured far more by
    customer experience than by a clock.

    My impression is that *some* customers have workloads that have found
    ZFS to be very stable already, while others push corner cases that are
    still uncovering bugs (I haven't heard of any for a while that involve
    actual data corruption, but I haven't been paying close attention, either).

    >
    >> Many file systems
    >>> grow dynamically so I would make that a check in ZFS's column.

    >> I'm not sure they grow dynamically quite as painlessly as ZFS does:
    >> usually, you first have to arrange to expand the underlying disk storage
    >> at the volume-manager level, and then have to incorporate the increase
    >> in volume size into the file system.

    >
    > It depends on the system, but these days those tasks are fairly
    > simple. ZFS gets this extreme ease of use by not having a RAID
    > controller between itself and the disks, which means a jbod (not
    > everyone is keen on that yet).


    Their loss, unless they need the raw single-operation low-latency
    write-through performance that NVRAM hardware assist can give to a
    hardware RAID box.

    ....

    >> It
    >>> still has to map blocks to files, which is the long part of a backup.
    >>> I know when NetApp backups occur it takes the snapshot and then tries
    >>> to do a dump. If you have millions of files it can be hours before
    >>> data is actually transferred, I believe ZFS is no different.

    >> Actually, it is, since it allows block sizes up to 128 KB (vs. 4 KB for
    >> WAFL IIRC, though if WAFL does a good job of defragmenting files the
    >> difference may not be too substantial). With the OP's 100 KB file
    >> sizes, this means that each file can be accessed (backed up) with a
    >> single disk access, yielding a fairly respectable backup bandwidth of
    >> about 6 MB/sec (assuming that such an access takes about 16 ms. for a
    >> 7200 rpm drive, including transfer time, and that the associated
    >> directory accesses can be batched during the scan).

    >
    > It's not the transfer I was referring to but rather the mapping (phase
    > I and II of a dump). I believe ZFS still has to map the files to
    > blocks even if it's a one to one ratio.


    The one-to-one ratio is what makes the difference (at least in this
    particular case, and even in general the ratio is considerably better
    than a non-extent-based file system that uses a 4 KB block size).

    At millions of files this can
    > be painful.


    Not with ZFS in this instance, unless one constructs a pathological case
    with a deep directory structure and only one or two files mapped per
    deep path traversal: otherwise, the mapping can proceed at less than
    one mapping access per 100 KB file (if each leaf directory has multiple
    files to be mapped), plus the eventual transfer access itself.

    Once those phases are done the transfer rates are
    > probably full pipe.
    > Also, in the 100KB file to 128KB block ratio you lose what, 20% of
    > your capacity? Big trade off in some environments.


    But likely not in this one: it's just not that large a system, nor are
    the disks very expensive if they're SATA.

    >
    >>> Since the OP's IO pattern is mostly reads the cpu load may not be an
    >>> issue but writes suffer a serious penalty if you are not cpu-rich.

    >> I'm not sure why that would be the case even if the integrity-checking
    >> *were* CPU-intensive, since the overhead to check the integrity on a
    >> read should be just about the same as the overhead to generate the
    >> checksum on a write. True, one must generate it all the way back up to
    >> the system superblock for a write (one reason why I prefer a
    >> log-oriented implementation that can defer and consolidate such
    >> activity), but below the root unless you've got many of the
    >> intermediate-level blocks cached you have to access and validate them on
    >> each read (and with on the order of a billion files, my guess is that
    >> needed directory data will quite frequently not be cached).

    >
    > In this case ZFS would also be doing the raid. If you're using a raid
    > controller the rules change, as do the features.


    I have no idea how your comment is meant to relate to the material it's
    responding to above.

    - bill

+ Reply to Thread