stat benchmark - Kernel

This is a discussion on stat benchmark - Kernel ; At http://www.daimi.au.dk/~sandmann/stat-benchmark.c there is a simple program that will measure the time it takes to stat every file in the current directory with a cold cache. This is essentially what applications do in a number of common cases such as ...

+ Reply to Thread
Results 1 to 16 of 16

Thread: stat benchmark

  1. stat benchmark

    At

    http://www.daimi.au.dk/~sandmann/stat-benchmark.c

    there is a simple program that will measure the time it takes to stat
    every file in the current directory with a cold cache.

    This is essentially what applications do in a number of common cases
    such as "ls -l", nautilus opening a directory, or an "open file"
    dialog being showed.

    Unfortunately, performance of that operation kinda sucks. On my system
    (ext3), it produces:

    c-24-61-65-93:~% sudo ./a.out
    Time to readdir(): 0.307671 s
    Time to stat 2349 files: 8.203693 s

    8 seconds is about 80 times slower than what a user perceives as
    "instantly" and slow enough that we really should display a progress
    bar if it can't be fixed.

    So I am looking for ways to improve this.

    Under the theory that disk seeks are killing us, one idea is to add a
    'multistat' system call that would allow statting of many files at a
    time, which would give the disk scheduler more to work with.

    Possibly the same thing would need to be done for the getxattr
    information.

    Does this sound like a reasonable idea?


    Thanks,
    Soren
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: stat benchmark

    On Thu, Apr 24, 2008 at 10:59 PM, Soeren Sandmann wrote:
    [ about programs reading all inodes after readdir ]

    > Unfortunately, performance of that operation kinda sucks. On my system
    > (ext3), it produces:
    >
    > c-24-61-65-93:~% sudo ./a.out
    > Time to readdir(): 0.307671 s
    > Time to stat 2349 files: 8.203693 s
    >
    > 8 seconds is about 80 times slower than what a user perceives as
    > "instantly" and slow enough that we really should display a progress
    > bar if it can't be fixed.
    >
    > So I am looking for ways to improve this.
    >
    > Under the theory that disk seeks are killing us, one idea is to add a
    > 'multistat' system call that would allow statting of many files at a
    > time, which would give the disk scheduler more to work with.


    I have experimented with the same problem, and another idea is to
    reorder the result from readdir, which I've gotten good results by doing.

    This works because:
    - For most filesystems there is a high correlation between the inode
    number and the sector on the disk.

    - Most programs like your example handle the files in the order that
    they are returned from readdir

    - The time spent sorting is very small compared to the disk seeks

    There are several possible ways to implement this:

    - reorder the dirents in the kernel for each getdents call

    - reorderi the dirents in user space, for example by running
    qsort in a libc wrapper

    - in the file system, optimize the order before writing back a dirty directory

    This does not only apply to programs only stating files, but also reading
    them, such as indexing files, backups (tar), and Nautilus getting thumbnails
    from JPGs.

    --
    Carl Henrik
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: stat benchmark


    On Thursday 2008-04-24 22:59, Soeren Sandmann wrote:
    >
    > http://www.daimi.au.dk/~sandmann/stat-benchmark.c
    >
    >there is a simple program that will measure the time it takes to stat
    >every file in the current directory with a cold cache.
    >
    >This is essentially what applications do in a number of common cases
    >such as "ls -l", nautilus opening a directory, or an "open file"
    >dialog being showed.


    Mh somewhat. Some programs/libraries unnecessarily stat No, nautilus is excluded

    >
    >Unfortunately, performance of that operation kinda sucks. On my system
    >(ext3), it produces:
    >
    > c-24-61-65-93:~% sudo ./a.out
    > Time to readdir(): 0.307671 s
    > Time to stat 2349 files: 8.203693 s
    >
    >8 seconds is about 80 times slower than what a user perceives as
    >"instantly" and slow enough that we really should display a progress
    >bar if it can't be fixed.
    >
    >So I am looking for ways to improve this.
    >
    >Under the theory that disk seeks are killing us, one idea is to add a
    >'multistat' system call that would allow statting of many files at a
    >time, which would give the disk scheduler more to work with.
    >
    >Possibly the same thing would need to be done for the getxattr
    >information.
    >
    >Does this sound like a reasonable idea?
    >


    XFS for example already has bulkstat, but it is not really
    exported to userspace :-/
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: stat benchmark

    Jan Engelhardt wrote
    >
    > XFS for example already has bulkstat, but it is not really
    > exported to userspace :-/


    Perhaps this could be in VFS, and used if present?

    -justinb

    --
    Justin Banks
    BakBone Software
    justinb@bakbone.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: stat benchmark

    On Thu, Apr 24, 2008 at 11:44:04PM +0200, Jan Engelhardt wrote:
    > XFS for example already has bulkstat, but it is not really
    > exported to userspace :-/


    It is avaiable through an ugly ioctl. Doesn't really help in this
    use-case because bulkstat iterates over all inodes of a given filesystem
    and not inodes of a directory.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: stat benchmark

    On Thu, Apr 24, 2008 at 10:59 PM, Soeren Sandmann wrote:
    [...]
    > Under the theory that disk seeks are killing us, one idea is to add a
    > 'multistat' system call that would allow statting of many files at a
    > time, which would give the disk scheduler more to work with.


    Instead of a specific system call for "multistat" it would be interesting
    to look at syslets. I'm not sure what the current state of that work is,
    the last reference I find is http://lkml.org/lkml/2007/12/6/336 and more
    interestingly http://lkml.org/lkml/2008/1/9/327

    I guess it would be difficult to get close to the optimal disk schedule by
    using syslets; if a directory contains 1000 files that would require 1000
    syslets and a good I/O scheduler - that's unlikely to be feasible.

    I've added Zach Brown to Cc because he authored the last syslet patchset.

    --
    Carl Henrik
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: stat benchmark

    Theodore Tso writes:

    > On Thu, Apr 24, 2008 at 10:59:10PM +0200, Soeren Sandmann wrote:
    > >
    > > Under the theory that disk seeks are killing us, one idea is to add a
    > > 'multistat' system call that would allow statting of many files at a
    > > time, which would give the disk scheduler more to work with.

    >
    > Why don't you try this version of your stat-benchmark first? If you
    > give it the -s option, it will sort the files by inode number first.
    > I think you will find this should make a significant difference.


    Yeah, Carl suggested this as well.

    Sorting by inode is a major improvement. The numbers are less stable,
    but consistently much lower:

    Time to readdir(): 0.238737 s
    Time to stat 2366 files: 1.338904 s

    compared to

    Time to readdir(): 0.227599 s
    Time to stat 2366 files: 7.981752 s

    Of course, 1.3 seconds is still far from instant, but it may be the
    best we can get given the realities of ext3 disk layout.

    One thing that surprised me is that lstat is slightly slower than
    stat. With lstat() instead of stat(), I get:

    Time to stat 2366 files: 1.472115 s
    Time to readdir(): 1.735542 s

    I naively thought that stat() was a superset of lstat(), but
    apparently not.

    > If it works, something that would be really great if someone were to
    > make a generic library which could be used instead of readdir(). I
    > have something which works as an LD_PRELOAD, but apparently it's been
    > blowing up on 64-bit systems, and I haven't had time to debug it.
    > It's probably better to do it as a library which userspace
    > applications linked against, anyway. Would you or someone you know be
    > interesed in maybe taking this idea and running with it?


    After I read Carl's mail I looked at what glib - which is used by both
    Nautilus and the gtk+ file open dialog - does, and in fact the latest
    version does actually read chunks of 1000 dirents and sorts them by
    inode. The version I had installed when I wrote the benchmark just
    stat'ed in readdir() order.

    For a directory of ~2360 files, chunks of a 1000 files is actually
    surprisingly worse than statting all of the files at once:

    Time to stat 1000 files: 1.008735 s
    Time to stat 1000 files: 0.738936 s
    Time to stat 366 files: 0.217002 s

    I guess this just shows that seeks really is pretty much all that
    matters. Glib should maybe use a larger chunk size.

    I don't know if a general library outside glib would be useful. It
    seems that just telling people to "sort by inode before statting"
    would be just as effective as telling them "use this optimized
    library".



    Thanks,
    Soren
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: stat benchmark

    On Mon, Apr 28, 2008 at 1:29 AM, Soeren Sandmann wrote:
    [...]
    > For a directory of ~2360 files, chunks of a 1000 files is actually
    > surprisingly worse than statting all of the files at once:
    >
    > Time to stat 1000 files: 1.008735 s
    > Time to stat 1000 files: 0.738936 s
    > Time to stat 366 files: 0.217002 s
    >
    > I guess this just shows that seeks really is pretty much all that
    > matters. Glib should maybe use a larger chunk size.


    I agree, if I remember correctly I did not find a directory on my local
    disk where the best result was to sort a chunk instead of the complete
    directory.

    > I don't know if a general library outside glib would be useful. It
    > seems that just telling people to "sort by inode before statting"
    > would be just as effective as telling them "use this optimized
    > library".


    A library (like glib) could disable the feature for solid-state drives
    and perhaps implement an alternative strategy for filesystems without
    any inode/sector correlation.

    So I think we should tell people to use glib. :-)

    --
    Carl Henrik
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: stat benchmark

    On Mon, Apr 28, 2008 at 01:29:52AM +0200, Soeren Sandmann wrote:
    > Sorting by inode is a major improvement. The numbers are less stable,
    > but consistently much lower:
    >
    > Time to readdir(): 0.238737 s
    > Time to stat 2366 files: 1.338904 s
    >
    > compared to
    >
    > Time to readdir(): 0.227599 s
    > Time to stat 2366 files: 7.981752 s
    >
    > Of course, 1.3 seconds is still far from instant, but it may be the
    > best we can get given the realities of ext3 disk layout.


    Out of curiosity, what was the directory that you were stating? If it
    took you 1.3 seconds to stat 2366, the directory have inodes scattered
    all over the disk, or the disk must be very slow. On my laptop disk,
    I can stat 9543 files in 1.1 seconds (from a Maildir directory).

    Also, why does the application need to stat all of the files? Is it
    just to get the file type? (i.e., regular file vs. directory) If so,
    maybe you can use the d_type field in the directory entry returned by
    readdir().

    > I don't know if a general library outside glib would be useful. It
    > seems that just telling people to "sort by inode before statting"
    > would be just as effective as telling them "use this optimized
    > library".


    Well, the question is what we would need to do in order to make it
    really easy for people to drop that into their code. Programmers are
    fundamentally lazy, after all, and if it's too much work to create an
    interim data structure, and then qsort it, they won't. But maybe the
    glib interface is that convenient interface, and all we need to do is
    change glibc to sort with a much larger chunk size.

    We do need to get similar changes into find, ls, and many other
    programs that might not be so interestedin linking against glibc,
    though.

    - Ted
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: stat benchmark

    On Thu, Apr 24, 2008 at 1:59 PM, Soeren Sandmann wrote:
    > So I am looking for ways to improve this.


    Aside from what has already been proposed there is also the
    readdirplus() route. Unfortunately the people behind this and related
    proposals vanished after the last discussions. I was hoping they
    come back with a revised proposal but perhaps not. Maybe it's time to
    pick up the ball myself.

    As a reminder, readdirplus() is an extended readdir() which also
    returns (a subset of) the stat information for the file at the same
    time. The subset part is needed to account for the different
    information contained in the inodes. For most applications the subset
    should be sufficient and therefore all that's needed is a single
    iteration over the directory.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: stat benchmark

    On Sun, Apr 27, 2008 at 09:43:05PM -0700, Ulrich Drepper wrote:
    > On Thu, Apr 24, 2008 at 1:59 PM, Soeren Sandmann wrote:
    > > So I am looking for ways to improve this.

    >
    > Aside from what has already been proposed there is also the
    > readdirplus() route. Unfortunately the people behind this and related
    > proposals vanished after the last discussions. I was hoping they
    > come back with a revised proposal but perhaps not. Maybe it's time to
    > pick up the ball myself.
    >
    > As a reminder, readdirplus() is an extended readdir() which also
    > returns (a subset of) the stat information for the file at the same
    > time. The subset part is needed to account for the different
    > information contained in the inodes. For most applications the subset
    > should be sufficient and therefore all that's needed is a single
    > iteration over the directory.


    I'm not sure this would help in the cold cache case, which is what
    Soeren originally complained about.[1] The problem is whaever
    information the user might need won't be store in the directory, so
    the filesystem would end having to stat the file anyway, incurring a
    disk seek, which was what the user was complaining about. A
    readdirplus() would save a whole bunch of system calls if the inode
    was already cached, yes, but I'm not sure that's it would be worth the
    effort given how small Linux's system call overhead would be. But in
    the cold cache case, you end up seeking all over the disk, and the
    only thing you can do is to try to keep the inodes close to each
    other, and to have either readdir() or the caller of readdir() sort
    all of the returned directory entries by inode number to avoid seeking
    all over the disk.

    - Ted


    [1] At least, not without making filesystem modifications; the problem
    is that if you're going to store the inode information (or part of the
    inode information) in the directory it has to get updated each time
    the inode gets updated, and that gets painful with hardlinks. There
    was a Usenix paper which explored this idea over ten years ago, and
    it's possible --- but then all directory entries need to have linked
    list so that that you can either (a) find the "master" directory entry
    which contains the inode information and be able to migrate inode
    information to another directory entry when the originl "master"
    dirent gets deleted, or (b) you keep the subset of the inode
    information in all of the directory entries of the hardlink, and have
    to update them all each time the inode information changes.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: stat benchmark

    Theodore Tso wrote:
    >> Aside from what has already been proposed there is also the
    >> readdirplus() route. Unfortunately the people behind this and related
    >> proposals vanished after the last discussions. I was hoping they
    >> come back with a revised proposal but perhaps not. Maybe it's time to
    >> pick up the ball myself.
    >>
    >> As a reminder, readdirplus() is an extended readdir() which also
    >> returns (a subset of) the stat information for the file at the same
    >> time. The subset part is needed to account for the different
    >> information contained in the inodes. For most applications the subset
    >> should be sufficient and therefore all that's needed is a single
    >> iteration over the directory.
    >>

    >
    > I'm not sure this would help in the cold cache case, which is what
    > Soeren originally complained about.[1] The problem is whaever
    > information the user might need won't be store in the directory, so
    > the filesystem would end having to stat the file anyway, incurring a
    > disk seek, which was what the user was complaining about. A
    > readdirplus() would save a whole bunch of system calls if the inode
    > was already cached, yes, but I'm not sure that's it would be worth the
    > effort given how small Linux's system call overhead would be. But in
    > the cold cache case, you end up seeking all over the disk, and the
    > only thing you can do is to try to keep the inodes close to each
    > other, and to have either readdir() or the caller of readdir() sort
    > all of the returned directory entries by inode number to avoid seeking
    > all over the disk.
    >


    A readdirplus() could sort the inodes according to the filesystem's
    layout, and additionally issue the stats in parallel, so if you have
    multiple spindles you get significant additional speedup.

    --
    error compiling committee.c: too many arguments to function

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: stat benchmark

    On Mon, Apr 28, 2008 at 02:59:26PM +0300, Avi Kivity wrote:
    >
    > A readdirplus() could sort the inodes according to the filesystem's layout,
    > and additionally issue the stats in parallel, so if you have multiple
    > spindles you get significant additional speedup.
    >


    Well, sorting inodes is something readdir() could do, but it takes
    potentially an unbounded amount of (non-swappable) kernel memory.
    That's why it's messy.

    And it wouldn't be hard to have flags (set by fcntl()) on the fd
    returned by readdir() which which requests inode cache prefetching,
    which would have the same net result, without needing to design a new
    system call. So before designing a new system call, it might be worth
    it to do some benchmarks to see how much of the cost is really in the
    number of stat() system calls, in the warm cache case.

    - Ted
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: stat benchmark

    On Mon, Apr 28, 2008 at 07:53:22AM -0400, Theodore Tso wrote:
    > On Sun, Apr 27, 2008 at 09:43:05PM -0700, Ulrich Drepper wrote:
    > > On Thu, Apr 24, 2008 at 1:59 PM, Soeren Sandmann wrote:
    > > > So I am looking for ways to improve this.

    > >
    > > Aside from what has already been proposed there is also the
    > > readdirplus() route. Unfortunately the people behind this and related
    > > proposals vanished after the last discussions. I was hoping they
    > > come back with a revised proposal but perhaps not. Maybe it's time to
    > > pick up the ball myself.
    > >
    > > As a reminder, readdirplus() is an extended readdir() which also
    > > returns (a subset of) the stat information for the file at the same
    > > time. The subset part is needed to account for the different
    > > information contained in the inodes. For most applications the subset
    > > should be sufficient and therefore all that's needed is a single
    > > iteration over the directory.

    >
    > I'm not sure this would help in the cold cache case, which is what
    > Soeren originally complained about.[1] The problem is whaever
    > information the user might need won't be store in the directory, so
    > the filesystem would end having to stat the file anyway, incurring a
    > disk seek, which was what the user was complaining about. A
    > readdirplus() would save a whole bunch of system calls if the inode
    > was already cached, yes, but I'm not sure that's it would be worth the
    > effort given how small Linux's system call overhead would be. But in
    > the cold cache case, you end up seeking all over the disk, and the
    > only thing you can do is to try to keep the inodes close to each
    > other, and to have either readdir() or the caller of readdir() sort
    > all of the returned directory entries by inode number to avoid seeking
    > all over the disk.


    The other reason for something like a readdirplus or a bulk stat is to
    provide an opportunity for parallelism.

    As my favorite example: cold-cache "git diff" of a linux tree on my
    desktop (with an nfs-mounted /home) takes about 12 seconds. That's
    mainly just a sequential stat of about about 24000 files. Patching git
    to issue the stats in parallel, I could get that down to about 3.5
    seconds. (Still not great. I don't know if it's disk seeks on the
    server or what that are the limiting factor.)

    In the case of git, it's looking just for files that it tracks--it's not
    reading whole directories--so I don't know if readdirplus() specifically
    would help.

    --b.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: stat benchmark


    > I guess it would be difficult to get close to the optimal disk schedule by
    > using syslets; if a directory contains 1000 files that would require 1000
    > syslets and a good I/O scheduler - that's unlikely to be feasible.


    It wouldn't even get to the I/O scheduler. The VFS stat path (see
    real_lookup()) is synchronous and serialized. Each stat will hold the
    i_mutex of the parent directory while it waits for the file system's
    lookup method to populate the inode from disk.

    - z
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: stat benchmark

    On Mon, 2008-04-28 at 02:13 +0200, Carl Henrik Lunde wrote:
    > On Mon, Apr 28, 2008 at 1:29 AM, Soeren Sandmann wrote:
    > [...]
    > > For a directory of ~2360 files, chunks of a 1000 files is actually
    > > surprisingly worse than statting all of the files at once:
    > >
    > > Time to stat 1000 files: 1.008735 s
    > > Time to stat 1000 files: 0.738936 s
    > > Time to stat 366 files: 0.217002 s
    > >
    > > I guess this just shows that seeks really is pretty much all that
    > > matters. Glib should maybe use a larger chunk size.

    >
    > I agree, if I remember correctly I did not find a directory on my local
    > disk where the best result was to sort a chunk instead of the complete
    > directory.


    I don't think that is expected either. The reason for the chunking is to
    avoid unlimited memory use on large directories.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread