do_sync() and XFSQA test 182 failures.... - Kernel

This is a discussion on do_sync() and XFSQA test 182 failures.... - Kernel ; Folks, I think I've finally found the bug(s) that is causing XFSQA test 182 to fail. Test 182 writes a bunch of files, then runs sync, then shuts the filesystem down. It then unmounts and remounts the fs and checks ...

+ Reply to Thread
Results 1 to 9 of 9

Thread: do_sync() and XFSQA test 182 failures....

  1. do_sync() and XFSQA test 182 failures....

    Folks,

    I think I've finally found the bug(s) that is causing XFSQA test 182
    to fail. Test 182 writes a bunch of files, then runs sync, then
    shuts the filesystem down. It then unmounts and remounts the fs and
    checks that all the files are the correct length and have the
    correct contents. i.e. that sync semantics have been correctly
    observed.

    The cause of the failure is that log recovery is replaying inode
    allocation and overwriting the last inode writes that contained
    unlogged changes (i.e. the inode size update that occurs at I/O
    completion).

    The problem is that we've never been able to work out why recovery
    is doing this. What has been nagging at the back of my mind for
    quite some time is the fact that we do actually write these inodes
    to disk and that should allow the tail of the log to move forward
    past the inode allocation transaction and hence it should not be
    replayed during recovery.

    A solution that has been proposed in the past (by Lachlan) is to log
    the inode size updates instead of writing the inode to disk. In
    that case, recovery also replays the inode modification transactions
    and so we don't lose anything. It is a solution that would fix the
    problem. However, always logging inodes instead of writing unlogged
    changes has other performance implications that we'd prefer to avoid
    (i.e. the number of extra transactions it will cause).

    This solution also seemed to me to be papering over the real problem
    which we hadn't yet found because it did not explain why we were
    replaying an allocation that we should not need to. Hence the
    problem has gone unfixed since Lachlan first discovered it despite
    trying several times to get to the bottom of the problem.
    Now I think I finally have.

    I started by instrumenting the sync code and the inode dirtying and
    writeback code to confirm the order of data, inode and sync
    operations, with a view to understanding why the tail of the log was
    not moving forwards when the inode clusters were written out during
    the sync. To start with, let's look at what do_sync() does:

    24 static void do_sync(unsigned long wait)
    25 {
    26 wakeup_pdflush(0);
    27 sync_inodes(0); /* All mappings, inodes and their blockdevs */
    28 DQUOT_SYNC(NULL);
    29 sync_supers(); /* Write the superblocks */
    30 sync_filesystems(0); /* Start syncing the filesystems */
    31 sync_filesystems(wait); /* Waitingly sync the filesystems */
    32 sync_inodes(wait); /* Mappings, inodes and blockdevs, again. */
    33 if (!wait)
    34 printk("Emergency Sync complete\n");
    35 if (unlikely(laptop_mode))
    36 laptop_sync_completion();
    37 }

    Let's translate this into what XFS does:

    wakeup_pdflush(0)[*] - run a concurrent background
    sync of the fs via pdflush.

    sync_inodes(0) - walks the superblock dirty inode
    list doing an async flush of
    inodes and their data.

    sync_supers() - writes the superblock, forces the
    log to disk

    sync_filesystems(0) - non block filesystem sync. XFS
    writes the superblock

    sync_filesystems(1) - XFS writes all dirty data to disk
    and waits for it. Dirties the
    superblock and the log. Does not
    write inodes.

    sync_inodes(1) - walk the superblock dirty inode
    list *twice*, first doing an async
    flush of dirty data and inodes, secondly
    doing a sync flush of remaining
    dirty data and inodes.
    [*] Starting pdflush to sync data in the background when we are
    about to start flushing ourselves is self-defeating. instead of
    having a single thread doing optimal writeout patterns, we now
    have two threads trying to sync the same filesystems and
    competing with each other to write out dirty inodes. This
    actually causes bugs in sync because pdflush is doing async
    flushes. Hence if pdflush races and wins during the sync flush
    part of the sync process, sync_inodes(1) will return before all
    the data/metadata is on disk because it can't be found to be
    waited on.

    Now the sync is _supposedly_ complete. But we still have a dirty
    log and superblock thanks to delayed allocation that may have
    occurred after the sync_supers() call. Hence we can immediately
    see that we cannot *ever* do a proper sync of an XFS filesystem
    in Linux without modifying do_sync() to do more callouts.

    Worse, XFS can also still have *dirty inodes* because sync_inodes(1)
    will remove inodes from the dirty list in the async pass, but they
    can get dirtied a short time later (if they had dirty data) when the
    data I/O completes. Hence if the second sync pass completes before
    the inode is dirtied again we'll miss flushing it. This will mean we
    don't write inode size updates during sync. This is the same race
    that pdflush running in the background can trigger.

    Clearly this is broken, but this particular problem is an XFS bug
    and is fixed by XFS marking the inode dirty before I/O dispatch if
    the end offset of the I/O is beyond the current EOF so there is no
    window where the inode is temporarily clean. This, however, does
    not fix the race condition between the sync thread and pdflush,
    just the async-then-sync problem within the flush thread.

    Back to do_sync(), the order of operations we need to reliably sync
    a journalling filesystem that uses delayed allocation and updates
    metadata on data I/O completion is effectively as follows:

    - flush all dirty data
    - wait for all metadata updates caused by data flush to
    complete
    - force unwritten async transactions to disk to unpin dirty metadata
    - flush all dirty metadata
    - write the superblock

    In generic speak, this effectively requires:

    sync_filesystems(0) [**]
    sync_filesystems(1)
    sync_supers()
    sync_inodes(1) [***]
    sync_supers()

    [**] async flush optimisation
    [***] async flush optimisation is implemented internally to
    sync_inodes() for sync flushes.

    This leads to the following callouts and the behaviour that XFS
    would need for the callouts:

    sync_filesystems(0)
    ->sync_fs() - async flush all dirty data
    sync_filesystems(1)
    ->sync_fs() - sync flush remaining dirty data
    sync_supers()
    ->write_super() - write super, force the log
    sync_inodes(1) [****]
    sync_inodes_sb(0) - async flush of dirty inodes
    sync_inodes_sb(1) - sync flush of remaining inodes
    sync_supers()
    ->write_super() - write sb, force the log.

    [****] sync_inodes() really needs to fall down to a ->sync_inodes()
    callout for the filesystem to be able to implement an optimal
    inode flushing strategy.

    However, even with this order in place, test 182 still fails.

    So I looked at the filesystem prior to log recovery (mount -o
    ro,norecovery) and saw that all the data is on disk, all the inode
    sizes are correct, the superblock is up to date and everything looks
    OK. That is, the sync did everything it was supposed to and the
    above order of writing out the filesystem is working correctly.

    As soon as I ran recovery, though, I saw a small number of inodes
    go back to having an inode size of zero - they regress. The reason
    for this is that the log tail location (l_tail_lsn) at the end of
    the sync is was not updated on disk at the end of the sync and
    hence recovery is replaying transactions.

    At this point I wondered if the log covering code was not working
    properly. I'd never really looked at it in any detail, and as soon
    as I read the description I knew that it was not working. The
    problem log covering is supposed to solve is as follows (from
    fs/xfs/xfs_log_priv.h):

    161 * These states are used to insert dummy log entries to cover
    162 * space allocation transactions which can undo non-transactional changes
    163 * after a crash. Writes to a file with space
    164 * already allocated do not result in any transactions. Allocations
    165 * might include space beyond the EOF. So if we just push the EOF a
    166 * little, the last transaction for the file could contain the wrong
    167 * size. If there is no file system activity, after an allocation
    168 * transaction, and the system crashes, the allocation transaction
    169 * will get replayed and the file will be truncated. This could
    170 * be hours/days/... after the allocation occurred.

    Immediately it is was obvious that we're seeing the above problem
    and that log covering is a method for ensuring that the state of the
    log on disk is the same as that in memory at the end of a sync.

    Hence, as the last part of the sync we need to try to cover the log
    with a dummy transaction to update the real location of the log tail
    in the log. Therefore we will no longer replay the inode allocation
    transactions because the tail in the log matches the in memory state
    after the inodes have been flushed.

    With the current do_sync() code, we have no callout once the inodes
    are written to issue a dummy transactions to cover the log
    correctly. The do_sync() process needs to end with a sync_supers()
    to get the correct callout to XFS to allow this to happen. i.e.
    whenever we try to write the superblock we also should be trying to
    initiate the log covering process, and we can't do this right now.
    Once the log is covered, the recovery-overwriting-inodes problem
    goes away because recovery is not needed.

    Everyone understand the problem now?



    FWIW, XFS has had this log covering code since, well, forever. It
    came from Irix and it worked on Irix. I don't think that it has ever
    worked on Linux, though, because of the lack of a sync_supers() call
    at the end of do_sync(1). We've just never noticed it until we
    corrected the infamous NULL files problems in 2.6.22 which hid this
    particular cause of file size mismatches after a crash.

    With a bunch of hacks in place, test 182 now passes and sync(1) on
    XFS finally does what it is supposed to. I'm not going to post the
    hacky, full-of-garbage, debuggy patch I have that I used to discover
    this - I'll clean it up first to just have the bits needed to fix
    the problem, then post it. That'll be tomorrow....

    However, I have a problem - I'm an expert in XFS, not the other tens
    of Linux filesystems so I can't begin to guess what the impact of
    changing do_sync() would be on those many filesystems. How many
    filesystems would such a change break? Indeed - how many are broken
    right now by having dirty inodes and superblocks slip through
    sync(1)?

    And then the big question - how the hell does one test such change?

    I can test XFS easily enough because it has shutdown ioctls that
    effectively simulate a power failure - that what test 182 uses. I
    don't think any other filesystem has such an ioctl, though, and I
    don't have the time or hardware to repeatedly crash test every
    filesystem out there to prove that a change to do_sync() doesn't
    negatively impact them.

    What are the alternatives? do_sync() operates above any particular
    filesystem, so it's hard to provide a filesystem specific ->do_sync
    method to avoid changing sync order for all filesystems. Do we
    change do_sync() to completely sync a superblock at a time instead
    of doing each operation across all superblocks before moving onto
    the next operation? Is there any particular reason (e.g. performance, locking) for the current
    method that would prevent changing to completely-sync-a-superblock
    iteration algorithm so we can provide a custom ->do_sync method?

    Are there any other ways that we can get a custom ->do_sync
    method for XFS? I'd prefer a custom method so we don't have to
    revalidate every linux filesystem, especially as XFS already has
    everything it needs to provide it's own sync method (used for
    freezing) and a test suite to validate it is working correctly.....

    Are there any other options for solving this?

    Cheers,

    Dave.
    --
    Dave Chinner
    david@fromorbit.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: do_sync() and XFSQA test 182 failures....

    On Thu, Oct 30, 2008 at 07:50:20PM +1100, Dave Chinner wrote:
    >[*] Starting pdflush to sync data in the background when we are
    > about to start flushing ourselves is self-defeating. instead of
    > having a single thread doing optimal writeout patterns, we now
    > have two threads trying to sync the same filesystems and
    > competing with each other to write out dirty inodes. This
    > actually causes bugs in sync because pdflush is doing async
    > flushes. Hence if pdflush races and wins during the sync flush
    > part of the sync process, sync_inodes(1) will return before all
    > the data/metadata is on disk because it can't be found to be
    > waited on.


    Note that this is an XFS special. Every other filesystem (at least the
    major ones) rely on the VFS to write out data.

    > Now the sync is _supposedly_ complete. But we still have a dirty
    > log and superblock thanks to delayed allocation that may have
    > occurred after the sync_supers() call. Hence we can immediately
    > see that we cannot *ever* do a proper sync of an XFS filesystem
    > in Linux without modifying do_sync() to do more callouts.
    >
    > Worse, XFS can also still have *dirty inodes* because sync_inodes(1)
    > will remove inodes from the dirty list in the async pass, but they
    > can get dirtied a short time later (if they had dirty data) when the
    > data I/O completes. Hence if the second sync pass completes before
    > the inode is dirtied again we'll miss flushing it. This will mean we
    > don't write inode size updates during sync. This is the same race
    > that pdflush running in the background can trigger.


    This is a common problem that would hit any filesystem trying to have
    some sort of ordered data or journaled data mode.

    > However, I have a problem - I'm an expert in XFS, not the other tens
    > of Linux filesystems so I can't begin to guess what the impact of
    > changing do_sync() would be on those many filesystems. How many
    > filesystems would such a change break? Indeed - how many are broken
    > right now by having dirty inodes and superblocks slip through
    > sync(1)?


    I would guess more are broken now then a change in order would break.
    Then again purely a change in order would still leave this code
    fragile as hell.

    > What are the alternatives? do_sync() operates above any particular
    > filesystem, so it's hard to provide a filesystem specific ->do_sync
    > method to avoid changing sync order for all filesystems. Do we
    > change do_sync() to completely sync a superblock at a time instead
    > of doing each operation across all superblocks before moving onto
    > the next operation? Is there any particular reason (e.g. performance, locking) for the current
    > method that would prevent changing to completely-sync-a-superblock
    > iteration algorithm so we can provide a custom ->do_sync method?


    Locking can't be the reason as we should never hold locks while
    returning from one of the VFS operations. I think it's performance
    or rather alledged performance as I think it doesn't really matter.
    If it matters however there is an easy method to make it perform just
    as well with a proper callout - just spawn a thread for every filesystem
    to perform it in parallel.

    > Are there any other ways that we can get a custom ->do_sync
    > method for XFS? I'd prefer a custom method so we don't have to
    > revalidate every linux filesystem, especially as XFS already has
    > everything it needs to provide it's own sync method (used for
    > freezing) and a test suite to validate it is working correctly.....


    I think having a method for this would be useful. And given that
    a proper sync should be exactly the same as a filesysytem freeze
    we should maybe use one method for both of those and use the chance
    to give filesystem better control over the freeze process?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: do_sync() and XFSQA test 182 failures....

    On Thu, Oct 30, 2008 at 06:46:25PM -0400, Christoph Hellwig wrote:
    > On Thu, Oct 30, 2008 at 07:50:20PM +1100, Dave Chinner wrote:
    > >[*] Starting pdflush to sync data in the background when we are
    > > about to start flushing ourselves is self-defeating. instead of
    > > having a single thread doing optimal writeout patterns, we now
    > > have two threads trying to sync the same filesystems and
    > > competing with each other to write out dirty inodes. This
    > > actually causes bugs in sync because pdflush is doing async
    > > flushes. Hence if pdflush races and wins during the sync flush
    > > part of the sync process, sync_inodes(1) will return before all
    > > the data/metadata is on disk because it can't be found to be
    > > waited on.

    >
    > Note that this is an XFS special. Every other filesystem (at least the
    > major ones) rely on the VFS to write out data.


    The race is based on which thread removes the remaining
    dirty inodes from the sb_s_dirty list - I don't think that is XFS
    specific.

    > > Now the sync is _supposedly_ complete. But we still have a dirty
    > > log and superblock thanks to delayed allocation that may have
    > > occurred after the sync_supers() call. Hence we can immediately
    > > see that we cannot *ever* do a proper sync of an XFS filesystem
    > > in Linux without modifying do_sync() to do more callouts.
    > >
    > > Worse, XFS can also still have *dirty inodes* because sync_inodes(1)
    > > will remove inodes from the dirty list in the async pass, but they
    > > can get dirtied a short time later (if they had dirty data) when the
    > > data I/O completes. Hence if the second sync pass completes before
    > > the inode is dirtied again we'll miss flushing it. This will mean we
    > > don't write inode size updates during sync. This is the same race
    > > that pdflush running in the background can trigger.

    >
    > This is a common problem that would hit any filesystem trying to have
    > some sort of ordered data or journaled data mode.
    >
    > > However, I have a problem - I'm an expert in XFS, not the other tens
    > > of Linux filesystems so I can't begin to guess what the impact of
    > > changing do_sync() would be on those many filesystems. How many
    > > filesystems would such a change break? Indeed - how many are broken
    > > right now by having dirty inodes and superblocks slip through
    > > sync(1)?

    >
    > I would guess more are broken now then a change in order would break.
    > Then again purely a change in order would still leave this code
    > fragile as hell.


    Right, which is why I want a custom ->do_sync method so I can
    *guarantee* that sync works on XFS without fear breaking other
    filesystems...

    > > What are the alternatives? do_sync() operates above any particular
    > > filesystem, so it's hard to provide a filesystem specific ->do_sync
    > > method to avoid changing sync order for all filesystems. Do we
    > > change do_sync() to completely sync a superblock at a time instead
    > > of doing each operation across all superblocks before moving onto
    > > the next operation? Is there any particular reason (e.g. performance, locking) for the current
    > > method that would prevent changing to completely-sync-a-superblock
    > > iteration algorithm so we can provide a custom ->do_sync method?

    >
    > Locking can't be the reason as we should never hold locks while
    > returning from one of the VFS operations. I think it's performance
    > or rather alledged performance as I think it doesn't really matter.


    Ok.

    > If it matters however there is an easy method to make it perform just
    > as well with a proper callout - just spawn a thread for every filesystem
    > to perform it in parallel.


    Sure, that will be faster as long as the filesystems are on
    separate devices.

    As it is, once we have custom ->do_sync, XFS will probably grow
    multiple threads per filesystem to sync AGs on independent spindles
    in parallel.....

    > > Are there any other ways that we can get a custom ->do_sync
    > > method for XFS? I'd prefer a custom method so we don't have to
    > > revalidate every linux filesystem, especially as XFS already has
    > > everything it needs to provide it's own sync method (used for
    > > freezing) and a test suite to validate it is working correctly.....

    >
    > I think having a method for this would be useful. And given that
    > a proper sync should be exactly the same as a filesysytem freeze
    > we should maybe use one method for both of those and use the chance
    > to give filesystem better control over the freeze process?


    Right - that's exactly where we should be going with this, I think.
    I'd suggest two callouts, perhaps: ->sync_data and ->sync_metadata.
    The freeze code can then still operate in two stages, and we can
    also use then for separating data and inode writeback in pdflush....

    FWIW, I mentioned doing this sort of thing here:

    http://xfs.org/index.php/Improving_i...c_pdflush_Code

    I think I'll look at redoing do_sync() to provide a custom sync
    method before trying to fix XFS....

    Cheers,

    Dave.
    --
    Dave Chinner
    david@fromorbit.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: do_sync() and XFSQA test 182 failures....

    On Fri, Oct 31, 2008 at 11:12:49AM +1100, Dave Chinner wrote:
    > On Thu, Oct 30, 2008 at 06:46:25PM -0400, Christoph Hellwig wrote:
    > > On Thu, Oct 30, 2008 at 07:50:20PM +1100, Dave Chinner wrote:
    > > > Are there any other ways that we can get a custom ->do_sync
    > > > method for XFS? I'd prefer a custom method so we don't have to
    > > > revalidate every linux filesystem, especially as XFS already has
    > > > everything it needs to provide it's own sync method (used for
    > > > freezing) and a test suite to validate it is working correctly.....

    > >
    > > I think having a method for this would be useful. And given that
    > > a proper sync should be exactly the same as a filesysytem freeze
    > > we should maybe use one method for both of those and use the chance
    > > to give filesystem better control over the freeze process?

    >
    > Right - that's exactly where we should be going with this, I think.
    > I'd suggest two callouts, perhaps: ->sync_data and ->sync_metadata.
    > The freeze code can then still operate in two stages, and we can
    > also use then for separating data and inode writeback in pdflush....
    >
    > FWIW, I mentioned doing this sort of thing here:
    >
    > http://xfs.org/index.php/Improving_i...c_pdflush_Code
    >
    > I think I'll look at redoing do_sync() to provide a custom sync
    > method before trying to fix XFS....


    As it is, here's the somewhat cleaned up patch that demonstrates
    the changes needed to make xfsqa test 182 pass....

    Cheers,

    Dave.
    --
    Dave Chinner
    david@fromorbit.com

    XFS, RFC: demonstration fix of test 182

    This patch shows the various changes needed to ensure sync(1)
    leave the filesystem in a consistent state. It shows the different
    format of do_sync(), the changes to mark the inode dirty before
    data I/O and the changes to xfs_fs_write_super() and the
    functions it calls to ensure the log gets covered at the
    end of the sync.

    This is in no way a real fix for the problem, merely a demonstration
    of the problem. The real fix is to make the XFS sync code use
    the same flush algorithm as freeze, but that first requires VFS
    level changes to provide a method for doing so.
    ---
    fs/sync.c | 7 +++----
    fs/xfs/linux-2.6/xfs_aops.c | 38 ++++++++++++++++++++++++++++----------
    fs/xfs/linux-2.6/xfs_super.c | 29 +++++++++++------------------
    fs/xfs/linux-2.6/xfs_sync.c | 41 ++++++++++++++++++++++++++---------------
    fs/xfs/linux-2.6/xfs_sync.h | 2 +-
    5 files changed, 69 insertions(+), 48 deletions(-)

    diff --git a/fs/sync.c b/fs/sync.c
    index 137b550..69f40b7 100644
    --- a/fs/sync.c
    +++ b/fs/sync.c
    @@ -23,13 +23,12 @@
    */
    static void do_sync(unsigned long wait)
    {
    - wakeup_pdflush(0);
    - sync_inodes(0); /* All mappings, inodes and their blockdevs */
    - DQUOT_SYNC(NULL);
    - sync_supers(); /* Write the superblocks */
    sync_filesystems(0); /* Start syncing the filesystems */
    sync_filesystems(wait); /* Waitingly sync the filesystems */
    + DQUOT_SYNC(NULL);
    + sync_supers(); /* Write the superblocks */
    sync_inodes(wait); /* Mappings, inodes and blockdevs, again. */
    + sync_supers(); /* Write the superblocks, again */
    if (!wait)
    printk("Emergency Sync complete\n");
    if (unlikely(laptop_mode))
    diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
    index f8fa620..cb79a21 100644
    --- a/fs/xfs/linux-2.6/xfs_aops.c
    +++ b/fs/xfs/linux-2.6/xfs_aops.c
    @@ -143,19 +143,37 @@ xfs_destroy_ioend(
    }

    /*
    + * If the end of the current ioend is beyond the current EOF,
    + * return the new EOF value, otherwise zero.
    + */
    +STATIC xfs_fsize_t
    +xfs_ioend_new_eof(
    + xfs_ioend_t *ioend)
    +{
    + xfs_inode_t *ip = XFS_I(ioend->io_inode);
    + xfs_fsize_t isize;
    + xfs_fsize_t bsize;
    +
    + bsize = ioend->io_offset + ioend->io_size;
    + isize = MAX(ip->i_size, ip->i_new_size);
    + isize = MIN(isize, bsize);
    + return isize > ip->i_d.di_size ? isize : 0;
    +}
    +
    +/*
    * Update on-disk file size now that data has been written to disk.
    * The current in-memory file size is i_size. If a write is beyond
    * eof i_new_size will be the intended file size until i_size is
    * updated. If this write does not extend all the way to the valid
    * file size then restrict this update to the end of the write.
    */
    +
    STATIC void
    xfs_setfilesize(
    xfs_ioend_t *ioend)
    {
    xfs_inode_t *ip = XFS_I(ioend->io_inode);
    xfs_fsize_t isize;
    - xfs_fsize_t bsize;

    ASSERT((ip->i_d.di_mode & S_IFMT) == S_IFREG);
    ASSERT(ioend->io_type != IOMAP_READ);
    @@ -163,19 +181,13 @@ xfs_setfilesize(
    if (unlikely(ioend->io_error))
    return;

    - bsize = ioend->io_offset + ioend->io_size;
    -
    xfs_ilock(ip, XFS_ILOCK_EXCL);
    -
    - isize = MAX(ip->i_size, ip->i_new_size);
    - isize = MIN(isize, bsize);
    -
    - if (ip->i_d.di_size < isize) {
    + isize = xfs_ioend_new_eof(ioend);
    + if (isize) {
    ip->i_d.di_size = isize;
    ip->i_update_size = 1;
    xfs_mark_inode_dirty_sync(ip);
    }
    -
    xfs_iunlock(ip, XFS_ILOCK_EXCL);
    }

    @@ -366,10 +378,16 @@ xfs_submit_ioend_bio(
    struct bio *bio)
    {
    atomic_inc(&ioend->io_remaining);
    -
    bio->bi_private = ioend;
    bio->bi_end_io = xfs_end_bio;

    + /*
    + * if the I/O is beyond EOF we mark the inode dirty immediately
    + * but don't update the inode size until I/O completion.
    + */
    + if (xfs_ioend_new_eof(ioend))
    + xfs_mark_inode_dirty_sync(XFS_I(ioend->io_inode));
    +
    submit_bio(WRITE, bio);
    ASSERT(!bio_flagged(bio, BIO_EOPNOTSUPP));
    bio_put(bio);
    diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
    index c5cfc1e..bbac9e3 100644
    --- a/fs/xfs/linux-2.6/xfs_super.c
    +++ b/fs/xfs/linux-2.6/xfs_super.c
    @@ -1118,8 +1118,7 @@ xfs_fs_write_super(
    struct super_block *sb)
    {
    if (!(sb->s_flags & MS_RDONLY))
    - xfs_sync_fsdata(XFS_M(sb), 0);
    - sb->s_dirt = 0;
    + xfs_sync_fsdata(XFS_M(sb), SYNC_WAIT);
    }

    STATIC int
    @@ -1131,22 +1130,16 @@ xfs_fs_sync_super(
    int error;

    /*
    - * Treat a sync operation like a freeze. This is to work
    - * around a race in sync_inodes() which works in two phases
    - * - an asynchronous flush, which can write out an inode
    - * without waiting for file size updates to complete, and a
    - * synchronous flush, which wont do anything because the
    - * async flush removed the inode's dirty flag. Also
    - * sync_inodes() will not see any files that just have
    - * outstanding transactions to be flushed because we don't
    - * dirty the Linux inode until after the transaction I/O
    - * completes.
    + * Treat a blocking sync operation like a data freeze.
    + * The are effectively the same thing - both need to
    + * get all the data on disk and wait for it to complete.
    + *
    + * Also, no point marking the superblock clean - we'll
    + * dirty metadata flushing data out....
    */
    - if (wait || unlikely(sb->s_frozen == SB_FREEZE_WRITE))
    - error = xfs_quiesce_data(mp);
    - else
    - error = xfs_sync_fsdata(mp, 0);
    - sb->s_dirt = 0;
    + if (unlikely(sb->s_frozen == SB_FREEZE_WRITE))
    + wait = 1;
    + error = xfs_quiesce_data(mp, wait);

    if (unlikely(laptop_mode)) {
    int prev_sync_seq = mp->m_sync_seq;
    @@ -1283,7 +1276,7 @@ xfs_fs_remount(

    /* rw -> ro */
    if (!(mp->m_flags & XFS_MOUNT_RDONLY) && (*flags & MS_RDONLY)) {
    - xfs_quiesce_data(mp);
    + xfs_quiesce_data(mp, 1);
    xfs_quiesce_attr(mp);
    mp->m_flags |= XFS_MOUNT_RDONLY;
    }
    diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
    index d12d31b..975534b 100644
    --- a/fs/xfs/linux-2.6/xfs_sync.c
    +++ b/fs/xfs/linux-2.6/xfs_sync.c
    @@ -64,8 +64,6 @@ xfs_sync_inodes_ag(
    int last_error = 0;
    int fflag = XFS_B_ASYNC;

    - if (flags & SYNC_DELWRI)
    - fflag = XFS_B_DELWRI;
    if (flags & SYNC_WAIT)
    fflag = 0; /* synchronous overrides all */

    @@ -127,6 +125,8 @@ xfs_sync_inodes_ag(
    /*
    * If we have to flush data or wait for I/O completion
    * we need to hold the iolock.
    + *
    + * XXX: this doesn't catch I/O in progress
    */
    if ((flags & SYNC_DELWRI) && VN_DIRTY(inode)) {
    xfs_ilock(ip, XFS_IOLOCK_SHARED);
    @@ -202,11 +202,15 @@ xfs_sync_inodes(
    STATIC int
    xfs_commit_dummy_trans(
    struct xfs_mount *mp,
    - uint log_flags)
    + uint flags)
    {
    struct xfs_inode *ip = mp->m_rootip;
    struct xfs_trans *tp;
    int error;
    + int log_flags = XFS_LOG_FORCE;
    +
    + if (flags & SYNC_WAIT)
    + log_flags |= XFS_LOG_SYNC;

    /*
    * Put a dummy transaction in the log to tell recovery
    @@ -224,13 +228,12 @@ xfs_commit_dummy_trans(
    xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
    xfs_trans_ihold(tp, ip);
    xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
    - /* XXX(hch): ignoring the error here.. */
    error = xfs_trans_commit(tp, 0);
    -
    xfs_iunlock(ip, XFS_ILOCK_EXCL);

    + /* the log force ensures this transaction is pushed to disk */
    xfs_log_force(mp, 0, log_flags);
    - return 0;
    + return error;
    }

    int
    @@ -270,6 +273,7 @@ xfs_sync_fsdata(
    */
    if (XFS_BUF_ISPINNED(bp))
    xfs_log_force(mp, 0, XFS_LOG_FORCE);
    + xfs_flush_buftarg(mp->m_ddev_targp, 1);
    }


    @@ -278,7 +282,13 @@ xfs_sync_fsdata(
    else
    XFS_BUF_ASYNC(bp);

    - return xfs_bwrite(mp, bp);
    + error = xfs_bwrite(mp, bp);
    + if (!error && xfs_log_need_covered(mp)) {
    + error = xfs_commit_dummy_trans(mp, (flags & SYNC_WAIT));
    + if (flags & SYNC_WAIT)
    + mp->m_super->s_dirt = 0;
    + }
    + return error;

    out_brelse:
    xfs_buf_relse(bp);
    @@ -305,18 +315,21 @@ xfs_sync_fsdata(
    */
    int
    xfs_quiesce_data(
    - struct xfs_mount *mp)
    + struct xfs_mount *mp,
    + int wait)
    {
    int error;

    /* push non-blocking */
    - xfs_sync_inodes(mp, SYNC_DELWRI|SYNC_BDFLUSH);
    + xfs_sync_inodes(mp, SYNC_DELWRI);
    XFS_QM_DQSYNC(mp, SYNC_BDFLUSH);
    - xfs_filestream_flush(mp);

    - /* push and block */
    - xfs_sync_inodes(mp, SYNC_DELWRI|SYNC_WAIT|SYNC_IOWAIT);
    - XFS_QM_DQSYNC(mp, SYNC_WAIT);
    + /* push and block till complete */
    + if (wait) {
    + xfs_sync_inodes(mp, SYNC_DELWRI|SYNC_WAIT|SYNC_IOWAIT);
    + XFS_QM_DQSYNC(mp, SYNC_WAIT);
    + xfs_filestream_flush(mp);
    + }

    /* write superblock and hoover up shutdown errors */
    error = xfs_sync_fsdata(mp, 0);
    @@ -480,8 +493,6 @@ xfs_sync_worker(
    /* dgc: errors ignored here */
    error = XFS_QM_DQSYNC(mp, SYNC_BDFLUSH);
    error = xfs_sync_fsdata(mp, SYNC_BDFLUSH);
    - if (xfs_log_need_covered(mp))
    - error = xfs_commit_dummy_trans(mp, XFS_LOG_FORCE);
    }
    mp->m_sync_seq++;
    wake_up(&mp->m_wait_single_sync_task);
    diff --git a/fs/xfs/linux-2.6/xfs_sync.h b/fs/xfs/linux-2.6/xfs_sync.h
    index 5f6de1e..5586f7a 100644
    --- a/fs/xfs/linux-2.6/xfs_sync.h
    +++ b/fs/xfs/linux-2.6/xfs_sync.h
    @@ -39,7 +39,7 @@ void xfs_syncd_stop(struct xfs_mount *mp);
    int xfs_sync_inodes(struct xfs_mount *mp, int flags);
    int xfs_sync_fsdata(struct xfs_mount *mp, int flags);

    -int xfs_quiesce_data(struct xfs_mount *mp);
    +int xfs_quiesce_data(struct xfs_mount *mp, int wait);
    void xfs_quiesce_attr(struct xfs_mount *mp);

    void xfs_flush_inode(struct xfs_inode *ip);
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: do_sync() and XFSQA test 182 failures....

    Dave Chinner wrote:
    > Folks,
    >
    > I think I've finally found the bug(s) that is causing XFSQA test 182
    > to fail. Test 182 writes a bunch of files, then runs sync, then
    > shuts the filesystem down. It then unmounts and remounts the fs and
    > checks that all the files are the correct length and have the
    > correct contents. i.e. that sync semantics have been correctly
    > observed.
    >
    > The cause of the failure is that log recovery is replaying inode
    > allocation and overwriting the last inode writes that contained
    > unlogged changes (i.e. the inode size update that occurs at I/O
    > completion).
    >
    > The problem is that we've never been able to work out why recovery
    > is doing this. What has been nagging at the back of my mind for
    > quite some time is the fact that we do actually write these inodes
    > to disk and that should allow the tail of the log to move forward
    > past the inode allocation transaction and hence it should not be
    > replayed during recovery.
    >
    > A solution that has been proposed in the past (by Lachlan) is to log
    > the inode size updates instead of writing the inode to disk. In
    > that case, recovery also replays the inode modification transactions
    > and so we don't lose anything. It is a solution that would fix the
    > problem. However, always logging inodes instead of writing unlogged
    > changes has other performance implications that we'd prefer to avoid
    > (i.e. the number of extra transactions it will cause).

    Logging the inode every time a file size update occured increased log
    traffic by about 11% for the load that test 182 generates.

    Logging the inode during inode writeout if, and only if, there are
    unlogged changes (ie i_update_core is set) has a negligible impact on
    log traffic or performance.

    I thought I understood this problem yet your description is a lot more
    detailed than I expected. I agree with you that sync is updating
    everything on disk and if it wasn't for log replay messing things up
    then everything would be good. So although the inode on disk is up to
    date the history of changes to that inode in the log is missing the
    last changes involving file size updates. So when the log is replayed
    the inode loses the file size update. And it's all because of the
    inode cluster buffer initialisation log entry that resets the cluster
    (and all the inodes in it) so even the di_flu****er hack can't save us.

    The obvious solution to me was to complete the history of the inode in
    the log so if replay wants to start from scratch it will make all the
    necessary changes to bring the inode to a state that matches what is
    on disk. Logging unlogged changes during sync will achieve this.

    Avoiding log replay altogether is even better but that solution is
    only going to work if we've run sync just before our system crashes.
    If we haven't run sync before our system crashes then we'll still hit
    this problem with regressing inodes but if we haven't run sync there
    are no guarantees, right?

    >
    > This solution also seemed to me to be papering over the real problem
    > which we hadn't yet found because it did not explain why we were
    > replaying an allocation that we should not need to. Hence the
    > problem has gone unfixed since Lachlan first discovered it despite
    > trying several times to get to the bottom of the problem.
    > Now I think I finally have.
    >
    > I started by instrumenting the sync code and the inode dirtying and
    > writeback code to confirm the order of data, inode and sync
    > operations, with a view to understanding why the tail of the log was
    > not moving forwards when the inode clusters were written out during
    > the sync. To start with, let's look at what do_sync() does:
    >
    > 24 static void do_sync(unsigned long wait)
    > 25 {
    > 26 wakeup_pdflush(0);
    > 27 sync_inodes(0); /* All mappings, inodes and their blockdevs */
    > 28 DQUOT_SYNC(NULL);
    > 29 sync_supers(); /* Write the superblocks */
    > 30 sync_filesystems(0); /* Start syncing the filesystems */
    > 31 sync_filesystems(wait); /* Waitingly sync the filesystems */
    > 32 sync_inodes(wait); /* Mappings, inodes and blockdevs, again. */
    > 33 if (!wait)
    > 34 printk("Emergency Sync complete\n");
    > 35 if (unlikely(laptop_mode))
    > 36 laptop_sync_completion();
    > 37 }
    >
    > Let's translate this into what XFS does:
    >
    > wakeup_pdflush(0)[*] - run a concurrent background
    > sync of the fs via pdflush.
    >
    > sync_inodes(0) - walks the superblock dirty inode
    > list doing an async flush of
    > inodes and their data.
    >
    > sync_supers() - writes the superblock, forces the
    > log to disk
    >
    > sync_filesystems(0) - non block filesystem sync. XFS
    > writes the superblock
    >
    > sync_filesystems(1) - XFS writes all dirty data to disk
    > and waits for it. Dirties the
    > superblock and the log. Does not
    > write inodes.
    >
    > sync_inodes(1) - walk the superblock dirty inode
    > list *twice*, first doing an async
    > flush of dirty data and inodes, secondly
    > doing a sync flush of remaining
    > dirty data and inodes.
    >
    >[*] Starting pdflush to sync data in the background when we are
    > about to start flushing ourselves is self-defeating. instead of
    > having a single thread doing optimal writeout patterns, we now
    > have two threads trying to sync the same filesystems and
    > competing with each other to write out dirty inodes. This
    > actually causes bugs in sync because pdflush is doing async
    > flushes. Hence if pdflush races and wins during the sync flush
    > part of the sync process, sync_inodes(1) will return before all
    > the data/metadata is on disk because it can't be found to be
    > waited on.
    >
    > Now the sync is _supposedly_ complete. But we still have a dirty
    > log and superblock thanks to delayed allocation that may have
    > occurred after the sync_supers() call. Hence we can immediately
    > see that we cannot *ever* do a proper sync of an XFS filesystem
    > in Linux without modifying do_sync() to do more callouts.
    >
    > Worse, XFS can also still have *dirty inodes* because sync_inodes(1)
    > will remove inodes from the dirty list in the async pass, but they
    > can get dirtied a short time later (if they had dirty data) when the
    > data I/O completes. Hence if the second sync pass completes before
    > the inode is dirtied again we'll miss flushing it. This will mean we
    > don't write inode size updates during sync. This is the same race
    > that pdflush running in the background can trigger.
    >
    > Clearly this is broken, but this particular problem is an XFS bug
    > and is fixed by XFS marking the inode dirty before I/O dispatch if
    > the end offset of the I/O is beyond the current EOF so there is no
    > window where the inode is temporarily clean. This, however, does
    > not fix the race condition between the sync thread and pdflush,
    > just the async-then-sync problem within the flush thread.
    >
    > Back to do_sync(), the order of operations we need to reliably sync
    > a journalling filesystem that uses delayed allocation and updates
    > metadata on data I/O completion is effectively as follows:
    >
    > - flush all dirty data
    > - wait for all metadata updates caused by data flush to
    > complete
    > - force unwritten async transactions to disk to unpin dirty metadata
    > - flush all dirty metadata
    > - write the superblock
    >
    > In generic speak, this effectively requires:
    >
    > sync_filesystems(0) [**]
    > sync_filesystems(1)
    > sync_supers()
    > sync_inodes(1) [***]
    > sync_supers()
    >
    > [**] async flush optimisation
    > [***] async flush optimisation is implemented internally to
    > sync_inodes() for sync flushes.
    >
    > This leads to the following callouts and the behaviour that XFS
    > would need for the callouts:
    >
    > sync_filesystems(0)
    > ->sync_fs() - async flush all dirty data
    > sync_filesystems(1)
    > ->sync_fs() - sync flush remaining dirty data
    > sync_supers()
    > ->write_super() - write super, force the log
    > sync_inodes(1) [****]
    > sync_inodes_sb(0) - async flush of dirty inodes
    > sync_inodes_sb(1) - sync flush of remaining inodes
    > sync_supers()
    > ->write_super() - write sb, force the log.
    >
    > [****] sync_inodes() really needs to fall down to a ->sync_inodes()
    > callout for the filesystem to be able to implement an optimal
    > inode flushing strategy.
    >
    > However, even with this order in place, test 182 still fails.
    >
    > So I looked at the filesystem prior to log recovery (mount -o
    > ro,norecovery) and saw that all the data is on disk, all the inode
    > sizes are correct, the superblock is up to date and everything looks
    > OK. That is, the sync did everything it was supposed to and the
    > above order of writing out the filesystem is working correctly.
    >
    > As soon as I ran recovery, though, I saw a small number of inodes
    > go back to having an inode size of zero - they regress. The reason
    > for this is that the log tail location (l_tail_lsn) at the end of
    > the sync is was not updated on disk at the end of the sync and
    > hence recovery is replaying transactions.
    >
    > At this point I wondered if the log covering code was not working
    > properly. I'd never really looked at it in any detail, and as soon
    > as I read the description I knew that it was not working. The
    > problem log covering is supposed to solve is as follows (from
    > fs/xfs/xfs_log_priv.h):
    >
    > 161 * These states are used to insert dummy log entries to cover
    > 162 * space allocation transactions which can undo non-transactional changes
    > 163 * after a crash. Writes to a file with space
    > 164 * already allocated do not result in any transactions. Allocations
    > 165 * might include space beyond the EOF. So if we just push the EOF a
    > 166 * little, the last transaction for the file could contain the wrong
    > 167 * size. If there is no file system activity, after an allocation
    > 168 * transaction, and the system crashes, the allocation transaction
    > 169 * will get replayed and the file will be truncated. This could
    > 170 * be hours/days/... after the allocation occurred.
    >
    > Immediately it is was obvious that we're seeing the above problem
    > and that log covering is a method for ensuring that the state of the
    > log on disk is the same as that in memory at the end of a sync.
    >
    > Hence, as the last part of the sync we need to try to cover the log
    > with a dummy transaction to update the real location of the log tail
    > in the log. Therefore we will no longer replay the inode allocation
    > transactions because the tail in the log matches the in memory state
    > after the inodes have been flushed.
    >
    > With the current do_sync() code, we have no callout once the inodes
    > are written to issue a dummy transactions to cover the log
    > correctly. The do_sync() process needs to end with a sync_supers()
    > to get the correct callout to XFS to allow this to happen. i.e.
    > whenever we try to write the superblock we also should be trying to
    > initiate the log covering process, and we can't do this right now.
    > Once the log is covered, the recovery-overwriting-inodes problem
    > goes away because recovery is not needed.
    >
    > Everyone understand the problem now?
    >
    >
    >
    > FWIW, XFS has had this log covering code since, well, forever. It
    > came from Irix and it worked on Irix. I don't think that it has ever
    > worked on Linux, though, because of the lack of a sync_supers() call
    > at the end of do_sync(1). We've just never noticed it until we
    > corrected the infamous NULL files problems in 2.6.22 which hid this
    > particular cause of file size mismatches after a crash.
    >
    > With a bunch of hacks in place, test 182 now passes and sync(1) on
    > XFS finally does what it is supposed to. I'm not going to post the
    > hacky, full-of-garbage, debuggy patch I have that I used to discover
    > this - I'll clean it up first to just have the bits needed to fix
    > the problem, then post it. That'll be tomorrow....
    >
    > However, I have a problem - I'm an expert in XFS, not the other tens
    > of Linux filesystems so I can't begin to guess what the impact of
    > changing do_sync() would be on those many filesystems. How many
    > filesystems would such a change break? Indeed - how many are broken
    > right now by having dirty inodes and superblocks slip through
    > sync(1)?
    >
    > And then the big question - how the hell does one test such change?
    >
    > I can test XFS easily enough because it has shutdown ioctls that
    > effectively simulate a power failure - that what test 182 uses. I
    > don't think any other filesystem has such an ioctl, though, and I
    > don't have the time or hardware to repeatedly crash test every
    > filesystem out there to prove that a change to do_sync() doesn't
    > negatively impact them.
    >
    > What are the alternatives? do_sync() operates above any particular
    > filesystem, so it's hard to provide a filesystem specific ->do_sync
    > method to avoid changing sync order for all filesystems. Do we
    > change do_sync() to completely sync a superblock at a time instead
    > of doing each operation across all superblocks before moving onto
    > the next operation? Is there any particular reason (e.g. performance, locking) for the current
    > method that would prevent changing to completely-sync-a-superblock
    > iteration algorithm so we can provide a custom ->do_sync method?
    >
    > Are there any other ways that we can get a custom ->do_sync
    > method for XFS? I'd prefer a custom method so we don't have to
    > revalidate every linux filesystem, especially as XFS already has
    > everything it needs to provide it's own sync method (used for
    > freezing) and a test suite to validate it is working correctly.....
    >
    > Are there any other options for solving this?
    >
    > Cheers,
    >
    > Dave.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: do_sync() and XFSQA test 182 failures....

    On Fri, Oct 31, 2008 at 03:02:02PM +1100, Lachlan McIlroy wrote:
    > Dave Chinner wrote:
    >> Folks,
    >>
    >> I think I've finally found the bug(s) that is causing XFSQA test 182
    >> to fail. Test 182 writes a bunch of files, then runs sync, then
    >> shuts the filesystem down. It then unmounts and remounts the fs and
    >> checks that all the files are the correct length and have the
    >> correct contents. i.e. that sync semantics have been correctly
    >> observed.
    >>
    >> The cause of the failure is that log recovery is replaying inode
    >> allocation and overwriting the last inode writes that contained
    >> unlogged changes (i.e. the inode size update that occurs at I/O
    >> completion).
    >>
    >> The problem is that we've never been able to work out why recovery
    >> is doing this. What has been nagging at the back of my mind for
    >> quite some time is the fact that we do actually write these inodes
    >> to disk and that should allow the tail of the log to move forward
    >> past the inode allocation transaction and hence it should not be
    >> replayed during recovery.
    >>
    >> A solution that has been proposed in the past (by Lachlan) is to log
    >> the inode size updates instead of writing the inode to disk. In
    >> that case, recovery also replays the inode modification transactions
    >> and so we don't lose anything. It is a solution that would fix the
    >> problem. However, always logging inodes instead of writing unlogged
    >> changes has other performance implications that we'd prefer to avoid
    >> (i.e. the number of extra transactions it will cause).

    > Logging the inode every time a file size update occured increased log
    > traffic by about 11% for the load that test 182 generates.


    It increases the number of transactions by 33%, (i.e one to create
    the file, one for the allocation, and an additional one for updating
    the file size). That means a significant increase in the CPU
    overhead involved in writing inodes via pdflush, and we've still got
    to write those inodes to disk at some point.

    Besides, what do we do when we write inodes back from the AIL and we
    find unlogged changes then? We can't issue transactions from the
    AIL, as that would deadlock if the log is full. And we can't not push
    the inodes out, as that will prevent the tail of the log moving
    forward and that will deadlock the filesystemi as well...

    On top of that, XFS already logs far, far too much information.
    We need to be decreasing logging overhead, not increasing it by
    a significant fraction....

    > Logging the inode during inode writeout if, and only if, there are
    > unlogged changes (ie i_update_core is set) has a negligible impact on
    > log traffic or performance.


    Every time you write beyond EOF that happens. That's a very
    frequent occurrence, so I'd say that it's not negliable.

    > I thought I understood this problem yet your description is a lot more
    > detailed than I expected. I agree with you that sync is updating
    > everything on disk


    It's not, and that is the problem - it's failing to write the
    in-memory position of the tail of the log at the end of the sync.
    So while all the data and metadata is up to date on disk, the
    log is not. We haven't written everything we should be to
    disk as part of the sync.

    > and if it wasn't for log replay messing things up
    > then everything would be good. So although the inode on disk is up to
    > date the history of changes to that inode in the log is missing the
    > last changes involving file size updates. So when the log is replayed
    > the inode loses the file size update. And it's all because of the
    > inode cluster buffer initialisation log entry that resets the cluster
    > (and all the inodes in it) so even the di_flu****er hack can't save us.


    That's a red herring. My last attempt at solving this problem a
    couple of months ago was based on this premise - it (successfully)
    avoided replaying the inode allocation transaction, but the inode
    still got overwritten. It got overwritten by the allocation
    transaction being replayed in exactly the manner described by the
    log covering code....

    > The obvious solution to me was to complete the history of the inode in
    > the log so if replay wants to start from scratch it will make all the
    > necessary changes to bring the inode to a state that matches what is
    > on disk. Logging unlogged changes during sync will achieve this.
    >
    > Avoiding log replay altogether is even better but that solution is
    > only going to work if we've run sync just before our system crashes.
    > If we haven't run sync before our system crashes then we'll still hit
    > this problem with regressing inodes but if we haven't run sync there
    > are no guarantees, right?


    If the system crashes while busy, there's no guarantee that cached
    write-behind data has been written out, let alone the inode or the
    allocation transaction. Recent transactions will still be in memory,
    and log buffers may have been written out of order so on a crash we
    can lose up to the last 2MB of transactions from recovery. Logging
    inodes instead of writing them back does not change this at all - we
    just lose the inode changes from the log instead of from the on disk
    location.

    XFS has never attempted or claimed to provide data reliability
    guarantees when a crash occurs - it was designed to ensure that
    the filesystem metadata is consistent after recovery, not that
    there was zero data loss.....

    Cheers,

    Dave.
    --
    Dave Chinner
    david@fromorbit.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: do_sync() and XFSQA test 182 failures....

    On Fri, Oct 31, 2008 at 11:12:49AM +1100, Dave Chinner wrote:
    > Right - that's exactly where we should be going with this, I think.
    > I'd suggest two callouts, perhaps: ->sync_data and ->sync_metadata.
    > The freeze code can then still operate in two stages, and we can
    > also use then for separating data and inode writeback in pdflush....
    >
    > FWIW, I mentioned doing this sort of thing here:
    >
    > http://xfs.org/index.php/Improving_i...c_pdflush_Code
    >
    > I think I'll look at redoing do_sync() to provide a custom sync
    > method before trying to fix XFS....


    And you weren't the first to thing of this. Reiser4 for example
    has bad a patch forever to turn sync_sb_inodes into a filesystem method,
    and I think something similar is what we want. When talking about
    syncing we basically want a few things:

    - sync out data, either async (from pdflush) or sync
    (from sync, freeze, remount ro or unmount)
    - sync out metadata (from pdflush), either async or sync
    (from sync, freeze, remount ro or unmount)

    and then we want pdflush / sync / etc call into it. If we are doing
    this correctly this would also avoid having our own xfssyncd.

    And as we found out it's not just sync that gets it wrong, it's also
    fsync (which isn't part of the above picture as it's per-inode) that
    gets this utterly wrong, as well as all kinds of syncs, not just the
    unmount one. Combine this with the other data integrity issues Nick
    found in write_cache_pages I come to the conclusion that this whole area
    needs some profound audit and re-architecture urgently.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: do_sync() and XFSQA test 182 failures....

    On Fri, Oct 31, 2008 at 04:31:23PM -0400, Christoph Hellwig wrote:
    > On Fri, Oct 31, 2008 at 11:12:49AM +1100, Dave Chinner wrote:
    > > Right - that's exactly where we should be going with this, I think.
    > > I'd suggest two callouts, perhaps: ->sync_data and ->sync_metadata.
    > > The freeze code can then still operate in two stages, and we can
    > > also use then for separating data and inode writeback in pdflush....
    > >
    > > FWIW, I mentioned doing this sort of thing here:
    > >
    > > http://xfs.org/index.php/Improving_i...c_pdflush_Code
    > >
    > > I think I'll look at redoing do_sync() to provide a custom sync
    > > method before trying to fix XFS....

    >
    > And you weren't the first to thing of this. Reiser4 for example
    > has bad a patch forever to turn sync_sb_inodes into a filesystem method,
    > and I think something similar is what we want. When talking about
    > syncing we basically want a few things:
    >
    > - sync out data, either async (from pdflush) or sync
    > (from sync, freeze, remount ro or unmount)
    > - sync out metadata (from pdflush), either async or sync
    > (from sync, freeze, remount ro or unmount)


    Effectively, yes.

    Currently we iterate inodes for data and "metadata" sync, and the
    only other concept is writing superblocks. I think most filesystems
    have more types of metadata than this, so it makes sense for sync to
    work on abstracts sync as data and metadata rather than data, inodes
    and superblocks...

    > and then we want pdflush / sync / etc call into it. If we are doing
    > this correctly this would also avoid having our own xfssyncd.


    Yes, though we'd need to change a couple of the functions that
    xfssynd does to pdflush operations...

    > And as we found out it's not just sync that gets it wrong, it's also
    > fsync (which isn't part of the above picture as it's per-inode) that
    > gets this utterly wrong, as well as all kinds of syncs, not just the
    > unmount one.


    Async writeback (write_inode()) has the same problem as fsync -
    writing the inode before waiting for data I/O to complete - which
    means we've got to jump through hoops in the filesystem to avoid
    blocking on inodes that can't be immediately flushed, and often we
    end up writing the inode multiple times and having to issue log
    forces whenw e shouldn't need to. Effectively we have to tell the
    VFS to "try again later" the entire time data is being flushed
    before we can write the inode and it's exceedingly inefficient.....

    > Combine this with the other data integrity issues Nick
    > found in write_cache_pages I come to the conclusion that this whole area
    > needs some profound audit and re-architecture urgently.


    It's looking more and more that way, isn't it?

    Cheers,

    Dave.
    --
    Dave Chinner
    david@fromorbit.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: do_sync() and XFSQA test 182 failures....

    On Sat, Nov 01, 2008 at 08:54:30AM +1100, Dave Chinner wrote:
    > Effectively, yes.
    >
    > Currently we iterate inodes for data and "metadata" sync, and the
    > only other concept is writing superblocks. I think most filesystems
    > have more types of metadata than this, so it makes sense for sync to
    > work on abstracts sync as data and metadata rather than data, inodes
    > and superblocks...


    Yes, absolutely. And for those that have inodes as primary / only
    metadata besides superblock we can still provide a generic_sync_inodes
    helper that just takes a callback to apply to every inode. Which we
    probably want anyway as XFS is the only intree-filesystem that currently
    has a more efficient way to iterate inodes.

    > > And as we found out it's not just sync that gets it wrong, it's also
    > > fsync (which isn't part of the above picture as it's per-inode) that
    > > gets this utterly wrong, as well as all kinds of syncs, not just the
    > > unmount one.

    >
    > Async writeback (write_inode()) has the same problem as fsync -
    > writing the inode before waiting for data I/O to complete - which
    > means we've got to jump through hoops in the filesystem to avoid
    > blocking on inodes that can't be immediately flushed, and often we
    > end up writing the inode multiple times and having to issue log
    > forces whenw e shouldn't need to. Effectively we have to tell the
    > VFS to "try again later" the entire time data is being flushed
    > before we can write the inode and it's exceedingly inefficient.....


    Yes, that was the couple of sync functions I meant above as the whole
    inode writeback path is extremly convoluted - mostly due to the dirty
    data vs metadata mixup mess.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread