[PATCH]: Fix SMP-reordering race in mark_buffer_dirty - Kernel

This is a discussion on [PATCH]: Fix SMP-reordering race in mark_buffer_dirty - Kernel ; Hi It looks like someone overoptimized mark_buffer_dirty(). mark_buffer_dirty() is void mark_buffer_dirty(struct buffer_head *bh) { WARN_ON_ONCE(!buffer_uptodate(bh)); if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh)) __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0); } That buffer_dirty() test is not atomic, it may be reordered with whatever else. So suppose this race ...

+ Reply to Thread
Results 1 to 14 of 14

Thread: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty

  1. [PATCH]: Fix SMP-reordering race in mark_buffer_dirty

    Hi

    It looks like someone overoptimized mark_buffer_dirty().

    mark_buffer_dirty() is
    void mark_buffer_dirty(struct buffer_head *bh)
    {
    WARN_ON_ONCE(!buffer_uptodate(bh));
    if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))
    __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
    }

    That buffer_dirty() test is not atomic, it may be reordered with whatever
    else.

    So suppose this race

    CPU1:

    write to buffer data
    call mark_buffer_dirty()
    test for !buffer_dirty(bh)

    --- there is no synchronizing operation, so inside CPU it may get
    reordered to:

    test for !buffer_dirty(bh)
    write to buffer data

    CPU2:
    clear_buffer_dirty(bh);
    submit_bh(WRITE, bh);

    The resulting operations may end up in this order:
    CPU1: test for !buffer_dirty(bh) --- sees that the bit is set
    CPU2: clear_buffer_dirty(bh);
    CPU2: submit_bh(WRITE, bh);
    CPU1: write to buffer data

    So we have a clean buffer with modified data and this modification is
    going to be lost.

    Mikulas


    Signed-off-by: Mikulas Patocka

    --- linux-2.6.25-rc8/fs/buffer.c_ 2008-04-02 21:08:36.000000000 +0200
    +++ linux-2.6.25-rc8/fs/buffer.c 2008-04-02 21:10:25.000000000 +0200
    @@ -1180,6 +1180,12 @@
    */
    void mark_buffer_dirty(struct buffer_head *bh)
    {
    + /*
    + * Make sure that the test for buffer_dirty(bh) is not reordered with
    + * previous modifications to the buffer data.
    + * -- mikulas
    + */
    + smp_mb();
    WARN_ON_ONCE(!buffer_uptodate(bh));
    if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))
    __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty



    On Wed, 2 Apr 2008, Mikulas Patocka wrote:
    > + /*
    > + * Make sure that the test for buffer_dirty(bh) is not reordered with
    > + * previous modifications to the buffer data.
    > + * -- mikulas
    > + */
    > + smp_mb();
    > WARN_ON_ONCE(!buffer_uptodate(bh));
    > if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))


    At that point, the better patch is to just *remove* the buffer_dirty()
    test, and rely on the stronger ordering requirements of
    test_set_buffer_dirty().

    The whole - and only - point of the buffer_dirty() check was to avoid the
    more expensive test_set_buffer_dirty() call, but it's only more expensive
    because of the barrier semantics. So if you add a barrier, the point goes
    away and you should instead remove the optimization.

    (I also seriously doubt you can actually trigger this in real life, but
    simplifying the code is probably fine regardless).

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty

    On Wed, 2 Apr 2008, Linus Torvalds wrote:

    > On Wed, 2 Apr 2008, Mikulas Patocka wrote:
    > > + /*
    > > + * Make sure that the test for buffer_dirty(bh) is not reordered with
    > > + * previous modifications to the buffer data.
    > > + * -- mikulas
    > > + */
    > > + smp_mb();
    > > WARN_ON_ONCE(!buffer_uptodate(bh));
    > > if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))

    >
    > At that point, the better patch is to just *remove* the buffer_dirty()
    > test, and rely on the stronger ordering requirements of
    > test_set_buffer_dirty().
    >
    > The whole - and only - point of the buffer_dirty() check was to avoid the
    > more expensive test_set_buffer_dirty() call, but it's only more expensive
    > because of the barrier semantics. So if you add a barrier, the point goes
    > away and you should instead remove the optimization.
    >
    > (I also seriously doubt you can actually trigger this in real life, but
    > simplifying the code is probably fine regardless).
    >
    > Linus


    I measured it:
    On Core2, mfence is faster (8 ticks) than lock btsl (27 ticks)
    On Pentium-4-prescott, mfence is 124 ticks and lock btsl is 86 ticks.
    On Pentium-4-pre-prescott, mfence wins again (100) and lock btsl is 120.
    On Athlon, mfence is 16 ticks and lock btsl is 19 ticks.

    So you're right, the gain of mfence is so little that you can remove it
    and use only test_set_buffer_dirty.

    I don't know if there are other architectures where smb_mb() would be
    significantly faster than test_and_set_bit.

    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty



    On Wed, 2 Apr 2008, Linus Torvalds wrote:
    >
    > Core 2 is the outlier in having a noticeably faster "mfence" than atomic
    > instructions


    Side note: mfence is probably faster than 6 cycles on Core 2. I've seen it
    be basically zero-cost. Try adding a few memory ops around it - the timing
    will probably change. Core 2 has this odd behaviour that it sometimes does
    worse on the *really* trivial things that don't happen in real life.

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty



    On Wed, 2 Apr 2008, Mikulas Patocka wrote:
    >
    > So you're right, the gain of mfence is so little that you can remove it
    > and use only test_set_buffer_dirty.


    Well, I suspect that part of the issue is that quite often you end up
    with *both* because the buffer wasn't already dirty from before.

    Re-dirtying a dirty buffer is pretty common for things like bitmap blocks
    etc, so it's probably a worthy optimization if it has no cost, and on
    Core2 I suspect your version is worth it, but it's not like it's going to
    be necessarily a 99% kind of case. I suspect quite a lot of the
    mark_buffer_dirty() calls are actually on clean buffers.

    (Of course, a valid argument is that if it was already dirty, we'll skip
    the other expensive parts, so only the "already dirty" case is worth
    optimizing for. Maybe true. There might also be cases where it means one
    less dirty cacheline in memory.)

    > I don't know if there are other architectures where smb_mb() would be
    > significantly faster than test_and_set_bit.


    Probably none, since it test_and_set_bit() implies a smp_mb(), and
    generally the bigger cost is in the barrier than in the bit setting
    itself.

    Core 2 is the outlier in having a noticeably faster "mfence" than atomic
    instructions (and judging by noises Intel makes, Nehalem will undo that
    outlier).

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty

    On Wed, 2 Apr 2008 12:44:05 -0700 (PDT)
    Linus Torvalds wrote:

    >
    >
    > On Wed, 2 Apr 2008, Mikulas Patocka wrote:
    > > + /*
    > > + * Make sure that the test for buffer_dirty(bh) is not reordered with
    > > + * previous modifications to the buffer data.
    > > + * -- mikulas
    > > + */
    > > + smp_mb();
    > > WARN_ON_ONCE(!buffer_uptodate(bh));
    > > if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))

    >
    > At that point, the better patch is to just *remove* the buffer_dirty()
    > test, and rely on the stronger ordering requirements of
    > test_set_buffer_dirty().
    >
    > The whole - and only - point of the buffer_dirty() check was to avoid the
    > more expensive test_set_buffer_dirty() call, but it's only more expensive
    > because of the barrier semantics. So if you add a barrier, the point goes
    > away and you should instead remove the optimization.


    But then the test-and-set of an already-set flag would newly cause the
    cacheline to be dirtied, requiring additional bus usage to write it back?

    The CPU's test-and-set-bit operation could of course optimise that away in
    this case. But does it?


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty



    On Wed, 2 Apr 2008, Andrew Morton wrote:
    >
    > But then the test-and-set of an already-set flag would newly cause the
    > cacheline to be dirtied, requiring additional bus usage to write it back?
    >
    > The CPU's test-and-set-bit operation could of course optimise that away in
    > this case. But does it?


    No, afaik no current x86 uarch will optimize away the write on a locked
    instuction if it turns out to be unnecessary.

    Can somebody find a timing reason to have the ugly code?

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty

    On Wed, 2 Apr 2008, Linus Torvalds wrote:

    > On Wed, 2 Apr 2008, Mikulas Patocka wrote:
    > >
    > > So you're right, the gain of mfence is so little that you can remove it
    > > and use only test_set_buffer_dirty.

    >
    > Well, I suspect that part of the issue is that quite often you end up
    > with *both* because the buffer wasn't already dirty from before.
    >
    > Re-dirtying a dirty buffer is pretty common for things like bitmap blocks
    > etc, so it's probably a worthy optimization if it has no cost, and on
    > Core2 I suspect your version is worth it, but it's not like it's going to
    > be necessarily a 99% kind of case. I suspect quite a lot of the
    > mark_buffer_dirty() calls are actually on clean buffers.


    The cost of doing the buffer lookup and update is much more than the 20
    clocks, so the 20 clocks mfence/lock difference can be forgotten.

    > (Of course, a valid argument is that if it was already dirty, we'll skip
    > the other expensive parts, so only the "already dirty" case is worth
    > optimizing for. Maybe true. There might also be cases where it means one
    > less dirty cacheline in memory.)


    That is another good example: when two CPUs are simultaneously updating
    the same buffer (i.e. two inodes in one block or simultaneous writes to a
    bitmap), then test_set_buffer_dirty will do cache ping-pong (that is much
    more expensive than that 20-100 cycles for an interlocked instruction),
    and smp_mb()+test_buffer_dirty will keep cache intact.

    So maybe there's still a reason for smb_mb().

    Mikulas

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty

    On Wed, 2 Apr 2008, Linus Torvalds wrote:

    > On Wed, 2 Apr 2008, Linus Torvalds wrote:
    > >
    > > Core 2 is the outlier in having a noticeably faster "mfence" than atomic
    > > instructions

    >
    > Side note: mfence is probably faster than 6 cycles on Core 2. I've seen it
    > be basically zero-cost. Try adding a few memory ops around it - the timing
    > will probably change. Core 2 has this odd behaviour that it sometimes does
    > worse on the *really* trivial things that don't happen in real life.
    >
    > Linus


    When I added memory reads around mfence, the combined result was worse
    than the sum of reads and mfence --- mfence alone 8 ticks, mfence+4reads
    18 ticks.

    mfence+writes scales linearly, i.e. mfence + 4 writes does 12 ticks.

    mfence can fully overlap with register operations.

    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty



    On Thu, 3 Apr 2008, Mikulas Patocka wrote:
    >
    > When I added memory reads around mfence, the combined result was worse
    > than the sum of reads and mfence --- mfence alone 8 ticks, mfence+4reads
    > 18 ticks.


    Sorry, my test-program was buggy.

    > mfence+writes scales linearly, i.e. mfence + 4 writes does 12 ticks.


    I get different results now that I fixed my memop test. I get a loop with
    reads and writes taking 3 cycles without mfence, and 21 cycles with (read
    after, write before).

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty



    On Wed, 2 Apr 2008, Linus Torvalds wrote:

    >
    >
    > On Wed, 2 Apr 2008, Andrew Morton wrote:
    > >
    > > But then the test-and-set of an already-set flag would newly cause the
    > > cacheline to be dirtied, requiring additional bus usage to write it back?
    > >
    > > The CPU's test-and-set-bit operation could of course optimise that away in
    > > this case. But does it?

    >
    > No, afaik no current x86 uarch will optimize away the write on a locked
    > instuction if it turns out to be unnecessary.


    No, it doesn't. Try this:

    #include
    #include
    void *pth(void *p)
    {
    int i;
    for (i = 0; i < 100000000; i++)
    __asm__ volatile ("lock;btsl $0, %0"::"m"(*(int
    *)p):"cc");
    return NULL;
    }
    int args[2000];
    int main(void)
    {
    pthread_t t1, t2, t3, t4;
    memset(args, -1, sizeof args);
    pthread_create(&t1, NULL, pth, &args[0]);
    pthread_create(&t2, NULL, pth, &args[16]);
    pthread_create(&t3, NULL, pth, &args[32]);
    pthread_create(&t4, NULL, pth, &args[48]);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    pthread_join(t3, NULL);
    pthread_join(t4, NULL);
    return 0;
    }

    --- when the &args[] indices are in a conflicting cacheline, I get 9 times
    slower execution. I tried it on 2 double-core Core 2 Xeons.

    Mikulas

    > Can somebody find a timing reason to have the ugly code?
    >
    > Linus
    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty



    On Wed, 2 Apr 2008, Andrew Morton wrote:
    >
    > But then the test-and-set of an already-set flag would newly cause the
    > cacheline to be dirtied, requiring additional bus usage to write it back?


    Looking around a bit, I don't see any realistic case where this could
    possibly be the case and that is performance-sensitive.

    The VFS-level uses of mark_buffer_dirty() seem to all be coupled with
    other uses that clear or set other bits in the buffer status word, so the
    cacheline will always apparently be dirty.

    The low-level filesystems sometimes do it for things like block bitmap
    changes, and I could imagine that there it actually (a) is no longer in
    the cache and (b) the buffer really was dirty to start with, but in ext3
    for example, you'd end up in journal_dirty_metadata which spinlocks on the
    BH_State bit first etc etc.

    So the cacheline *will* be dirty, and this function doesn't seem like it
    could possibly ever show up in a real profile for any real load anyway, so
    it seems odd to try to optimize it this way.

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty

    On Wed, 2 Apr 2008 16:52:14 -0700 (PDT) Linus Torvalds wrote:

    >
    >
    > On Wed, 2 Apr 2008, Andrew Morton wrote:
    > >
    > > But then the test-and-set of an already-set flag would newly cause the
    > > cacheline to be dirtied, requiring additional bus usage to write it back?

    >
    > Looking around a bit, I don't see any realistic case where this could
    > possibly be the case and that is performance-sensitive.


    You sure? A pretty common case would be overwrite of an already-dirty page
    and from a quick read the only place where we modify bh.b_state is the
    set_buffer_uptodate() and clear_buffer_new() in __block_commit_write(),
    both of which could/should be converted to use the same trick. Like
    __block_prepare_write(), which already does

    if (!buffer_uptodate(bh))
    set_buffer_uptodate(bh);


    What happened here was back in about, umm, 2001 we discovered one or two
    code paths which when optimised in this way led to overall-measurably (not
    just oprofile-measurably) improvements. I don't recall which ones they
    were.

    So we then said oh-goody and sprinkled the same pattern all over the place
    on the off-chance. But I'm sure that over the ages we've let that
    optimisation rot (witness __block_commit_write() above).

    These were simpler times, and we didn't worry our little heads overly much
    about this ordering stuff.

    > The VFS-level uses of mark_buffer_dirty() seem to all be coupled with
    > other uses that clear or set other bits in the buffer status word, so the
    > cacheline will always apparently be dirty.


    As I say, I expect we could fix this if we want to. The key point here is
    that a page overwrite does not do lock_buffer(), so it should be possible
    to do the whole operation without modifying bh.b_state. If we wish to do
    that.

    > The low-level filesystems sometimes do it for things like block bitmap
    > changes, and I could imagine that there it actually (a) is no longer in
    > the cache and (b) the buffer really was dirty to start with, but in ext3
    > for example, you'd end up in journal_dirty_metadata which spinlocks on the
    > BH_State bit first etc etc.


    Yeah, ext3 is probably a lost cause. Anyone who cares about performance is
    using ext2 and a UPS

    > So the cacheline *will* be dirty, and this function doesn't seem like it
    > could possibly ever show up in a real profile for any real load anyway, so
    > it seems odd to try to optimize it this way.
    >


    I don't think the world would end if we took it out. Particularly as not
    many people use ext2 and we already broke it.

    The trick will be to hunt down all the other places where we did a similar
    thing and check them.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [PATCH]: Fix SMP-reordering race in mark_buffer_dirty



    On Wed, 2 Apr 2008, Andrew Morton wrote:
    >
    > You sure? A pretty common case would be overwrite of an already-dirty page
    > and from a quick read the only place where we modify bh.b_state is the
    > set_buffer_uptodate() and clear_buffer_new() in __block_commit_write(),
    > both of which could/should be converted to use the same trick. Like
    > __block_prepare_write(), which already does
    >
    > if (!buffer_uptodate(bh))
    > set_buffer_uptodate(bh);


    Well, that particular optimization is safe, but it's safe because
    "uptodate" is sticky. Once it gets set, it is never reset.

    But in general, it's simply a bad idea to do

    if (read)
    atomic-read-modify-write;

    because it so often has races. This is pretty much exactly the same bug as
    we had not long ago with

    if (!waitqueue_empty(..))
    wake_up(..);

    and for very similar reasons - the "read" part is very fast, yes, but it's
    also by definition not actually doing all the careful things that the
    atomic operation (whether a CPU-atomic one, or a software-written atomic
    with a spinlock one) does.

    > What happened here was back in about, umm, 2001 we discovered one or two
    > code paths which when optimised in this way led to overall-measurably (not
    > just oprofile-measurably) improvements. I don't recall which ones they
    > were.
    >
    > So we then said oh-goody and sprinkled the same pattern all over the place
    > on the off-chance. But I'm sure that over the ages we've let that
    > optimisation rot (witness __block_commit_write() above).


    And the problem is that

    if (!buffer_uptodate(bh))
    set_buffer_uptodate(bh);

    really isn't "the same" optimization at all as

    if (!buffer_dirty(bh) && test_and_set_buffer_dirty(bh)) {
    ..

    and the latter is simply fundamentally different.

    > As I say, I expect we could fix this if we want to. The key point here is
    > that a page overwrite does not do lock_buffer(), so it should be possible
    > to do the whole operation without modifying bh.b_state. If we wish to do
    > that.


    Well, if we really want to do this op, then I'd rather make the code be
    really obvious what the smp_mb is about, but also make sure that we don't
    unnecessarily do *both* the smp_mb and the actual already-serialized bit
    operation.

    But I'd be even happier if we only did these kinds of things when we have
    real performance-data that they help.

    Linus
    ---
    fs/buffer.c | 15 ++++++++++++++-
    1 files changed, 14 insertions(+), 1 deletions(-)

    diff --git a/fs/buffer.c b/fs/buffer.c
    index 9819632..39ff144 100644
    --- a/fs/buffer.c
    +++ b/fs/buffer.c
    @@ -1181,7 +1181,20 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
    void mark_buffer_dirty(struct buffer_head *bh)
    {
    WARN_ON_ONCE(!buffer_uptodate(bh));
    - if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))
    +
    + /*
    + * Very *carefully* optimize the it-is-already-dirty case.
    + *
    + * Don't let the final "is it dirty" escape to before we
    + * perhaps modified the buffer.
    + */
    + if (buffer_dirty(bh)) {
    + smp_mb();
    + if (buffer_dirty(bh))
    + return;
    + }
    +
    + if (!test_set_buffer_dirty(bh))
    __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
    }

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread