[PATCH] Memory management livelock - Kernel

This is a discussion on [PATCH] Memory management livelock - Kernel ; On Friday 03 October 2008 14:17, Andrew Morton wrote: > On Fri, 3 Oct 2008 14:07:55 +1000 Nick Piggin > > Possibly a new mutex in the address_space? > > That's another, umm 24 bytes minimum in the address_space (and ...

+ Reply to Thread
Page 2 of 3 FirstFirst 1 2 3 LastLast
Results 21 to 40 of 57

Thread: [PATCH] Memory management livelock

  1. Re: [PATCH] Memory management livelock

    On Friday 03 October 2008 14:17, Andrew Morton wrote:
    > On Fri, 3 Oct 2008 14:07:55 +1000 Nick Piggin
    > > Possibly a new mutex in the address_space?

    >
    > That's another, umm 24 bytes minimum in the address_space (and inode).
    > That's fairly ouch, which is why Miklaus did that hokey bit-based
    > thing.


    Well yeah, it would be a bit based mutex in mapping->flags with
    hashed waitqueues. Like Miklaus's.


    > > Yeah... I went to break the sync/async cases into two, but it looks like
    > > it may not have been worthwhile. Just another branch might be the best
    > > way to go.

    >
    > Yup. Could add another do-this flag in the writeback_control, perhaps.
    > Or even a function pointer.


    Yeah... possibly we could just _always_ do the PAGECACHE_TAG_FSYNC thing
    if mode != WB_SYNC_NONE. I think if we had the infrastructure there to
    do it all, it should always be something we want to do for data integrity
    writeout.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [PATCH] Memory management livelock

    > > *What* is, forever? Data integrity syncs should have pages operated on
    > > in-order, until we get to the end of the range. Circular writeback could
    > > go through again, possibly, but no more than once.

    >
    > OK, I have been able to reproduce it somewhat. It is not a livelock,
    > but what is happening is that direct IO read basically does an fsync
    > on the file before performing the IO. The fsync gets stuck behind the
    > dd that is dirtying the pages, and ends up following behind it and
    > doing all its IO for it.
    >
    > The following patch avoids the issue for direct IO, by using the range
    > syncs rather than trying to sync the whole file.
    >
    > The underlying problem I guess is unchanged. Is it really a problem,
    > though? The way I'd love to solve it is actually by adding another bit
    > or two to the pagecache radix tree, that can be used to transiently tag
    > the tree for future operations. That way we could record the dirty and
    > writeback pages up front, and then only bother with operating on them.
    >
    > That's *if* it really is a problem. I don't have much pity for someone
    > doing buffered IO and direct IO to the same pages of the same file


    LVM does (that is where the bug was discovered). Basically, it scans all
    the block devices with direct IO and if someone else does buffered IO on
    any device simultaneously, it locks up.

    That fsync-vs-write livelock is quite improbably (why would some
    application do it?) --- although it could be used as a DoS --- getting
    unkillable process.

    But there is another possible real-world problem --- sync-vs-write ---
    i.e. admin plugs in two disks and copies data from one to the other.
    Meanwhile, some unrelated server process executes sync(). The server goes
    into coma until the copy finishes.

    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH] Memory management livelock

    On Thu, 2 Oct 2008, Andrew Morton wrote:

    > On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin wrote:
    >
    > > > I expect there's no solution which avoids blocking the writers at some
    > > > stage.

    > >
    > > See my other email. Something roughly like this would do the trick
    > > (hey, it actually boots and runs and does fix the problem too).

    >
    > It needs exclusion to protect all those temp tags. Is do_fsync()'s
    > i_mutex sufficient? It's qute unobvious (and unmaintainable?) that all
    > the callers of this stuff are running under that lock.


    That filemap_fdatawrite and filemap_fdatawait in fsync() aren't really
    called under i_mutex (see do_fsync).

    So the possible solutions are:

    1. Add jiffies when the page was diried and wroteback to struct page
    + no impact on locking and concurrency
    - increases the structure by 8 bytes

    2. Stop the writers when the starvation happens (what I did)
    + doesn't do any locking if the livelock doesn't happen
    - locks writers when the livelock happens (I think it's not really serious
    --- because very few people complained about the livelock, very few people
    will see performance degradation from blocking the writers).

    3. Add another bit to radix tree (what Nick did)
    + doesn't ever block writers
    - unconditionally takes the lock on fsync path and serializates concurrent
    syncs/fsyncs. Probably low overhead too ... or I don't know, is there any
    possible situation when more processes execute sync() in parallel and user
    would see degradations if those syncs were serialized?

    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH] Memory management livelock

    On Friday 03 October 2008 21:43, Mikulas Patocka wrote:
    > On Thu, 2 Oct 2008, Andrew Morton wrote:
    > > On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin

    wrote:
    > > > > I expect there's no solution which avoids blocking the writers at
    > > > > some stage.
    > > >
    > > > See my other email. Something roughly like this would do the trick
    > > > (hey, it actually boots and runs and does fix the problem too).

    > >
    > > It needs exclusion to protect all those temp tags. Is do_fsync()'s
    > > i_mutex sufficient? It's qute unobvious (and unmaintainable?) that all
    > > the callers of this stuff are running under that lock.

    >
    > That filemap_fdatawrite and filemap_fdatawait in fsync() aren't really
    > called under i_mutex (see do_fsync).
    >
    > So the possible solutions are:
    >
    > 1. Add jiffies when the page was diried and wroteback to struct page
    > + no impact on locking and concurrency
    > - increases the structure by 8 bytes


    This one is not practical.


    > 2. Stop the writers when the starvation happens (what I did)
    > + doesn't do any locking if the livelock doesn't happen
    > - locks writers when the livelock happens (I think it's not really serious
    > --- because very few people complained about the livelock, very few people
    > will see performance degradation from blocking the writers).


    Maybe it is because not much actually does sequential writes to a massive
    file or block device while trying to fsync it as well? I don't know. You
    could still have cases where fsync takes much longer than expected but it
    is still not long enough for a user to report it as a "livelock" bug.


    > 3. Add another bit to radix tree (what Nick did)
    > + doesn't ever block writers
    > - unconditionally takes the lock on fsync path and serializates concurrent
    > syncs/fsyncs. Probably low overhead too ... or I don't know, is there any
    > possible situation when more processes execute sync() in parallel and user
    > would see degradations if those syncs were serialized?


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH] Memory management livelock

    On Friday 03 October 2008 21:26, Mikulas Patocka wrote:
    > > > *What* is, forever? Data integrity syncs should have pages operated on
    > > > in-order, until we get to the end of the range. Circular writeback
    > > > could go through again, possibly, but no more than once.

    > >
    > > OK, I have been able to reproduce it somewhat. It is not a livelock,
    > > but what is happening is that direct IO read basically does an fsync
    > > on the file before performing the IO. The fsync gets stuck behind the
    > > dd that is dirtying the pages, and ends up following behind it and
    > > doing all its IO for it.
    > >
    > > The following patch avoids the issue for direct IO, by using the range
    > > syncs rather than trying to sync the whole file.
    > >
    > > The underlying problem I guess is unchanged. Is it really a problem,
    > > though? The way I'd love to solve it is actually by adding another bit
    > > or two to the pagecache radix tree, that can be used to transiently tag
    > > the tree for future operations. That way we could record the dirty and
    > > writeback pages up front, and then only bother with operating on them.
    > >
    > > That's *if* it really is a problem. I don't have much pity for someone
    > > doing buffered IO and direct IO to the same pages of the same file

    >
    > LVM does (that is where the bug was discovered). Basically, it scans all
    > the block devices with direct IO and if someone else does buffered IO on
    > any device simultaneously, it locks up.


    Scans all block devices with direct IO? Hmm, why, I wonder? Should
    really consider using buffered (posix_fadvise to readahead/dropbehind).


    > That fsync-vs-write livelock is quite improbably (why would some
    > application do it?) --- although it could be used as a DoS --- getting
    > unkillable process.


    I'm not sure.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH] Memory management livelock

    > > So the possible solutions are:
    > >
    > > 1. Add jiffies when the page was diried and wroteback to struct page
    > > + no impact on locking and concurrency
    > > - increases the structure by 8 bytes

    >
    > This one is not practical.
    >
    >
    > > 2. Stop the writers when the starvation happens (what I did)
    > > + doesn't do any locking if the livelock doesn't happen
    > > - locks writers when the livelock happens (I think it's not really serious
    > > --- because very few people complained about the livelock, very few people
    > > will see performance degradation from blocking the writers).

    >
    > Maybe it is because not much actually does sequential writes to a massive
    > file or block device while trying to fsync it as well? I don't know. You
    > could still have cases where fsync takes much longer than expected but it
    > is still not long enough for a user to report it as a "livelock" bug.


    At most twice the time it would normally take (one loop of writeback queue
    until it detects the livelock and the other loop until it drains all the
    new pages that were created during the first loop).

    While with solution (3) it would take only once for the whole writeback
    queue.

    Mikulas

    > > 3. Add another bit to radix tree (what Nick did)
    > > + doesn't ever block writers
    > > - unconditionally takes the lock on fsync path and serializates concurrent
    > > syncs/fsyncs. Probably low overhead too ... or I don't know, is there any
    > > possible situation when more processes execute sync() in parallel and user
    > > would see degradations if those syncs were serialized?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH] Memory management livelock

    > > LVM does (that is where the bug was discovered). Basically, it scans all
    > > the block devices with direct IO and if someone else does buffered IO on
    > > any device simultaneously, it locks up.

    >
    > Scans all block devices with direct IO? Hmm, why, I wonder? Should
    > really consider using buffered (posix_fadvise to readahead/dropbehind).


    LVM must not allocate any memory when doing IO because it suspends the
    block device and memory allocation could trigger writeback on the
    suspended device and deadlock.

    So it preallocates heap and stack, mlockall()s itself and does direct IO.

    Well, there are still two allocations on direct IO path --- one in
    __blockdev_direct_IO and the other in dio_bio_alloc. If you have an idea
    how to get rid of them (especially that kzalloc(sizeof(*dio), GFP_KERNEL),
    that bio allocation would be quite easy to avoid), it would be benefical.

    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH] Memory management livelock

    On Fri, Oct 03, 2008 at 10:31:03PM +1000, Nick Piggin wrote:
    > On Friday 03 October 2008 21:26, Mikulas Patocka wrote:
    > > LVM does (that is where the bug was discovered). Basically, it scans all
    > > the block devices with direct IO and if someone else does buffered IO on
    > > any device simultaneously, it locks up.

    > Scans all block devices with direct IO? Hmm, why, I wonder? Should
    > really consider using buffered (posix_fadvise to readahead/dropbehind).


    Scan in the sense of it reading a few sectors from each potential LVM
    device to check whether it contains an LVM label.

    Alasdair
    --
    agk@redhat.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH] Memory management livelock

    On Fri, Oct 03, 2008 at 09:50:17AM -0400, Mikulas Patocka wrote:
    > > > LVM does (that is where the bug was discovered). Basically, it scans all
    > > > the block devices with direct IO and if someone else does buffered IO on
    > > > any device simultaneously, it locks up.

    > > Scans all block devices with direct IO? Hmm, why, I wonder? Should
    > > really consider using buffered (posix_fadvise to readahead/dropbehind).

    > LVM must not allocate any memory when doing IO because it suspends the
    > block device and memory allocation could trigger writeback on the
    > suspended device and deadlock.
    > So it preallocates heap and stack, mlockall()s itself and does direct IO.


    True, but unrelated to the scanning, which LVM performs *prior* to
    entering such a state.

    We use direct IO while scanning because it's essential all nodes in a
    cluster see the same updated version of the data after any node updated
    it.

    Alasdair
    --
    agk@redhat.com
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. application syncing options (was Re: [PATCH] Memory management livelock)

    On Fri, 3 Oct 2008, Nick Piggin wrote:

    >> *What* is, forever? Data integrity syncs should have pages operated on
    >> in-order, until we get to the end of the range. Circular writeback could
    >> go through again, possibly, but no more than once.

    >
    > OK, I have been able to reproduce it somewhat. It is not a livelock,
    > but what is happening is that direct IO read basically does an fsync
    > on the file before performing the IO. The fsync gets stuck behind the
    > dd that is dirtying the pages, and ends up following behind it and
    > doing all its IO for it.
    >
    > The following patch avoids the issue for direct IO, by using the range
    > syncs rather than trying to sync the whole file.
    >
    > The underlying problem I guess is unchanged. Is it really a problem,
    > though? The way I'd love to solve it is actually by adding another bit
    > or two to the pagecache radix tree, that can be used to transiently tag
    > the tree for future operations. That way we could record the dirty and
    > writeback pages up front, and then only bother with operating on them.
    >
    > That's *if* it really is a problem. I don't have much pity for someone
    > doing buffered IO and direct IO to the same pages of the same file


    I've seen lots of discussions here about different options in syncing. in
    this case a fix is to do a fsync of a range. I've also seen discussions of
    how the kernel filesystem code can do ordered writes without having to
    wait for them with the use of barriers, is this capability exported to
    userspace? if so, could you point me at documentation for it?

    David Lang
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. [PATCH 1/3] bit mutexes

    A bit mutex implementation.

    Bit mutex is a space-efficient mutex that consumes exactly one bit in
    the apropriate structure (if mutex debugging is turned off). Bit mutex
    is implemented as a bit in unsigned long field. Other bits in the field
    may be used for other purposes, if test/set/clear_bit functions are used
    to manipulate them. There is no wait queue in the structure containing
    the bit mutex; hashed wait queues for waiting on bits are used.

    Additional structure struct bit_mutex is needed, it contains lock
    debugging & dependency information. When the kernel is compiled without
    lock debugging, this structure is empty.

    Bit mutexes are used with functions
    bit_mutex_init
    bit_mutex_lock
    bit_mutex_unlock
    bit_mutex_is_locked
    These functions take 3 arguments: pointer to the unsigned long field,
    the bit position and pointer to struct bit_mutex.

    Signed-off-by: Mikulas Patocka

    ---
    include/linux/bit-mutex.h | 98 ++++++++++++++++++++++++++++++++++++++++++++++
    include/linux/wait.h | 8 +++
    kernel/Makefile | 2
    kernel/bit-mutex-debug.c | 55 +++++++++++++++++++++++++
    kernel/wait.c | 7 +++
    5 files changed, 169 insertions(+), 1 deletion(-)

    Index: linux-2.6.27-rc8-devel/include/linux/bit-mutex.h
    ================================================== =================
    --- /dev/null 1970-01-01 00:00:00.000000000 +0000
    +++ linux-2.6.27-rc8-devel/include/linux/bit-mutex.h 2008-10-05 19:58:30.000000000 +0200
    @@ -0,0 +1,98 @@
    +#ifndef __LINUX_BIT_MUTEX_H
    +#define __LINUX_BIT_MUTEX_H
    +
    +#include
    +#include
    +#include
    +#include
    +
    +struct bit_mutex {
    +#ifdef CONFIG_DEBUG_MUTEXES
    + unsigned long *field;
    + int bit;
    + struct thread_info *owner;
    + const char *name;
    + void *magic;
    +#endif
    +#ifdef CONFIG_DEBUG_LOCK_ALLOC
    + struct lockdep_map dep_map;
    +#endif
    +};
    +
    +#ifndef CONFIG_DEBUG_MUTEXES
    +
    +static inline void bit_mutex_init(unsigned long *field, int bit, struct bit_mutex *mtx)
    +{
    + clear_bit(bit, field);
    + smp_mb__after_clear_bit();
    +}
    +
    +static inline void bit_mutex_lock(unsigned long *field, int bit, struct bit_mutex *mtx)
    +{
    + wait_on_bit_lock(field, bit, wait_action_schedule, TASK_UNINTERRUPTIBLE);
    +}
    +
    +static inline void bit_mutex_unlock(unsigned long *field, int bit, struct bit_mutex *mtx)
    +{
    + smp_mb__before_clear_bit();
    + clear_bit(bit, field);
    + smp_mb__after_clear_bit();
    + wake_up_bit(field, bit);
    +}
    +
    +static inline int bit_mutex_is_locked(unsigned long *field, int bit, struct bit_mutex *mtx)
    +{
    + return test_bit(bit, field);
    +}
    +
    +#define __DEBUG_BIT_MUTEX_INITIALIZER(field, bit, mtx)
    +
    +#else
    +
    +void __bit_mutex_init(unsigned long *field, int bit, struct bit_mutex *mtx, const char *name, struct lock_class_key *key);
    +
    +#define bit_mutex_init(field, bit, mtx) \
    +do { \
    + static struct lock_class_key __key; \
    + __bit_mutex_init(field, bit, mtx, #mtx, &__key); \
    +} while (0)
    +
    +void __bit_mutex_lock(unsigned long *field, int bit, struct bit_mutex *mtx, int subclass);
    +
    +#define bit_mutex_lock(field, bit, mtx) \
    + __bit_mutex_lock(field, bit, mtx, 0)
    +
    +void __bit_mutex_unlock(unsigned long *field, int bit, struct bit_mutex *mtx, int subclass);
    +
    +#define bit_mutex_unlock(field, bit, mtx) \
    + __bit_mutex_unlock(field, bit, mtx, 0)
    +
    +int bit_mutex_is_locked(unsigned long *field, int bit, struct bit_mutex *mtx);
    +
    +#define __DEBUG_BIT_MUTEX_INITIALIZER(field_, bit_, mtx) \
    + .magic = &(mtx), \
    + .field = (field_), \
    + .bit = (bit_)
    +
    +#endif
    +
    +
    +#ifdef CONFIG_DEBUG_LOCK_ALLOC
    +
    +#define __DEP_MAP_BIT_MUTEX_INITIALIZER(field, bit, mtx) \
    + , .dep_map = { .name = #mtx }
    +
    +#else
    +
    +#define __DEP_MAP_BIT_MUTEX_INITIALIZER(field, bit, mtx)
    +
    +#endif
    +
    +
    +#define __BIT_MUTEX_INITIALIZER(field, bit, mtx) \
    + { \
    + __DEBUG_BIT_MUTEX_INITIALIZER(field, bit, mtx) \
    + __DEP_MAP_BIT_MUTEX_INITIALIZER(field, bit, mtx)\
    + }
    +
    +#endif
    Index: linux-2.6.27-rc8-devel/include/linux/wait.h
    ================================================== =================
    --- linux-2.6.27-rc8-devel.orig/include/linux/wait.h 2008-10-04 13:40:46.000000000 +0200
    +++ linux-2.6.27-rc8-devel/include/linux/wait.h 2008-10-04 13:41:21.000000000 +0200
    @@ -513,7 +513,13 @@ static inline int wait_on_bit_lock(void
    return 0;
    return out_of_line_wait_on_bit_lock(word, bit, action, mode);
    }
    -
    +
    +/**
    + * wait_action_schedule - this function can be passed to wait_on_bit or
    + * wait_on_bit_lock and it will call just schedule().
    + */
    +int wait_action_schedule(void *);
    +
    #endif /* __KERNEL__ */

    #endif
    Index: linux-2.6.27-rc8-devel/kernel/wait.c
    ================================================== =================
    --- linux-2.6.27-rc8-devel.orig/kernel/wait.c 2008-10-04 13:37:24.000000000 +0200
    +++ linux-2.6.27-rc8-devel/kernel/wait.c 2008-10-04 13:38:21.000000000 +0200
    @@ -251,3 +251,10 @@ wait_queue_head_t *bit_waitqueue(void *w
    return &zone->wait_table[hash_long(val, zone->wait_table_bits)];
    }
    EXPORT_SYMBOL(bit_waitqueue);
    +
    +int wait_action_schedule(void *word)
    +{
    + schedule();
    + return 0;
    +}
    +EXPORT_SYMBOL(wait_action_schedule);
    Index: linux-2.6.27-rc8-devel/kernel/Makefile
    ================================================== =================
    --- linux-2.6.27-rc8-devel.orig/kernel/Makefile 2008-10-05 14:03:24.000000000 +0200
    +++ linux-2.6.27-rc8-devel/kernel/Makefile 2008-10-05 14:11:25.000000000 +0200
    @@ -17,6 +17,7 @@ ifdef CONFIG_FTRACE
    # Do not trace debug files and internal ftrace files
    CFLAGS_REMOVE_lockdep.o = -pg
    CFLAGS_REMOVE_lockdep_proc.o = -pg
    +CFLAGS_REMOVE_bit-mutex-debug.o = -pg
    CFLAGS_REMOVE_mutex-debug.o = -pg
    CFLAGS_REMOVE_rtmutex-debug.o = -pg
    CFLAGS_REMOVE_cgroup-debug.o = -pg
    @@ -29,6 +30,7 @@ obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sy
    obj-$(CONFIG_STACKTRACE) += stacktrace.o
    obj-y += time/
    obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
    +obj-$(CONFIG_DEBUG_MUTEXES) += bit-mutex-debug.o
    obj-$(CONFIG_LOCKDEP) += lockdep.o
    ifeq ($(CONFIG_PROC_FS),y)
    obj-$(CONFIG_LOCKDEP) += lockdep_proc.o
    Index: linux-2.6.27-rc8-devel/kernel/bit-mutex-debug.c
    ================================================== =================
    --- /dev/null 1970-01-01 00:00:00.000000000 +0000
    +++ linux-2.6.27-rc8-devel/kernel/bit-mutex-debug.c 2008-10-05 19:12:06.000000000 +0200
    @@ -0,0 +1,55 @@
    +#include
    +#include
    +#include
    +#include
    +
    +void __bit_mutex_init(unsigned long *field, int bit, struct bit_mutex *mtx, const char *name, struct lock_class_key *key)
    +{
    +#ifdef CONFIG_DEBUG_LOCK_ALLOC
    + debug_check_no_locks_freed((void *)mtx, sizeof(*mtx));
    + lockdep_init_map(&mtx->dep_map, name, key, 0);
    +#endif
    + mtx->field = field;
    + mtx->bit = bit;
    + mtx->owner = NULL;
    + mtx->magic = mtx;
    + clear_bit(bit, field);
    + smp_mb__after_clear_bit();
    +}
    +EXPORT_SYMBOL(__bit_mutex_init);
    +
    +void __bit_mutex_lock(unsigned long *field, int bit, struct bit_mutex *mtx, int subclass)
    +{
    + DEBUG_LOCKS_WARN_ON(mtx->magic != mtx);
    + DEBUG_LOCKS_WARN_ON(field != mtx->field);
    + DEBUG_LOCKS_WARN_ON(bit != mtx->bit);
    + mutex_acquire(&mtx->dep_map, subclass, 0, _RET_IP_);
    + wait_on_bit_lock(field, bit, wait_action_schedule, TASK_UNINTERRUPTIBLE);
    + lock_acquired(&mtx->dep_map);
    + mtx->owner = current_thread_info();
    +}
    +EXPORT_SYMBOL(__bit_mutex_lock);
    +
    +void __bit_mutex_unlock(unsigned long *field, int bit, struct bit_mutex *mtx, int nested)
    +{
    + DEBUG_LOCKS_WARN_ON(mtx->magic != mtx);
    + DEBUG_LOCKS_WARN_ON(mtx->owner != current_thread_info());
    + DEBUG_LOCKS_WARN_ON(mtx->field != field);
    + DEBUG_LOCKS_WARN_ON(mtx->bit != bit);
    + mtx->owner = NULL;
    + mutex_release(&mtx->dep_map, nested, _RET_IP_);
    + smp_mb__before_clear_bit();
    + clear_bit(bit, field);
    + smp_mb__after_clear_bit();
    + wake_up_bit(field, bit);
    +}
    +EXPORT_SYMBOL(__bit_mutex_unlock);
    +
    +int bit_mutex_is_locked(unsigned long *field, int bit, struct bit_mutex *mtx)
    +{
    + DEBUG_LOCKS_WARN_ON(mtx->magic != mtx);
    + DEBUG_LOCKS_WARN_ON(field != mtx->field);
    + DEBUG_LOCKS_WARN_ON(bit != mtx->bit);
    + return test_bit(bit, field);
    +}
    +EXPORT_SYMBOL(bit_mutex_is_locked);
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. [PATCH 3/3] Fix fsync-vs-write misbehavior

    Fix violation of sync()/fsync() semantics. Previous code walked up to
    mapping->nrpages * 2 pages. Because pages could be created while
    __filemap_fdatawrite_range was in progress, it could lead to a
    misbehavior. Example: there are two pages in address space with indices
    4, 5. Both are dirty. Someone calls __filemap_fdatawrite_range, it sets
    ..nr_to_write = 4. Meanwhile, some other process creates dirty pages 0,
    1, 2, 3. __filemap_fdatawrite_range writes pages 0, 1, 2, 3, finds out
    that it reached the limit and exits.

    Result: pages that were dirty before __filemap_fdatawrite_range was
    invoked were not written.

    With starvation protection from the previous patch, this
    mapping->nrpages * 2 logic is no longer needed.

    Signed-off-by: Mikulas Patocka

    ---
    mm/filemap.c | 7 ++++++-
    1 file changed, 6 insertions(+), 1 deletion(-)

    Index: linux-2.6.27-rc7-devel/mm/filemap.c
    ================================================== =================
    --- linux-2.6.27-rc7-devel.orig/mm/filemap.c 2008-09-24 14:47:01.000000000 +0200
    +++ linux-2.6.27-rc7-devel/mm/filemap.c 2008-09-24 15:01:23.000000000 +0200
    @@ -202,6 +202,11 @@ static int sync_page_killable(void *word
    * opposed to a regular memory cleansing writeback. The difference between
    * these two operations is that if a dirty page/buffer is encountered, it must
    * be waited upon, and not just skipped over.
    + *
    + * Because new pages dirty can be created while this is executing, that
    + * mapping->nrpages * 2 condition is unsafe. If we are doing data integrity
    + * write, we must write all the pages. AS_STARVATION bit will eventually prevent
    + * creating more dirty pages to avoid starvation.
    */
    int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
    loff_t end, int sync_mode)
    @@ -209,7 +214,7 @@ int __filemap_fdatawrite_range(struct ad
    int ret;
    struct writeback_control wbc = {
    .sync_mode = sync_mode,
    - .nr_to_write = mapping->nrpages * 2,
    + .nr_to_write = sync_mode == WB_SYNC_NONE ? mapping->nrpages * 2 : LONG_MAX,
    .range_start = start,
    .range_end = end,
    };
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. [PATCH 2/3] Fix fsync livelock

    Fix starvation in memory management.

    The bug happens when one process is doing sequential buffered writes to
    a block device (or file) and another process is attempting to execute
    sync(), fsync() or direct-IO on that device (or file). This syncing
    process will wait indefinitelly, until the first writing process
    finishes.

    For example, run these two commands:
    dd if=/dev/zero of=/dev/sda1 bs=65536 &
    dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct

    The bug is caused by sequential walking of address space in
    write_cache_pages and wait_on_page_writeback_range: if some other
    process is constantly making dirty and writeback pages while these
    functions run, the functions will wait on every new page, resulting in
    indefinite wait.

    To fix the starvation, I declared a bit mutex starvation_barrier in
    struct address_space. The bit AS_STARVATION_BARRIER in
    address_space->flags is used for actual locking. When the mutex is
    taken, anyone making dirty pages on that address space should stop. The
    functions that walk address space sequentially first estimate a number
    of pages to process. If they process more than this amount (plus some
    small delta), it means that someone is making dirty pages under them ---
    they take the mutex and hold it until they finish. When the mutex is
    taken, the function balance_dirty_pages will wait and not allow more
    dirty pages being made.

    The mutex is not really used as a mutex, it does not prevent access to
    any critical section. It is used as a barrier that blocks new dirty
    pages from being created. I use mutex and not wait queue to make sure
    that the starvation can't happend the other way (if there were wait
    queue, excessive calls to write_cache_pages and
    wait_on_page_writeback_range would block balance_dirty_pages forever;
    with mutex it is guaranteed that every process eventually makes
    progress).

    The essential property of this patch is that if the starvation doesn't
    happen, no additional locks are taken and no atomic operations are
    performed. So the patch shouldn't damage performance.

    The indefinite starvation was observed in write_cache_pages and
    wait_on_page_writeback_range. Another possibility where it could happen
    is invalidate_inode_pages2_range.

    There are two more functions that walk all the pages in address space,
    but I think they don't need this starvation protection:
    truncate_inode_pages_range: it is called with i_mutex locked, so no new
    pages are created under it.
    __invalidate_mapping_pages: it skips locked, dirty and writeback pages,
    so there's no point in protecting the function against starving on them.

    Signed-off-by: Mikulas Patocka

    ---
    fs/inode.c | 1 +
    include/linux/fs.h | 3 +++
    include/linux/pagemap.h | 14 ++++++++++++++
    mm/filemap.c | 5 +++++
    mm/page-writeback.c | 18 +++++++++++++++++-
    mm/swap_state.c | 1 +
    mm/truncate.c | 10 +++++++++-
    7 files changed, 50 insertions(+), 2 deletions(-)

    Index: linux-2.6.27-rc8-devel/fs/inode.c
    ================================================== =================
    --- linux-2.6.27-rc8-devel.orig/fs/inode.c 2008-10-04 16:47:55.000000000 +0200
    +++ linux-2.6.27-rc8-devel/fs/inode.c 2008-10-04 16:48:04.000000000 +0200
    @@ -216,6 +216,7 @@ void inode_init_once(struct inode *inode
    spin_lock_init(&inode->i_data.private_lock);
    INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
    INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
    + bit_mutex_init(&inode->i_data.flags, AS_STARVATION_BARRIER, &inode->i_data.starvation_barrier);
    i_size_ordered_init(inode);
    #ifdef CONFIG_INOTIFY
    INIT_LIST_HEAD(&inode->inotify_watches);
    Index: linux-2.6.27-rc8-devel/include/linux/fs.h
    ================================================== =================
    --- linux-2.6.27-rc8-devel.orig/include/linux/fs.h 2008-10-04 16:47:54.000000000 +0200
    +++ linux-2.6.27-rc8-devel/include/linux/fs.h 2008-10-04 16:49:54.000000000 +0200
    @@ -289,6 +289,7 @@ extern int dir_notify_enable;
    #include
    #include
    #include
    +#include
    #include
    #include

    @@ -539,6 +540,8 @@ struct address_space {
    spinlock_t private_lock; /* for use by the address_space */
    struct list_head private_list; /* ditto */
    struct address_space *assoc_mapping; /* ditto */
    +
    + struct bit_mutex starvation_barrier;/* taken when fsync starves */
    } __attribute__((aligned(sizeof(long))));
    /*
    * On most architectures that alignment is already the case; but
    Index: linux-2.6.27-rc8-devel/include/linux/pagemap.h
    ================================================== =================
    --- linux-2.6.27-rc8-devel.orig/include/linux/pagemap.h 2008-10-04 16:47:54.000000000 +0200
    +++ linux-2.6.27-rc8-devel/include/linux/pagemap.h 2008-10-04 16:48:04.000000000 +0200
    @@ -21,6 +21,7 @@
    #define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
    #define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */
    #define AS_MM_ALL_LOCKS (__GFP_BITS_SHIFT + 2) /* under mm_take_all_locks() */
    +#define AS_STARVATION_BARRIER (__GFP_BITS_SHIFT + 3) /* an anti-starvation barrier */

    static inline void mapping_set_error(struct address_space *mapping, int error)
    {
    @@ -424,4 +425,17 @@ static inline int add_to_page_cache(stru
    return error;
    }

    +#define starvation_protection_head(n) \
    + long pages_to_process = (n); \
    + pages_to_process += (pages_to_process >> 3) + 16;
    +
    +#define starvation_protection_step(mapping) \
    + if (pages_to_process >= 0) \
    + if (!pages_to_process--) \
    + bit_mutex_lock(&(mapping)->flags, AS_STARVATION_BARRIER, &(mapping)->starvation_barrier);
    +
    +#define starvation_protection_end(mapping) \
    + if (pages_to_process < 0) \
    + bit_mutex_unlock(&(mapping)->flags, AS_STARVATION_BARRIER, &(mapping)->starvation_barrier);
    +
    #endif /* _LINUX_PAGEMAP_H */
    Index: linux-2.6.27-rc8-devel/mm/filemap.c
    ================================================== =================
    --- linux-2.6.27-rc8-devel.orig/mm/filemap.c 2008-10-04 16:47:55.000000000 +0200
    +++ linux-2.6.27-rc8-devel/mm/filemap.c 2008-10-04 16:48:04.000000000 +0200
    @@ -269,6 +269,7 @@ int wait_on_page_writeback_range(struct
    int nr_pages;
    int ret = 0;
    pgoff_t index;
    + starvation_protection_head(bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK));

    if (end < start)
    return 0;
    @@ -288,6 +289,8 @@ int wait_on_page_writeback_range(struct
    if (page->index > end)
    continue;

    + starvation_protection_step(mapping);
    +
    wait_on_page_writeback(page);
    if (PageError(page))
    ret = -EIO;
    @@ -296,6 +299,8 @@ int wait_on_page_writeback_range(struct
    cond_resched();
    }

    + starvation_protection_end(mapping);
    +
    /* Check for outstanding write errors */
    if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
    ret = -ENOSPC;
    Index: linux-2.6.27-rc8-devel/mm/page-writeback.c
    ================================================== =================
    --- linux-2.6.27-rc8-devel.orig/mm/page-writeback.c 2008-10-04 16:47:55.000000000 +0200
    +++ linux-2.6.27-rc8-devel/mm/page-writeback.c 2008-10-04 16:48:04.000000000 +0200
    @@ -435,6 +435,14 @@ static void balance_dirty_pages(struct a

    struct backing_dev_info *bdi = mapping->backing_dev_info;

    + if (unlikely(bit_mutex_is_locked(&mapping->flags, AS_STARVATION_BARRIER,
    + &mapping->starvation_barrier))) {
    + bit_mutex_lock(&mapping->flags, AS_STARVATION_BARRIER,
    + &mapping->starvation_barrier);
    + bit_mutex_unlock(&mapping->flags, AS_STARVATION_BARRIER,
    + &mapping->starvation_barrier);
    + }
    +
    for (; {
    struct writeback_control wbc = {
    .bdi = bdi,
    @@ -876,6 +884,7 @@ int write_cache_pages(struct address_spa
    pgoff_t end; /* Inclusive */
    int scanned = 0;
    int range_whole = 0;
    + starvation_protection_head(bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE));

    if (wbc->nonblocking && bdi_write_congested(bdi)) {
    wbc->encountered_congestion = 1;
    @@ -902,7 +911,11 @@ retry:

    scanned = 1;
    for (i = 0; i < nr_pages; i++) {
    - struct page *page = pvec.pages[i];
    + struct page *page;
    +
    + starvation_protection_step(mapping);
    +
    + page = pvec.pages[i];

    /*
    * At this point we hold neither mapping->tree_lock nor
    @@ -963,6 +976,9 @@ retry:

    if (wbc->range_cont)
    wbc->range_start = index << PAGE_CACHE_SHIFT;
    +
    + starvation_protection_end(mapping);
    +
    return ret;
    }
    EXPORT_SYMBOL(write_cache_pages);
    Index: linux-2.6.27-rc8-devel/mm/swap_state.c
    ================================================== =================
    --- linux-2.6.27-rc8-devel.orig/mm/swap_state.c 2008-10-04 16:47:55.000000000 +0200
    +++ linux-2.6.27-rc8-devel/mm/swap_state.c 2008-10-04 17:14:03.000000000 +0200
    @@ -43,6 +43,7 @@ struct address_space swapper_space = {
    .a_ops = &swap_aops,
    .i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
    .backing_dev_info = &swap_backing_dev_info,
    + .starvation_barrier = __BIT_MUTEX_INITIALIZER(&swapper_space.flags, AS_STARVATION_BARRIER, swapper_space.starvation_barrier),
    };

    #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)
    Index: linux-2.6.27-rc8-devel/mm/truncate.c
    ================================================== =================
    --- linux-2.6.27-rc8-devel.orig/mm/truncate.c 2008-10-04 16:47:55.000000000 +0200
    +++ linux-2.6.27-rc8-devel/mm/truncate.c 2008-10-04 16:48:04.000000000 +0200
    @@ -392,6 +392,7 @@ int invalidate_inode_pages2_range(struct
    int ret2 = 0;
    int did_range_unmap = 0;
    int wrapped = 0;
    + starvation_protection_head(mapping->nrpages);

    pagevec_init(&pvec, 0);
    next = start;
    @@ -399,9 +400,13 @@ int invalidate_inode_pages2_range(struct
    pagevec_lookup(&pvec, mapping, next,
    min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
    for (i = 0; i < pagevec_count(&pvec); i++) {
    - struct page *page = pvec.pages[i];
    + struct page *page;
    pgoff_t page_index;

    + starvation_protection_step(mapping);
    +
    + page = pvec.pages[i];
    +
    lock_page(page);
    if (page->mapping != mapping) {
    unlock_page(page);
    @@ -449,6 +454,9 @@ int invalidate_inode_pages2_range(struct
    pagevec_release(&pvec);
    cond_resched();
    }
    +
    + starvation_protection_end(mapping);
    +
    return ret;
    }
    EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. RFC: one-bit mutexes (was: Re: [PATCH 2/3] Memory management livelock)

    Hi

    I removed the repeated code and create a new bit mutexes. They are
    space-efficient mutexes that consume only one bit. See the next 3 patches.

    If you are concerned about the size of an inode, I can convert other
    mutexes to bit mutexes: i_mutex and inotify_mutex. I could also create
    bit_spinlock (one-bit spinlock that uses test_and_set_bit) and save space
    for address_space->tree_lock, address_space->i_mmap_lock,
    address_space->private_lock, inode->i_lock.

    Look at it and say what you think about the idea of condensing mutexes
    into single bits.

    Mikulas

    >
    > > Subject: [PATCH 2/3] Memory management livelock

    >
    > Please don't send multiple patches with identical titles - think up a
    > good, unique, meaningful title for each patch.
    >
    > On Wed, 24 Sep 2008 14:52:18 -0400 (EDT) Mikulas Patocka wrote:
    >
    > > Avoid starvation when walking address space.
    > >
    > > Signed-off-by: Mikulas Patocka
    > >

    >
    > Please include a full changelog with each iteration of each patch.
    >
    > That changelog should explain the reason for playing games with
    > bitlocks so Linus doesn't have kittens when he sees it.
    >
    > > include/linux/pagemap.h | 1 +
    > > mm/filemap.c | 20 ++++++++++++++++++++
    > > mm/page-writeback.c | 37 ++++++++++++++++++++++++++++++++++++-
    > > mm/truncate.c | 24 +++++++++++++++++++++++-
    > > 4 files changed, 80 insertions(+), 2 deletions(-)
    > >
    > > Index: linux-2.6.27-rc7-devel/include/linux/pagemap.h
    > > ================================================== =================
    > > --- linux-2.6.27-rc7-devel.orig/include/linux/pagemap.h 2008-09-24 02:57:37.000000000 +0200
    > > +++ linux-2.6.27-rc7-devel/include/linux/pagemap.h 2008-09-24 02:59:04.000000000 +0200
    > > @@ -21,6 +21,7 @@
    > > #define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
    > > #define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */
    > > #define AS_MM_ALL_LOCKS (__GFP_BITS_SHIFT + 2) /* under mm_take_all_locks() */
    > > +#define AS_STARVATION (__GFP_BITS_SHIFT + 3) /* an anti-starvation barrier */
    > >
    > > static inline void mapping_set_error(struct address_space *mapping, int error)
    > > {
    > > Index: linux-2.6.27-rc7-devel/mm/filemap.c
    > > ================================================== =================
    > > --- linux-2.6.27-rc7-devel.orig/mm/filemap.c 2008-09-24 02:59:33.000000000 +0200
    > > +++ linux-2.6.27-rc7-devel/mm/filemap.c 2008-09-24 03:13:47.000000000 +0200
    > > @@ -269,10 +269,19 @@ int wait_on_page_writeback_range(struct
    > > int nr_pages;
    > > int ret = 0;
    > > pgoff_t index;
    > > + long pages_to_process;
    > >
    > > if (end < start)
    > > return 0;
    > >
    > > + /*
    > > + * Estimate the number of pages to process. If we process significantly
    > > + * more than this, someone is making writeback pages under us.
    > > + * We must pull the anti-starvation plug.
    > > + */
    > > + pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
    > > + pages_to_process += (pages_to_process >> 3) + 16;

    >
    > This sequence appears twice and it would probably be clearer to
    > implement it in a well-commented function.
    >
    > > pagevec_init(&pvec, 0);
    > > index = start;
    > > while ((index <= end) &&
    > > @@ -288,6 +297,10 @@ int wait_on_page_writeback_range(struct
    > > if (page->index > end)
    > > continue;
    > >
    > > + if (pages_to_process >= 0)
    > > + if (!pages_to_process--)
    > > + wait_on_bit_lock(&mapping->flags, AS_STARVATION, wait_action_schedule, TASK_UNINTERRUPTIBLE);

    >
    > This is copied three times and perhaps also should be factored out.
    >
    > Please note that an effort has been made to make mm/filemap.c look
    > presentable in an 80-col display.
    >
    > > wait_on_page_writeback(page);
    > > if (PageError(page))
    > > ret = -EIO;
    > > @@ -296,6 +309,13 @@ int wait_on_page_writeback_range(struct
    > > cond_resched();
    > > }
    > >
    > > + if (pages_to_process < 0) {
    > > + smp_mb__before_clear_bit();
    > > + clear_bit(AS_STARVATION, &mapping->flags);
    > > + smp_mb__after_clear_bit();
    > > + wake_up_bit(&mapping->flags, AS_STARVATION);
    > > + }

    >
    > This sequence is repeated three or four times and should be pulled out
    > into a well-commented function. That comment should explain the logic
    > behind the use of these barriers, please.
    >
    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [PATCH 2/3] Fix fsync livelock

    On Sun, 5 Oct 2008 18:14:50 -0400 (EDT)
    Mikulas Patocka wrote:

    > Fix starvation in memory management.
    >
    > The bug happens when one process is doing sequential buffered writes
    > to a block device (or file) and another process is attempting to
    > execute sync(), fsync() or direct-IO on that device (or file). This
    > syncing process will wait indefinitelly, until the first writing
    > process finishes.
    >
    > For example, run these two commands:
    > dd if=/dev/zero of=/dev/sda1 bs=65536 &
    > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
    >
    > The bug is caused by sequential walking of address space in
    > write_cache_pages and wait_on_page_writeback_range: if some other
    > process is constantly making dirty and writeback pages while these
    > functions run, the functions will wait on every new page, resulting in
    > indefinite wait.


    are you sure?
    isn't the right fix to just walk the file pages only once?
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [PATCH 2/3] Fix fsync livelock



    On Sun, 5 Oct 2008, Arjan van de Ven wrote:

    > On Sun, 5 Oct 2008 18:14:50 -0400 (EDT)
    > Mikulas Patocka wrote:
    >
    > > Fix starvation in memory management.
    > >
    > > The bug happens when one process is doing sequential buffered writes
    > > to a block device (or file) and another process is attempting to
    > > execute sync(), fsync() or direct-IO on that device (or file). This
    > > syncing process will wait indefinitelly, until the first writing
    > > process finishes.
    > >
    > > For example, run these two commands:
    > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
    > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
    > >
    > > The bug is caused by sequential walking of address space in
    > > write_cache_pages and wait_on_page_writeback_range: if some other
    > > process is constantly making dirty and writeback pages while these
    > > functions run, the functions will wait on every new page, resulting in
    > > indefinite wait.

    >
    > are you sure?
    > isn't the right fix to just walk the file pages only once?


    It walks the pages only once --- and waits on each on them. But because
    new pages are contantly appearing under it, that "walk only once" becomes
    infinite loop (well, finite, until the whole disk is written).

    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [PATCH 2/3] Fix fsync livelock

    On Sun, 5 Oct 2008 19:02:57 -0400 (EDT)
    Mikulas Patocka wrote:

    > > are you sure?
    > > isn't the right fix to just walk the file pages only once?

    >
    > It walks the pages only once --- and waits on each on them. But
    > because new pages are contantly appearing under it, that "walk only
    > once" becomes infinite loop (well, finite, until the whole disk is
    > written).


    well. fsync() promises that everything that's dirty at the time of the
    call will hit the disk. That is not something you can compromise.
    The only way out would be is to not allow new dirty during an fsync()...
    which is imo even worse.

    Submit them all in one go, then wait, should not be TOO bad. Unless a
    lot was dirty already, but then you go back to "but it has to go to
    disk".


    --
    Arjan van de Ven Intel Open Source Technology Centre
    For development, discussion and tips for power savings,
    visit http://www.lesswatts.org
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [PATCH 2/3] Fix fsync livelock

    On Sun, 5 Oct 2008, Arjan van de Ven wrote:

    > On Sun, 5 Oct 2008 19:02:57 -0400 (EDT)
    > Mikulas Patocka wrote:
    >
    > > > are you sure?
    > > > isn't the right fix to just walk the file pages only once?

    > >
    > > It walks the pages only once --- and waits on each on them. But
    > > because new pages are contantly appearing under it, that "walk only
    > > once" becomes infinite loop (well, finite, until the whole disk is
    > > written).

    >
    > well. fsync() promises that everything that's dirty at the time of the
    > call will hit the disk. That is not something you can compromise.
    > The only way out would be is to not allow new dirty during an fsync()...
    > which is imo even worse.
    >
    > Submit them all in one go, then wait, should not be TOO bad. Unless a
    > lot was dirty already, but then you go back to "but it has to go to
    > disk".


    The problem here is that you have two processes --- one is writing, the
    other is simultaneously syncing. The syncing process can't distinguish the
    pages that were created before fsync() was invoked (it has to wait on
    them) and the pages that were created while fsync() was running (it
    doesn't have to wait on them) --- so it waits on them all. The result is
    livelock, it waits indefinitely, because more and more pages are being
    created.

    The patch changes it so that if it waits long enough, it stops the other
    writers creating dirty pages.

    Or, how otherwise would you implement "Submit them all in one go, then
    wait"? The current code is:
    you grab page 0, see it is under writeback, wait on it
    you grab page 1, see it is under writeback, wait on it
    you grab page 2, see it is under writeback, wait on it
    you grab page 3, see it is under writeback, wait on it
    ....
    --- and the other process is just making more and more writeback pages
    while your waiting routine run. So the waiting is indefinite.

    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [PATCH 2/3] Fix fsync livelock

    On Sun, 5 Oct 2008 19:18:22 -0400 (EDT)
    Mikulas Patocka wrote:

    > On Sun, 5 Oct 2008, Arjan van de Ven wrote:
    >
    >
    > The problem here is that you have two processes --- one is writing,
    > the other is simultaneously syncing. The syncing process can't
    > distinguish the pages that were created before fsync() was invoked
    > (it has to wait on them) and the pages that were created while
    > fsync() was running (it doesn't have to wait on them) --- so it waits
    > on them all.


    for pages it already passed that's not true.
    but yes while you walk, you may find new ones. tough luck.
    >
    > Or, how otherwise would you implement "Submit them all in one go,
    > then wait"? The current code is:
    > you grab page 0, see it is under writeback, wait on it
    > you grab page 1, see it is under writeback, wait on it
    > you grab page 2, see it is under writeback, wait on it
    > you grab page 3, see it is under writeback, wait on it


    it shouldn't be doing this. It should be "submit the lot"
    "wait on the lot that got submitted".

    keeping away new writers is just shifting the latency to an innocent
    party (since the fsync() is just bloody expensive, the person doing it
    should be paying the price, not the rest of the system), and a grave
    mistake.

    If the fsync() implementation isn't smart enough, sure, lets improve
    it. But not by shifting latency around... lets make it more efficient
    at submitting IO.
    If we need to invent something like "chained IO" where if you wait on
    the last of the chain, you wait on the entirely chain, so be it.


    --
    Arjan van de Ven Intel Open Source Technology Centre
    For development, discussion and tips for power savings,
    visit http://www.lesswatts.org
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [PATCH 2/3] Fix fsync livelock



    On Sun, 5 Oct 2008, Arjan van de Ven wrote:

    > On Sun, 5 Oct 2008 19:18:22 -0400 (EDT)
    > Mikulas Patocka wrote:
    >
    > > On Sun, 5 Oct 2008, Arjan van de Ven wrote:
    > >
    > >
    > > The problem here is that you have two processes --- one is writing,
    > > the other is simultaneously syncing. The syncing process can't
    > > distinguish the pages that were created before fsync() was invoked
    > > (it has to wait on them) and the pages that were created while
    > > fsync() was running (it doesn't have to wait on them) --- so it waits
    > > on them all.

    >
    > for pages it already passed that's not true.
    > but yes while you walk, you may find new ones. tough luck.
    > >
    > > Or, how otherwise would you implement "Submit them all in one go,
    > > then wait"? The current code is:
    > > you grab page 0, see it is under writeback, wait on it
    > > you grab page 1, see it is under writeback, wait on it
    > > you grab page 2, see it is under writeback, wait on it
    > > you grab page 3, see it is under writeback, wait on it

    >
    > it shouldn't be doing this. It should be "submit the lot"
    > "wait on the lot that got submitted".


    And how do you want to implement that "wait on the lot that got
    submitted". Note that you can't allocate an array or linked list that
    holds pointers to all submitted pages. And if you add anything to the page
    structure or radix tree, you have to serialize concurrent sync,fsync
    calls, otherwise they'd fight over these bits.

    > keeping away new writers is just shifting the latency to an innocent
    > party (since the fsync() is just bloody expensive, the person doing it
    > should be paying the price, not the rest of the system), and a grave
    > mistake.


    I assume that if very few people complained about the livelock till now,
    very few people will see degraded write performance. My patch blocks the
    writes only if the livelock happens, so if the livelock doesn't happen in
    unpatched kernel for most people, the patch won't make it worse.

    > If the fsync() implementation isn't smart enough, sure, lets improve
    > it. But not by shifting latency around... lets make it more efficient
    > at submitting IO.
    > If we need to invent something like "chained IO" where if you wait on
    > the last of the chain, you wait on the entirely chain, so be it.


    This looks madly complicated. And ineffective, because if some page was
    submitted before fsync() was invoked, and is under writeback while fsync()
    is called, fsync() still has to wait on it.

    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 2 of 3 FirstFirst 1 2 3 LastLast