[PATCH] Memory management livelock - Kernel

This is a discussion on [PATCH] Memory management livelock - Kernel ; Hi Here is a patch for MM livelock. The original bug report follows after the patch. To reproduce the bug on my computer, I had to change bs=4M to bs=65536 in examples in the original report. Mikulas --- Fix starvation ...

+ Reply to Thread
Page 1 of 3 1 2 3 LastLast
Results 1 to 20 of 57

Thread: [PATCH] Memory management livelock

  1. [PATCH] Memory management livelock

    Hi

    Here is a patch for MM livelock. The original bug report follows after the
    patch. To reproduce the bug on my computer, I had to change bs=4M to
    bs=65536 in examples in the original report.

    Mikulas

    ---

    Fix starvation in memory management.

    The bug happens when one process is doing sequential buffered writes to
    a block device (or file) and another process is attempting to execute
    sync(), fsync() or direct-IO on that device (or file). This syncing
    process will wait indefinitelly, until the first writing process
    finishes.

    For example, run these two commands:
    dd if=/dev/zero of=/dev/sda1 bs=65536 &
    dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct

    The bug is caused by sequential walking of address space in
    write_cache_pages and wait_on_page_writeback_range: if some other
    process is constantly making dirty and writeback pages while these
    functions run, the functions will wait on every new page, resulting in
    indefinite wait.

    To fix the starvation, I declared a mutex starvation_barrier in struct
    address_space. When the mutex is taken, anyone making dirty pages on
    that address space should stop. The functions that walk address space
    sequentially first estimate a number of pages to process. If they
    process more than this amount (plus some small delta), it means that
    someone is making dirty pages under them --- they take the mutex and
    hold it until they finish. When the mutex is taken, the function
    balance_dirty_pages will wait and not allow more dirty pages being made.

    The mutex is not really used as a mutex, it does not prevent access to
    any critical section. It is used as a barrier that blocks new dirty
    pages from being created. I use mutex and not wait queue to make sure
    that the starvation can't happend the other way (if there were wait
    queue, excessive calls to write_cache_pages and
    wait_on_page_writeback_range would block balance_dirty_pages forever;
    with mutex it is guaranteed that every process eventually makes
    progress).

    The essential property of this patch is that if the starvation doesn't
    happen, no additional locks are taken and no atomic operations are
    performed. So the patch shouldn't damage performance.

    The indefinite starvation was observed in write_cache_pages and
    wait_on_page_writeback_range. Another possibility where it could happen
    is invalidate_inode_pages2_range.

    There are two more functions that walk all the pages in address space,
    but I think they don't need this starvation protection:
    truncate_inode_pages_range: it is called with i_mutex locked, so no new
    pages are created under it.
    __invalidate_mapping_pages: it skips locked, dirty and writeback pages,
    so there's no point in protecting the function against starving on them.

    Signed-off-by: Mikulas Patocka

    ---
    fs/inode.c | 1 +
    include/linux/fs.h | 1 +
    mm/filemap.c | 16 ++++++++++++++++
    mm/page-writeback.c | 30 +++++++++++++++++++++++++++++-
    mm/swap_state.c | 1 +
    mm/truncate.c | 20 +++++++++++++++++++-
    6 files changed, 67 insertions(+), 2 deletions(-)

    Index: linux-2.6.27-rc7-devel/fs/inode.c
    ================================================== =================
    --- linux-2.6.27-rc7-devel.orig/fs/inode.c 2008-09-22 23:04:31.000000000 +0200
    +++ linux-2.6.27-rc7-devel/fs/inode.c 2008-09-22 23:04:34.000000000 +0200
    @@ -214,6 +214,7 @@ void inode_init_once(struct inode *inode
    spin_lock_init(&inode->i_data.i_mmap_lock);
    INIT_LIST_HEAD(&inode->i_data.private_list);
    spin_lock_init(&inode->i_data.private_lock);
    + mutex_init(&inode->i_data.starvation_barrier);
    INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
    INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
    i_size_ordered_init(inode);
    Index: linux-2.6.27-rc7-devel/include/linux/fs.h
    ================================================== =================
    --- linux-2.6.27-rc7-devel.orig/include/linux/fs.h 2008-09-22 23:04:31.000000000 +0200
    +++ linux-2.6.27-rc7-devel/include/linux/fs.h 2008-09-22 23:04:34.000000000 +0200
    @@ -539,6 +539,7 @@ struct address_space {
    spinlock_t private_lock; /* for use by the address_space */
    struct list_head private_list; /* ditto */
    struct address_space *assoc_mapping; /* ditto */
    + struct mutex starvation_barrier; /* taken when fsync starves */
    } __attribute__((aligned(sizeof(long))));
    /*
    * On most architectures that alignment is already the case; but
    Index: linux-2.6.27-rc7-devel/mm/filemap.c
    ================================================== =================
    --- linux-2.6.27-rc7-devel.orig/mm/filemap.c 2008-09-22 23:04:31.000000000 +0200
    +++ linux-2.6.27-rc7-devel/mm/filemap.c 2008-09-22 23:04:34.000000000 +0200
    @@ -269,10 +269,19 @@ int wait_on_page_writeback_range(struct
    int nr_pages;
    int ret = 0;
    pgoff_t index;
    + long pages_to_process;

    if (end < start)
    return 0;

    + /*
    + * Estimate the number of pages to process. If we process significantly
    + * more than this, someone is making writeback pages under us.
    + * We must pull the anti-starvation plug.
    + */
    + pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
    + pages_to_process += pages_to_process >> 3;
    +
    pagevec_init(&pvec, 0);
    index = start;
    while ((index <= end) &&
    @@ -288,6 +297,10 @@ int wait_on_page_writeback_range(struct
    if (page->index > end)
    continue;

    + if (pages_to_process >= 0)
    + if (!pages_to_process--)
    + mutex_lock(&mapping->starvation_barrier);
    +
    wait_on_page_writeback(page);
    if (PageError(page))
    ret = -EIO;
    @@ -296,6 +309,9 @@ int wait_on_page_writeback_range(struct
    cond_resched();
    }

    + if (pages_to_process < 0)
    + mutex_unlock(&mapping->starvation_barrier);
    +
    /* Check for outstanding write errors */
    if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
    ret = -ENOSPC;
    Index: linux-2.6.27-rc7-devel/mm/page-writeback.c
    ================================================== =================
    --- linux-2.6.27-rc7-devel.orig/mm/page-writeback.c 2008-09-22 23:04:31.000000000 +0200
    +++ linux-2.6.27-rc7-devel/mm/page-writeback.c 2008-09-22 23:04:34.000000000 +0200
    @@ -435,6 +435,15 @@ static void balance_dirty_pages(struct a

    struct backing_dev_info *bdi = mapping->backing_dev_info;

    + /*
    + * If there is sync() starving on this address space, block
    + * writers until it finishes.
    + */
    + if (unlikely(mutex_is_locked(&mapping->starvation_barrier))) {
    + mutex_lock(&mapping->starvation_barrier);
    + mutex_unlock(&mapping->starvation_barrier);
    + }
    +
    for (; {
    struct writeback_control wbc = {
    .bdi = bdi,
    @@ -876,12 +885,21 @@ int write_cache_pages(struct address_spa
    pgoff_t end; /* Inclusive */
    int scanned = 0;
    int range_whole = 0;
    + long pages_to_process;

    if (wbc->nonblocking && bdi_write_congested(bdi)) {
    wbc->encountered_congestion = 1;
    return 0;
    }

    + /*
    + * Estimate the number of pages to process. If we process significantly
    + * more than this, someone is making dirty pages under us.
    + * Pull the anti-starvation plug to stop him.
    + */
    + pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
    + pages_to_process += pages_to_process >> 3;
    +
    pagevec_init(&pvec, 0);
    if (wbc->range_cyclic) {
    index = mapping->writeback_index; /* Start from prev offset */
    @@ -902,7 +920,13 @@ retry:

    scanned = 1;
    for (i = 0; i < nr_pages; i++) {
    - struct page *page = pvec.pages[i];
    + struct page *page;
    +
    + if (pages_to_process >= 0)
    + if (!pages_to_process--)
    + mutex_lock(&mapping->starvation_barrier);
    +
    + page = pvec.pages[i];

    /*
    * At this point we hold neither mapping->tree_lock nor
    @@ -958,6 +982,10 @@ retry:
    index = 0;
    goto retry;
    }
    +
    + if (pages_to_process < 0)
    + mutex_unlock(&mapping->starvation_barrier);
    +
    if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
    mapping->writeback_index = index;

    Index: linux-2.6.27-rc7-devel/mm/swap_state.c
    ================================================== =================
    --- linux-2.6.27-rc7-devel.orig/mm/swap_state.c 2008-09-22 23:04:31.000000000 +0200
    +++ linux-2.6.27-rc7-devel/mm/swap_state.c 2008-09-22 23:04:34.000000000 +0200
    @@ -43,6 +43,7 @@ struct address_space swapper_space = {
    .a_ops = &swap_aops,
    .i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
    .backing_dev_info = &swap_backing_dev_info,
    + .starvation_barrier = __MUTEX_INITIALIZER(swapper_space.starvation_barri er),
    };

    #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)
    Index: linux-2.6.27-rc7-devel/mm/truncate.c
    ================================================== =================
    --- linux-2.6.27-rc7-devel.orig/mm/truncate.c 2008-09-22 23:04:31.000000000 +0200
    +++ linux-2.6.27-rc7-devel/mm/truncate.c 2008-09-22 23:04:34.000000000 +0200
    @@ -392,6 +392,14 @@ int invalidate_inode_pages2_range(struct
    int ret2 = 0;
    int did_range_unmap = 0;
    int wrapped = 0;
    + long pages_to_process;
    +
    + /*
    + * Estimate number of pages to process. If we process more, someone
    + * is making pages under us.
    + */
    + pages_to_process = mapping->nrpages;
    + pages_to_process += pages_to_process >> 3;

    pagevec_init(&pvec, 0);
    next = start;
    @@ -399,9 +407,15 @@ int invalidate_inode_pages2_range(struct
    pagevec_lookup(&pvec, mapping, next,
    min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
    for (i = 0; i < pagevec_count(&pvec); i++) {
    - struct page *page = pvec.pages[i];
    + struct page *page;
    pgoff_t page_index;

    + if (pages_to_process >= 0)
    + if (!pages_to_process--)
    + mutex_lock(&mapping->starvation_barrier);
    +
    + page = pvec.pages[i];
    +
    lock_page(page);
    if (page->mapping != mapping) {
    unlock_page(page);
    @@ -449,6 +463,10 @@ int invalidate_inode_pages2_range(struct
    pagevec_release(&pvec);
    cond_resched();
    }
    +
    + if (pages_to_process < 0)
    + mutex_unlock(&mapping->starvation_barrier);
    +
    return ret;
    }
    EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);

    > ----- Forwarded message from Chris Webb -----
    >
    > Date: Thu, 11 Sep 2008 10:47:36 +0100
    > From: Chris Webb
    > Subject: Unexpected behaviour with O_DIRECT affecting lvm
    >
    > To reproduce:
    >
    > # dd if=/dev/zero of=/dev/scratch/a bs=4M &
    > [1] 2612
    > # lvm lvrename /dev/scratch/b /dev/scratch/c
    > [hangs]
    >
    > (Similarly for lvremote, lvcreate, etc., and it doesn't matter whether the
    > operation is on the same VG or a different one.)
    >
    > Stracing lvm, I saw
    >
    > stat64("/dev/dm-7", {st_mode=S_IFBLK|0600, st_rdev=makedev(251, 7), ...}) = 0
    > open("/dev/dm-7", O_RDWR|O_DIRECT|O_LARGEFILE|O_NOATIME) = 6
    > fstat64(6, {st_mode=S_IFBLK|0600, st_rdev=makedev(251, 7), ...}) = 0
    > ioctl(6, BLKBSZGET, 0x89bbb30) = 0
    > _llseek(6, 0, [0], SEEK_SET) = 0
    > read(6,
    > [hangs]
    >
    > If I kill the dd at this point, the lvm unblocks and continues as normal:
    >
    > read(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0"..., 4096) = 4096
    > close(6) = 0
    > stat64("/dev/dm-8", {st_mode=S_IFBLK|0600, st_rdev=makedev(251, 8), ...}) = 0
    > stat64("/dev/dm-8", {st_mode=S_IFBLK|0600, st_rdev=makedev(251, 8), ...}) = 0
    > open("/dev/dm-8", O_RDWR|O_DIRECT|O_LARGEFILE|O_NOATIME) = 6
    > [etc]
    >
    > /dev/dm-7 is the dm device corresponding to /dev/disk.1/0.0 which is being
    > written to.
    >
    > I wrote a tiny test program which does a single O_DIRECT read of 4096 bytes
    > from a device, and can confirm that, more generally, an O_DIRECT read from a
    > device being continuously written to always hangs indefinitely. This occurs
    > even if I ionice -c3 the dd, and ionice -c1 -n0 the O_DIRECT reader...
    >
    > # dd if=/dev/zero of=/dev/disk.1/0.0 bs=4M &
    > [1] 2697
    > # ~/tmp/readtest
    > [hangs]
    >
    > # ionice -c3 dd if=/dev/zero of=/dev/disk.1/0.0 bs=4M &
    > [1] 2734
    > # ionice -c1 -n0 ~/tmp/readtest
    > [hangs]
    >
    > There's no problem if the dd is in the other direction, reading continuously
    > from the LV, nor if the test read isn't O_DIRECT. Attempting to kill -9 the
    > O_DIRECT reader doesn't succeed until after the read() returns.
    >
    > Finally, I've tried replacing the LV with a raw disk /dev/sdb, and the same
    > behaviour appears there for all choices of IO scheduler for the disk, so
    > it's neither device mapper specific nor even IO scheduler specific! This
    > means that it isn't in any sense a bug in your code, of course, but given
    > that the phenomenon seems to block any LVM management operation when heavy
    > writes are going on, surely we can't be the first people to hit this whilst
    > using the LVM tool. I'm wondering whether you've had other similar reports
    > or have any other suggestions?
    >
    > For what it's worth, these tests were initially done on our dev machine
    > which is running lvm 2.02.38 and a kernel from the head of Linus' tree
    > (usually no more than a couple of days old with a queue of my local patches
    > that Ed hasn't integrated into the upstream ata-over-ethernet drivers yet),
    > but the same behaviour appears on my totally standard desktop machine with
    > much older 2.6.25 and 2.6.24.4 kernels 'as released', so I think it's pretty
    > long-standing.
    >
    > Best wishes,
    >
    > Chris.
    >
    > #define _GNU_SOURCE
    > #include
    > #include
    > #include
    > #include
    > #include
    >
    > #define BLOCK 4096
    >
    > int main(int argc, char **argv) {
    > int fd, n;
    > char *buffer;
    >
    > if (argc != 2) {
    > fprintf(stderr, "Usage: %s FILE\n", argv[0]);
    > return 1;
    > }
    >
    > buffer = valloc(BLOCK);
    > if (!buffer) {
    > perror("valloc");
    > return 1;
    > }
    >
    > fd = open(argv[1], O_RDONLY | O_LARGEFILE | O_DIRECT);
    > if (fd < 0) {
    > perror("open");
    > return 1;
    > }
    >
    > if (pread(fd, buffer, BLOCK, 0) < 0)
    > perror("pread");
    >
    > close(fd);
    > return 0;
    > }
    >
    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [PATCH] Memory management livelock

    On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
    Mikulas Patocka wrote:

    > The bug happens when one process is doing sequential buffered writes to
    > a block device (or file) and another process is attempting to execute
    > sync(), fsync() or direct-IO on that device (or file). This syncing
    > process will wait indefinitelly, until the first writing process
    > finishes.
    >
    > For example, run these two commands:
    > dd if=/dev/zero of=/dev/sda1 bs=65536 &
    > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
    >
    > The bug is caused by sequential walking of address space in
    > write_cache_pages and wait_on_page_writeback_range: if some other
    > process is constantly making dirty and writeback pages while these
    > functions run, the functions will wait on every new page, resulting in
    > indefinite wait.


    Shouldn't happen. All the data-syncing functions should have an upper
    bound on the number of pages which they attempt to write. In the
    example above, we end up in here:

    int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
    loff_t end, int sync_mode)
    {
    int ret;
    struct writeback_control wbc = {
    .sync_mode = sync_mode,
    .nr_to_write = mapping->nrpages * 2, <<--
    .range_start = start,
    .range_end = end,
    };

    so generic_file_direct_write()'s filemap_write_and_wait() will attempt
    to write at most 2* the number of pages which are in cache for that inode.

    I'd say that either a) that logic got broken or b) you didn't wait long
    enough, and we might need to do something to make it not wait so long.

    But before we patch anything we should fully understand what is
    happening and why the current anti-livelock code isn't working in this
    case.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH] Memory management livelock

    > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
    > Mikulas Patocka wrote:
    >
    > > The bug happens when one process is doing sequential buffered writes to
    > > a block device (or file) and another process is attempting to execute
    > > sync(), fsync() or direct-IO on that device (or file). This syncing
    > > process will wait indefinitelly, until the first writing process
    > > finishes.
    > >
    > > For example, run these two commands:
    > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
    > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
    > >
    > > The bug is caused by sequential walking of address space in
    > > write_cache_pages and wait_on_page_writeback_range: if some other
    > > process is constantly making dirty and writeback pages while these
    > > functions run, the functions will wait on every new page, resulting in
    > > indefinite wait.

    >
    > Shouldn't happen. All the data-syncing functions should have an upper
    > bound on the number of pages which they attempt to write. In the
    > example above, we end up in here:
    >
    > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
    > start,
    > loff_t end, int sync_mode)
    > {
    > int ret;
    > struct writeback_control wbc = {
    > .sync_mode = sync_mode,
    > .nr_to_write = mapping->nrpages * 2, <<--
    > .range_start = start,
    > .range_end = end,
    > };
    >
    > so generic_file_direct_write()'s filemap_write_and_wait() will attempt
    > to write at most 2* the number of pages which are in cache for that inode.


    See write_cache_pages:

    if (wbc->sync_mode != WB_SYNC_NONE)
    wait_on_page_writeback(page); (1)
    if (PageWriteback(page) ||
    !clear_page_dirty_for_io(page)) {
    unlock_page(page); (2)
    continue;
    }
    ret = (*writepage)(page, wbc, data);
    if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
    unlock_page(page);
    ret = 0;
    }
    if (ret || (--(wbc->nr_to_write) <= 0))
    done = 1;

    --- so if it goes by points (1) and (2), the counter is not decremented,
    yet the function waits for the page. If there is constant stream of
    writeback pages being generated, it waits on each on them --- that is,
    forever. I have seen livelock in this function. For you that example with
    two dd's, one buffered write and the other directIO read doesn't work? For
    me it livelocks here.

    wait_on_page_writeback_range is another example where the livelock
    happened, there is no protection at all against starvation.


    BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous thing
    to me.

    Imagine this case: You have two pages with indices 4 and 5 dirty in a
    file. You call fsync(). It sets nr_to_write to 4.

    Meanwhile, another process makes pages 0, 1, 2, 3 dirty.

    The fsync() process goes to write_cache_pages, writes the first 4 dirty
    pages and exits because it goes over the limit.

    result --- you violate fsync() semantics, pages that were dirty before
    call to fsync() are not written when fsync() exits.

    > I'd say that either a) that logic got broken or b) you didn't wait long
    > enough, and we might need to do something to make it not wait so long.
    >
    > But before we patch anything we should fully understand what is
    > happening and why the current anti-livelock code isn't working in this
    > case.


    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH] Memory management livelock

    On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
    Mikulas Patocka wrote:

    > > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
    > > Mikulas Patocka wrote:
    > >
    > > > The bug happens when one process is doing sequential buffered writes to
    > > > a block device (or file) and another process is attempting to execute
    > > > sync(), fsync() or direct-IO on that device (or file). This syncing
    > > > process will wait indefinitelly, until the first writing process
    > > > finishes.
    > > >
    > > > For example, run these two commands:
    > > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
    > > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
    > > >
    > > > The bug is caused by sequential walking of address space in
    > > > write_cache_pages and wait_on_page_writeback_range: if some other
    > > > process is constantly making dirty and writeback pages while these
    > > > functions run, the functions will wait on every new page, resulting in
    > > > indefinite wait.

    > >
    > > Shouldn't happen. All the data-syncing functions should have an upper
    > > bound on the number of pages which they attempt to write. In the
    > > example above, we end up in here:
    > >
    > > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
    > > start,
    > > loff_t end, int sync_mode)
    > > {
    > > int ret;
    > > struct writeback_control wbc = {
    > > .sync_mode = sync_mode,
    > > .nr_to_write = mapping->nrpages * 2, <<--
    > > .range_start = start,
    > > .range_end = end,
    > > };
    > >
    > > so generic_file_direct_write()'s filemap_write_and_wait() will attempt
    > > to write at most 2* the number of pages which are in cache for that inode.

    >
    > See write_cache_pages:
    >
    > if (wbc->sync_mode != WB_SYNC_NONE)
    > wait_on_page_writeback(page); (1)
    > if (PageWriteback(page) ||
    > !clear_page_dirty_for_io(page)) {
    > unlock_page(page); (2)
    > continue;
    > }
    > ret = (*writepage)(page, wbc, data);
    > if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
    > unlock_page(page);
    > ret = 0;
    > }
    > if (ret || (--(wbc->nr_to_write) <= 0))
    > done = 1;
    >
    > --- so if it goes by points (1) and (2), the counter is not decremented,
    > yet the function waits for the page. If there is constant stream of
    > writeback pages being generated, it waits on each on them --- that is,
    > forever. I have seen livelock in this function. For you that example with
    > two dd's, one buffered write and the other directIO read doesn't work? For
    > me it livelocks here.
    >
    > wait_on_page_writeback_range is another example where the livelock
    > happened, there is no protection at all against starvation.


    um, OK. So someone else is initiating IO for this inode and this
    thread *never* gets to initiate any writeback. That's a bit of a
    surprise.

    How do we fix that? Maybe decrement nt_to_write for these pages as
    well?

    >
    > BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous thing
    > to me.
    >
    > Imagine this case: You have two pages with indices 4 and 5 dirty in a
    > file. You call fsync(). It sets nr_to_write to 4.
    >
    > Meanwhile, another process makes pages 0, 1, 2, 3 dirty.
    >
    > The fsync() process goes to write_cache_pages, writes the first 4 dirty
    > pages and exits because it goes over the limit.
    >
    > result --- you violate fsync() semantics, pages that were dirty before
    > call to fsync() are not written when fsync() exits.


    yup, that's pretty much unfixable, really, unless new locks are added
    which block threads which are writing to unrelated sections of the
    file, and that could hurt some workloads quite a lot, I expect.

    Hopefully high performance applications are instantiating the file
    up-front and are using sync_file_range() to prevent these sorts of
    things from happening. But they probably aren't.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH] Memory management livelock



    On Tue, 23 Sep 2008, Andrew Morton wrote:

    > On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
    > Mikulas Patocka wrote:
    >
    > > > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
    > > > Mikulas Patocka wrote:
    > > >
    > > > > The bug happens when one process is doing sequential buffered writes to
    > > > > a block device (or file) and another process is attempting to execute
    > > > > sync(), fsync() or direct-IO on that device (or file). This syncing
    > > > > process will wait indefinitelly, until the first writing process
    > > > > finishes.
    > > > >
    > > > > For example, run these two commands:
    > > > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
    > > > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
    > > > >
    > > > > The bug is caused by sequential walking of address space in
    > > > > write_cache_pages and wait_on_page_writeback_range: if some other
    > > > > process is constantly making dirty and writeback pages while these
    > > > > functions run, the functions will wait on every new page, resulting in
    > > > > indefinite wait.
    > > >
    > > > Shouldn't happen. All the data-syncing functions should have an upper
    > > > bound on the number of pages which they attempt to write. In the
    > > > example above, we end up in here:
    > > >
    > > > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
    > > > start,
    > > > loff_t end, int sync_mode)
    > > > {
    > > > int ret;
    > > > struct writeback_control wbc = {
    > > > .sync_mode = sync_mode,
    > > > .nr_to_write = mapping->nrpages * 2, <<--
    > > > .range_start = start,
    > > > .range_end = end,
    > > > };
    > > >
    > > > so generic_file_direct_write()'s filemap_write_and_wait() will attempt
    > > > to write at most 2* the number of pages which are in cache for that inode.

    > >
    > > See write_cache_pages:
    > >
    > > if (wbc->sync_mode != WB_SYNC_NONE)
    > > wait_on_page_writeback(page); (1)
    > > if (PageWriteback(page) ||
    > > !clear_page_dirty_for_io(page)) {
    > > unlock_page(page); (2)
    > > continue;
    > > }
    > > ret = (*writepage)(page, wbc, data);
    > > if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
    > > unlock_page(page);
    > > ret = 0;
    > > }
    > > if (ret || (--(wbc->nr_to_write) <= 0))
    > > done = 1;
    > >
    > > --- so if it goes by points (1) and (2), the counter is not decremented,
    > > yet the function waits for the page. If there is constant stream of
    > > writeback pages being generated, it waits on each on them --- that is,
    > > forever. I have seen livelock in this function. For you that example with
    > > two dd's, one buffered write and the other directIO read doesn't work? For
    > > me it livelocks here.
    > >
    > > wait_on_page_writeback_range is another example where the livelock
    > > happened, there is no protection at all against starvation.

    >
    > um, OK. So someone else is initiating IO for this inode and this
    > thread *never* gets to initiate any writeback. That's a bit of a
    > surprise.
    >
    > How do we fix that? Maybe decrement nt_to_write for these pages as
    > well?


    And what do you want to do with wait_on_page_writeback_range? When I
    solved that livelock in write_cache_pages(), I got another livelock in
    wait_on_page_writeback_range.

    > > BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous thing
    > > to me.
    > >
    > > Imagine this case: You have two pages with indices 4 and 5 dirty in a
    > > file. You call fsync(). It sets nr_to_write to 4.
    > >
    > > Meanwhile, another process makes pages 0, 1, 2, 3 dirty.
    > >
    > > The fsync() process goes to write_cache_pages, writes the first 4 dirty
    > > pages and exits because it goes over the limit.
    > >
    > > result --- you violate fsync() semantics, pages that were dirty before
    > > call to fsync() are not written when fsync() exits.

    >
    > yup, that's pretty much unfixable, really, unless new locks are added
    > which block threads which are writing to unrelated sections of the
    > file, and that could hurt some workloads quite a lot, I expect.


    It is fixable with the patch I sent --- it doesn't take any locks unless
    the starvation happens. Then, you don't have to use .nr_to_write for
    fsync anymore.

    Another solution could be to record in page structure jiffies when the
    page entered dirty state and writeback state. The start writeback/wait on
    writeback functions could then trivially ignore pages that were
    dirtied/writebacked while the function was in progress.

    > Hopefully high performance applications are instantiating the file
    > up-front and are using sync_file_range() to prevent these sorts of
    > things from happening. But they probably aren't.


    --- for databases it is pretty much possible that one thread is writing
    already journaled data (so it doesn't care when the data are really
    written) and another thread is calling fsync() on the same inode
    simultaneously --- so fsync() could mistakenly write the data generated by
    the first thread and ignore the data generated by the second thread, that
    it should really write.

    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH] Memory management livelock

    On Tue, 23 Sep 2008 19:11:51 -0400 (EDT)
    Mikulas Patocka wrote:

    >
    >
    > > > wait_on_page_writeback_range is another example where the livelock
    > > > happened, there is no protection at all against starvation.

    > >
    > > um, OK. So someone else is initiating IO for this inode and this
    > > thread *never* gets to initiate any writeback. That's a bit of a
    > > surprise.
    > >
    > > How do we fix that? Maybe decrement nt_to_write for these pages as
    > > well?

    >
    > And what do you want to do with wait_on_page_writeback_range?


    Don't know. I was asking you.

    > When I
    > solved that livelock in write_cache_pages(), I got another livelock in
    > wait_on_page_writeback_range.
    >
    > > > BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous thing
    > > > to me.
    > > >
    > > > Imagine this case: You have two pages with indices 4 and 5 dirty in a
    > > > file. You call fsync(). It sets nr_to_write to 4.
    > > >
    > > > Meanwhile, another process makes pages 0, 1, 2, 3 dirty.
    > > >
    > > > The fsync() process goes to write_cache_pages, writes the first 4 dirty
    > > > pages and exits because it goes over the limit.
    > > >
    > > > result --- you violate fsync() semantics, pages that were dirty before
    > > > call to fsync() are not written when fsync() exits.

    > >
    > > yup, that's pretty much unfixable, really, unless new locks are added
    > > which block threads which are writing to unrelated sections of the
    > > file, and that could hurt some workloads quite a lot, I expect.

    >
    > It is fixable with the patch I sent --- it doesn't take any locks unless
    > the starvation happens. Then, you don't have to use .nr_to_write for
    > fsync anymore.


    I agree that the patch is low-impact and relatively straightforward.
    The main problem is making the address_space larger - there can (and
    often are) millions and millions of these things in memory. Making it
    larger is a big deal. We should work hard to seek an alternative and
    afacit that isn't happening here.

    We already have existing code and design which attempts to avoid
    livelock without adding stuff to the address_space. Can it be modified
    so as to patch up this quite obscure and rarely-occuring problem?

    > Another solution could be to record in page structure jiffies when the
    > page entered dirty state and writeback state. The start writeback/wait on
    > writeback functions could then trivially ignore pages that were
    > dirtied/writebacked while the function was in progress.
    >
    > > Hopefully high performance applications are instantiating the file
    > > up-front and are using sync_file_range() to prevent these sorts of
    > > things from happening. But they probably aren't.

    >
    > --- for databases it is pretty much possible that one thread is writing
    > already journaled data (so it doesn't care when the data are really
    > written) and another thread is calling fsync() on the same inode
    > simultaneously --- so fsync() could mistakenly write the data generated by
    > the first thread and ignore the data generated by the second thread, that
    > it should really write.
    >
    > Mikulas

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH] Memory management livelock

    > > > yup, that's pretty much unfixable, really, unless new locks are added
    > > > which block threads which are writing to unrelated sections of the
    > > > file, and that could hurt some workloads quite a lot, I expect.

    > >
    > > It is fixable with the patch I sent --- it doesn't take any locks unless
    > > the starvation happens. Then, you don't have to use .nr_to_write for
    > > fsync anymore.

    >
    > I agree that the patch is low-impact and relatively straightforward.
    > The main problem is making the address_space larger - there can (and
    > often are) millions and millions of these things in memory. Making it
    > larger is a big deal. We should work hard to seek an alternative and
    > afacit that isn't happening here.
    >
    > We already have existing code and design which attempts to avoid
    > livelock without adding stuff to the address_space. Can it be modified
    > so as to patch up this quite obscure and rarely-occuring problem?


    I reworked my patch to use a bit in address_space->flags and hashes wait
    queues, so it doesn't take any extra memory. I'm sending it in three
    parts.
    1 - make generic function wait_action_schedule
    2 - fix the livelock, the logic is exactly the same as in my previous
    patch, wait_on_bit_lock is used instead of mutexes
    3 - remove that nr_pages * 2 limit, because it causes misbehavior and
    possible data loss.

    Mikulas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. [PATCH 1/3] Memory management livelock

    A generic function wait_action_schedule that allows to use wait_on_bit_lock just
    like mutexes.

    Signed-off-by: Mikulas Patocka

    ---
    include/linux/wait.h | 8 +++++++-
    kernel/wait.c | 7 +++++++
    2 files changed, 14 insertions(+), 1 deletion(-)

    Index: linux-2.6.27-rc7-devel/include/linux/wait.h
    ================================================== =================
    --- linux-2.6.27-rc7-devel.orig/include/linux/wait.h 2008-09-24 03:20:59.000000000 +0200
    +++ linux-2.6.27-rc7-devel/include/linux/wait.h 2008-09-24 03:26:34.000000000 +0200
    @@ -513,7 +513,13 @@ static inline int wait_on_bit_lock(void
    return 0;
    return out_of_line_wait_on_bit_lock(word, bit, action, mode);
    }
    -
    +
    +/**
    + * wait_action_schedule - this function can be passed to wait_on_bit or
    + * wait_on_bit_lock and it will call just schedule().
    + */
    +int wait_action_schedule(void *);
    +
    #endif /* __KERNEL__ */

    #endif
    Index: linux-2.6.27-rc7-devel/kernel/wait.c
    ================================================== =================
    --- linux-2.6.27-rc7-devel.orig/kernel/wait.c 2008-09-24 03:22:58.000000000 +0200
    +++ linux-2.6.27-rc7-devel/kernel/wait.c 2008-09-24 03:24:10.000000000 +0200
    @@ -251,3 +251,10 @@ wait_queue_head_t *bit_waitqueue(void *w
    return &zone->wait_table[hash_long(val, zone->wait_table_bits)];
    }
    EXPORT_SYMBOL(bit_waitqueue);
    +
    +int wait_action_schedule(void *word)
    +{
    + schedule();
    + return 0;
    +}
    +EXPORT_SYMBOL(wait_action_schedule);

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. [PATCH 2/3] Memory management livelock

    Avoid starvation when walking address space.

    Signed-off-by: Mikulas Patocka

    ---
    include/linux/pagemap.h | 1 +
    mm/filemap.c | 20 ++++++++++++++++++++
    mm/page-writeback.c | 37 ++++++++++++++++++++++++++++++++++++-
    mm/truncate.c | 24 +++++++++++++++++++++++-
    4 files changed, 80 insertions(+), 2 deletions(-)

    Index: linux-2.6.27-rc7-devel/include/linux/pagemap.h
    ================================================== =================
    --- linux-2.6.27-rc7-devel.orig/include/linux/pagemap.h 2008-09-24 02:57:37.000000000 +0200
    +++ linux-2.6.27-rc7-devel/include/linux/pagemap.h 2008-09-24 02:59:04.000000000 +0200
    @@ -21,6 +21,7 @@
    #define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
    #define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */
    #define AS_MM_ALL_LOCKS (__GFP_BITS_SHIFT + 2) /* under mm_take_all_locks() */
    +#define AS_STARVATION (__GFP_BITS_SHIFT + 3) /* an anti-starvation barrier */

    static inline void mapping_set_error(struct address_space *mapping, int error)
    {
    Index: linux-2.6.27-rc7-devel/mm/filemap.c
    ================================================== =================
    --- linux-2.6.27-rc7-devel.orig/mm/filemap.c 2008-09-24 02:59:33.000000000 +0200
    +++ linux-2.6.27-rc7-devel/mm/filemap.c 2008-09-24 03:13:47.000000000 +0200
    @@ -269,10 +269,19 @@ int wait_on_page_writeback_range(struct
    int nr_pages;
    int ret = 0;
    pgoff_t index;
    + long pages_to_process;

    if (end < start)
    return 0;

    + /*
    + * Estimate the number of pages to process. If we process significantly
    + * more than this, someone is making writeback pages under us.
    + * We must pull the anti-starvation plug.
    + */
    + pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
    + pages_to_process += (pages_to_process >> 3) + 16;
    +
    pagevec_init(&pvec, 0);
    index = start;
    while ((index <= end) &&
    @@ -288,6 +297,10 @@ int wait_on_page_writeback_range(struct
    if (page->index > end)
    continue;

    + if (pages_to_process >= 0)
    + if (!pages_to_process--)
    + wait_on_bit_lock(&mapping->flags, AS_STARVATION, wait_action_schedule, TASK_UNINTERRUPTIBLE);
    +
    wait_on_page_writeback(page);
    if (PageError(page))
    ret = -EIO;
    @@ -296,6 +309,13 @@ int wait_on_page_writeback_range(struct
    cond_resched();
    }

    + if (pages_to_process < 0) {
    + smp_mb__before_clear_bit();
    + clear_bit(AS_STARVATION, &mapping->flags);
    + smp_mb__after_clear_bit();
    + wake_up_bit(&mapping->flags, AS_STARVATION);
    + }
    +
    /* Check for outstanding write errors */
    if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
    ret = -ENOSPC;
    Index: linux-2.6.27-rc7-devel/mm/page-writeback.c
    ================================================== =================
    --- linux-2.6.27-rc7-devel.orig/mm/page-writeback.c 2008-09-24 03:10:34.000000000 +0200
    +++ linux-2.6.27-rc7-devel/mm/page-writeback.c 2008-09-24 03:20:24.000000000 +0200
    @@ -435,6 +435,18 @@ static void balance_dirty_pages(struct a

    struct backing_dev_info *bdi = mapping->backing_dev_info;

    + /*
    + * If there is sync() starving on this address space, block
    + * writers until it finishes.
    + */
    + if (unlikely(test_bit(AS_STARVATION, &mapping->flags))) {
    + wait_on_bit_lock(&mapping->flags, AS_STARVATION, wait_action_schedule, TASK_UNINTERRUPTIBLE);
    + smp_mb__before_clear_bit();
    + clear_bit(AS_STARVATION, &mapping->flags);
    + smp_mb__after_clear_bit();
    + wake_up_bit(&mapping->flags, AS_STARVATION);
    + }
    +
    for (; {
    struct writeback_control wbc = {
    .bdi = bdi,
    @@ -876,12 +888,21 @@ int write_cache_pages(struct address_spa
    pgoff_t end; /* Inclusive */
    int scanned = 0;
    int range_whole = 0;
    + long pages_to_process;

    if (wbc->nonblocking && bdi_write_congested(bdi)) {
    wbc->encountered_congestion = 1;
    return 0;
    }

    + /*
    + * Estimate the number of pages to process. If we process significantly
    + * more than this, someone is making dirty pages under us.
    + * Pull the anti-starvation plug to stop him.
    + */
    + pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
    + pages_to_process += (pages_to_process >> 3) + 16;
    +
    pagevec_init(&pvec, 0);
    if (wbc->range_cyclic) {
    index = mapping->writeback_index; /* Start from prev offset */
    @@ -902,7 +923,13 @@ retry:

    scanned = 1;
    for (i = 0; i < nr_pages; i++) {
    - struct page *page = pvec.pages[i];
    + struct page *page;
    +
    + if (pages_to_process >= 0)
    + if (!pages_to_process--)
    + wait_on_bit_lock(&mapping->flags, AS_STARVATION, wait_action_schedule, TASK_UNINTERRUPTIBLE);
    +
    + page = pvec.pages[i];

    /*
    * At this point we hold neither mapping->tree_lock nor
    @@ -949,6 +976,14 @@ retry:
    pagevec_release(&pvec);
    cond_resched();
    }
    +
    + if (pages_to_process < 0) {
    + smp_mb__before_clear_bit();
    + clear_bit(AS_STARVATION, &mapping->flags);
    + smp_mb__after_clear_bit();
    + wake_up_bit(&mapping->flags, AS_STARVATION);
    + }
    +
    if (!scanned && !done) {
    /*
    * We hit the last page and there is more work to be done: wrap
    Index: linux-2.6.27-rc7-devel/mm/truncate.c
    ================================================== =================
    --- linux-2.6.27-rc7-devel.orig/mm/truncate.c 2008-09-24 03:16:15.000000000 +0200
    +++ linux-2.6.27-rc7-devel/mm/truncate.c 2008-09-24 03:18:00.000000000 +0200
    @@ -392,6 +392,14 @@ int invalidate_inode_pages2_range(struct
    int ret2 = 0;
    int did_range_unmap = 0;
    int wrapped = 0;
    + long pages_to_process;
    +
    + /*
    + * Estimate number of pages to process. If we process more, someone
    + * is making pages under us.
    + */
    + pages_to_process = mapping->nrpages;
    + pages_to_process += (pages_to_process >> 3) + 16;

    pagevec_init(&pvec, 0);
    next = start;
    @@ -399,9 +407,15 @@ int invalidate_inode_pages2_range(struct
    pagevec_lookup(&pvec, mapping, next,
    min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
    for (i = 0; i < pagevec_count(&pvec); i++) {
    - struct page *page = pvec.pages[i];
    + struct page *page;
    pgoff_t page_index;

    + if (pages_to_process >= 0)
    + if (!pages_to_process--)
    + wait_on_bit_lock(&mapping->flags, AS_STARVATION, wait_action_schedule, TASK_UNINTERRUPTIBLE);
    +
    + page = pvec.pages[i];
    +
    lock_page(page);
    if (page->mapping != mapping) {
    unlock_page(page);
    @@ -449,6 +463,14 @@ int invalidate_inode_pages2_range(struct
    pagevec_release(&pvec);
    cond_resched();
    }
    +
    + if (pages_to_process < 0) {
    + smp_mb__before_clear_bit();
    + clear_bit(AS_STARVATION, &mapping->flags);
    + smp_mb__after_clear_bit();
    + wake_up_bit(&mapping->flags, AS_STARVATION);
    + }
    +
    return ret;
    }
    EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. [PATCH 3/3] Memory management livelock

    Fix violation of sync()/fsync() semantics. Previous code walked up to
    mapping->nrpages * 2 pages. Because pages could be created while
    __filemap_fdatawrite_range was in progress, it could lead to a misbehavior.
    Example: there are two pages in address space with indices 4, 5. Both are dirty.
    Someone calls __filemap_fdatawrite_range, it sets .nr_to_write = 4.
    Meanwhile, some other process creates dirty pages 0, 1, 2, 3.
    __filemap_fdatawrite_range writes pages 0, 1, 2, 3, finds out that it reached
    the limit and exits.
    Result: pages that were dirty before __filemap_fdatawrite_range was invoked were
    not written.

    With starvation protection from the previous patch, this mapping->nrpages * 2
    logic is no longer needed.

    Signed-off-by: Mikulas Patocka

    ---
    mm/filemap.c | 7 ++++++-
    1 file changed, 6 insertions(+), 1 deletion(-)

    Index: linux-2.6.27-rc7-devel/mm/filemap.c
    ================================================== =================
    --- linux-2.6.27-rc7-devel.orig/mm/filemap.c 2008-09-24 14:47:01.000000000 +0200
    +++ linux-2.6.27-rc7-devel/mm/filemap.c 2008-09-24 15:01:23.000000000 +0200
    @@ -202,6 +202,11 @@ static int sync_page_killable(void *word
    * opposed to a regular memory cleansing writeback. The difference between
    * these two operations is that if a dirty page/buffer is encountered, it must
    * be waited upon, and not just skipped over.
    + *
    + * Because new pages dirty can be created while this is executing, that
    + * mapping->nrpages * 2 condition is unsafe. If we are doing data integrity
    + * write, we must write all the pages. AS_STARVATION bit will eventually prevent
    + * creating more dirty pages to avoid starvation.
    */
    int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
    loff_t end, int sync_mode)
    @@ -209,7 +214,7 @@ int __filemap_fdatawrite_range(struct ad
    int ret;
    struct writeback_control wbc = {
    .sync_mode = sync_mode,
    - .nr_to_write = mapping->nrpages * 2,
    + .nr_to_write = sync_mode == WB_SYNC_NONE ? mapping->nrpages * 2 : LONG_MAX,
    .range_start = start,
    .range_end = end,
    };

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [PATCH 2/3] Memory management livelock


    > Subject: [PATCH 2/3] Memory management livelock


    Please don't send multiple patches with identical titles - think up a
    good, unique, meaningful title for each patch.

    On Wed, 24 Sep 2008 14:52:18 -0400 (EDT) Mikulas Patocka wrote:

    > Avoid starvation when walking address space.
    >
    > Signed-off-by: Mikulas Patocka
    >


    Please include a full changelog with each iteration of each patch.

    That changelog should explain the reason for playing games with
    bitlocks so Linus doesn't have kittens when he sees it.

    > include/linux/pagemap.h | 1 +
    > mm/filemap.c | 20 ++++++++++++++++++++
    > mm/page-writeback.c | 37 ++++++++++++++++++++++++++++++++++++-
    > mm/truncate.c | 24 +++++++++++++++++++++++-
    > 4 files changed, 80 insertions(+), 2 deletions(-)
    >
    > Index: linux-2.6.27-rc7-devel/include/linux/pagemap.h
    > ================================================== =================
    > --- linux-2.6.27-rc7-devel.orig/include/linux/pagemap.h 2008-09-24 02:57:37.000000000 +0200
    > +++ linux-2.6.27-rc7-devel/include/linux/pagemap.h 2008-09-24 02:59:04.000000000 +0200
    > @@ -21,6 +21,7 @@
    > #define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
    > #define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */
    > #define AS_MM_ALL_LOCKS (__GFP_BITS_SHIFT + 2) /* under mm_take_all_locks() */
    > +#define AS_STARVATION (__GFP_BITS_SHIFT + 3) /* an anti-starvation barrier */
    >
    > static inline void mapping_set_error(struct address_space *mapping, int error)
    > {
    > Index: linux-2.6.27-rc7-devel/mm/filemap.c
    > ================================================== =================
    > --- linux-2.6.27-rc7-devel.orig/mm/filemap.c 2008-09-24 02:59:33.000000000 +0200
    > +++ linux-2.6.27-rc7-devel/mm/filemap.c 2008-09-24 03:13:47.000000000 +0200
    > @@ -269,10 +269,19 @@ int wait_on_page_writeback_range(struct
    > int nr_pages;
    > int ret = 0;
    > pgoff_t index;
    > + long pages_to_process;
    >
    > if (end < start)
    > return 0;
    >
    > + /*
    > + * Estimate the number of pages to process. If we process significantly
    > + * more than this, someone is making writeback pages under us.
    > + * We must pull the anti-starvation plug.
    > + */
    > + pages_to_process = bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
    > + pages_to_process += (pages_to_process >> 3) + 16;


    This sequence appears twice and it would probably be clearer to
    implement it in a well-commented function.

    > pagevec_init(&pvec, 0);
    > index = start;
    > while ((index <= end) &&
    > @@ -288,6 +297,10 @@ int wait_on_page_writeback_range(struct
    > if (page->index > end)
    > continue;
    >
    > + if (pages_to_process >= 0)
    > + if (!pages_to_process--)
    > + wait_on_bit_lock(&mapping->flags, AS_STARVATION, wait_action_schedule, TASK_UNINTERRUPTIBLE);


    This is copied three times and perhaps also should be factored out.

    Please note that an effort has been made to make mm/filemap.c look
    presentable in an 80-col display.

    > wait_on_page_writeback(page);
    > if (PageError(page))
    > ret = -EIO;
    > @@ -296,6 +309,13 @@ int wait_on_page_writeback_range(struct
    > cond_resched();
    > }
    >
    > + if (pages_to_process < 0) {
    > + smp_mb__before_clear_bit();
    > + clear_bit(AS_STARVATION, &mapping->flags);
    > + smp_mb__after_clear_bit();
    > + wake_up_bit(&mapping->flags, AS_STARVATION);
    > + }


    This sequence is repeated three or four times and should be pulled out
    into a well-commented function. That comment should explain the logic
    behind the use of these barriers, please.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [PATCH] Memory management livelock

    On Wednesday 24 September 2008 08:49, Andrew Morton wrote:
    > On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
    >
    > Mikulas Patocka wrote:
    > > > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
    > > >
    > > > Mikulas Patocka wrote:
    > > > > The bug happens when one process is doing sequential buffered writes
    > > > > to a block device (or file) and another process is attempting to
    > > > > execute sync(), fsync() or direct-IO on that device (or file). This
    > > > > syncing process will wait indefinitelly, until the first writing
    > > > > process finishes.
    > > > >
    > > > > For example, run these two commands:
    > > > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
    > > > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
    > > > >
    > > > > The bug is caused by sequential walking of address space in
    > > > > write_cache_pages and wait_on_page_writeback_range: if some other
    > > > > process is constantly making dirty and writeback pages while these
    > > > > functions run, the functions will wait on every new page, resulting
    > > > > in indefinite wait.


    I think the problem has been misidentified, or else I have misread the
    code. See below. I hope I'm right, because I think the patches are pretty
    heavy on complexity in these already complex paths...

    It would help if you explicitly identify the exact livelock. Ie. give a
    sequence of behaviour that leads to our progress rate falling to zero.


    > > > Shouldn't happen. All the data-syncing functions should have an upper
    > > > bound on the number of pages which they attempt to write. In the
    > > > example above, we end up in here:
    > > >
    > > > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
    > > > start,
    > > > loff_t end, int sync_mode)
    > > > {
    > > > int ret;
    > > > struct writeback_control wbc = {
    > > > .sync_mode = sync_mode,
    > > > .nr_to_write = mapping->nrpages * 2, <<--
    > > > .range_start = start,
    > > > .range_end = end,
    > > > };
    > > >
    > > > so generic_file_direct_write()'s filemap_write_and_wait() will attempt
    > > > to write at most 2* the number of pages which are in cache for that
    > > > inode.

    > >
    > > See write_cache_pages:
    > >
    > > if (wbc->sync_mode != WB_SYNC_NONE)
    > > wait_on_page_writeback(page); (1)
    > > if (PageWriteback(page) ||
    > > !clear_page_dirty_for_io(page)) {
    > > unlock_page(page); (2)
    > > continue;
    > > }
    > > ret = (*writepage)(page, wbc, data);
    > > if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
    > > unlock_page(page);
    > > ret = 0;
    > > }
    > > if (ret || (--(wbc->nr_to_write) <= 0))
    > > done = 1;
    > >
    > > --- so if it goes by points (1) and (2), the counter is not decremented,
    > > yet the function waits for the page. If there is constant stream of
    > > writeback pages being generated, it waits on each on them --- that is,
    > > forever.


    *What* is, forever? Data integrity syncs should have pages operated on
    in-order, until we get to the end of the range. Circular writeback could
    go through again, possibly, but no more than once.


    > > I have seen livelock in this function. For you that example with
    > > two dd's, one buffered write and the other directIO read doesn't work?
    > > For me it livelocks here.
    > >
    > > wait_on_page_writeback_range is another example where the livelock
    > > happened, there is no protection at all against starvation.

    >
    > um, OK. So someone else is initiating IO for this inode and this
    > thread *never* gets to initiate any writeback. That's a bit of a
    > surprise.
    >
    > How do we fix that? Maybe decrement nt_to_write for these pages as
    > well?


    What's the actual problem, though? nr_to_write should not be used for
    data integrity operations, and it should not be critical for other
    writeout. Upper layers should be able to deal with it rather than
    have us lying to them.


    > > BTW. that .nr_to_write = mapping->nrpages * 2 looks like a dangerous
    > > thing to me.
    > >
    > > Imagine this case: You have two pages with indices 4 and 5 dirty in a
    > > file. You call fsync(). It sets nr_to_write to 4.
    > >
    > > Meanwhile, another process makes pages 0, 1, 2, 3 dirty.
    > >
    > > The fsync() process goes to write_cache_pages, writes the first 4 dirty
    > > pages and exits because it goes over the limit.
    > >
    > > result --- you violate fsync() semantics, pages that were dirty before
    > > call to fsync() are not written when fsync() exits.


    Wow, that's really nasty. Sad we still have known data integrity problems
    in such core functions.


    > yup, that's pretty much unfixable, really, unless new locks are added
    > which block threads which are writing to unrelated sections of the
    > file, and that could hurt some workloads quite a lot, I expect.


    Why is it unfixable? Just ignore nr_to_write, and write out everything
    properly, I would have thought.

    Some things may go a tad slower, but those are going to be the things
    that are using fsync, in which cases they are going to hurt much more
    from the loss of data integrity than a slowdown.

    Unfortunately because we have played fast and loose for so long, they
    expect this behaviour, were tested and optimised with it, and systems
    designed and deployed with it, and will notice performance regressions
    if we start trying to do things properly. This is one of my main
    arguments for doing things correctly up-front, even if it means a
    massive slowdown in some real or imagined workload: at least then we
    will hear about complaints and be able to try to improve them rather
    than setting ourselves up for failure later.
    /rant

    Anyway, in this case, I don't think there would be really big problems.
    Also, I think there is a reasonable optimisation that might improve it
    (2nd last point, in attached patch).

    OK, so after glancing at the code... wow, it seems like there are a lot
    of bugs in there.


  13. Re: [PATCH] Memory management livelock

    On Fri, 3 Oct 2008 12:32:23 +1000 Nick Piggin wrote:

    > > yup, that's pretty much unfixable, really, unless new locks are added
    > > which block threads which are writing to unrelated sections of the
    > > file, and that could hurt some workloads quite a lot, I expect.

    >
    > Why is it unfixable? Just ignore nr_to_write, and write out everything
    > properly, I would have thought.


    That can cause fsync to wait arbitrarily long if some other process is
    writing the file. This happens.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [PATCH] Memory management livelock

    On Friday 03 October 2008 12:40, Andrew Morton wrote:
    > On Fri, 3 Oct 2008 12:32:23 +1000 Nick Piggin

    wrote:
    > > > yup, that's pretty much unfixable, really, unless new locks are added
    > > > which block threads which are writing to unrelated sections of the
    > > > file, and that could hurt some workloads quite a lot, I expect.

    > >
    > > Why is it unfixable? Just ignore nr_to_write, and write out everything
    > > properly, I would have thought.

    >
    > That can cause fsync to wait arbitrarily long if some other process is
    > writing the file.


    It can be fixed without touching non-fsync paths (see my next email for
    the way to fix it without touching fastpaths).


    > This happens.


    What does such a thing? It would have been nicer to ask them to not do
    that then, or get them to use range syncs or something. Now that's much
    harder because we've accepted the crappy workaround for so long.

    It's far far worse to just ignore data integrity of fsync because of the
    problem. Should at least have returned an error from fsync in that case,
    or make it interruptible or something.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [PATCH] Memory management livelock

    On Friday 03 October 2008 12:32, Nick Piggin wrote:
    > On Wednesday 24 September 2008 08:49, Andrew Morton wrote:
    > > On Tue, 23 Sep 2008 18:34:20 -0400 (EDT)
    > >
    > > Mikulas Patocka wrote:
    > > > > On Mon, 22 Sep 2008 17:10:04 -0400 (EDT)
    > > > >
    > > > > Mikulas Patocka wrote:
    > > > > > The bug happens when one process is doing sequential buffered
    > > > > > writes to a block device (or file) and another process is
    > > > > > attempting to execute sync(), fsync() or direct-IO on that device
    > > > > > (or file). This syncing process will wait indefinitelly, until the
    > > > > > first writing process finishes.
    > > > > >
    > > > > > For example, run these two commands:
    > > > > > dd if=/dev/zero of=/dev/sda1 bs=65536 &
    > > > > > dd if=/dev/sda1 of=/dev/null bs=4096 count=1 iflag=direct
    > > > > >
    > > > > > The bug is caused by sequential walking of address space in
    > > > > > write_cache_pages and wait_on_page_writeback_range: if some other
    > > > > > process is constantly making dirty and writeback pages while these
    > > > > > functions run, the functions will wait on every new page, resulting
    > > > > > in indefinite wait.

    >
    > I think the problem has been misidentified, or else I have misread the
    > code. See below. I hope I'm right, because I think the patches are pretty
    > heavy on complexity in these already complex paths...
    >
    > It would help if you explicitly identify the exact livelock. Ie. give a
    > sequence of behaviour that leads to our progress rate falling to zero.
    >
    > > > > Shouldn't happen. All the data-syncing functions should have an upper
    > > > > bound on the number of pages which they attempt to write. In the
    > > > > example above, we end up in here:
    > > > >
    > > > > int __filemap_fdatawrite_range(struct address_space *mapping, loff_t
    > > > > start,
    > > > > loff_t end, int sync_mode)
    > > > > {
    > > > > int ret;
    > > > > struct writeback_control wbc = {
    > > > > .sync_mode = sync_mode,
    > > > > .nr_to_write = mapping->nrpages * 2, <<--
    > > > > .range_start = start,
    > > > > .range_end = end,
    > > > > };
    > > > >
    > > > > so generic_file_direct_write()'s filemap_write_and_wait() will
    > > > > attempt to write at most 2* the number of pages which are in cache
    > > > > for that inode.
    > > >
    > > > See write_cache_pages:
    > > >
    > > > if (wbc->sync_mode != WB_SYNC_NONE)
    > > > wait_on_page_writeback(page); (1)
    > > > if (PageWriteback(page) ||
    > > > !clear_page_dirty_for_io(page)) {
    > > > unlock_page(page); (2)
    > > > continue;
    > > > }
    > > > ret = (*writepage)(page, wbc, data);
    > > > if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
    > > > unlock_page(page);
    > > > ret = 0;
    > > > }
    > > > if (ret || (--(wbc->nr_to_write) <= 0))
    > > > done = 1;
    > > >
    > > > --- so if it goes by points (1) and (2), the counter is not
    > > > decremented, yet the function waits for the page. If there is constant
    > > > stream of writeback pages being generated, it waits on each on them ---
    > > > that is, forever.

    >
    > *What* is, forever? Data integrity syncs should have pages operated on
    > in-order, until we get to the end of the range. Circular writeback could
    > go through again, possibly, but no more than once.


    OK, I have been able to reproduce it somewhat. It is not a livelock,
    but what is happening is that direct IO read basically does an fsync
    on the file before performing the IO. The fsync gets stuck behind the
    dd that is dirtying the pages, and ends up following behind it and
    doing all its IO for it.

    The following patch avoids the issue for direct IO, by using the range
    syncs rather than trying to sync the whole file.

    The underlying problem I guess is unchanged. Is it really a problem,
    though? The way I'd love to solve it is actually by adding another bit
    or two to the pagecache radix tree, that can be used to transiently tag
    the tree for future operations. That way we could record the dirty and
    writeback pages up front, and then only bother with operating on them.

    That's *if* it really is a problem. I don't have much pity for someone
    doing buffered IO and direct IO to the same pages of the same file


  16. Re: [PATCH] Memory management livelock

    On Fri, 3 Oct 2008 12:59:17 +1000 Nick Piggin wrote:

    > On Friday 03 October 2008 12:40, Andrew Morton wrote:
    > > On Fri, 3 Oct 2008 12:32:23 +1000 Nick Piggin

    > wrote:
    > > > > yup, that's pretty much unfixable, really, unless new locks are added
    > > > > which block threads which are writing to unrelated sections of the
    > > > > file, and that could hurt some workloads quite a lot, I expect.
    > > >
    > > > Why is it unfixable? Just ignore nr_to_write, and write out everything
    > > > properly, I would have thought.

    > >
    > > That can cause fsync to wait arbitrarily long if some other process is
    > > writing the file.

    >
    > It can be fixed without touching non-fsync paths (see my next email for
    > the way to fix it without touching fastpaths).
    >


    yup.

    >
    > > This happens.

    >
    > What does such a thing?


    I forget. People do all sorts of weird stuff.

    > It would have been nicer to ask them to not do
    > that then, or get them to use range syncs or something. Now that's much
    > harder because we've accepted the crappy workaround for so long.
    >
    > It's far far worse to just ignore data integrity of fsync because of the
    > problem. Should at least have returned an error from fsync in that case,
    > or make it interruptible or something.


    If a file has one dirty page at offset 1000000000000000 then someone
    does an fsync() and someone else gets in first and starts madly writing
    pages at offset 0, we want to write that page at 1000000000000000.
    Somehow.

    I expect there's no solution which avoids blocking the writers at some
    stage.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [PATCH] Memory management livelock

    On Friday 03 October 2008 13:14, Andrew Morton wrote:
    > On Fri, 3 Oct 2008 12:59:17 +1000 Nick Piggin

    wrote:
    > > On Friday 03 October 2008 12:40, Andrew Morton wrote:


    > > > That can cause fsync to wait arbitrarily long if some other process is
    > > > writing the file.

    > >
    > > It can be fixed without touching non-fsync paths (see my next email for
    > > the way to fix it without touching fastpaths).

    >
    > yup.
    >
    > > > This happens.

    > >
    > > What does such a thing?

    >
    > I forget. People do all sorts of weird stuff.


    Damn people...

    I guess they also do non-weird stuff like expecting fsync to really fsync.


    > > It would have been nicer to ask them to not do
    > > that then, or get them to use range syncs or something. Now that's much
    > > harder because we've accepted the crappy workaround for so long.
    > >
    > > It's far far worse to just ignore data integrity of fsync because of the
    > > problem. Should at least have returned an error from fsync in that case,
    > > or make it interruptible or something.

    >
    > If a file has one dirty page at offset 1000000000000000 then someone
    > does an fsync() and someone else gets in first and starts madly writing
    > pages at offset 0, we want to write that page at 1000000000000000.
    > Somehow.
    >
    > I expect there's no solution which avoids blocking the writers at some
    > stage.


    See my other email. Something roughly like this would do the trick
    (hey, it actually boots and runs and does fix the problem too).

    It's ugly because we don't have quite the right radix tree operations
    yet (eg. lookup multiple tags, set tag X if tag Y was set, proper range
    lookups). But the theory is to up-front tag the pages that we need to
    get to disk.

    Completely no impact or slowdown to any writers (although it does add
    8 bytes of tags to the radix tree node... but doesn't increase memory
    footprint as such due to slab).


  18. Re: [PATCH] Memory management livelock

    On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin wrote:

    > > I expect there's no solution which avoids blocking the writers at some
    > > stage.

    >
    > See my other email. Something roughly like this would do the trick
    > (hey, it actually boots and runs and does fix the problem too).


    It needs exclusion to protect all those temp tags. Is do_fsync()'s
    i_mutex sufficient? It's qute unobvious (and unmaintainable?) that all
    the callers of this stuff are running under that lock.

    > It's ugly because we don't have quite the right radix tree operations
    > yet (eg. lookup multiple tags, set tag X if tag Y was set, proper range
    > lookups). But the theory is to up-front tag the pages that we need to
    > get to disk.


    Perhaps some callback-calling radix tree walker.

    > Completely no impact or slowdown to any writers (although it does add
    > 8 bytes of tags to the radix tree node... but doesn't increase memory
    > footprint as such due to slab).


    Can we reduce the amount of copy-n-pasting here?
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [PATCH] Memory management livelock

    On Friday 03 October 2008 13:56, Andrew Morton wrote:
    > On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin

    wrote:
    > > > I expect there's no solution which avoids blocking the writers at some
    > > > stage.

    > >
    > > See my other email. Something roughly like this would do the trick
    > > (hey, it actually boots and runs and does fix the problem too).

    >
    > It needs exclusion to protect all those temp tags. Is do_fsync()'s
    > i_mutex sufficient? It's qute unobvious (and unmaintainable?) that all
    > the callers of this stuff are running under that lock.


    Yeah... it does need a lock, which I brushed under the carpet :P
    I was going to just say use i_mutex, but then we really would start
    impacting on other fastpaths (eg writers).

    Possibly a new mutex in the address_space? That way we can say
    "anybody who holds this mutex is allowed to use the tag for anything"
    and it doesn't have to be fsync specific (whether that would be of
    any use to anything else, I don't know).


    > > It's ugly because we don't have quite the right radix tree operations
    > > yet (eg. lookup multiple tags, set tag X if tag Y was set, proper range
    > > lookups). But the theory is to up-front tag the pages that we need to
    > > get to disk.

    >
    > Perhaps some callback-calling radix tree walker.


    Possibly, yes. That would make it fairly general. I'll have a look...


    > > Completely no impact or slowdown to any writers (although it does add
    > > 8 bytes of tags to the radix tree node... but doesn't increase memory
    > > footprint as such due to slab).

    >
    > Can we reduce the amount of copy-n-pasting here?


    Yeah... I went to break the sync/async cases into two, but it looks like
    it may not have been worthwhile. Just another branch might be the best
    way to go.

    As far as the c&p in setting the FSYNC tag, yes that should all go away
    if the radix-tree is up to scratch. Basically:

    radix_tree_tag_set_if_tagged(start, end, ifWRITEBACK|DIRTY, setFSYNC);

    should be able to replace the whole thing, and we'd hold the tree_lock, so
    we would not have to take the page lock etc. Basically it would be much
    nicer... even somewhere close to a viable solution.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [PATCH] Memory management livelock

    On Fri, 3 Oct 2008 14:07:55 +1000 Nick Piggin wrote:

    > On Friday 03 October 2008 13:56, Andrew Morton wrote:
    > > On Fri, 3 Oct 2008 13:47:21 +1000 Nick Piggin

    > wrote:
    > > > > I expect there's no solution which avoids blocking the writers at some
    > > > > stage.
    > > >
    > > > See my other email. Something roughly like this would do the trick
    > > > (hey, it actually boots and runs and does fix the problem too).

    > >
    > > It needs exclusion to protect all those temp tags. Is do_fsync()'s
    > > i_mutex sufficient? It's qute unobvious (and unmaintainable?) that all
    > > the callers of this stuff are running under that lock.

    >
    > Yeah... it does need a lock, which I brushed under the carpet :P
    > I was going to just say use i_mutex, but then we really would start
    > impacting on other fastpaths (eg writers).
    >
    > Possibly a new mutex in the address_space?


    That's another, umm 24 bytes minimum in the address_space (and inode).
    That's fairly ouch, which is why Miklaus did that hokey bit-based
    thing.

    > That way we can say
    > "anybody who holds this mutex is allowed to use the tag for anything"
    > and it doesn't have to be fsync specific (whether that would be of
    > any use to anything else, I don't know).
    >
    >
    > > > It's ugly because we don't have quite the right radix tree operations
    > > > yet (eg. lookup multiple tags, set tag X if tag Y was set, proper range
    > > > lookups). But the theory is to up-front tag the pages that we need to
    > > > get to disk.

    > >
    > > Perhaps some callback-calling radix tree walker.

    >
    > Possibly, yes. That would make it fairly general. I'll have a look...
    >
    >
    > > > Completely no impact or slowdown to any writers (although it does add
    > > > 8 bytes of tags to the radix tree node... but doesn't increase memory
    > > > footprint as such due to slab).

    > >
    > > Can we reduce the amount of copy-n-pasting here?

    >
    > Yeah... I went to break the sync/async cases into two, but it looks like
    > it may not have been worthwhile. Just another branch might be the best
    > way to go.


    Yup. Could add another do-this flag in the writeback_control, perhaps.
    Or even a function pointer.

    > As far as the c&p in setting the FSYNC tag, yes that should all go away
    > if the radix-tree is up to scratch. Basically:
    >
    > radix_tree_tag_set_if_tagged(start, end, ifWRITEBACK|DIRTY, setFSYNC);
    >
    > should be able to replace the whole thing, and we'd hold the tree_lock, so
    > we would not have to take the page lock etc. Basically it would be much
    > nicer... even somewhere close to a viable solution.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 3 1 2 3 LastLast