[PATCH] VFS: Pagecache usage optimization on pagesize != blocksize environment - Kernel

This is a discussion on [PATCH] VFS: Pagecache usage optimization on pagesize != blocksize environment - Kernel ; Hi. When we read some part of a file through pagecache, if there is a pagecache of corresponding index but this page is not uptodate, read IO is issued and this page will be uptodate. I think this is good ...

+ Reply to Thread
Results 1 to 15 of 15

Thread: [PATCH] VFS: Pagecache usage optimization on pagesize != blocksize environment

  1. [PATCH] VFS: Pagecache usage optimization on pagesize != blocksize environment

    Hi.

    When we read some part of a file through pagecache, if there is a pagecache
    of corresponding index but this page is not uptodate, read IO is issued and
    this page will be uptodate.
    I think this is good for pagesize == blocksize environment but there is room
    for improvement on pagesize != blocksize environment. Because in this case
    a page can have multiple buffers and even if a page is not uptodate, some buffers
    can be uptodate. So I suggest that when all buffers which correspond to a part
    of a file that we want to read are uptodate, use this pagecache and copy data
    from this pagecache to user buffer even if a page is not uptodate. This can
    reduce read IO and improve system throughput.

    I did a performance test using the sysbench.

    #sysbench --num-threads=4 --max-requests=120000 --test=fileio --file-num=1 --file-block-size=1K --file-total-size=100M --file-test-mode=rndrw --file-fsync-freq=0 --file-rw-ratio=0.5 run

    The result was:

    -- 2.6.26-rc3
    Operations performed: 40002 Read, 79998 Write, 1 Other = 120001 Total
    Read 39.064Mb Written 78.123Mb Total transferred 117.19Mb (375Kb/sec)
    375.00 Requests/sec executed

    Test execution summary:
    total time: 320.0027s
    total number of events: 120000
    total time taken by event execution: 1231.5564
    per-request statistics:
    min: 0.0000s
    avg: 0.0103s
    max: 2.7605s
    approx. 95 percentile: 0.0381s


    -- 2.6.26-rc3-patched
    Operations performed: 40002 Read, 79998 Write, 1 Other = 120001 Total
    Read 39.064Mb Written 78.123Mb Total transferred 117.19Mb (409.78Kb/sec)
    409.78 Requests/sec executed

    Test execution summary:
    total time: 292.8406s
    total number of events: 120000
    total time taken by event execution: 1106.3995
    per-request statistics:
    min: 0.0000s
    avg: 0.0092s
    max: 3.7366s
    approx. 95 percentile: 0.0327s


    arch:i386
    filesystem:ext3
    blocksize:1024 bytes
    Memory: 1GB

    Random read/write throughput was somewhat improved with following patch.
    Thanks.

    Signed-off-by :Hisashi Hifumi

    diff -Nrup linux-2.6.26-rc3.org/fs/buffer.c linux-2.6.26-rc3/fs/buffer.c
    --- linux-2.6.26-rc3.org/fs/buffer.c 2008-05-19 11:35:10.000000000 +0900
    +++ linux-2.6.26-rc3/fs/buffer.c 2008-05-19 14:29:25.000000000 +0900
    @@ -2084,6 +2084,48 @@ int generic_write_end(struct file *file,
    EXPORT_SYMBOL(generic_write_end);

    /*
    + * check_buffers_uptodate checks whether buffers within a page are
    + * uptodate or not.
    + *
    + * Returns true if all buffers which correspond to a file portion
    + * we want to read are uptodate.
    + */
    +int check_buffers_uptodate(unsigned long from,
    + read_descriptor_t *desc, struct page *page)
    +{
    + struct inode *inode = page->mapping->host;
    + unsigned long block_start, block_end, blocksize;
    + unsigned long to;
    + struct buffer_head *bh, *head;
    + int ret = 1;
    +
    + blocksize = 1 << inode->i_blkbits;
    + to = from + desc->count;
    + if (to > PAGE_CACHE_SIZE)
    + to = PAGE_CACHE_SIZE;
    + if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize)
    + return 0;
    +
    + head = page_buffers(page);
    +
    + for (bh = head, block_start = 0; bh != head || !block_start;
    + block_start = block_end, bh = bh->b_this_page) {
    + block_end = block_start + blocksize;
    + if (block_end <= from || block_start >= to)
    + continue;
    + else {
    + if (!buffer_uptodate(bh)) {
    + ret = 0;
    + break;
    + }
    + if (block_end >= to)
    + break;
    + }
    + }
    + return ret;
    +}
    +
    +/*
    * Generic "read page" function for block devices that have the normal
    * get_block functionality. This is most of the block device filesystems.
    * Reads the page asynchronously --- the unlock_buffer() and
    diff -Nrup linux-2.6.26-rc3.org/include/linux/buffer_head.h linux-2.6.26-rc3/include/linux/buffer_head.h
    --- linux-2.6.26-rc3.org/include/linux/buffer_head.h 2008-05-19 11:35:11.000000000 +0900
    +++ linux-2.6.26-rc3/include/linux/buffer_head.h 2008-05-19 12:13:46.000000000 +0900
    @@ -205,6 +205,8 @@ void block_invalidatepage(struct page *p
    int block_write_full_page(struct page *page, get_block_t *get_block,
    struct writeback_control *wbc);
    int block_read_full_page(struct page*, get_block_t*);
    +int check_buffers_uptodate(unsigned long from,
    + read_descriptor_t *desc, struct page *page);
    int block_write_begin(struct file *, struct address_space *,
    loff_t, unsigned, unsigned,
    struct page **, void **, get_block_t*);
    diff -Nrup linux-2.6.26-rc3.org/mm/filemap.c linux-2.6.26-rc3/mm/filemap.c
    --- linux-2.6.26-rc3.org/mm/filemap.c 2008-05-19 11:35:11.000000000 +0900
    +++ linux-2.6.26-rc3/mm/filemap.c 2008-05-19 14:29:23.000000000 +0900
    @@ -932,8 +932,16 @@ find_page:
    ra, filp, page,
    index, last_index - index);
    }
    - if (!PageUptodate(page))
    - goto page_not_up_to_date;
    + if (!PageUptodate(page)) {
    + if (inode->i_blkbits == PAGE_CACHE_SHIFT)
    + goto page_not_up_to_date;
    + if (TestSetPageLocked(page))
    + goto page_not_up_to_date;
    + if (!page_has_buffers(page) ||
    + !check_buffers_uptodate(offset, desc, page))
    + goto page_not_up_to_date_locked;
    + unlock_page(page);
    + }
    page_ok:
    /*
    * i_size must be checked after we know the page is Uptodate.
    @@ -1003,6 +1011,7 @@ page_not_up_to_date:
    if (lock_page_killable(page))
    goto readpage_eio;

    +page_not_up_to_date_locked:
    /* Did it get truncated before we got the lock? */
    if (!page->mapping) {
    unlock_page(page);

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [PATCH] VFS: Pagecache usage optimization on pagesize != blocksize environment

    On Wed, 21 May 2008 15:52:04 +0900 Hisashi Hifumi wrote:

    > Hi.
    >
    > When we read some part of a file through pagecache, if there is a pagecache
    > of corresponding index but this page is not uptodate, read IO is issued and
    > this page will be uptodate.
    > I think this is good for pagesize == blocksize environment but there is room
    > for improvement on pagesize != blocksize environment. Because in this case
    > a page can have multiple buffers and even if a page is not uptodate, some buffers
    > can be uptodate. So I suggest that when all buffers which correspond to a part
    > of a file that we want to read are uptodate, use this pagecache and copy data
    > from this pagecache to user buffer even if a page is not uptodate. This can
    > reduce read IO and improve system throughput.


    I suppose that makes sense.

    > I did a performance test using the sysbench.


    That's not a terribly good benchmark, IMO. It's too complex.

    To work out the best-case for a change like this I'd suggest a
    microbenchmark which does something such as seeking all around a file
    doing single-byte reads.

    Then one should think up a benchmark which demonstrates the worst-case,
    such as reading one-byte-quantities from a file at offsets 0, 0x2000,
    0x4000, 0x6000, ... and then read more one-byte-quantities at offsets
    0x1000, 0x3000, 0x5000, etc. That would be a pretty cruel comparison,
    but as one tosses in more such artificial worklaods, one is in a better
    position to work out whether the change is an aggregate benefit.

    The results from a great big lumped-together benchmark such as sysbench
    aren't a lot of use to us in predicting how effective this change will
    be across all the workloads which the kernel implements.

    > @@ -932,8 +932,16 @@ find_page:
    > ra, filp, page,
    > index, last_index - index);
    > }
    > - if (!PageUptodate(page))
    > - goto page_not_up_to_date;
    > + if (!PageUptodate(page)) {
    > + if (inode->i_blkbits == PAGE_CACHE_SHIFT)
    > + goto page_not_up_to_date;
    > + if (TestSetPageLocked(page))
    > + goto page_not_up_to_date;
    > + if (!page_has_buffers(page) ||
    > + !check_buffers_uptodate(offset, desc, page))


    We shouldn't do this.

    > + goto page_not_up_to_date_locked;
    > + unlock_page(page);
    > + }


    See, the code which you have here is assuming that if PagePrivate is
    set, then the thing which is at page.private is a ring of buffer_heads.

    But this code (do_generic_file_read) doesn't know that! Take a look at
    afs, nfs, perhaps other filesystems, grep for set_page_private().

    Only the address_space implementation (ie: the filesystem) knows
    whether page.private holds buffer_heads and only the
    address_space_operations functions are allowed to call into library
    functions which treat page.private as a buffer_head ring.

    Now, your code _may_ not crash, because perhaps there is no filesystem
    which puts something else into page.private which also uses
    do_generic_file_read(). But it's still wrong.

    I guess a suitable fix might be to implement the above using a new
    address_space_operations callback:

    if (PagePrivate(page) && aops->is_partially_uptodate) {
    if (aops->is_partially_uptodate(page, desc, offset))


    then implement a generic_file_is_partially_uptodate() in fs/buffer.c
    and wire that up in the filesystems.

    Note that things like network filesystems can then implement this also.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [PATCH] VFS: Pagecache usage optimization on pagesize != blocksize environment

    On Wed, 21 May 2008 00:19:30 -0700 Andrew Morton wrote:

    >
    > I guess a suitable fix might be to implement the above using a new
    > address_space_operations callback:
    >
    > if (PagePrivate(page) && aops->is_partially_uptodate) {


    This is a bit presumptuous. A filesytem could in theory be able to
    work out if a section of a page is uptodate without necessarily
    maintaining per-page metadata at page.private.

    So one really should just do

    if (aops->is_partially_uptodate)

    however it's hard to conceive how a filesystem could sanely do this
    _without_ putting data at page.private, so the PagePrivate() test could
    perhaps be retained as a microoptimisation, as long as it is suitably
    commented.

    otoh, non-zero aops->is_partially_uptodate might well be less common
    than non-zero PagePrivate(), so perhaps these should be tested in the
    other order. Not sure.

    > if (aops->is_partially_uptodate(page, desc, offset))
    >
    >
    > then implement a generic_file_is_partially_uptodate() in fs/buffer.c


    Make that block_is_partially_uptodate().

    > and wire that up in the filesystems.
    >
    > Note that things like network filesystems can then implement this also.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment

    Thank you for your comment.

    At 16:19 08/05/21, Andrew Morton wrote:
    >That's not a terribly good benchmark, IMO. It's too complex.
    >
    >To work out the best-case for a change like this I'd suggest a
    >microbenchmark which does something such as seeking all around a file
    >doing single-byte reads.
    >


    I will try microbenchmark later.

    >See, the code which you have here is assuming that if PagePrivate is
    >set, then the thing which is at page.private is a ring of buffer_heads.
    >
    >But this code (do_generic_file_read) doesn't know that! Take a look at
    >afs, nfs, perhaps other filesystems, grep for set_page_private().
    >
    >Only the address_space implementation (ie: the filesystem) knows
    >whether page.private holds buffer_heads and only the
    >address_space_operations functions are allowed to call into library
    >functions which treat page.private as a buffer_head ring.
    >
    >Now, your code _may_ not crash, because perhaps there is no filesystem
    >which puts something else into page.private which also uses
    >do_generic_file_read(). But it's still wrong.
    >
    >I guess a suitable fix might be to implement the above using a new
    >address_space_operations callback:
    >
    > if (PagePrivate(page) && aops->is_partially_uptodate) {
    > if (aops->is_partially_uptodate(page, desc, offset))
    >
    >
    >then implement a generic_file_is_partially_uptodate() in fs/buffer.c
    >and wire that up in the filesystems.


    I modified my patch based on your comment.

    I added is_partially_uptodate method to address_space_operations, and
    I registered block_is_partially_uptodate in fs/buffer.c to ext2,ext3,ext4's aops.
    Thanks.

    Signed-off-by :Hisashi Hifumi

    diff -Nrup linux-2.6.26-rc3.org/fs/buffer.c linux-2.6.26-rc3/fs/buffer.c
    --- linux-2.6.26-rc3.org/fs/buffer.c 2008-05-19 11:35:10.000000000 +0900
    +++ linux-2.6.26-rc3/fs/buffer.c 2008-05-22 13:23:29.000000000 +0900
    @@ -2084,6 +2084,51 @@ int generic_write_end(struct file *file,
    EXPORT_SYMBOL(generic_write_end);

    /*
    + * block_is_partially_uptodate checks whether buffers within a page are
    + * uptodate or not.
    + *
    + * Returns true if all buffers which correspond to a file portion
    + * we want to read are uptodate.
    + */
    +int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    + unsigned long from)
    +{
    + struct inode *inode = page->mapping->host;
    + unsigned long block_start, block_end, blocksize;
    + unsigned long to;
    + struct buffer_head *bh, *head;
    + int ret = 1;
    +
    + if (!page_has_buffers(page))
    + return 0;
    +
    + blocksize = 1 << inode->i_blkbits;
    + to = from + desc->count;
    + if (to > PAGE_CACHE_SIZE)
    + to = PAGE_CACHE_SIZE;
    + if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize)
    + return 0;
    +
    + head = page_buffers(page);
    +
    + for (bh = head, block_start = 0; bh != head || !block_start;
    + block_start = block_end, bh = bh->b_this_page) {
    + block_end = block_start + blocksize;
    + if (block_end <= from || block_start >= to)
    + continue;
    + else {
    + if (!buffer_uptodate(bh)) {
    + ret = 0;
    + break;
    + }
    + if (block_end >= to)
    + break;
    + }
    + }
    + return ret;
    +}
    +
    +/*
    * Generic "read page" function for block devices that have the normal
    * get_block functionality. This is most of the block device filesystems.
    * Reads the page asynchronously --- the unlock_buffer() and
    diff -Nrup linux-2.6.26-rc3.org/fs/ext2/inode.c linux-2.6.26-rc3/fs/ext2/inode.c
    --- linux-2.6.26-rc3.org/fs/ext2/inode.c 2008-05-19 11:35:10.000000000 +0900
    +++ linux-2.6.26-rc3/fs/ext2/inode.c 2008-05-22 12:39:46.000000000 +0900
    @@ -791,6 +791,7 @@ const struct address_space_operations ex
    .direct_IO = ext2_direct_IO,
    .writepages = ext2_writepages,
    .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    const struct address_space_operations ext2_aops_xip = {
    diff -Nrup linux-2.6.26-rc3.org/fs/ext3/inode.c linux-2.6.26-rc3/fs/ext3/inode.c
    --- linux-2.6.26-rc3.org/fs/ext3/inode.c 2008-05-19 11:35:10.000000000 +0900
    +++ linux-2.6.26-rc3/fs/ext3/inode.c 2008-05-22 13:10:40.000000000 +0900
    @@ -1767,44 +1767,47 @@ static int ext3_journalled_set_page_dirt
    }

    static const struct address_space_operations ext3_ordered_aops = {
    - .readpage = ext3_readpage,
    - .readpages = ext3_readpages,
    - .writepage = ext3_ordered_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext3_write_begin,
    - .write_end = ext3_ordered_write_end,
    - .bmap = ext3_bmap,
    - .invalidatepage = ext3_invalidatepage,
    - .releasepage = ext3_releasepage,
    - .direct_IO = ext3_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext3_readpage,
    + .readpages = ext3_readpages,
    + .writepage = ext3_ordered_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext3_write_begin,
    + .write_end = ext3_ordered_write_end,
    + .bmap = ext3_bmap,
    + .invalidatepage = ext3_invalidatepage,
    + .releasepage = ext3_releasepage,
    + .direct_IO = ext3_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext3_writeback_aops = {
    - .readpage = ext3_readpage,
    - .readpages = ext3_readpages,
    - .writepage = ext3_writeback_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext3_write_begin,
    - .write_end = ext3_writeback_write_end,
    - .bmap = ext3_bmap,
    - .invalidatepage = ext3_invalidatepage,
    - .releasepage = ext3_releasepage,
    - .direct_IO = ext3_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext3_readpage,
    + .readpages = ext3_readpages,
    + .writepage = ext3_writeback_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext3_write_begin,
    + .write_end = ext3_writeback_write_end,
    + .bmap = ext3_bmap,
    + .invalidatepage = ext3_invalidatepage,
    + .releasepage = ext3_releasepage,
    + .direct_IO = ext3_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext3_journalled_aops = {
    - .readpage = ext3_readpage,
    - .readpages = ext3_readpages,
    - .writepage = ext3_journalled_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext3_write_begin,
    - .write_end = ext3_journalled_write_end,
    - .set_page_dirty = ext3_journalled_set_page_dirty,
    - .bmap = ext3_bmap,
    - .invalidatepage = ext3_invalidatepage,
    - .releasepage = ext3_releasepage,
    + .readpage = ext3_readpage,
    + .readpages = ext3_readpages,
    + .writepage = ext3_journalled_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext3_write_begin,
    + .write_end = ext3_journalled_write_end,
    + .set_page_dirty = ext3_journalled_set_page_dirty,
    + .bmap = ext3_bmap,
    + .invalidatepage = ext3_invalidatepage,
    + .releasepage = ext3_releasepage,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    void ext3_set_aops(struct inode *inode)
    diff -Nrup linux-2.6.26-rc3.org/fs/ext4/inode.c linux-2.6.26-rc3/fs/ext4/inode.c
    --- linux-2.6.26-rc3.org/fs/ext4/inode.c 2008-05-19 11:35:10.000000000 +0900
    +++ linux-2.6.26-rc3/fs/ext4/inode.c 2008-05-22 13:13:17.000000000 +0900
    @@ -1817,44 +1817,47 @@ static int ext4_journalled_set_page_dirt
    }

    static const struct address_space_operations ext4_ordered_aops = {
    - .readpage = ext4_readpage,
    - .readpages = ext4_readpages,
    - .writepage = ext4_ordered_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext4_write_begin,
    - .write_end = ext4_ordered_write_end,
    - .bmap = ext4_bmap,
    - .invalidatepage = ext4_invalidatepage,
    - .releasepage = ext4_releasepage,
    - .direct_IO = ext4_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext4_readpage,
    + .readpages = ext4_readpages,
    + .writepage = ext4_ordered_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext4_write_begin,
    + .write_end = ext4_ordered_write_end,
    + .bmap = ext4_bmap,
    + .invalidatepage = ext4_invalidatepage,
    + .releasepage = ext4_releasepage,
    + .direct_IO = ext4_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext4_writeback_aops = {
    - .readpage = ext4_readpage,
    - .readpages = ext4_readpages,
    - .writepage = ext4_writeback_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext4_write_begin,
    - .write_end = ext4_writeback_write_end,
    - .bmap = ext4_bmap,
    - .invalidatepage = ext4_invalidatepage,
    - .releasepage = ext4_releasepage,
    - .direct_IO = ext4_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext4_readpage,
    + .readpages = ext4_readpages,
    + .writepage = ext4_writeback_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext4_write_begin,
    + .write_end = ext4_writeback_write_end,
    + .bmap = ext4_bmap,
    + .invalidatepage = ext4_invalidatepage,
    + .releasepage = ext4_releasepage,
    + .direct_IO = ext4_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext4_journalled_aops = {
    - .readpage = ext4_readpage,
    - .readpages = ext4_readpages,
    - .writepage = ext4_journalled_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext4_write_begin,
    - .write_end = ext4_journalled_write_end,
    - .set_page_dirty = ext4_journalled_set_page_dirty,
    - .bmap = ext4_bmap,
    - .invalidatepage = ext4_invalidatepage,
    - .releasepage = ext4_releasepage,
    + .readpage = ext4_readpage,
    + .readpages = ext4_readpages,
    + .writepage = ext4_journalled_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext4_write_begin,
    + .write_end = ext4_journalled_write_end,
    + .set_page_dirty = ext4_journalled_set_page_dirty,
    + .bmap = ext4_bmap,
    + .invalidatepage = ext4_invalidatepage,
    + .releasepage = ext4_releasepage,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    void ext4_set_aops(struct inode *inode)
    diff -Nrup linux-2.6.26-rc3.org/include/linux/buffer_head.h linux-2.6.26-rc3/include/linux/buffer_head.h
    --- linux-2.6.26-rc3.org/include/linux/buffer_head.h 2008-05-19 11:35:11.000000000 +0900
    +++ linux-2.6.26-rc3/include/linux/buffer_head.h 2008-05-22 11:22:26.000000000 +0900
    @@ -205,6 +205,8 @@ void block_invalidatepage(struct page *p
    int block_write_full_page(struct page *page, get_block_t *get_block,
    struct writeback_control *wbc);
    int block_read_full_page(struct page*, get_block_t*);
    +int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    + unsigned long from);
    int block_write_begin(struct file *, struct address_space *,
    loff_t, unsigned, unsigned,
    struct page **, void **, get_block_t*);
    diff -Nrup linux-2.6.26-rc3.org/include/linux/fs.h linux-2.6.26-rc3/include/linux/fs.h
    --- linux-2.6.26-rc3.org/include/linux/fs.h 2008-05-19 11:35:11.000000000 +0900
    +++ linux-2.6.26-rc3/include/linux/fs.h 2008-05-22 15:13:39.000000000 +0900
    @@ -439,6 +439,27 @@ static inline size_t iov_iter_count(stru
    return i->count;
    }

    +/*
    + * "descriptor" for what we're up to with a read.
    + * This allows us to use the same read code yet
    + * have multiple different users of the data that
    + * we read from a file.
    + *
    + * The simplest case just copies the data to user
    + * mode.
    + */
    +typedef struct {
    + size_t written;
    + size_t count;
    + union {
    + char __user *buf;
    + void *data;
    + } arg;
    + int error;
    +} read_descriptor_t;
    +
    +typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
    + unsigned long, unsigned long);

    struct address_space_operations {
    int (*writepage)(struct page *page, struct writeback_control *wbc);
    @@ -480,6 +501,8 @@ struct address_space_operations {
    int (*migratepage) (struct address_space *,
    struct page *, struct page *);
    int (*launder_page) (struct page *);
    + int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
    + unsigned long);
    };

    /*
    @@ -1186,27 +1209,6 @@ struct block_device_operations {
    struct module *owner;
    };

    -/*
    - * "descriptor" for what we're up to with a read.
    - * This allows us to use the same read code yet
    - * have multiple different users of the data that
    - * we read from a file.
    - *
    - * The simplest case just copies the data to user
    - * mode.
    - */
    -typedef struct {
    - size_t written;
    - size_t count;
    - union {
    - char __user * buf;
    - void *data;
    - } arg;
    - int error;
    -} read_descriptor_t;
    -
    -typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long);
    -
    /* These macros are for out of kernel modules to test that
    * the kernel supports the unlocked_ioctl and compat_ioctl
    * fields in struct file_operations. */
    diff -Nrup linux-2.6.26-rc3.org/mm/filemap.c linux-2.6.26-rc3/mm/filemap.c
    --- linux-2.6.26-rc3.org/mm/filemap.c 2008-05-19 11:35:11.000000000 +0900
    +++ linux-2.6.26-rc3/mm/filemap.c 2008-05-22 12:34:58.000000000 +0900
    @@ -932,8 +932,17 @@ find_page:
    ra, filp, page,
    index, last_index - index);
    }
    - if (!PageUptodate(page))
    - goto page_not_up_to_date;
    + if (!PageUptodate(page)) {
    + if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
    + !mapping->a_ops->is_partially_uptodate)
    + goto page_not_up_to_date;
    + if (TestSetPageLocked(page))
    + goto page_not_up_to_date;
    + if (!mapping->a_ops->is_partially_uptodate(page,
    + desc, offset))
    + goto page_not_up_to_date_locked;
    + unlock_page(page);
    + }
    page_ok:
    /*
    * i_size must be checked after we know the page is Uptodate.
    @@ -1003,6 +1012,7 @@ page_not_up_to_date:
    if (lock_page_killable(page))
    goto readpage_eio;

    +page_not_up_to_date_locked:
    /* Did it get truncated before we got the lock? */
    if (!page->mapping) {
    unlock_page(page);

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment

    On Thu, 22 May 2008 16:31:15 +0900 Hisashi Hifumi wrote:

    > I modified my patch based on your comment.


    Looks pretty close to me.

    > I added is_partially_uptodate method to address_space_operations, and
    > I registered block_is_partially_uptodate in fs/buffer.c to ext2,ext3,ext4's aops.
    > Thanks.
    >
    > Signed-off-by :Hisashi Hifumi
    >
    > diff -Nrup linux-2.6.26-rc3.org/fs/buffer.c linux-2.6.26-rc3/fs/buffer.c
    > --- linux-2.6.26-rc3.org/fs/buffer.c 2008-05-19 11:35:10.000000000 +0900
    > +++ linux-2.6.26-rc3/fs/buffer.c 2008-05-22 13:23:29.000000000 +0900
    > @@ -2084,6 +2084,51 @@ int generic_write_end(struct file *file,
    > EXPORT_SYMBOL(generic_write_end);
    >
    > /*
    > + * block_is_partially_uptodate checks whether buffers within a page are
    > + * uptodate or not.
    > + *
    > + * Returns true if all buffers which correspond to a file portion
    > + * we want to read are uptodate.
    > + */
    > +int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    > + unsigned long from)
    > +{
    > + struct inode *inode = page->mapping->host;
    > + unsigned long block_start, block_end, blocksize;
    > + unsigned long to;


    These functions can get quite confusing, and the appropriate use of
    types can help.

    For offsets within a page I'd suggest `unsigned'. I expect that the
    32-bit quantities are more efficient on at least some 64-bit machines.

    For page indices use pgoff_t.

    For byte-offset-within-a-file use loff_t.

    For byte-range-within-a-file use size_t.

    Be careful about overflows and truncation when shifting, comparing,
    assigning, etc. Be sure that the code is still correct outside the
    4Gbyte mark on 32-bit and outside the 4Gpage mark on 64-bit.

    It doesn't hurt at all to put each variable definition on its own line
    with a little comment off to the right. It helps to mention what the
    variable's units are too - bytes/pages/sectors/etc.

    > + struct buffer_head *bh, *head;
    > + int ret = 1;
    > +
    > + if (!page_has_buffers(page))
    > + return 0;
    > +
    > + blocksize = 1 << inode->i_blkbits;
    > + to = from + desc->count;
    > + if (to > PAGE_CACHE_SIZE)
    > + to = PAGE_CACHE_SIZE;
    > + if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize)
    > + return 0;
    > +
    > + head = page_buffers(page);
    > +
    > + for (bh = head, block_start = 0; bh != head || !block_start;
    > + block_start = block_end, bh = bh->b_this_page) {
    > + block_end = block_start + blocksize;
    > + if (block_end <= from || block_start >= to)
    > + continue;


    Oh dear, you've copied one of my least favourite parts of the VFS
    Just look at it!

    Do you think it can be simplified or clarified?

    > + else {
    > + if (!buffer_uptodate(bh)) {
    > + ret = 0;
    > + break;
    > + }
    > + if (block_end >= to)
    > + break;
    > + }
    > + }
    > + return ret;
    > +}


    We'll need the EXPORT_SYMBOL() here.

    > --- linux-2.6.26-rc3.org/fs/ext2/inode.c 2008-05-19 11:35:10.000000000 +0900
    > +++ linux-2.6.26-rc3/fs/ext2/inode.c 2008-05-22 12:39:46.000000000 +0900
    > @@ -791,6 +791,7 @@ const struct address_space_operations ex
    > .direct_IO = ext2_direct_IO,
    > .writepages = ext2_writepages,
    > .migratepage = buffer_migrate_page,
    > + .is_partially_uptodate = block_is_partially_uptodate,


    Sometime we should update Documentation/Locking to reflect the new
    address_space_operation.


    One other thing we should think about here is the `nobh' mode which the
    extX filesystems support (although I have a feeling that Nick might
    have broken this ) We also have data=ordered, data=writeback and
    data=journal to think about. This optimisation might not be
    appropriate at all to data=journal mode, but I haven't looked into
    that.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH] VFS: Pagecache usage optimization on pagesize!=blocksize environment

    Hi.

    >> +int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    >> + unsigned long from)
    >> +{
    >> + struct inode *inode = page->mapping->host;
    >> + unsigned long block_start, block_end, blocksize;
    >> + unsigned long to;

    >
    >These functions can get quite confusing, and the appropriate use of
    >types can help.
    >
    >For offsets within a page I'd suggest `unsigned'. I expect that the
    >32-bit quantities are more efficient on at least some 64-bit machines.
    >
    >For page indices use pgoff_t.
    >
    >For byte-offset-within-a-file use loff_t.
    >
    >For byte-range-within-a-file use size_t.
    >
    >Be careful about overflows and truncation when shifting, comparing,
    >assigning, etc. Be sure that the code is still correct outside the
    >4Gbyte mark on 32-bit and outside the 4Gpage mark on 64-bit.
    >
    >It doesn't hurt at all to put each variable definition on its own line
    >with a little comment off to the right. It helps to mention what the
    >variable's units are too - bytes/pages/sectors/etc.
    >


    >> + head = page_buffers(page);
    >> +
    >> + for (bh = head, block_start = 0; bh != head || !block_start;
    >> + block_start = block_end, bh = bh->b_this_page) {
    >> + block_end = block_start + blocksize;
    >> + if (block_end <= from || block_start >= to)
    >> + continue;

    >
    >Oh dear, you've copied one of my least favourite parts of the VFS
    >Just look at it!
    >
    >Do you think it can be simplified or clarified?


    I cleaned up block_is_partially_uptodate.

    Signed-off-by :Hisashi Hifumi

    diff -Nrup linux-2.6.26-rc3.org/fs/buffer.c linux-2.6.26-rc3/fs/buffer.c
    --- linux-2.6.26-rc3.org/fs/buffer.c 2008-05-19 11:35:10.000000000 +0900
    +++ linux-2.6.26-rc3/fs/buffer.c 2008-05-22 19:48:35.000000000 +0900
    @@ -2084,6 +2084,52 @@ int generic_write_end(struct file *file,
    EXPORT_SYMBOL(generic_write_end);

    /*
    + * block_is_partially_uptodate checks whether buffers within a page are
    + * uptodate or not.
    + *
    + * Returns true if all buffers which correspond to a file portion
    + * we want to read are uptodate.
    + */
    +int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    + unsigned long from)
    +{
    + struct inode *inode = page->mapping->host;
    + unsigned block_start, block_end, blocksize;
    + unsigned to;
    + struct buffer_head *bh, *head;
    + int ret = 1;
    +
    + if (!page_has_buffers(page))
    + return 0;
    +
    + blocksize = 1 << inode->i_blkbits;
    + to = min_t(unsigned, PAGE_CACHE_SIZE - from, desc->count);
    + to = from + to;
    + if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize)
    + return 0;
    +
    + head = page_buffers(page);
    + bh = head;
    + block_start = 0;
    + do {
    + block_end = block_start + blocksize;
    + if (block_end > from && block_start < to) {
    + if (!buffer_uptodate(bh)) {
    + ret = 0;
    + break;
    + }
    + if (block_end >= to)
    + break;
    + }
    + block_start = block_end;
    + bh = bh->b_this_page;
    + } while (bh != head);
    +
    + return ret;
    +}
    +EXPORT_SYMBOL(block_is_partially_uptodate);
    +
    +/*
    * Generic "read page" function for block devices that have the normal
    * get_block functionality. This is most of the block device filesystems.
    * Reads the page asynchronously --- the unlock_buffer() and
    diff -Nrup linux-2.6.26-rc3.org/fs/ext2/inode.c linux-2.6.26-rc3/fs/ext2/inode.c
    --- linux-2.6.26-rc3.org/fs/ext2/inode.c 2008-05-19 11:35:10.000000000 +0900
    +++ linux-2.6.26-rc3/fs/ext2/inode.c 2008-05-22 12:39:46.000000000 +0900
    @@ -791,6 +791,7 @@ const struct address_space_operations ex
    .direct_IO = ext2_direct_IO,
    .writepages = ext2_writepages,
    .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    const struct address_space_operations ext2_aops_xip = {
    diff -Nrup linux-2.6.26-rc3.org/fs/ext3/inode.c linux-2.6.26-rc3/fs/ext3/inode.c
    --- linux-2.6.26-rc3.org/fs/ext3/inode.c 2008-05-19 11:35:10.000000000 +0900
    +++ linux-2.6.26-rc3/fs/ext3/inode.c 2008-05-22 13:10:40.000000000 +0900
    @@ -1767,44 +1767,47 @@ static int ext3_journalled_set_page_dirt
    }

    static const struct address_space_operations ext3_ordered_aops = {
    - .readpage = ext3_readpage,
    - .readpages = ext3_readpages,
    - .writepage = ext3_ordered_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext3_write_begin,
    - .write_end = ext3_ordered_write_end,
    - .bmap = ext3_bmap,
    - .invalidatepage = ext3_invalidatepage,
    - .releasepage = ext3_releasepage,
    - .direct_IO = ext3_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext3_readpage,
    + .readpages = ext3_readpages,
    + .writepage = ext3_ordered_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext3_write_begin,
    + .write_end = ext3_ordered_write_end,
    + .bmap = ext3_bmap,
    + .invalidatepage = ext3_invalidatepage,
    + .releasepage = ext3_releasepage,
    + .direct_IO = ext3_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext3_writeback_aops = {
    - .readpage = ext3_readpage,
    - .readpages = ext3_readpages,
    - .writepage = ext3_writeback_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext3_write_begin,
    - .write_end = ext3_writeback_write_end,
    - .bmap = ext3_bmap,
    - .invalidatepage = ext3_invalidatepage,
    - .releasepage = ext3_releasepage,
    - .direct_IO = ext3_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext3_readpage,
    + .readpages = ext3_readpages,
    + .writepage = ext3_writeback_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext3_write_begin,
    + .write_end = ext3_writeback_write_end,
    + .bmap = ext3_bmap,
    + .invalidatepage = ext3_invalidatepage,
    + .releasepage = ext3_releasepage,
    + .direct_IO = ext3_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext3_journalled_aops = {
    - .readpage = ext3_readpage,
    - .readpages = ext3_readpages,
    - .writepage = ext3_journalled_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext3_write_begin,
    - .write_end = ext3_journalled_write_end,
    - .set_page_dirty = ext3_journalled_set_page_dirty,
    - .bmap = ext3_bmap,
    - .invalidatepage = ext3_invalidatepage,
    - .releasepage = ext3_releasepage,
    + .readpage = ext3_readpage,
    + .readpages = ext3_readpages,
    + .writepage = ext3_journalled_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext3_write_begin,
    + .write_end = ext3_journalled_write_end,
    + .set_page_dirty = ext3_journalled_set_page_dirty,
    + .bmap = ext3_bmap,
    + .invalidatepage = ext3_invalidatepage,
    + .releasepage = ext3_releasepage,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    void ext3_set_aops(struct inode *inode)
    diff -Nrup linux-2.6.26-rc3.org/fs/ext4/inode.c linux-2.6.26-rc3/fs/ext4/inode.c
    --- linux-2.6.26-rc3.org/fs/ext4/inode.c 2008-05-19 11:35:10.000000000 +0900
    +++ linux-2.6.26-rc3/fs/ext4/inode.c 2008-05-22 13:13:17.000000000 +0900
    @@ -1817,44 +1817,47 @@ static int ext4_journalled_set_page_dirt
    }

    static const struct address_space_operations ext4_ordered_aops = {
    - .readpage = ext4_readpage,
    - .readpages = ext4_readpages,
    - .writepage = ext4_ordered_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext4_write_begin,
    - .write_end = ext4_ordered_write_end,
    - .bmap = ext4_bmap,
    - .invalidatepage = ext4_invalidatepage,
    - .releasepage = ext4_releasepage,
    - .direct_IO = ext4_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext4_readpage,
    + .readpages = ext4_readpages,
    + .writepage = ext4_ordered_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext4_write_begin,
    + .write_end = ext4_ordered_write_end,
    + .bmap = ext4_bmap,
    + .invalidatepage = ext4_invalidatepage,
    + .releasepage = ext4_releasepage,
    + .direct_IO = ext4_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext4_writeback_aops = {
    - .readpage = ext4_readpage,
    - .readpages = ext4_readpages,
    - .writepage = ext4_writeback_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext4_write_begin,
    - .write_end = ext4_writeback_write_end,
    - .bmap = ext4_bmap,
    - .invalidatepage = ext4_invalidatepage,
    - .releasepage = ext4_releasepage,
    - .direct_IO = ext4_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext4_readpage,
    + .readpages = ext4_readpages,
    + .writepage = ext4_writeback_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext4_write_begin,
    + .write_end = ext4_writeback_write_end,
    + .bmap = ext4_bmap,
    + .invalidatepage = ext4_invalidatepage,
    + .releasepage = ext4_releasepage,
    + .direct_IO = ext4_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext4_journalled_aops = {
    - .readpage = ext4_readpage,
    - .readpages = ext4_readpages,
    - .writepage = ext4_journalled_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext4_write_begin,
    - .write_end = ext4_journalled_write_end,
    - .set_page_dirty = ext4_journalled_set_page_dirty,
    - .bmap = ext4_bmap,
    - .invalidatepage = ext4_invalidatepage,
    - .releasepage = ext4_releasepage,
    + .readpage = ext4_readpage,
    + .readpages = ext4_readpages,
    + .writepage = ext4_journalled_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext4_write_begin,
    + .write_end = ext4_journalled_write_end,
    + .set_page_dirty = ext4_journalled_set_page_dirty,
    + .bmap = ext4_bmap,
    + .invalidatepage = ext4_invalidatepage,
    + .releasepage = ext4_releasepage,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    void ext4_set_aops(struct inode *inode)
    diff -Nrup linux-2.6.26-rc3.org/include/linux/buffer_head.h linux-2.6.26-rc3/include/linux/buffer_head.h
    --- linux-2.6.26-rc3.org/include/linux/buffer_head.h 2008-05-19 11:35:11.000000000 +0900
    +++ linux-2.6.26-rc3/include/linux/buffer_head.h 2008-05-22 19:46:05.000000000 +0900
    @@ -205,6 +205,8 @@ void block_invalidatepage(struct page *p
    int block_write_full_page(struct page *page, get_block_t *get_block,
    struct writeback_control *wbc);
    int block_read_full_page(struct page*, get_block_t*);
    +int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    + unsigned long from);
    int block_write_begin(struct file *, struct address_space *,
    loff_t, unsigned, unsigned,
    struct page **, void **, get_block_t*);
    diff -Nrup linux-2.6.26-rc3.org/include/linux/fs.h linux-2.6.26-rc3/include/linux/fs.h
    --- linux-2.6.26-rc3.org/include/linux/fs.h 2008-05-19 11:35:11.000000000 +0900
    +++ linux-2.6.26-rc3/include/linux/fs.h 2008-05-22 15:13:39.000000000 +0900
    @@ -439,6 +439,27 @@ static inline size_t iov_iter_count(stru
    return i->count;
    }

    +/*
    + * "descriptor" for what we're up to with a read.
    + * This allows us to use the same read code yet
    + * have multiple different users of the data that
    + * we read from a file.
    + *
    + * The simplest case just copies the data to user
    + * mode.
    + */
    +typedef struct {
    + size_t written;
    + size_t count;
    + union {
    + char __user *buf;
    + void *data;
    + } arg;
    + int error;
    +} read_descriptor_t;
    +
    +typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
    + unsigned long, unsigned long);

    struct address_space_operations {
    int (*writepage)(struct page *page, struct writeback_control *wbc);
    @@ -480,6 +501,8 @@ struct address_space_operations {
    int (*migratepage) (struct address_space *,
    struct page *, struct page *);
    int (*launder_page) (struct page *);
    + int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
    + unsigned long);
    };

    /*
    @@ -1186,27 +1209,6 @@ struct block_device_operations {
    struct module *owner;
    };

    -/*
    - * "descriptor" for what we're up to with a read.
    - * This allows us to use the same read code yet
    - * have multiple different users of the data that
    - * we read from a file.
    - *
    - * The simplest case just copies the data to user
    - * mode.
    - */
    -typedef struct {
    - size_t written;
    - size_t count;
    - union {
    - char __user * buf;
    - void *data;
    - } arg;
    - int error;
    -} read_descriptor_t;
    -
    -typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long);
    -
    /* These macros are for out of kernel modules to test that
    * the kernel supports the unlocked_ioctl and compat_ioctl
    * fields in struct file_operations. */
    diff -Nrup linux-2.6.26-rc3.org/mm/filemap.c linux-2.6.26-rc3/mm/filemap.c
    --- linux-2.6.26-rc3.org/mm/filemap.c 2008-05-19 11:35:11.000000000 +0900
    +++ linux-2.6.26-rc3/mm/filemap.c 2008-05-22 12:34:58.000000000 +0900
    @@ -932,8 +932,17 @@ find_page:
    ra, filp, page,
    index, last_index - index);
    }
    - if (!PageUptodate(page))
    - goto page_not_up_to_date;
    + if (!PageUptodate(page)) {
    + if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
    + !mapping->a_ops->is_partially_uptodate)
    + goto page_not_up_to_date;
    + if (TestSetPageLocked(page))
    + goto page_not_up_to_date;
    + if (!mapping->a_ops->is_partially_uptodate(page,
    + desc, offset))
    + goto page_not_up_to_date_locked;
    + unlock_page(page);
    + }
    page_ok:
    /*
    * i_size must be checked after we know the page is Uptodate.
    @@ -1003,6 +1012,7 @@ page_not_up_to_date:
    if (lock_page_killable(page))
    goto readpage_eio;

    +page_not_up_to_date_locked:
    /* Did it get truncated before we got the lock? */
    if (!page->mapping) {
    unlock_page(page);

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment

    > On Thu, 22 May 2008 16:31:15 +0900 Hisashi Hifumi wrote:
    >
    > One other thing we should think about here is the `nobh' mode which the
    > extX filesystems support (although I have a feeling that Nick might
    > have broken this ) We also have data=ordered, data=writeback and
    > data=journal to think about. This optimisation might not be
    > appropriate at all to data=journal mode, but I haven't looked into
    > that.

    Why do you think so? We mess with buffer_dirty bits in data=journal
    mode but as far as I understand the patch, it only does not read the
    page if the part we need is marked as uptodate in buffers. And this
    should be safe.
    But I'm slightly confused that the patch helps because I've always
    thought that mpage_readpage() (which is what we end up calling from
    do_generic_mapping_read()) always reads the whole page. Thus either all
    buffers in the page or none of them are uptodate... So what do I miss
    here?

    Honza
    --
    Jan Kara
    SuSE CR Labs
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment


    > But I'm slightly confused that the patch helps because I've always
    >thought that mpage_readpage() (which is what we end up calling from
    >do_generic_mapping_read()) always reads the whole page. Thus either all
    >buffers in the page or none of them are uptodate... So what do I miss
    >here?
    >
    > Honza


    On ext3/4, a file is written through buffer/block. So if a page has multiple
    buffers, buffers can be uptodate partially especially under random write workloads.
    See __block_prepare_write and __block_commit_write.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH] VFS: Pagecache usage optimization on pagesize !=blocksize environment

    On Mon 26-05-08 16:20:10, Hisashi Hifumi wrote:
    >
    > > But I'm slightly confused that the patch helps because I've always
    > >thought that mpage_readpage() (which is what we end up calling from
    > >do_generic_mapping_read()) always reads the whole page. Thus either all
    > >buffers in the page or none of them are uptodate... So what do I miss
    > >here?
    > >
    > > Honza

    >
    > On ext3/4, a file is written through buffer/block. So if a page has multiple
    > buffers, buffers can be uptodate partially especially under random write workloads.
    > See __block_prepare_write and __block_commit_write.

    Ah, I see. Thanks for explanation.

    Honza
    --
    Jan Kara
    SUSE Labs, CR
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH] VFS: Pagecache usage optimization on pagesize!=blocksize environment

    Hi Andrew.

    >> +int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    >> + unsigned long from)
    >> +{
    >> + struct inode *inode = page->mapping->host;
    >> + unsigned long block_start, block_end, blocksize;
    >> + unsigned long to;

    >
    >These functions can get quite confusing, and the appropriate use of
    >types can help.
    >
    >For offsets within a page I'd suggest `unsigned'. I expect that the
    >32-bit quantities are more efficient on at least some 64-bit machines.
    >
    >For page indices use pgoff_t.
    >
    >For byte-offset-within-a-file use loff_t.
    >
    >For byte-range-within-a-file use size_t.
    >
    >Be careful about overflows and truncation when shifting, comparing,
    >assigning, etc. Be sure that the code is still correct outside the
    >4Gbyte mark on 32-bit and outside the 4Gpage mark on 64-bit.
    >
    >It doesn't hurt at all to put each variable definition on its own line
    >with a little comment off to the right. It helps to mention what the
    >variable's units are too - bytes/pages/sectors/etc.



    >> + head = page_buffers(page);
    >> +
    >> + for (bh = head, block_start = 0; bh != head || !block_start;
    >> + block_start = block_end, bh = bh->b_this_page) {
    >> + block_end = block_start + blocksize;
    >> + if (block_end <= from || block_start >= to)
    >> + continue;

    >
    >Oh dear, you've copied one of my least favourite parts of the VFS
    >Just look at it!
    >
    >Do you think it can be simplified or clarified?



    I cleaned up and simplified block_is_partially_uptodate.
    Following patch is for linux-2.6.26-rc4.
    Please review my patch.
    Thanks.

    Signed-off-by :Hisashi Hifumi

    diff -Nrup linux-2.6.26-rc4.org/fs/buffer.c linux-2.6.26-rc4/fs/buffer.c
    --- linux-2.6.26-rc4.org/fs/buffer.c 2008-05-27 14:36:21.000000000 +0900
    +++ linux-2.6.26-rc4/fs/buffer.c 2008-05-27 14:38:20.000000000 +0900
    @@ -2084,6 +2084,52 @@ int generic_write_end(struct file *file,
    EXPORT_SYMBOL(generic_write_end);

    /*
    + * block_is_partially_uptodate checks whether buffers within a page are
    + * uptodate or not.
    + *
    + * Returns true if all buffers which correspond to a file portion
    + * we want to read are uptodate.
    + */
    +int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    + unsigned long from)
    +{
    + struct inode *inode = page->mapping->host;
    + unsigned block_start, block_end, blocksize;
    + unsigned to;
    + struct buffer_head *bh, *head;
    + int ret = 1;
    +
    + if (!page_has_buffers(page))
    + return 0;
    +
    + blocksize = 1 << inode->i_blkbits;
    + to = min_t(unsigned, PAGE_CACHE_SIZE - from, desc->count);
    + to = from + to;
    + if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize)
    + return 0;
    +
    + head = page_buffers(page);
    + bh = head;
    + block_start = 0;
    + do {
    + block_end = block_start + blocksize;
    + if (block_end > from && block_start < to) {
    + if (!buffer_uptodate(bh)) {
    + ret = 0;
    + break;
    + }
    + if (block_end >= to)
    + break;
    + }
    + block_start = block_end;
    + bh = bh->b_this_page;
    + } while (bh != head);
    +
    + return ret;
    +}
    +EXPORT_SYMBOL(block_is_partially_uptodate);
    +
    +/*
    * Generic "read page" function for block devices that have the normal
    * get_block functionality. This is most of the block device filesystems.
    * Reads the page asynchronously --- the unlock_buffer() and
    diff -Nrup linux-2.6.26-rc4.org/fs/ext2/inode.c linux-2.6.26-rc4/fs/ext2/inode.c
    --- linux-2.6.26-rc4.org/fs/ext2/inode.c 2008-05-27 14:36:21.000000000 +0900
    +++ linux-2.6.26-rc4/fs/ext2/inode.c 2008-05-27 14:38:20.000000000 +0900
    @@ -791,6 +791,7 @@ const struct address_space_operations ex
    .direct_IO = ext2_direct_IO,
    .writepages = ext2_writepages,
    .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    const struct address_space_operations ext2_aops_xip = {
    diff -Nrup linux-2.6.26-rc4.org/fs/ext3/inode.c linux-2.6.26-rc4/fs/ext3/inode.c
    --- linux-2.6.26-rc4.org/fs/ext3/inode.c 2008-05-27 14:36:21.000000000 +0900
    +++ linux-2.6.26-rc4/fs/ext3/inode.c 2008-05-27 14:38:20.000000000 +0900
    @@ -1767,44 +1767,47 @@ static int ext3_journalled_set_page_dirt
    }

    static const struct address_space_operations ext3_ordered_aops = {
    - .readpage = ext3_readpage,
    - .readpages = ext3_readpages,
    - .writepage = ext3_ordered_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext3_write_begin,
    - .write_end = ext3_ordered_write_end,
    - .bmap = ext3_bmap,
    - .invalidatepage = ext3_invalidatepage,
    - .releasepage = ext3_releasepage,
    - .direct_IO = ext3_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext3_readpage,
    + .readpages = ext3_readpages,
    + .writepage = ext3_ordered_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext3_write_begin,
    + .write_end = ext3_ordered_write_end,
    + .bmap = ext3_bmap,
    + .invalidatepage = ext3_invalidatepage,
    + .releasepage = ext3_releasepage,
    + .direct_IO = ext3_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext3_writeback_aops = {
    - .readpage = ext3_readpage,
    - .readpages = ext3_readpages,
    - .writepage = ext3_writeback_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext3_write_begin,
    - .write_end = ext3_writeback_write_end,
    - .bmap = ext3_bmap,
    - .invalidatepage = ext3_invalidatepage,
    - .releasepage = ext3_releasepage,
    - .direct_IO = ext3_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext3_readpage,
    + .readpages = ext3_readpages,
    + .writepage = ext3_writeback_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext3_write_begin,
    + .write_end = ext3_writeback_write_end,
    + .bmap = ext3_bmap,
    + .invalidatepage = ext3_invalidatepage,
    + .releasepage = ext3_releasepage,
    + .direct_IO = ext3_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext3_journalled_aops = {
    - .readpage = ext3_readpage,
    - .readpages = ext3_readpages,
    - .writepage = ext3_journalled_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext3_write_begin,
    - .write_end = ext3_journalled_write_end,
    - .set_page_dirty = ext3_journalled_set_page_dirty,
    - .bmap = ext3_bmap,
    - .invalidatepage = ext3_invalidatepage,
    - .releasepage = ext3_releasepage,
    + .readpage = ext3_readpage,
    + .readpages = ext3_readpages,
    + .writepage = ext3_journalled_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext3_write_begin,
    + .write_end = ext3_journalled_write_end,
    + .set_page_dirty = ext3_journalled_set_page_dirty,
    + .bmap = ext3_bmap,
    + .invalidatepage = ext3_invalidatepage,
    + .releasepage = ext3_releasepage,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    void ext3_set_aops(struct inode *inode)
    diff -Nrup linux-2.6.26-rc4.org/fs/ext4/inode.c linux-2.6.26-rc4/fs/ext4/inode.c
    --- linux-2.6.26-rc4.org/fs/ext4/inode.c 2008-05-27 14:36:21.000000000 +0900
    +++ linux-2.6.26-rc4/fs/ext4/inode.c 2008-05-27 14:38:20.000000000 +0900
    @@ -1817,44 +1817,47 @@ static int ext4_journalled_set_page_dirt
    }

    static const struct address_space_operations ext4_ordered_aops = {
    - .readpage = ext4_readpage,
    - .readpages = ext4_readpages,
    - .writepage = ext4_ordered_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext4_write_begin,
    - .write_end = ext4_ordered_write_end,
    - .bmap = ext4_bmap,
    - .invalidatepage = ext4_invalidatepage,
    - .releasepage = ext4_releasepage,
    - .direct_IO = ext4_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext4_readpage,
    + .readpages = ext4_readpages,
    + .writepage = ext4_ordered_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext4_write_begin,
    + .write_end = ext4_ordered_write_end,
    + .bmap = ext4_bmap,
    + .invalidatepage = ext4_invalidatepage,
    + .releasepage = ext4_releasepage,
    + .direct_IO = ext4_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext4_writeback_aops = {
    - .readpage = ext4_readpage,
    - .readpages = ext4_readpages,
    - .writepage = ext4_writeback_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext4_write_begin,
    - .write_end = ext4_writeback_write_end,
    - .bmap = ext4_bmap,
    - .invalidatepage = ext4_invalidatepage,
    - .releasepage = ext4_releasepage,
    - .direct_IO = ext4_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext4_readpage,
    + .readpages = ext4_readpages,
    + .writepage = ext4_writeback_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext4_write_begin,
    + .write_end = ext4_writeback_write_end,
    + .bmap = ext4_bmap,
    + .invalidatepage = ext4_invalidatepage,
    + .releasepage = ext4_releasepage,
    + .direct_IO = ext4_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext4_journalled_aops = {
    - .readpage = ext4_readpage,
    - .readpages = ext4_readpages,
    - .writepage = ext4_journalled_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext4_write_begin,
    - .write_end = ext4_journalled_write_end,
    - .set_page_dirty = ext4_journalled_set_page_dirty,
    - .bmap = ext4_bmap,
    - .invalidatepage = ext4_invalidatepage,
    - .releasepage = ext4_releasepage,
    + .readpage = ext4_readpage,
    + .readpages = ext4_readpages,
    + .writepage = ext4_journalled_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext4_write_begin,
    + .write_end = ext4_journalled_write_end,
    + .set_page_dirty = ext4_journalled_set_page_dirty,
    + .bmap = ext4_bmap,
    + .invalidatepage = ext4_invalidatepage,
    + .releasepage = ext4_releasepage,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    void ext4_set_aops(struct inode *inode)
    diff -Nrup linux-2.6.26-rc4.org/include/linux/buffer_head.h linux-2.6.26-rc4/include/linux/buffer_head.h
    --- linux-2.6.26-rc4.org/include/linux/buffer_head.h 2008-05-27 14:36:22.000000000 +0900
    +++ linux-2.6.26-rc4/include/linux/buffer_head.h 2008-05-27 14:38:20.000000000 +0900
    @@ -205,6 +205,8 @@ void block_invalidatepage(struct page *p
    int block_write_full_page(struct page *page, get_block_t *get_block,
    struct writeback_control *wbc);
    int block_read_full_page(struct page*, get_block_t*);
    +int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    + unsigned long from);
    int block_write_begin(struct file *, struct address_space *,
    loff_t, unsigned, unsigned,
    struct page **, void **, get_block_t*);
    diff -Nrup linux-2.6.26-rc4.org/include/linux/fs.h linux-2.6.26-rc4/include/linux/fs.h
    --- linux-2.6.26-rc4.org/include/linux/fs.h 2008-05-27 14:36:22.000000000 +0900
    +++ linux-2.6.26-rc4/include/linux/fs.h 2008-05-27 14:38:20.000000000 +0900
    @@ -439,6 +439,27 @@ static inline size_t iov_iter_count(stru
    return i->count;
    }

    +/*
    + * "descriptor" for what we're up to with a read.
    + * This allows us to use the same read code yet
    + * have multiple different users of the data that
    + * we read from a file.
    + *
    + * The simplest case just copies the data to user
    + * mode.
    + */
    +typedef struct {
    + size_t written;
    + size_t count;
    + union {
    + char __user *buf;
    + void *data;
    + } arg;
    + int error;
    +} read_descriptor_t;
    +
    +typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
    + unsigned long, unsigned long);

    struct address_space_operations {
    int (*writepage)(struct page *page, struct writeback_control *wbc);
    @@ -480,6 +501,8 @@ struct address_space_operations {
    int (*migratepage) (struct address_space *,
    struct page *, struct page *);
    int (*launder_page) (struct page *);
    + int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
    + unsigned long);
    };

    /*
    @@ -1186,27 +1209,6 @@ struct block_device_operations {
    struct module *owner;
    };

    -/*
    - * "descriptor" for what we're up to with a read.
    - * This allows us to use the same read code yet
    - * have multiple different users of the data that
    - * we read from a file.
    - *
    - * The simplest case just copies the data to user
    - * mode.
    - */
    -typedef struct {
    - size_t written;
    - size_t count;
    - union {
    - char __user * buf;
    - void *data;
    - } arg;
    - int error;
    -} read_descriptor_t;
    -
    -typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long);
    -
    /* These macros are for out of kernel modules to test that
    * the kernel supports the unlocked_ioctl and compat_ioctl
    * fields in struct file_operations. */
    diff -Nrup linux-2.6.26-rc4.org/mm/filemap.c linux-2.6.26-rc4/mm/filemap.c
    --- linux-2.6.26-rc4.org/mm/filemap.c 2008-05-27 14:36:23.000000000 +0900
    +++ linux-2.6.26-rc4/mm/filemap.c 2008-05-27 14:38:20.000000000 +0900
    @@ -932,8 +932,17 @@ find_page:
    ra, filp, page,
    index, last_index - index);
    }
    - if (!PageUptodate(page))
    - goto page_not_up_to_date;
    + if (!PageUptodate(page)) {
    + if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
    + !mapping->a_ops->is_partially_uptodate)
    + goto page_not_up_to_date;
    + if (TestSetPageLocked(page))
    + goto page_not_up_to_date;
    + if (!mapping->a_ops->is_partially_uptodate(page,
    + desc, offset))
    + goto page_not_up_to_date_locked;
    + unlock_page(page);
    + }
    page_ok:
    /*
    * i_size must be checked after we know the page is Uptodate.
    @@ -1003,6 +1012,7 @@ page_not_up_to_date:
    if (lock_page_killable(page))
    goto readpage_eio;

    +page_not_up_to_date_locked:
    /* Did it get truncated before we got the lock? */
    if (!page->mapping) {
    unlock_page(page);

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [PATCH] VFS: Pagecache usage optimization on pagesize!=blocksize environment

    On Tue, 27 May 2008 17:38:34 +0900 Hisashi Hifumi wrote:

    > I cleaned up and simplified block_is_partially_uptodate.
    > Following patch is for linux-2.6.26-rc4.
    > Please review my patch.


    We seem to have lost the changelog which means that the patch can't go
    much further and it makes it considerably harder for people to review
    the patch.

    Please maintain the changelog alongside the patch and resend it (possibly after
    suitable updates) along with each iteration of the patch.

    Thanks.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksize environment

    Hi

    At 17:51 08/05/27, Andrew Morton wrote:
    >On Tue, 27 May 2008 17:38:34 +0900 Hisashi Hifumi
    > wrote:
    >
    >> I cleaned up and simplified block_is_partially_uptodate.
    >> Following patch is for linux-2.6.26-rc4.
    >> Please review my patch.

    >
    >We seem to have lost the changelog which means that the patch can't go
    >much further and it makes it considerably harder for people to review
    >the patch.
    >
    >Please maintain the changelog alongside the patch and resend it (possibly after
    >suitable updates) along with each iteration of the patch.
    >
    >Thanks.


    Sorry, I attach changelog.

    When we read some part of a file through pagecache, if there is a pagecache
    of corresponding index but this page is not uptodate, read IO is issued and
    this page will be uptodate.
    I think this is good for pagesize == blocksize environment but there is room
    for improvement on pagesize != blocksize environment. Because in this case
    a page can have multiple buffers and even if a page is not uptodate, some buffers
    can be uptodate. So I suggest that when all buffers which correspond to a part
    of a file that we want to read are uptodate, use this pagecache and copy data
    from this pagecache to user buffer even if a page is not uptodate. This can
    reduce read IO and improve system throughput.

    v2: add new address_space_operations member is_partially_uptodate, and
    block_is_partially_uptodate was registered to ext2/3/4's aops.
    modify do_generic_file_read to use this aops callback.
    v3: use unsigned instead of unsigned long in block_is_partially_uptodate.
    cleaned up and simplified page buffer iteration code in block_is_partially_uptodate.

    Signed-off-by :Hisashi Hifumi

    diff -Nrup linux-2.6.26-rc4.org/fs/buffer.c linux-2.6.26-rc4/fs/buffer.c
    --- linux-2.6.26-rc4.org/fs/buffer.c 2008-05-27 14:36:21.000000000 +0900
    +++ linux-2.6.26-rc4/fs/buffer.c 2008-05-27 14:38:20.000000000 +0900
    @@ -2084,6 +2084,52 @@ int generic_write_end(struct file *file,
    EXPORT_SYMBOL(generic_write_end);

    /*
    + * block_is_partially_uptodate checks whether buffers within a page are
    + * uptodate or not.
    + *
    + * Returns true if all buffers which correspond to a file portion
    + * we want to read are uptodate.
    + */
    +int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    + unsigned long from)
    +{
    + struct inode *inode = page->mapping->host;
    + unsigned block_start, block_end, blocksize;
    + unsigned to;
    + struct buffer_head *bh, *head;
    + int ret = 1;
    +
    + if (!page_has_buffers(page))
    + return 0;
    +
    + blocksize = 1 << inode->i_blkbits;
    + to = min_t(unsigned, PAGE_CACHE_SIZE - from, desc->count);
    + to = from + to;
    + if (from < blocksize && to > PAGE_CACHE_SIZE - blocksize)
    + return 0;
    +
    + head = page_buffers(page);
    + bh = head;
    + block_start = 0;
    + do {
    + block_end = block_start + blocksize;
    + if (block_end > from && block_start < to) {
    + if (!buffer_uptodate(bh)) {
    + ret = 0;
    + break;
    + }
    + if (block_end >= to)
    + break;
    + }
    + block_start = block_end;
    + bh = bh->b_this_page;
    + } while (bh != head);
    +
    + return ret;
    +}
    +EXPORT_SYMBOL(block_is_partially_uptodate);
    +
    +/*
    * Generic "read page" function for block devices that have the normal
    * get_block functionality. This is most of the block device filesystems.
    * Reads the page asynchronously --- the unlock_buffer() and
    diff -Nrup linux-2.6.26-rc4.org/fs/ext2/inode.c linux-2.6.26-rc4/fs/ext2/inode.c
    --- linux-2.6.26-rc4.org/fs/ext2/inode.c 2008-05-27 14:36:21.000000000 +0900
    +++ linux-2.6.26-rc4/fs/ext2/inode.c 2008-05-27 14:38:20.000000000 +0900
    @@ -791,6 +791,7 @@ const struct address_space_operations ex
    .direct_IO = ext2_direct_IO,
    .writepages = ext2_writepages,
    .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    const struct address_space_operations ext2_aops_xip = {
    diff -Nrup linux-2.6.26-rc4.org/fs/ext3/inode.c linux-2.6.26-rc4/fs/ext3/inode.c
    --- linux-2.6.26-rc4.org/fs/ext3/inode.c 2008-05-27 14:36:21.000000000 +0900
    +++ linux-2.6.26-rc4/fs/ext3/inode.c 2008-05-27 14:38:20.000000000 +0900
    @@ -1767,44 +1767,47 @@ static int ext3_journalled_set_page_dirt
    }

    static const struct address_space_operations ext3_ordered_aops = {
    - .readpage = ext3_readpage,
    - .readpages = ext3_readpages,
    - .writepage = ext3_ordered_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext3_write_begin,
    - .write_end = ext3_ordered_write_end,
    - .bmap = ext3_bmap,
    - .invalidatepage = ext3_invalidatepage,
    - .releasepage = ext3_releasepage,
    - .direct_IO = ext3_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext3_readpage,
    + .readpages = ext3_readpages,
    + .writepage = ext3_ordered_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext3_write_begin,
    + .write_end = ext3_ordered_write_end,
    + .bmap = ext3_bmap,
    + .invalidatepage = ext3_invalidatepage,
    + .releasepage = ext3_releasepage,
    + .direct_IO = ext3_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext3_writeback_aops = {
    - .readpage = ext3_readpage,
    - .readpages = ext3_readpages,
    - .writepage = ext3_writeback_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext3_write_begin,
    - .write_end = ext3_writeback_write_end,
    - .bmap = ext3_bmap,
    - .invalidatepage = ext3_invalidatepage,
    - .releasepage = ext3_releasepage,
    - .direct_IO = ext3_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext3_readpage,
    + .readpages = ext3_readpages,
    + .writepage = ext3_writeback_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext3_write_begin,
    + .write_end = ext3_writeback_write_end,
    + .bmap = ext3_bmap,
    + .invalidatepage = ext3_invalidatepage,
    + .releasepage = ext3_releasepage,
    + .direct_IO = ext3_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext3_journalled_aops = {
    - .readpage = ext3_readpage,
    - .readpages = ext3_readpages,
    - .writepage = ext3_journalled_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext3_write_begin,
    - .write_end = ext3_journalled_write_end,
    - .set_page_dirty = ext3_journalled_set_page_dirty,
    - .bmap = ext3_bmap,
    - .invalidatepage = ext3_invalidatepage,
    - .releasepage = ext3_releasepage,
    + .readpage = ext3_readpage,
    + .readpages = ext3_readpages,
    + .writepage = ext3_journalled_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext3_write_begin,
    + .write_end = ext3_journalled_write_end,
    + .set_page_dirty = ext3_journalled_set_page_dirty,
    + .bmap = ext3_bmap,
    + .invalidatepage = ext3_invalidatepage,
    + .releasepage = ext3_releasepage,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    void ext3_set_aops(struct inode *inode)
    diff -Nrup linux-2.6.26-rc4.org/fs/ext4/inode.c linux-2.6.26-rc4/fs/ext4/inode.c
    --- linux-2.6.26-rc4.org/fs/ext4/inode.c 2008-05-27 14:36:21.000000000 +0900
    +++ linux-2.6.26-rc4/fs/ext4/inode.c 2008-05-27 14:38:20.000000000 +0900
    @@ -1817,44 +1817,47 @@ static int ext4_journalled_set_page_dirt
    }

    static const struct address_space_operations ext4_ordered_aops = {
    - .readpage = ext4_readpage,
    - .readpages = ext4_readpages,
    - .writepage = ext4_ordered_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext4_write_begin,
    - .write_end = ext4_ordered_write_end,
    - .bmap = ext4_bmap,
    - .invalidatepage = ext4_invalidatepage,
    - .releasepage = ext4_releasepage,
    - .direct_IO = ext4_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext4_readpage,
    + .readpages = ext4_readpages,
    + .writepage = ext4_ordered_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext4_write_begin,
    + .write_end = ext4_ordered_write_end,
    + .bmap = ext4_bmap,
    + .invalidatepage = ext4_invalidatepage,
    + .releasepage = ext4_releasepage,
    + .direct_IO = ext4_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext4_writeback_aops = {
    - .readpage = ext4_readpage,
    - .readpages = ext4_readpages,
    - .writepage = ext4_writeback_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext4_write_begin,
    - .write_end = ext4_writeback_write_end,
    - .bmap = ext4_bmap,
    - .invalidatepage = ext4_invalidatepage,
    - .releasepage = ext4_releasepage,
    - .direct_IO = ext4_direct_IO,
    - .migratepage = buffer_migrate_page,
    + .readpage = ext4_readpage,
    + .readpages = ext4_readpages,
    + .writepage = ext4_writeback_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext4_write_begin,
    + .write_end = ext4_writeback_write_end,
    + .bmap = ext4_bmap,
    + .invalidatepage = ext4_invalidatepage,
    + .releasepage = ext4_releasepage,
    + .direct_IO = ext4_direct_IO,
    + .migratepage = buffer_migrate_page,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    static const struct address_space_operations ext4_journalled_aops = {
    - .readpage = ext4_readpage,
    - .readpages = ext4_readpages,
    - .writepage = ext4_journalled_writepage,
    - .sync_page = block_sync_page,
    - .write_begin = ext4_write_begin,
    - .write_end = ext4_journalled_write_end,
    - .set_page_dirty = ext4_journalled_set_page_dirty,
    - .bmap = ext4_bmap,
    - .invalidatepage = ext4_invalidatepage,
    - .releasepage = ext4_releasepage,
    + .readpage = ext4_readpage,
    + .readpages = ext4_readpages,
    + .writepage = ext4_journalled_writepage,
    + .sync_page = block_sync_page,
    + .write_begin = ext4_write_begin,
    + .write_end = ext4_journalled_write_end,
    + .set_page_dirty = ext4_journalled_set_page_dirty,
    + .bmap = ext4_bmap,
    + .invalidatepage = ext4_invalidatepage,
    + .releasepage = ext4_releasepage,
    + .is_partially_uptodate = block_is_partially_uptodate,
    };

    void ext4_set_aops(struct inode *inode)
    diff -Nrup linux-2.6.26-rc4.org/include/linux/buffer_head.h linux-2.6.26-rc4/include/linux/buffer_head.h
    --- linux-2.6.26-rc4.org/include/linux/buffer_head.h 2008-05-27 14:36:22.000000000 +0900
    +++ linux-2.6.26-rc4/include/linux/buffer_head.h 2008-05-27 14:38:20.000000000 +0900
    @@ -205,6 +205,8 @@ void block_invalidatepage(struct page *p
    int block_write_full_page(struct page *page, get_block_t *get_block,
    struct writeback_control *wbc);
    int block_read_full_page(struct page*, get_block_t*);
    +int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
    + unsigned long from);
    int block_write_begin(struct file *, struct address_space *,
    loff_t, unsigned, unsigned,
    struct page **, void **, get_block_t*);
    diff -Nrup linux-2.6.26-rc4.org/include/linux/fs.h linux-2.6.26-rc4/include/linux/fs.h
    --- linux-2.6.26-rc4.org/include/linux/fs.h 2008-05-27 14:36:22.000000000 +0900
    +++ linux-2.6.26-rc4/include/linux/fs.h 2008-05-27 14:38:20.000000000 +0900
    @@ -439,6 +439,27 @@ static inline size_t iov_iter_count(stru
    return i->count;
    }

    +/*
    + * "descriptor" for what we're up to with a read.
    + * This allows us to use the same read code yet
    + * have multiple different users of the data that
    + * we read from a file.
    + *
    + * The simplest case just copies the data to user
    + * mode.
    + */
    +typedef struct {
    + size_t written;
    + size_t count;
    + union {
    + char __user *buf;
    + void *data;
    + } arg;
    + int error;
    +} read_descriptor_t;
    +
    +typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
    + unsigned long, unsigned long);

    struct address_space_operations {
    int (*writepage)(struct page *page, struct writeback_control *wbc);
    @@ -480,6 +501,8 @@ struct address_space_operations {
    int (*migratepage) (struct address_space *,
    struct page *, struct page *);
    int (*launder_page) (struct page *);
    + int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
    + unsigned long);
    };

    /*
    @@ -1186,27 +1209,6 @@ struct block_device_operations {
    struct module *owner;
    };

    -/*
    - * "descriptor" for what we're up to with a read.
    - * This allows us to use the same read code yet
    - * have multiple different users of the data that
    - * we read from a file.
    - *
    - * The simplest case just copies the data to user
    - * mode.
    - */
    -typedef struct {
    - size_t written;
    - size_t count;
    - union {
    - char __user * buf;
    - void *data;
    - } arg;
    - int error;
    -} read_descriptor_t;
    -
    -typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long);
    -
    /* These macros are for out of kernel modules to test that
    * the kernel supports the unlocked_ioctl and compat_ioctl
    * fields in struct file_operations. */
    diff -Nrup linux-2.6.26-rc4.org/mm/filemap.c linux-2.6.26-rc4/mm/filemap.c
    --- linux-2.6.26-rc4.org/mm/filemap.c 2008-05-27 14:36:23.000000000 +0900
    +++ linux-2.6.26-rc4/mm/filemap.c 2008-05-27 14:38:20.000000000 +0900
    @@ -932,8 +932,17 @@ find_page:
    ra, filp, page,
    index, last_index - index);
    }
    - if (!PageUptodate(page))
    - goto page_not_up_to_date;
    + if (!PageUptodate(page)) {
    + if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
    + !mapping->a_ops->is_partially_uptodate)
    + goto page_not_up_to_date;
    + if (TestSetPageLocked(page))
    + goto page_not_up_to_date;
    + if (!mapping->a_ops->is_partially_uptodate(page,
    + desc, offset))
    + goto page_not_up_to_date_locked;
    + unlock_page(page);
    + }
    page_ok:
    /*
    * i_size must be checked after we know the page is Uptodate.
    @@ -1003,6 +1012,7 @@ page_not_up_to_date:
    if (lock_page_killable(page))
    goto readpage_eio;

    +page_not_up_to_date_locked:
    /* Did it get truncated before we got the lock? */
    if (!page->mapping) {
    unlock_page(page);

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksize environment

    On Tue, 27 May 2008 18:34:02 +0900
    Hisashi Hifumi wrote:

    > When we read some part of a file through pagecache, if there is a pagecache
    > of corresponding index but this page is not uptodate, read IO is issued and
    > this page will be uptodate.
    > I think this is good for pagesize == blocksize environment but there is room
    > for improvement on pagesize != blocksize environment. Because in this case
    > a page can have multiple buffers and even if a page is not uptodate, some buffers
    > can be uptodate. So I suggest that when all buffers which correspond to a part
    > of a file that we want to read are uptodate, use this pagecache and copy data
    > from this pagecache to user buffer even if a page is not uptodate. This can
    > reduce read IO and improve system throughput.
    >
    > v2: add new address_space_operations member is_partially_uptodate, and
    > block_is_partially_uptodate was registered to ext2/3/4's aops.
    > modify do_generic_file_read to use this aops callback.
    > v3: use unsigned instead of unsigned long in block_is_partially_uptodate.
    > cleaned up and simplified page buffer iteration code in block_is_partially_uptodate.
    >
    > Signed-off-by :Hisashi Hifumi


    OK, thanks, let's give it a try.

    It would be excellent if we had some benchmark numbers which justify
    this change.

    It also would be good if we can think up some workload which might be
    slowed down by the change, and then demonstrate that this is not a
    significant problem.

    After all, it _is_ a performance patch...
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksizeenvironment


    At 08:23 08/05/29, Andrew Morton wrote:
    > On Tue, 27 May 2008 18:34:02 +0900
    >Hisashi Hifumi wrote:
    >
    >> When we read some part of a file through pagecache, if there is a pagecache
    >> of corresponding index but this page is not uptodate, read IO is issued and
    >> this page will be uptodate.
    >> I think this is good for pagesize == blocksize environment but there is room
    >> for improvement on pagesize != blocksize environment. Because in this case
    >> a page can have multiple buffers and even if a page is not uptodate, some

    >buffers
    >> can be uptodate. So I suggest that when all buffers which correspond to a part
    >> of a file that we want to read are uptodate, use this pagecache and copy data
    >> from this pagecache to user buffer even if a page is not uptodate. This can
    >> reduce read IO and improve system throughput.
    >>
    >> v2: add new address_space_operations member is_partially_uptodate, and
    >> block_is_partially_uptodate was registered to ext2/3/4's aops.
    >> modify do_generic_file_read to use this aops callback.
    >> v3: use unsigned instead of unsigned long in block_is_partially_uptodate.
    >> cleaned up and simplified page buffer iteration code in

    >block_is_partially_uptodate.
    >>
    >> Signed-off-by :Hisashi Hifumi

    >
    >OK, thanks, let's give it a try.
    >
    >It would be excellent if we had some benchmark numbers which justify
    >this change.
    >
    >It also would be good if we can think up some workload which might be
    >slowed down by the change, and then demonstrate that this is not a
    >significant problem.
    >
    >After all, it _is_ a performance patch...


    I wrote a benchmark program and got result number with this program.
    This benchmark do:
    1, mount and open a test file.
    2, create a 512MB file.
    3, close a file and umount.
    4, mount and again open a test file.
    5, pwrite randomly 300000 times on a test file. offset is aligned by IO size(1024bytes).
    6, measure time of preading randomly 100000 times on a test file.

    The result was:
    2.6.26-rc5
    329 sec

    2.6.25-rc5-patched
    223 sec

    arch:i386
    filesystem:ext3
    blocksize:1024 bytes
    Memory: 1GB

    The read IO throughput was improved.

    The benchmark program is as follows:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define LEN 1024
    #define LOOP 1024*512 /* 512MB */

    main(void)
    {
    unsigned long i, offset, filesize;
    int fd;
    char buf[LEN];
    time_t t1, t2;

    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    memset(buf, 0, LEN);
    fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }
    for (i = 0; i < LOOP; i++)
    write(fd, buf, LEN);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
    perror("cannot mount\n");
    exit(1);
    }
    fd = open("/root/test1/testfile", O_RDWR);
    if (fd < 0) {
    perror("cannot open file\n");
    exit(1);
    }

    filesize = LEN * LOOP;
    for (i = 0; i < 300000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pwrite(fd, buf, LEN, offset);
    }
    printf("start test\n");
    time(&t1);
    for (i = 0; i < 100000; i++){
    offset = (random() % filesize) & (~(LEN - 1));
    pread(fd, buf, LEN, offset);
    }
    time(&t2);
    printf("%ld sec\n", t2-t1);
    close(fd);
    if (umount("/root/test1/") < 0) {
    perror("cannot umount\n");
    exit(1);
    }
    }


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [PATCH] VFS: Pagecache usage optimization onpagesize!=blocksize environment

    On Tue, 27 May 2008 18:34:02 +0900 Hisashi Hifumi wrote:

    > When we read some part of a file through pagecache, if there is a pagecache
    > of corresponding index but this page is not uptodate, read IO is issued and
    > this page will be uptodate.
    > I think this is good for pagesize == blocksize environment but there is room
    > for improvement on pagesize != blocksize environment. Because in this case
    > a page can have multiple buffers and even if a page is not uptodate, some buffers
    > can be uptodate. So I suggest that when all buffers which correspond to a part
    > of a file that we want to read are uptodate, use this pagecache and copy data
    > from this pagecache to user buffer even if a page is not uptodate. This can
    > reduce read IO and improve system throughput.
    >
    > v2: add new address_space_operations member is_partially_uptodate, and
    > block_is_partially_uptodate was registered to ext2/3/4's aops.
    > modify do_generic_file_read to use this aops callback.
    > v3: use unsigned instead of unsigned long in block_is_partially_uptodate.
    > cleaned up and simplified page buffer iteration code in block_is_partially_uptodate.
    >
    > Signed-off-by :Hisashi Hifumi


    ext4 has now gained a new set of address_space_operations:
    ext4_da_aops. I put a

    .is_partially_uptodate = block_is_partially_uptodate,

    into there as well, but I have no clue whether it will work OK.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread