[PATCH 00/32] Swap over NFS - v19 - Kernel

This is a discussion on [PATCH 00/32] Swap over NFS - v19 - Kernel ; New addres_space_operations methods are added: int swapon(struct file *); int swapoff(struct file *); int swap_out(struct file *, struct page *, struct writeback_control *); int swap_in(struct file *, struct page *); When during sys_swapon() the ->swapon() method is found and returns ...

+ Reply to Thread
Page 2 of 3 FirstFirst 1 2 3 LastLast
Results 21 to 40 of 46

Thread: [PATCH 00/32] Swap over NFS - v19

  1. [PATCH 26/32] mm: add support for non block device backed swap files

    New addres_space_operations methods are added:
    int swapon(struct file *);
    int swapoff(struct file *);
    int swap_out(struct file *, struct page *, struct writeback_control *);
    int swap_in(struct file *, struct page *);

    When during sys_swapon() the ->swapon() method is found and returns no error
    the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, and
    make use of ->swap_{out,in}() to write/read swapcache pages.

    The ->swapon() method will be used to communicate to the file that the VM
    relies on it, and the address_space should take adequate measures (like
    reserving memory for mempools or the like). The ->swapoff() method will be
    called on sys_swapoff() when ->swapon() was found and returned no error.

    This new interface can be used to obviate the need for ->bmap in the swapfile
    code. A filesystem would need to load (and maybe even allocate) the full block
    map for a file into memory and pin it there on ->swapon() so that
    ->swap_{out,in}() have instant access to it. It can be released on ->swapoff().

    The reason to provide ->swap_{out,in}() over using {write,read}page() is to
    1) make a distinction between swapcache and pagecache pages, and
    2) to provide a struct file * for credential context (normally not needed
    in the context of writepage, as the page content is normally dirtied
    using either of the following interfaces:
    write_{begin,end}()
    {prepare,commit}_write()
    page_mkwrite()
    which do have the file context.

    [miklos@szeredi.hu: cleanups]
    Signed-off-by: Peter Zijlstra
    ---
    Documentation/filesystems/Locking | 22 ++++++++++++++++
    Documentation/filesystems/vfs.txt | 18 +++++++++++++
    include/linux/buffer_head.h | 2 -
    include/linux/fs.h | 9 ++++++
    include/linux/swap.h | 4 ++
    mm/page_io.c | 52 ++++++++++++++++++++++++++++++++++++++
    mm/swap_state.c | 4 +-
    mm/swapfile.c | 32 +++++++++++++++++++++--
    8 files changed, 137 insertions(+), 6 deletions(-)

    Index: linux-2.6/include/linux/swap.h
    ================================================== =================
    --- linux-2.6.orig/include/linux/swap.h
    +++ linux-2.6/include/linux/swap.h
    @@ -121,6 +121,7 @@ enum {
    SWP_USED = (1 << 0), /* is slot in swap_info[] used? */
    SWP_WRITEOK = (1 << 1), /* ok to write to this swap? */
    SWP_ACTIVE = (SWP_USED | SWP_WRITEOK),
    + SWP_FILE = (1 << 2), /* file swap area */
    /* add others here before... */
    SWP_SCANNING = (1 << 8), /* refcount in scan_swap_map */
    };
    @@ -274,6 +275,8 @@ extern void swap_unplug_io_fn(struct bac
    /* linux/mm/page_io.c */
    extern int swap_readpage(struct file *, struct page *);
    extern int swap_writepage(struct page *page, struct writeback_control *wbc);
    +extern void swap_sync_page(struct page *page);
    +extern int swap_set_page_dirty(struct page *page);
    extern void end_swap_bio_read(struct bio *bio, int err);

    /* linux/mm/swap_state.c */
    @@ -306,6 +309,7 @@ extern unsigned int count_swap_pages(int
    extern sector_t map_swap_page(struct swap_info_struct *, pgoff_t);
    extern sector_t swapdev_block(int, pgoff_t);
    extern struct swap_info_struct *get_swap_info_struct(unsigned);
    +extern struct swap_info_struct *page_swap_info(struct page *);
    extern int can_share_swap_page(struct page *);
    extern int remove_exclusive_swap_page(struct page *);
    extern int remove_exclusive_swap_page_ref(struct page *);
    Index: linux-2.6/mm/page_io.c
    ================================================== =================
    --- linux-2.6.orig/mm/page_io.c
    +++ linux-2.6/mm/page_io.c
    @@ -17,6 +17,7 @@
    #include
    #include
    #include
    +#include
    #include

    static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
    @@ -97,11 +98,23 @@ int swap_writepage(struct page *page, st
    {
    struct bio *bio;
    int ret = 0, rw = WRITE;
    + struct swap_info_struct *sis = page_swap_info(page);

    if (remove_exclusive_swap_page(page)) {
    unlock_page(page);
    goto out;
    }
    +
    + if (sis->flags & SWP_FILE) {
    + struct file *swap_file = sis->swap_file;
    + struct address_space *mapping = swap_file->f_mapping;
    +
    + ret = mapping->a_ops->swap_out(swap_file, page, wbc);
    + if (!ret)
    + count_vm_event(PSWPOUT);
    + return ret;
    + }
    +
    bio = get_swap_bio(GFP_NOIO, page_private(page), page,
    end_swap_bio_write);
    if (bio == NULL) {
    @@ -120,13 +133,52 @@ out:
    return ret;
    }

    +void swap_sync_page(struct page *page)
    +{
    + struct swap_info_struct *sis = page_swap_info(page);
    +
    + if (sis->flags & SWP_FILE) {
    + struct address_space *mapping = sis->swap_file->f_mapping;
    +
    + if (mapping->a_ops->sync_page)
    + mapping->a_ops->sync_page(page);
    + } else {
    + block_sync_page(page);
    + }
    +}
    +
    +int swap_set_page_dirty(struct page *page)
    +{
    + struct swap_info_struct *sis = page_swap_info(page);
    +
    + if (sis->flags & SWP_FILE) {
    + struct address_space *mapping = sis->swap_file->f_mapping;
    +
    + return mapping->a_ops->set_page_dirty(page);
    + } else {
    + return __set_page_dirty_nobuffers(page);
    + }
    +}
    +
    int swap_readpage(struct file *file, struct page *page)
    {
    struct bio *bio;
    int ret = 0;
    + struct swap_info_struct *sis = page_swap_info(page);

    BUG_ON(!PageLocked(page));
    BUG_ON(PageUptodate(page));
    +
    + if (sis->flags & SWP_FILE) {
    + struct file *swap_file = sis->swap_file;
    + struct address_space *mapping = swap_file->f_mapping;
    +
    + ret = mapping->a_ops->swap_in(swap_file, page);
    + if (!ret)
    + count_vm_event(PSWPIN);
    + return ret;
    + }
    +
    bio = get_swap_bio(GFP_KERNEL, page_private(page), page,
    end_swap_bio_read);
    if (bio == NULL) {
    Index: linux-2.6/mm/swap_state.c
    ================================================== =================
    --- linux-2.6.orig/mm/swap_state.c
    +++ linux-2.6/mm/swap_state.c
    @@ -27,8 +27,8 @@
    */
    static const struct address_space_operations swap_aops = {
    .writepage = swap_writepage,
    - .sync_page = block_sync_page,
    - .set_page_dirty = __set_page_dirty_nobuffers,
    + .sync_page = swap_sync_page,
    + .set_page_dirty = swap_set_page_dirty,
    .migratepage = migrate_page,
    };

    Index: linux-2.6/mm/swapfile.c
    ================================================== =================
    --- linux-2.6.orig/mm/swapfile.c
    +++ linux-2.6/mm/swapfile.c
    @@ -1032,6 +1032,14 @@ static void destroy_swap_extents(struct
    list_del(&se->list);
    kfree(se);
    }
    +
    + if (sis->flags & SWP_FILE) {
    + struct file *swap_file = sis->swap_file;
    + struct address_space *mapping = swap_file->f_mapping;
    +
    + sis->flags &= ~SWP_FILE;
    + mapping->a_ops->swapoff(swap_file);
    + }
    }

    /*
    @@ -1106,7 +1114,9 @@ add_swap_extent(struct swap_info_struct
    */
    static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
    {
    - struct inode *inode;
    + struct file *swap_file = sis->swap_file;
    + struct address_space *mapping = swap_file->f_mapping;
    + struct inode *inode = mapping->host;
    unsigned blocks_per_page;
    unsigned long page_no;
    unsigned blkbits;
    @@ -1117,13 +1127,22 @@ static int setup_swap_extents(struct swa
    int nr_extents = 0;
    int ret;

    - inode = sis->swap_file->f_mapping->host;
    if (S_ISBLK(inode->i_mode)) {
    ret = add_swap_extent(sis, 0, sis->max, 0);
    *span = sis->pages;
    goto done;
    }

    + if (mapping->a_ops->swapon) {
    + ret = mapping->a_ops->swapon(swap_file);
    + if (!ret) {
    + sis->flags |= SWP_FILE;
    + ret = add_swap_extent(sis, 0, sis->max, 0);
    + *span = sis->pages;
    + }
    + goto done;
    + }
    +
    blkbits = inode->i_blkbits;
    blocks_per_page = PAGE_SIZE >> blkbits;

    @@ -1696,7 +1715,7 @@ asmlinkage long sys_swapon(const char __
    else
    p->prio = --least_priority;
    p->swap_map = swap_map;
    - p->flags = SWP_ACTIVE;
    + p->flags |= SWP_WRITEOK;
    nr_swap_pages += nr_good_pages;
    total_swap_pages += nr_good_pages;

    @@ -1817,6 +1836,13 @@ get_swap_info_struct(unsigned type)
    return &swap_info[type];
    }

    +struct swap_info_struct *page_swap_info(struct page *page)
    +{
    + swp_entry_t swap = { .val = page_private(page) };
    + BUG_ON(!PageSwapCache(page));
    + return &swap_info[swp_type(swap)];
    +}
    +
    /*
    * swap_lock prevents swap_map being freed. Don't grab an extra
    * reference on the swaphandle, it doesn't matter if it becomes unused.
    Index: linux-2.6/include/linux/fs.h
    ================================================== =================
    --- linux-2.6.orig/include/linux/fs.h
    +++ linux-2.6/include/linux/fs.h
    @@ -507,6 +507,15 @@ struct address_space_operations {
    int (*launder_page) (struct page *);
    int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
    unsigned long);
    +
    + /*
    + * swapfile support
    + */
    + int (*swapon)(struct file *file);
    + int (*swapoff)(struct file *file);
    + int (*swap_out)(struct file *file, struct page *page,
    + struct writeback_control *wbc);
    + int (*swap_in)(struct file *file, struct page *page);
    };

    /*
    Index: linux-2.6/Documentation/filesystems/Locking
    ================================================== =================
    --- linux-2.6.orig/Documentation/filesystems/Locking
    +++ linux-2.6/Documentation/filesystems/Locking
    @@ -169,6 +169,10 @@ prototypes:
    int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
    loff_t offset, unsigned long nr_segs);
    int (*launder_page) (struct page *);
    + int (*swapon) (struct file *);
    + int (*swapoff) (struct file *);
    + int (*swap_out) (struct file *, struct page *, struct writeback_control *);
    + int (*swap_in) (struct file *, struct page *);

    locking rules:
    All except set_page_dirty may block
    @@ -190,6 +194,10 @@ invalidatepage: no yes
    releasepage: no yes
    direct_IO: no
    launder_page: no yes
    +swapon no
    +swapoff no
    +swap_out no yes, unlocks
    +swap_in no yes, unlocks

    ->prepare_write(), ->commit_write(), ->sync_page() and ->readpage()
    may be called from the request handler (/dev/loop).
    @@ -289,6 +297,20 @@ cleaned, or an error value if not. Note
    getting mapped back in and redirtied, it needs to be kept locked
    across the entire operation.

    + ->swapon() will be called with a non-zero argument on files backing
    +(non block device backed) swapfiles. A return value of zero indicates success,
    +in which case this file can be used for backing swapspace. The swapspace
    +operations will be proxied to the address space operations.
    +
    + ->swapoff() will be called in the sys_swapoff() path when ->swapon()
    +returned success.
    +
    + ->swap_out() when swapon() returned success, this method is used to
    +write the swap page.
    +
    + ->swap_in() when swapon() returned success, this method is used to
    +read the swap page.
    +
    Note: currently almost all instances of address_space methods are
    using BKL for internal serialization and that's one of the worst sources
    of contention. Normally they are calling library functions (in fs/buffer.c)
    Index: linux-2.6/include/linux/buffer_head.h
    ================================================== =================
    --- linux-2.6.orig/include/linux/buffer_head.h
    +++ linux-2.6/include/linux/buffer_head.h
    @@ -336,7 +336,7 @@ static inline void invalidate_inode_buff
    static inline int remove_inode_buffers(struct inode *inode) { return 1; }
    static inline int sync_mapping_buffers(struct address_space *mapping) { return 0; }
    static inline void invalidate_bdev(struct block_device *bdev) {}
    -
    +static inline void block_sync_page(struct page *) { }

    #endif /* CONFIG_BLOCK */
    #endif /* _LINUX_BUFFER_HEAD_H */
    Index: linux-2.6/Documentation/filesystems/vfs.txt
    ================================================== =================
    --- linux-2.6.orig/Documentation/filesystems/vfs.txt
    +++ linux-2.6/Documentation/filesystems/vfs.txt
    @@ -539,6 +539,11 @@ struct address_space_operations {
    /* migrate the contents of a page to the specified target */
    int (*migratepage) (struct page *, struct page *);
    int (*launder_page) (struct page *);
    + int (*swapon)(struct file *);
    + int (*swapoff)(struct file *);
    + int (*swap_out)(struct file *file, struct page *page,
    + struct writeback_control *wbc);
    + int (*swap_in)(struct file *file, struct page *page);
    };

    writepage: called by the VM to write a dirty page to backing store.
    @@ -724,6 +729,19 @@ struct address_space_operations {
    prevent redirtying the page, it is kept locked during the whole
    operation.

    + swapon: Called when swapon is used on a file. A
    + return value of zero indicates success, in which case this
    + file can be used to back swapspace. The swapspace operations
    + will be proxied to this address space's ->swap_{out,in} methods.
    +
    + swapoff: Called during swapoff on files where swapon was successfull.
    +
    + swap_out: Called to write a swapcache page to a backing store, similar to
    + writepage.
    +
    + swap_in: Called to read a swapcache page from a backing store, similar to
    + readpage.
    +
    The File Object
    ===============


    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. [PATCH 31/32] nfs: enable swap on NFS

    Implement all the new swapfile a_ops for NFS. This will set the NFS socket to
    SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well as reset
    SOCK_MEMALLOC before engaging the protocol ->connect() method.

    PF_MEMALLOC should allow the allocation of struct socket and related objects
    and the early (re)setting of SOCK_MEMALLOC should allow us to receive the
    packets required for the TCP connection buildup.

    (swapping continues over a server reset during heavy network traffic)

    Signed-off-by: Peter Zijlstra
    ---
    fs/Kconfig | 17 ++++++++++
    fs/nfs/file.c | 18 ++++++++++
    fs/nfs/write.c | 22 +++++++++++++
    include/linux/nfs_fs.h | 2 +
    include/linux/sunrpc/xprt.h | 5 ++-
    net/sunrpc/sched.c | 9 ++++-
    net/sunrpc/xprtsock.c | 73 ++++++++++++++++++++++++++++++++++++++++++++
    7 files changed, 143 insertions(+), 3 deletions(-)

    Index: linux-2.6/fs/nfs/file.c
    ================================================== =================
    --- linux-2.6.orig/fs/nfs/file.c
    +++ linux-2.6/fs/nfs/file.c
    @@ -434,6 +434,18 @@ static int nfs_launder_page(struct page
    return nfs_wb_page(inode, page);
    }

    +#ifdef CONFIG_NFS_SWAP
    +static int nfs_swapon(struct file *file)
    +{
    + return xs_swapper(NFS_CLIENT(file->f_mapping->host)->cl_xprt, 1);
    +}
    +
    +static int nfs_swapoff(struct file *file)
    +{
    + return xs_swapper(NFS_CLIENT(file->f_mapping->host)->cl_xprt, 0);
    +}
    +#endif
    +
    const struct address_space_operations nfs_file_aops = {
    .readpage = nfs_readpage,
    .readpages = nfs_readpages,
    @@ -446,6 +458,12 @@ const struct address_space_operations nf
    .releasepage = nfs_release_page,
    .direct_IO = nfs_direct_IO,
    .launder_page = nfs_launder_page,
    +#ifdef CONFIG_NFS_SWAP
    + .swapon = nfs_swapon,
    + .swapoff = nfs_swapoff,
    + .swap_out = nfs_swap_out,
    + .swap_in = nfs_readpage,
    +#endif
    };

    static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct page *page)
    Index: linux-2.6/fs/nfs/write.c
    ================================================== =================
    --- linux-2.6.orig/fs/nfs/write.c
    +++ linux-2.6/fs/nfs/write.c
    @@ -333,6 +333,28 @@ int nfs_writepage(struct page *page, str
    return ret;
    }

    +static int nfs_writepage_setup(struct nfs_open_context *ctx, struct page *page,
    + unsigned int offset, unsigned int count);
    +
    +int nfs_swap_out(struct file *file, struct page *page,
    + struct writeback_control *wbc)
    +{
    + struct nfs_open_context *ctx = nfs_file_open_context(file);
    + int status;
    +
    + status = nfs_writepage_setup(ctx, page, 0, nfs_page_length(page));
    + if (status < 0) {
    + nfs_set_pageerror(page);
    + goto out;
    + }
    +
    + status = nfs_writepage_locked(page, wbc);
    +
    +out:
    + unlock_page(page);
    + return status;
    +}
    +
    static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data)
    {
    int ret;
    Index: linux-2.6/include/linux/nfs_fs.h
    ================================================== =================
    --- linux-2.6.orig/include/linux/nfs_fs.h
    +++ linux-2.6/include/linux/nfs_fs.h
    @@ -462,6 +462,8 @@ extern int nfs_flush_incompatible(struc
    extern int nfs_updatepage(struct file *, struct page *, unsigned int, unsigned int);
    extern int nfs_writeback_done(struct rpc_task *, struct nfs_write_data *);
    extern void nfs_writedata_release(void *);
    +extern int nfs_swap_out(struct file *file, struct page *page,
    + struct writeback_control *wbc);

    /*
    * Try to write back everything synchronously (but check the
    Index: linux-2.6/fs/Kconfig
    ================================================== =================
    --- linux-2.6.orig/fs/Kconfig
    +++ linux-2.6/fs/Kconfig
    @@ -1205,6 +1205,18 @@ config ROOT_NFS

    Most people say N here.

    +config NFS_SWAP
    + bool "Provide swap over NFS support"
    + default n
    + depends on NFS_FS
    + select SUNRPC_SWAP
    + help
    + This option enables swapon to work on files located on NFS mounts.
    +
    + For more details, see Documentation/network-swap.txt
    +
    + If unsure, say N.
    +
    config NFSD
    tristate "NFS server support"
    depends on INET
    @@ -1348,6 +1360,11 @@ config SUNRPC_REGISTER_V4
    RPC services using only rpcbind version 2). Distributions
    using the legacy Linux portmapper daemon must say N here.

    +config SUNRPC_SWAP
    + def_bool n
    + depends on SUNRPC
    + select NETVM
    +
    config RPCSEC_GSS_KRB5
    tristate "Secure RPC: Kerberos V mechanism (EXPERIMENTAL)"
    depends on SUNRPC && EXPERIMENTAL
    Index: linux-2.6/include/linux/sunrpc/xprt.h
    ================================================== =================
    --- linux-2.6.orig/include/linux/sunrpc/xprt.h
    +++ linux-2.6/include/linux/sunrpc/xprt.h
    @@ -147,7 +147,9 @@ struct rpc_xprt {
    unsigned int max_reqs; /* total slots */
    unsigned long state; /* transport state */
    unsigned char shutdown : 1, /* being shut down */
    - resvport : 1; /* use a reserved port */
    + resvport : 1, /* use a reserved port */
    + swapper : 1; /* we're swapping over this
    + transport */
    unsigned int bind_index; /* bind function index */

    /*
    @@ -249,6 +251,7 @@ void xprt_release_rqst_cong(struct rpc
    void xprt_disconnect_done(struct rpc_xprt *xprt);
    void xprt_force_disconnect(struct rpc_xprt *xprt);
    void xprt_conditional_disconnect(struct rpc_xprt *xprt, unsigned int cookie);
    +int xs_swapper(struct rpc_xprt *xprt, int enable);

    /*
    * Reserved bit positions in xprt->state
    Index: linux-2.6/net/sunrpc/sched.c
    ================================================== =================
    --- linux-2.6.orig/net/sunrpc/sched.c
    +++ linux-2.6/net/sunrpc/sched.c
    @@ -729,7 +729,10 @@ struct rpc_buffer {
    void *rpc_malloc(struct rpc_task *task, size_t size)
    {
    struct rpc_buffer *buf;
    - gfp_t gfp = RPC_IS_SWAPPER(task) ? GFP_ATOMIC : GFP_NOWAIT;
    + gfp_t gfp = GFP_NOWAIT;
    +
    + if (RPC_IS_SWAPPER(task))
    + gfp |= __GFP_MEMALLOC;

    size += sizeof(struct rpc_buffer);
    if (size <= RPC_BUFFER_MAXSIZE)
    @@ -800,6 +803,8 @@ static void rpc_init_task(struct rpc_tas
    kref_get(&task->tk_client->cl_kref);
    if (task->tk_client->cl_softrtry)
    task->tk_flags |= RPC_TASK_SOFT;
    + if (task->tk_client->cl_xprt->swapper)
    + task->tk_flags |= RPC_TASK_SWAPPER;
    }

    if (task->tk_ops->rpc_call_prepare != NULL)
    @@ -825,7 +830,7 @@ static void rpc_init_task(struct rpc_tas
    static struct rpc_task *
    rpc_alloc_task(void)
    {
    - return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOFS);
    + return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOIO);
    }

    /*
    Index: linux-2.6/net/sunrpc/xprtsock.c
    ================================================== =================
    --- linux-2.6.orig/net/sunrpc/xprtsock.c
    +++ linux-2.6/net/sunrpc/xprtsock.c
    @@ -1445,6 +1445,55 @@ static inline void xs_reclassify_socket6
    }
    #endif

    +#ifdef CONFIG_SUNRPC_SWAP
    +static void xs_set_memalloc(struct rpc_xprt *xprt)
    +{
    + struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
    +
    + if (xprt->swapper)
    + sk_set_memalloc(transport->inet);
    +}
    +
    +#define RPC_BUF_RESERVE_PAGES \
    + kmalloc_estimate_objs(sizeof(struct rpc_rqst), GFP_KERNEL, RPC_MAX_SLOT_TABLE)
    +#define RPC_RESERVE_PAGES (RPC_BUF_RESERVE_PAGES + TX_RESERVE_PAGES)
    +
    +/**
    + * xs_swapper - Tag this transport as being used for swap.
    + * @xprt: transport to tag
    + * @enable: enable/disable
    + *
    + */
    +int xs_swapper(struct rpc_xprt *xprt, int enable)
    +{
    + struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
    + int err = 0;
    +
    + if (enable) {
    + /*
    + * keep one extra sock reference so the reserve won't dip
    + * when the socket gets reconnected.
    + */
    + err = sk_adjust_memalloc(1, RPC_RESERVE_PAGES);
    + if (!err) {
    + xprt->swapper = 1;
    + xs_set_memalloc(xprt);
    + }
    + } else if (xprt->swapper) {
    + xprt->swapper = 0;
    + sk_clear_memalloc(transport->inet);
    + sk_adjust_memalloc(-1, -RPC_RESERVE_PAGES);
    + }
    +
    + return err;
    +}
    +EXPORT_SYMBOL_GPL(xs_swapper);
    +#else
    +static void xs_set_memalloc(struct rpc_xprt *xprt)
    +{
    +}
    +#endif
    +
    static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
    {
    struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
    @@ -1469,6 +1518,8 @@ static void xs_udp_finish_connecting(str
    transport->sock = sock;
    transport->inet = sk;

    + xs_set_memalloc(xprt);
    +
    write_unlock_bh(&sk->sk_callback_lock);
    }
    xs_udp_do_set_buffer_size(xprt);
    @@ -1486,11 +1537,15 @@ static void xs_udp_connect_worker4(struc
    container_of(work, struct sock_xprt, connect_worker.work);
    struct rpc_xprt *xprt = &transport->xprt;
    struct socket *sock = transport->sock;
    + unsigned long pflags = current->flags;
    int err, status = -EIO;

    if (xprt->shutdown || !xprt_bound(xprt))
    goto out;

    + if (xprt->swapper)
    + current->flags |= PF_MEMALLOC;
    +
    /* Start by resetting any existing state */
    xs_close(xprt);

    @@ -1513,6 +1568,7 @@ static void xs_udp_connect_worker4(struc
    out:
    xprt_wake_pending_tasks(xprt, status);
    xprt_clear_connecting(xprt);
    + tsk_restore_flags(current, pflags, PF_MEMALLOC);
    }

    /**
    @@ -1527,11 +1583,15 @@ static void xs_udp_connect_worker6(struc
    container_of(work, struct sock_xprt, connect_worker.work);
    struct rpc_xprt *xprt = &transport->xprt;
    struct socket *sock = transport->sock;
    + unsigned long pflags = current->flags;
    int err, status = -EIO;

    if (xprt->shutdown || !xprt_bound(xprt))
    goto out;

    + if (xprt->swapper)
    + current->flags |= PF_MEMALLOC;
    +
    /* Start by resetting any existing state */
    xs_close(xprt);

    @@ -1554,6 +1614,7 @@ static void xs_udp_connect_worker6(struc
    out:
    xprt_wake_pending_tasks(xprt, status);
    xprt_clear_connecting(xprt);
    + tsk_restore_flags(current, pflags, PF_MEMALLOC);
    }

    /*
    @@ -1613,6 +1674,8 @@ static int xs_tcp_finish_connecting(stru
    write_unlock_bh(&sk->sk_callback_lock);
    }

    + xs_set_memalloc(xprt);
    +
    /* Tell the socket layer to start connecting... */
    xprt->stat.connect_count++;
    xprt->stat.connect_start = jiffies;
    @@ -1631,11 +1694,15 @@ static void xs_tcp_connect_worker4(struc
    container_of(work, struct sock_xprt, connect_worker.work);
    struct rpc_xprt *xprt = &transport->xprt;
    struct socket *sock = transport->sock;
    + unsigned long pflags = current->flags;
    int err, status = -EIO;

    if (xprt->shutdown || !xprt_bound(xprt))
    goto out;

    + if (xprt->swapper)
    + current->flags |= PF_MEMALLOC;
    +
    if (!sock) {
    /* start from scratch */
    if ((err = sock_create_kern(PF_INET, SOCK_STREAM, IPPROTO_TCP, &sock)) < 0) {
    @@ -1677,6 +1744,7 @@ out:
    xprt_wake_pending_tasks(xprt, status);
    out_clear:
    xprt_clear_connecting(xprt);
    + tsk_restore_flags(current, pflags, PF_MEMALLOC);
    }

    /**
    @@ -1691,11 +1759,15 @@ static void xs_tcp_connect_worker6(struc
    container_of(work, struct sock_xprt, connect_worker.work);
    struct rpc_xprt *xprt = &transport->xprt;
    struct socket *sock = transport->sock;
    + unsigned long pflags = current->flags;
    int err, status = -EIO;

    if (xprt->shutdown || !xprt_bound(xprt))
    goto out;

    + if (xprt->swapper)
    + current->flags |= PF_MEMALLOC;
    +
    if (!sock) {
    /* start from scratch */
    if ((err = sock_create_kern(PF_INET6, SOCK_STREAM, IPPROTO_TCP, &sock)) < 0) {
    @@ -1736,6 +1808,7 @@ out:
    xprt_wake_pending_tasks(xprt, status);
    out_clear:
    xprt_clear_connecting(xprt);
    + tsk_restore_flags(current, pflags, PF_MEMALLOC);
    }

    /**

    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. [PATCH 29/32] nfs: teach the NFS client how to treat PG_swapcache pages

    Replace all relevant occurences of page->index and page->mapping in the NFS
    client with the new page_file_index() and page_file_mapping() functions.

    Signed-off-by: Peter Zijlstra
    ---
    fs/nfs/file.c | 6 +++---
    fs/nfs/internal.h | 7 ++++---
    fs/nfs/pagelist.c | 6 +++---
    fs/nfs/read.c | 6 +++---
    fs/nfs/write.c | 53 +++++++++++++++++++++++++++--------------------------
    5 files changed, 40 insertions(+), 38 deletions(-)

    Index: linux-2.6/fs/nfs/file.c
    ================================================== =================
    --- linux-2.6.orig/fs/nfs/file.c
    +++ linux-2.6/fs/nfs/file.c
    @@ -413,7 +413,7 @@ static void nfs_invalidate_page(struct p
    if (offset != 0)
    return;
    /* Cancel any unstarted writes on this page */
    - nfs_wb_page_cancel(page->mapping->host, page);
    + nfs_wb_page_cancel(page_file_mapping(page)->host, page);
    }

    static int nfs_release_page(struct page *page, gfp_t gfp)
    @@ -426,7 +426,7 @@ static int nfs_release_page(struct page

    static int nfs_launder_page(struct page *page)
    {
    - struct inode *inode = page->mapping->host;
    + struct inode *inode = page_file_mapping(page)->host;

    dfprintk(PAGECACHE, "NFS: launder_page(%ld, %llu)\n",
    inode->i_ino, (long long)page_offset(page));
    @@ -462,7 +462,7 @@ static int nfs_vm_page_mkwrite(struct vm
    (long long)page_offset(page));

    lock_page(page);
    - mapping = page->mapping;
    + mapping = page_file_mapping(page);
    if (mapping != dentry->d_inode->i_mapping)
    goto out_unlock;

    Index: linux-2.6/fs/nfs/pagelist.c
    ================================================== =================
    --- linux-2.6.orig/fs/nfs/pagelist.c
    +++ linux-2.6/fs/nfs/pagelist.c
    @@ -76,11 +76,11 @@ nfs_create_request(struct nfs_open_conte
    * update_nfs_request below if the region is not locked. */
    req->wb_page = page;
    atomic_set(&req->wb_complete, 0);
    - req->wb_index = page->index;
    + req->wb_index = page_file_index(page);
    page_cache_get(page);
    BUG_ON(PagePrivate(page));
    BUG_ON(!PageLocked(page));
    - BUG_ON(page->mapping->host != inode);
    + BUG_ON(page_file_mapping(page)->host != inode);
    req->wb_offset = offset;
    req->wb_pgbase = offset;
    req->wb_bytes = count;
    @@ -376,7 +376,7 @@ void nfs_pageio_cond_complete(struct nfs
    * nfs_scan_list - Scan a list for matching requests
    * @nfsi: NFS inode
    * @dst: Destination list
    - * @idx_start: lower bound of page->index to scan
    + * @idx_start: lower bound of page_file_index(page) to scan
    * @npages: idx_start + npages sets the upper bound to scan.
    * @tag: tag to scan for
    *
    Index: linux-2.6/fs/nfs/read.c
    ================================================== =================
    --- linux-2.6.orig/fs/nfs/read.c
    +++ linux-2.6/fs/nfs/read.c
    @@ -474,11 +474,11 @@ static const struct rpc_call_ops nfs_rea
    int nfs_readpage(struct file *file, struct page *page)
    {
    struct nfs_open_context *ctx;
    - struct inode *inode = page->mapping->host;
    + struct inode *inode = page_file_mapping(page)->host;
    int error;

    dprintk("NFS: nfs_readpage (%p %ld@%lu)\n",
    - page, PAGE_CACHE_SIZE, page->index);
    + page, PAGE_CACHE_SIZE, page_file_index(page));
    nfs_inc_stats(inode, NFSIOS_VFSREADPAGE);
    nfs_add_stats(inode, NFSIOS_READPAGES, 1);

    @@ -525,7 +525,7 @@ static int
    readpage_async_filler(void *data, struct page *page)
    {
    struct nfs_readdesc *desc = (struct nfs_readdesc *)data;
    - struct inode *inode = page->mapping->host;
    + struct inode *inode = page_file_mapping(page)->host;
    struct nfs_page *new;
    unsigned int len;
    int error;
    Index: linux-2.6/fs/nfs/write.c
    ================================================== =================
    --- linux-2.6.orig/fs/nfs/write.c
    +++ linux-2.6/fs/nfs/write.c
    @@ -115,7 +115,7 @@ static struct nfs_page *nfs_page_find_re

    static struct nfs_page *nfs_page_find_request(struct page *page)
    {
    - struct inode *inode = page->mapping->host;
    + struct inode *inode = page_file_mapping(page)->host;
    struct nfs_page *req = NULL;

    spin_lock(&inode->i_lock);
    @@ -127,16 +127,16 @@ static struct nfs_page *nfs_page_find_re
    /* Adjust the file length if we're writing beyond the end */
    static void nfs_grow_file(struct page *page, unsigned int offset, unsigned int count)
    {
    - struct inode *inode = page->mapping->host;
    + struct inode *inode = page_file_mapping(page)->host;
    loff_t end, i_size;
    pgoff_t end_index;

    spin_lock(&inode->i_lock);
    i_size = i_size_read(inode);
    end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
    - if (i_size > 0 && page->index < end_index)
    + if (i_size > 0 && page_file_index(page) < end_index)
    goto out;
    - end = ((loff_t)page->index << PAGE_CACHE_SHIFT) + ((loff_t)offset+count);
    + end = page_file_offset(page) + ((loff_t)offset+count);
    if (i_size >= end)
    goto out;
    i_size_write(inode, end);
    @@ -149,7 +149,7 @@ out:
    static void nfs_set_pageerror(struct page *page)
    {
    SetPageError(page);
    - nfs_zap_mapping(page->mapping->host, page->mapping);
    + nfs_zap_mapping(page_file_mapping(page)->host, page_file_mapping(page));
    }

    /* We can set the PG_uptodate flag if we see that a write request
    @@ -190,7 +190,7 @@ static int nfs_set_page_writeback(struct
    int ret = test_set_page_writeback(page);

    if (!ret) {
    - struct inode *inode = page->mapping->host;
    + struct inode *inode = page_file_mapping(page)->host;
    struct nfs_server *nfss = NFS_SERVER(inode);

    if (atomic_long_inc_return(&nfss->writeback) >
    @@ -202,7 +202,7 @@ static int nfs_set_page_writeback(struct

    static void nfs_end_page_writeback(struct page *page)
    {
    - struct inode *inode = page->mapping->host;
    + struct inode *inode = page_file_mapping(page)->host;
    struct nfs_server *nfss = NFS_SERVER(inode);

    end_page_writeback(page);
    @@ -217,7 +217,7 @@ static void nfs_end_page_writeback(struc
    static int nfs_page_async_flush(struct nfs_pageio_descriptor *pgio,
    struct page *page)
    {
    - struct inode *inode = page->mapping->host;
    + struct inode *inode = page_file_mapping(page)->host;
    struct nfs_page *req;
    int ret;

    @@ -260,12 +260,12 @@ static int nfs_page_async_flush(struct n

    static int nfs_do_writepage(struct page *page, struct writeback_control *wbc, struct nfs_pageio_descriptor *pgio)
    {
    - struct inode *inode = page->mapping->host;
    + struct inode *inode = page_file_mapping(page)->host;

    nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE);
    nfs_add_stats(inode, NFSIOS_WRITEPAGES, 1);

    - nfs_pageio_cond_complete(pgio, page->index);
    + nfs_pageio_cond_complete(pgio, page_file_index(page));
    return nfs_page_async_flush(pgio, page);
    }

    @@ -277,7 +277,7 @@ static int nfs_writepage_locked(struct p
    struct nfs_pageio_descriptor pgio;
    int err;

    - nfs_pageio_init_write(&pgio, page->mapping->host, wb_priority(wbc));
    + nfs_pageio_init_write(&pgio, page_file_mapping(page)->host, wb_priority(wbc));
    err = nfs_do_writepage(page, wbc, &pgio);
    nfs_pageio_complete(&pgio);
    if (err < 0)
    @@ -406,7 +406,8 @@ nfs_mark_request_commit(struct nfs_page
    NFS_PAGE_TAG_COMMIT);
    spin_unlock(&inode->i_lock);
    inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
    - inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
    + inc_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
    + BDI_RECLAIMABLE);
    __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
    }

    @@ -417,7 +418,7 @@ nfs_clear_request_commit(struct nfs_page

    if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
    dec_zone_page_state(page, NR_UNSTABLE_NFS);
    - dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
    + dec_bdi_stat(page_file_mapping(page)->backing_dev_info, BDI_RECLAIMABLE);
    return 1;
    }
    return 0;
    @@ -523,7 +524,7 @@ static void nfs_cancel_commit_list(struc
    * nfs_scan_commit - Scan an inode for commit requests
    * @inode: NFS inode to scan
    * @dst: destination list
    - * @idx_start: lower bound of page->index to scan.
    + * @idx_start: lower bound of page_file_index(page) to scan.
    * @npages: idx_start + npages sets the upper bound to scan.
    *
    * Moves requests from the inode's 'commit' request list.
    @@ -634,7 +635,7 @@ out_err:
    static struct nfs_page * nfs_setup_write_request(struct nfs_open_context* ctx,
    struct page *page, unsigned int offset, unsigned int bytes)
    {
    - struct inode *inode = page->mapping->host;
    + struct inode *inode = page_file_mapping(page)->host;
    struct nfs_page *req;
    int error;

    @@ -689,7 +690,7 @@ int nfs_flush_incompatible(struct file *
    nfs_release_request(req);
    if (!do_flush)
    return 0;
    - status = nfs_wb_page(page->mapping->host, page);
    + status = nfs_wb_page(page_file_mapping(page)->host, page);
    } while (status == 0);
    return status;
    }
    @@ -715,7 +716,7 @@ int nfs_updatepage(struct file *file, st
    unsigned int offset, unsigned int count)
    {
    struct nfs_open_context *ctx = nfs_file_open_context(file);
    - struct inode *inode = page->mapping->host;
    + struct inode *inode = page_file_mapping(page)->host;
    int status = 0;

    nfs_inc_stats(inode, NFSIOS_VFSUPDATEPAGE);
    @@ -723,7 +724,7 @@ int nfs_updatepage(struct file *file, st
    dprintk("NFS: nfs_updatepage(%s/%s %d@%lld)\n",
    file->f_path.dentry->d_parent->d_name.name,
    file->f_path.dentry->d_name.name, count,
    - (long long)(page_offset(page) + offset));
    + (long long)(page_file_offset(page) + offset));

    /* If we're not using byte range locks, and we know the page
    * is up to date, it may be more efficient to extend the write
    @@ -998,7 +999,7 @@ static void nfs_writeback_release_partia
    }

    if (nfs_write_need_commit(data)) {
    - struct inode *inode = page->mapping->host;
    + struct inode *inode = page_file_mapping(page)->host;

    spin_lock(&inode->i_lock);
    if (test_bit(PG_NEED_RESCHED, &req->wb_flags)) {
    @@ -1259,7 +1260,7 @@ nfs_commit_list(struct inode *inode, str
    nfs_list_remove_request(req);
    nfs_mark_request_commit(req);
    dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
    - dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
    + dec_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
    BDI_RECLAIMABLE);
    nfs_clear_page_tag_locked(req);
    }
    @@ -1450,10 +1451,10 @@ int nfs_wb_nocommit(struct inode *inode)
    int nfs_wb_page_cancel(struct inode *inode, struct page *page)
    {
    struct nfs_page *req;
    - loff_t range_start = page_offset(page);
    + loff_t range_start = page_file_offset(page);
    loff_t range_end = range_start + (loff_t)(PAGE_CACHE_SIZE - 1);
    struct writeback_control wbc = {
    - .bdi = page->mapping->backing_dev_info,
    + .bdi = page_file_mapping(page)->backing_dev_info,
    .sync_mode = WB_SYNC_ALL,
    .nr_to_write = LONG_MAX,
    .range_start = range_start,
    @@ -1486,7 +1487,7 @@ int nfs_wb_page_cancel(struct inode *ino
    }
    if (!PagePrivate(page))
    return 0;
    - ret = nfs_sync_mapping_wait(page->mapping, &wbc, FLUSH_INVALIDATE);
    + ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, FLUSH_INVALIDATE);
    out:
    return ret;
    }
    @@ -1494,10 +1495,10 @@ out:
    static int nfs_wb_page_priority(struct inode *inode, struct page *page,
    int how)
    {
    - loff_t range_start = page_offset(page);
    + loff_t range_start = page_file_offset(page);
    loff_t range_end = range_start + (loff_t)(PAGE_CACHE_SIZE - 1);
    struct writeback_control wbc = {
    - .bdi = page->mapping->backing_dev_info,
    + .bdi = page_file_mapping(page)->backing_dev_info,
    .sync_mode = WB_SYNC_ALL,
    .nr_to_write = LONG_MAX,
    .range_start = range_start,
    @@ -1512,7 +1513,7 @@ static int nfs_wb_page_priority(struct i
    goto out_error;
    } else if (!PagePrivate(page))
    break;
    - ret = nfs_sync_mapping_wait(page->mapping, &wbc, how);
    + ret = nfs_sync_mapping_wait(page_file_mapping(page), &wbc, how);
    if (ret < 0)
    goto out_error;
    } while (PagePrivate(page));
    Index: linux-2.6/fs/nfs/internal.h
    ================================================== =================
    --- linux-2.6.orig/fs/nfs/internal.h
    +++ linux-2.6/fs/nfs/internal.h
    @@ -253,13 +253,14 @@ void nfs_super_set_maxbytes(struct super
    static inline
    unsigned int nfs_page_length(struct page *page)
    {
    - loff_t i_size = i_size_read(page->mapping->host);
    + loff_t i_size = i_size_read(page_file_mapping(page)->host);

    if (i_size > 0) {
    + pgoff_t page_index = page_file_index(page);
    pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
    - if (page->index < end_index)
    + if (page_index < end_index)
    return PAGE_CACHE_SIZE;
    - if (page->index == end_index)
    + if (page_index == end_index)
    return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1;
    }
    return 0;

    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. [PATCH 04/32] net: ipv6: initialize ip6_route sysctl vars in ip6_route_net_init()

    This makes that ip6_route_net_init() does all of the route init code.
    There used to be a race between ip6_route_net_init() and ip6_net_init()
    and someone relying on the combined result was left out cold.

    Signed-off-by: Peter Zijlstra
    ---
    net/ipv6/af_inet6.c | 8 --------
    net/ipv6/route.c | 9 +++++++++
    2 files changed, 9 insertions(+), 8 deletions(-)

    Index: linux-2.6/net/ipv6/af_inet6.c
    ================================================== =================
    --- linux-2.6.orig/net/ipv6/af_inet6.c
    +++ linux-2.6/net/ipv6/af_inet6.c
    @@ -839,14 +839,6 @@ static int inet6_net_init(struct net *ne
    int err = 0;

    net->ipv6.sysctl.bindv6only = 0;
    - net->ipv6.sysctl.flush_delay = 0;
    - net->ipv6.sysctl.ip6_rt_max_size = 4096;
    - net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
    - net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
    - net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;
    - net->ipv6.sysctl.ip6_rt_gc_elasticity = 9;
    - net->ipv6.sysctl.ip6_rt_mtu_expires = 10*60*HZ;
    - net->ipv6.sysctl.ip6_rt_min_advmss = IPV6_MIN_MTU - 20 - 40;
    net->ipv6.sysctl.icmpv6_time = 1*HZ;

    #ifdef CONFIG_PROC_FS
    Index: linux-2.6/net/ipv6/route.c
    ================================================== =================
    --- linux-2.6.orig/net/ipv6/route.c
    +++ linux-2.6/net/ipv6/route.c
    @@ -2627,6 +2627,15 @@ static int ip6_route_net_init(struct net
    net->ipv6.ip6_blk_hole_entry->u.dst.ops = net->ipv6.ip6_dst_ops;
    #endif

    + net->ipv6.sysctl.flush_delay = 0;
    + net->ipv6.sysctl.ip6_rt_max_size = 4096;
    + net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
    + net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
    + net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;
    + net->ipv6.sysctl.ip6_rt_gc_elasticity = 9;
    + net->ipv6.sysctl.ip6_rt_mtu_expires = 10*60*HZ;
    + net->ipv6.sysctl.ip6_rt_min_advmss = IPV6_MIN_MTU - 20 - 40;
    +
    #ifdef CONFIG_PROC_FS
    proc_net_fops_create(net, "ipv6_route", 0, &ipv6_route_proc_fops);
    proc_net_fops_create(net, "rt6_stats", S_IRUGO, &rt6_stats_seq_fops);

    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. [PATCH 10/32] mm: allow PF_MEMALLOC from softirq context

    This is needed to allow network softirq packet processing to make use of
    PF_MEMALLOC.

    Currently softirq context cannot use PF_MEMALLOC due to it not being associated
    with a task, and therefore not having task flags to fiddle with - thus the gfp
    to alloc flag mapping ignores the task flags when in interrupts (hard or soft)
    context.

    Allowing softirqs to make use of PF_MEMALLOC therefore requires some trickery.
    We basically borrow the task flags from whatever process happens to be
    preempted by the softirq.

    So we modify the gfp to alloc flags mapping to not exclude task flags in
    softirq context, and modify the softirq code to save, clear and restore the
    PF_MEMALLOC flag.

    The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
    leak into the softirq. The restore ensures a softirq's PF_MEMALLOC flag cannot
    leak back into the preempted process.

    Signed-off-by: Peter Zijlstra
    ---
    include/linux/sched.h | 7 +++++++
    kernel/softirq.c | 3 +++
    mm/page_alloc.c | 7 ++++---
    3 files changed, 14 insertions(+), 3 deletions(-)

    Index: linux-2.6/mm/page_alloc.c
    ================================================== =================
    --- linux-2.6.orig/mm/page_alloc.c
    +++ linux-2.6/mm/page_alloc.c
    @@ -1449,9 +1449,10 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
    alloc_flags |= ALLOC_HARDER;

    if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
    - if (!in_interrupt() &&
    - ((p->flags & PF_MEMALLOC) ||
    - unlikely(test_thread_flag(TIF_MEMDIE))))
    + if (!in_irq() && (p->flags & PF_MEMALLOC))
    + alloc_flags |= ALLOC_NO_WATERMARKS;
    + else if (!in_interrupt() &&
    + unlikely(test_thread_flag(TIF_MEMDIE)))
    alloc_flags |= ALLOC_NO_WATERMARKS;
    }

    Index: linux-2.6/kernel/softirq.c
    ================================================== =================
    --- linux-2.6.orig/kernel/softirq.c
    +++ linux-2.6/kernel/softirq.c
    @@ -213,6 +213,8 @@ asmlinkage void __do_softirq(void)
    __u32 pending;
    int max_restart = MAX_SOFTIRQ_RESTART;
    int cpu;
    + unsigned long pflags = current->flags;
    + current->flags &= ~PF_MEMALLOC;

    pending = local_softirq_pending();
    account_system_vtime(current);
    @@ -251,6 +253,7 @@ restart:

    account_system_vtime(current);
    _local_bh_enable();
    + tsk_restore_flags(current, pflags, PF_MEMALLOC);
    }

    #ifndef __ARCH_HAS_DO_SOFTIRQ
    Index: linux-2.6/include/linux/sched.h
    ================================================== =================
    --- linux-2.6.orig/include/linux/sched.h
    +++ linux-2.6/include/linux/sched.h
    @@ -1533,6 +1533,13 @@ static inline void put_task_struct(struc
    #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
    #define used_math() tsk_used_math(current)

    +static inline void tsk_restore_flags(struct task_struct *p,
    + unsigned long pflags, unsigned long mask)
    +{
    + p->flags &= ~mask;
    + p->flags |= pflags & mask;
    +}
    +
    #ifdef CONFIG_SMP
    extern int set_cpus_allowed_ptr(struct task_struct *p,
    const cpumask_t *new_mask);

    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. [PATCH 13/32] mm: __GFP_MEMALLOC

    __GFP_MEMALLOC will allow the allocation to disregard the watermarks,
    much like PF_MEMALLOC.

    It allows one to pass along the memalloc state in object related allocation
    flags as opposed to task related flags, such as sk->sk_allocation.

    Signed-off-by: Peter Zijlstra
    ---
    include/linux/gfp.h | 3 ++-
    mm/page_alloc.c | 4 +++-
    2 files changed, 5 insertions(+), 2 deletions(-)

    Index: linux-2.6/include/linux/gfp.h
    ================================================== =================
    --- linux-2.6.orig/include/linux/gfp.h
    +++ linux-2.6/include/linux/gfp.h
    @@ -43,6 +43,7 @@ struct vm_area_struct;
    #define __GFP_REPEAT ((__force gfp_t)0x400u) /* See above */
    #define __GFP_NOFAIL ((__force gfp_t)0x800u) /* See above */
    #define __GFP_NORETRY ((__force gfp_t)0x1000u)/* See above */
    +#define __GFP_MEMALLOC ((__force gfp_t)0x2000u)/* Use emergency reserves */
    #define __GFP_COMP ((__force gfp_t)0x4000u)/* Add compound page metadata */
    #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
    #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
    @@ -88,7 +89,7 @@ struct vm_area_struct;
    /* Control page allocator reclaim behavior */
    #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
    __GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
    - __GFP_NORETRY|__GFP_NOMEMALLOC)
    + __GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)

    /* Control allocation constraints */
    #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
    Index: linux-2.6/mm/page_alloc.c
    ================================================== =================
    --- linux-2.6.orig/mm/page_alloc.c
    +++ linux-2.6/mm/page_alloc.c
    @@ -1452,7 +1452,9 @@ int gfp_to_alloc_flags(gfp_t gfp_mask)
    alloc_flags |= ALLOC_HARDER;

    if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
    - if (!in_irq() && (p->flags & PF_MEMALLOC))
    + if (gfp_mask & __GFP_MEMALLOC)
    + alloc_flags |= ALLOC_NO_WATERMARKS;
    + else if (!in_irq() && (p->flags & PF_MEMALLOC))
    alloc_flags |= ALLOC_NO_WATERMARKS;
    else if (!in_interrupt() &&
    unlikely(test_thread_flag(TIF_MEMDIE)))

    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH 00/32] Swap over NFS - v19

    On Thu, 02 Oct 2008 15:05:04 +0200 Peter Zijlstra wrote:

    > Let's get this ball rolling...


    I don't think we're really able to get any MM balls rolling until we
    get all the split-LRU stuff landed. Is anyone testing it? Is it good?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [PATCH 00/32] Swap over NFS - v19

    On Thu, 2008-10-02 at 12:47 -0700, Andrew Morton wrote:
    > On Thu, 02 Oct 2008 15:05:04 +0200 Peter Zijlstra wrote:
    >
    > > Let's get this ball rolling...

    >
    > I don't think we're really able to get any MM balls rolling until we
    > get all the split-LRU stuff landed. Is anyone testing it? Is it good?


    Andrew:

    Up until the mailing list traffic and patches slowed down, I was testing
    it continuously with a heavy stress load that would bring the system to
    its knees before the splitlru and unevictable changes. When it would
    run for days without error [96 hours was my max run] and no further
    patches came, I've concentrated on other things.

    Rik and Kosaki-san have run some performance oriented tests, reported
    here a while back. Maybe they have more info.

    Lee

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH 00/32] Swap over NFS - v19

    On Thursday 02 October 2008 23:05, Peter Zijlstra wrote:
    > Patches are against: v2.6.27-rc5-mm1
    >
    > This release features more comments and (hopefully) better Changelogs.
    > Also the netns stuff got sorted and ipv6 will now build and not oops
    > on boot ;-)
    >
    > The first 4 patches are cleanups and can go in if the respective
    > maintainers agree.
    >
    > The code is lightly tested but seems to work on my default config.
    >
    > Let's get this ball rolling...


    I know it's not too helpful for me to say this, but I am spending
    time looking at this stuff. I have commented on it in the past,
    but I want to get a good handle on the code before I chime in again.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH 00/32] Swap over NFS - v19

    On Friday 03 October 2008 05:47, Andrew Morton wrote:
    > On Thu, 02 Oct 2008 15:05:04 +0200 Peter Zijlstra

    wrote:
    > > Let's get this ball rolling...

    >
    > I don't think we're really able to get any MM balls rolling until we
    > get all the split-LRU stuff landed. Is anyone testing it? Is it good?


    Peter's patches are very orthogonal to that work and shouldn't
    actually change those kinds of reclaim heuristics at all.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [PATCH 08/32] mm: slb: add knowledge of reserve pages

    Because I'm a dork and forgot to refresh...

    ---
    Index: linux-2.6/mm/slob.c
    ================================================== =================
    --- linux-2.6.orig/mm/slob.c
    +++ linux-2.6/mm/slob.c
    @@ -239,7 +239,7 @@ static int slob_last(slob_t *s)

    static void *slob_new_page(gfp_t gfp, int order, int node)
    {
    - void *page;
    + struct page *page;

    #ifdef CONFIG_NUMA
    if (node != -1)
    @@ -318,7 +318,7 @@ static void *slob_alloc(size_t size, gfp
    slob_t *b = NULL;
    unsigned long flags;

    - if (unlikely(slub_reserve)) {
    + if (unlikely(slob_reserve)) {
    if (!(gfp_to_alloc_flags(gfp) & ALLOC_NO_WATERMARKS))
    goto grow;
    }


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [PATCH 00/32] Swap over NFS - v19

    Em Thu, 02 Oct 2008 15:05:04 +0200
    Peter Zijlstra escreveu:

    | Patches are against: v2.6.27-rc5-mm1
    |
    | This release features more comments and (hopefully) better Changelogs.
    | Also the netns stuff got sorted and ipv6 will now build and not oops
    | on boot ;-)
    |
    | The first 4 patches are cleanups and can go in if the respective maintainers
    | agree.
    |
    | The code is lightly tested but seems to work on my default config.
    |
    | Let's get this ball rolling...

    What's the best way to test this? Create a swap in a NFS mount
    point and stress it?

    --
    Luiz Fernando N. Capitulino
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [PATCH 00/32] Swap over NFS - v19

    On Thu, 2 Oct 2008 12:47:48 -0700
    Andrew Morton wrote:
    > On Thu, 02 Oct 2008 15:05:04 +0200 Peter Zijlstra wrote:
    >
    > > Let's get this ball rolling...

    >
    > I don't think we're really able to get any MM balls rolling until we
    > get all the split-LRU stuff landed. Is anyone testing it? Is it good?


    I've done some testing on it on my two test systems and have not
    found performance regressions against the mainline VM.

    As for stability, I think we have done enough testing to conclude
    that it is stable by now.

    --
    All rights reversed.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [PATCH 00/32] Swap over NFS - v19

    On Fri, 2008-10-03 at 14:17 -0300, Luiz Fernando N. Capitulino wrote:
    > Em Thu, 02 Oct 2008 15:05:04 +0200
    > Peter Zijlstra escreveu:
    >
    > | Patches are against: v2.6.27-rc5-mm1
    > |
    > | This release features more comments and (hopefully) better Changelogs.
    > | Also the netns stuff got sorted and ipv6 will now build and not oops
    > | on boot ;-)
    > |
    > | The first 4 patches are cleanups and can go in if the respective maintainers
    > | agree.
    > |
    > | The code is lightly tested but seems to work on my default config.
    > |
    > | Let's get this ball rolling...
    >
    > What's the best way to test this? Create a swap in a NFS mount
    > point and stress it?


    What I do is boot with mem=256M, then swapoff -a;
    swapon /net/host/$path/file.swp;

    the file.swp I created using dd and mkswap on the remote host.

    I then run 2 cyclic loops on anonymous memory sized 96mb, and run 2
    cyclic loops on file backed memory on the same NFS mount
    (eg /net/host/$path/file[12]), also sized 96mb

    That gives a memory footprint of 4*96=384mb and will thus rely on paging
    quite heavily.

    While this is on-going you can have a little deamon that listens and
    accepts connections and reads from them.

    On a 3rd machine, start say a 1000 connections to this deamon that
    continuously write stuff to it.

    Then on you NFS host do something like: /etc/init.d/nfs stop

    go for lunch

    and when you're back do: /etc/init.d/nfs start

    and see if all comes back up again ;-)

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [PATCH 00/32] Swap over NFS - v19

    Hi

    > Andrew Morton wrote:
    > > On Thu, 02 Oct 2008 15:05:04 +0200 Peter Zijlstra wrote:
    > >
    > > > Let's get this ball rolling...

    > >
    > > I don't think we're really able to get any MM balls rolling until we
    > > get all the split-LRU stuff landed. Is anyone testing it? Is it good?

    >
    > I've done some testing on it on my two test systems and have not
    > found performance regressions against the mainline VM.
    >
    > As for stability, I think we have done enough testing to conclude
    > that it is stable by now.


    Also my experience doesn't found any regression.
    and in my experience, split-lru patch increase performance stability.

    What is performance stability?
    example, HPC parallel compution use many process and communication
    each other.
    Then, the system performance is decided by most slow process.

    So, peek and average performance isn't only important, but also
    worst case performance is important.

    Especially, split-lru outperform mainline in anon and file mixed workload.


    example, I ran himeno benchmark.
    (this is one of most famous hpc benchmark in japan, this benchmark
    do matrix calculation on large memory (= use anon only))

    machine
    -------------
    CPU IA64 x8
    MEM 8G

    benchmark setting
    ----------------
    # of parallel: 4
    use mem: 1.7G x4 (used nealy total mem)


    first:
    result of when other process stoped (Unit: MFLOPS)

    each process
    result
    1 2 3 4 worst average
    ---------------------------------------------------------
    2.6.27-rc8: 217 213 217 154 154 200
    mmotm 02 Oct: 217 214 217 217 214 216

    ok, these are the almost same


    next:
    result of when another io process running (Unit: MFLOPS)
    (*) infinite loop of dd command used

    each process
    result
    1 2 3 4 worst average
    ---------------------------------------------------------
    2.6.27-rc8: 34 205 69 196 34 126
    mmotm 02 Oct: 162 179 146 178 146 166


    Wow, worst case is significant difference.
    (this result is reprodusable)

    because reclaim processing of mainline VM is too slow.
    then, the process of calling direct reclaim is decreased performance largely.


    this characteristics is not useful for hpc, but also useful for desktop.
    because if X server (or another critical process) call direct reclaim,
    it can strike end-user-experience easily.


    yup,
    I know many people want to other benchmark result too.
    I'll try to mesure other bench at next week.



    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [PATCH 00/32] Swap over NFS - v19

    Peter Zijlstra wrote:
    > Patches are against: v2.6.27-rc5-mm1
    >
    > This release features more comments and (hopefully) better Changelogs.
    > Also the netns stuff got sorted and ipv6 will now build


    Except for this one I think ;-)

    net/netfilter/core.c: In function ‘nf_hook_slow’:
    net/netfilter/core.c:191: error: ‘pskb’ undeclared (first use in this
    function)

    > and not oops on boot ;-)


    The culprit is emergency-nf_queue.patch. The following change fixes the
    build error for me.

    Index: linux-2.6.26/net/netfilter/core.c
    ================================================== =================
    --- linux-2.6.26.orig/net/netfilter/core.c
    +++ linux-2.6.26/net/netfilter/core.c
    @@ -184,9 +184,12 @@ next_hook:
    ret = 1;
    goto unlock;
    } else if (verdict == NF_DROP) {
    +drop:
    kfree_skb(skb);
    ret = -EPERM;
    } else if ((verdict & NF_VERDICT_MASK) == NF_QUEUE) {
    + if (skb_emergency(skb))
    + goto drop;
    if (!nf_queue(skb, elem, pf, hook, indev, outdev, okfn,
    verdict >> NF_VERDICT_BITS))
    goto next_hook;


    Thanks,

    --
    Suresh Jayaraman
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. split-lru performance mesurement part2

    Hi

    > yup,
    > I know many people want to other benchmark result too.
    > I'll try to mesure other bench at next week.


    I ran another benchmark today.
    I choice dbench because dbench is one of most famous and real workload like i/o benchmark.


    % dbench client.txt 4000

    mainline: Throughput 13.4231 MB/sec 4000 clients 4000 procs max_latency=1421988.159 ms
    mmotm(*): Throughput 7.0354 MB/sec 4000 clients 4000 procs max_latency=2369213.380 ms

    (*) mmotm 2/Oct + Hugh's recently slub fix


    Wow!
    mmotm is slower than mainline largely (about half performance).

    Therefore, I mesured it on "mainline + split-lru(only)" build.


    mainline + split-lru(only): Throughput 14.4062 MB/sec 4000 clients 4000 procs max_latency=1152231.896 ms


    OK!
    split-lru outperform mainline from viewpoint of both throughput and latency



    However, I don't understand why this regression happend.
    Do you have any suggestion?




    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: split-lru performance mesurement part2

    On Tue, 7 Oct 2008 23:26:54 +0900 (JST)
    KOSAKI Motohiro wrote:

    > Hi
    >
    > > yup,
    > > I know many people want to other benchmark result too.
    > > I'll try to mesure other bench at next week.

    >
    > I ran another benchmark today.
    > I choice dbench because dbench is one of most famous and real workload like i/o benchmark.
    >
    >
    > % dbench client.txt 4000
    >
    > mainline: Throughput 13.4231 MB/sec 4000 clients 4000 procs max_latency=1421988.159 ms
    > mmotm(*): Throughput 7.0354 MB/sec 4000 clients 4000 procs max_latency=2369213.380 ms
    >
    > (*) mmotm 2/Oct + Hugh's recently slub fix
    >
    >
    > Wow!
    > mmotm is slower than mainline largely (about half performance).
    >
    > Therefore, I mesured it on "mainline + split-lru(only)" build.
    >
    >
    > mainline + split-lru(only): Throughput 14.4062 MB/sec 4000 clients 4000 procs max_latency=1152231.896 ms
    >
    >
    > OK!
    > split-lru outperform mainline from viewpoint of both throughput and latency
    >
    >
    >
    > However, I don't understand why this regression happend.


    erk.

    dbench is pretty chaotic and it could be that a good change causes
    dbench to get worse. That's happened plenty of times in the past.


    > Do you have any suggestion?



    One of these:

    vmscan-give-referenced-active-and-unmapped-pages-a-second-trip-around-the-lru.patch
    vm-dont-run-touch_buffer-during-buffercache-lookups.patch

    perhaps?
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [PATCH 04/32] net: ipv6: initialize ip6_route sysctl vars in ip6_route_net_init()

    From: Peter Zijlstra
    Date: Thu, 02 Oct 2008 15:05:08 +0200

    > This makes that ip6_route_net_init() does all of the route init code.
    > There used to be a race between ip6_route_net_init() and ip6_net_init()
    > and someone relying on the combined result was left out cold.
    >
    > Signed-off-by: Peter Zijlstra


    Looks good, applied to net-next-2.6, thanks.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [PATCH 03/32] net: ipv6: clean up ip6_route_net_init() error handling

    From: Peter Zijlstra
    Date: Thu, 02 Oct 2008 15:05:07 +0200

    > ip6_route_net_init() error handling looked less than solid, fix 'er up.
    >
    > Signed-off-by: Peter Zijlstra


    Looks good, applied to net-next-2.6
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 2 of 3 FirstFirst 1 2 3 LastLast