[RFC v8][PATCH 0/12] Kernel based checkpoint/restart - Kernel

This is a discussion on [RFC v8][PATCH 0/12] Kernel based checkpoint/restart - Kernel ; Basic checkpoint-restart [C/R]: v8 adds support for "external" checkpoint and improves documentation. Older announcements below. The git tree tracking v8 (branch 'ckpt-v8'), and older versions, is at: git://gorgona.ncl.cs.columbia.edu/pub/git/linux-cr-dev.git (or for the latest version - git://gorgona.ncl.cs.columbia.edu/pub/git/linux-cr.git) We'd like to see these ...

+ Reply to Thread
Results 1 to 17 of 17

Thread: [RFC v8][PATCH 0/12] Kernel based checkpoint/restart

  1. [RFC v8][PATCH 0/12] Kernel based checkpoint/restart

    Basic checkpoint-restart [C/R]: v8 adds support for "external" checkpoint
    and improves documentation. Older announcements below.

    The git tree tracking v8 (branch 'ckpt-v8'), and older versions, is at:
    git://gorgona.ncl.cs.columbia.edu/pub/git/linux-cr-dev.git

    (or for the latest version -
    git://gorgona.ncl.cs.columbia.edu/pub/git/linux-cr.git)

    We'd like to see these make their way into -mm.
    As Dave Hansen put it:

    --
    Why do we want it? It allows containers to be moved between physical
    machines' kernels in the same way that VMWare can move VMs between
    physical machines' hypervisors. There are currently at least two
    out-of-tree implementations of this in the commercial world (IBM's
    Metacluster and Parallels' OpenVZ/Virtuozzo) and several in the academic
    world like Zap.

    Why do we need it in mainline now? Because we already have plenty of
    out-of-tree ones, and want to know what an in-tree one will be like.
    What *I* want right now is the extra review and scrutiny that comes with
    a mainline submission to make sure we're not going in a direction
    contrary to the community.

    This only supports pretty simple apps. But, I trust Ingo when he says:
    >> > > Generally, if something works for simple apps already (in a robust,
    >> > > compatible and supportable way) and users find it "very cool", then
    >> > > support for more complex apps is not far in the future. but if you
    >> > > want to support more complex apps straight away, it takes forever and
    >> > > gets ugly.


    We're *certainly* going to be changing the ABI (which is the format of
    the checkpoint). I'd like to follow the model that we used for
    ext4-dev, which is to make it very clear that this is a development-only
    feature for now. Perhaps we do that by making the interface only
    available through debugfs or something similar for now. Or, reserving
    the syscall numbers but require some runtime switch to be thrown before
    they can be used. I'm open to suggestions here.
    --

    Oren.

    --
    Todo:
    - Add support for x86-64 and improve ABI
    - Refine or change syscall interface
    - Extend to handle (multiple) tasks in a container
    - Handle multiple namespaces in a container (e.g. save the filesystem
    namespaces state with the file descriptors)
    - Security (without CAPS_SYS_ADMIN files restore may fail)

    Changelog:

    [2008-Oct-29] v8:
    - Support "external" checkpoint
    - Include Dave Hansen's 'deny-checkpoint' patch
    - Split docs in Documentation/checkpoint/..., and improve contents

    [2008-Oct-17] v7:
    - Fix save/restore state of FPU
    - Fix argument given to kunmap_atomic() in memory dump/restore

    [2008-Oct-07] v6:
    - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
    - Add assumptions and what's-missing to documentation
    - Misc fixes and cleanups

    [2008-Sep-11] v5:
    - Config is now 'def_bool n' by default
    - Improve memory dump/restore code (following Dave Hansen's comments)
    - Change dump format (and code) to allow chunks of
    instead of one long list of each
    - Fix use of follow_page() to avoid faulting in non-present pages
    - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
    - Remove preempt_disable() when restoring debug registers
    - Rename headers files s/ckpt/checkpoint/
    - Fix misc bugs in files dump/restore
    - Fixes and cleanups on some error paths
    - Fix misc coding style

    [2008-Sep-09] v4:
    - Various fixes and clean-ups
    - Fix calculation of hash table size
    - Fix header structure alignment
    - Use stand list_... for cr_pgarr

    [2008-Aug-29] v3:
    - Various fixes and clean-ups
    - Use standard hlist_... for hash table
    - Better use of standard kmalloc/kfree

    [2008-Aug-20] v2:
    - Added Dump and restore of open files (regular and directories)
    - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
    - Added documentation
    - Improved ABI, 64bit padding for image data
    - Improved locking when saving/restoring memory
    - Added UTS information to header (release, version, machine)
    - Cleanup extraction of filename from a file pointer
    - Refactor to allow easier reviewing
    - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
    - Other cleanup and response to comments for v1

    [2008-Jul-29] v1:
    - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.

    --
    At the containers mini-conference before OLS, the consensus among
    all the stakeholders was that doing checkpoint/restart in the kernel
    as much as possible was the best approach. With this approach, the
    kernel will export a relatively opaque 'blob' of data to userspace
    which can then be handed to the new kernel at restore time.

    This is different than what had been proposed before, which was
    that a userspace application would be responsible for collecting
    all of this data. We were also planning on adding lots of new,
    little kernel interfaces for all of the things that needed
    checkpointing. This unites those into a single, grand interface.

    The 'blob' will contain copies of select portions of kernel
    structures such as vmas and mm_structs. It will also contain
    copies of the actual memory that the process uses. Any changes
    in this blob's format between kernel revisions can be handled by
    an in-userspace conversion program.

    This is a similar approach to virtually all of the commercial
    checkpoint/restart products out there, as well as the research
    project Zap.

    These patches basically serialize internel kernel state and write
    it out to a file descriptor. The checkpoint and restore are done
    with two new system calls: sys_checkpoint and sys_restart.

    In this incarnation, they can only work checkpoint and restore a
    single task. The task's address space may consist of only private,
    simple vma's - anonymous or file-mapped. The open files may consist
    of only simple files and directories.
    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. [RFC v8][PATCH 10/12] Restore open file descriprtors

    Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
    and lookup objref in the hash table; if not found (first occurence), read
    in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
    Otherwise attach the file pointer from the hash as an FD.

    This patch only handles basic FDs - regular files, directories and also
    symbolic links.

    Changelog[v6]:
    - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

    Signed-off-by: Oren Laadan
    Acked-by: Serge Hallyn
    Signed-off-by: Dave Hansen
    ---
    checkpoint/Makefile | 2 +-
    checkpoint/restart.c | 4 +
    checkpoint/rstr_file.c | 246 ++++++++++++++++++++++++++++++++++++++++++++
    include/linux/checkpoint.h | 1 +
    4 files changed, 252 insertions(+), 1 deletions(-)
    create mode 100644 checkpoint/rstr_file.c

    diff --git a/checkpoint/Makefile b/checkpoint/Makefile
    index 7496695..88bbc10 100644
    --- a/checkpoint/Makefile
    +++ b/checkpoint/Makefile
    @@ -3,4 +3,4 @@
    #

    obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
    - ckpt_mem.o rstr_mem.o ckpt_file.o
    + ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
    diff --git a/checkpoint/restart.c b/checkpoint/restart.c
    index f4d87ba..9ff9f66 100644
    --- a/checkpoint/restart.c
    +++ b/checkpoint/restart.c
    @@ -219,6 +219,10 @@ static int cr_read_task(struct cr_ctx *ctx)
    cr_debug("memory: ret %d\n", ret);
    if (ret < 0)
    goto out;
    + ret = cr_read_files(ctx);
    + cr_debug("files: ret %d\n", ret);
    + if (ret < 0)
    + goto out;
    ret = cr_read_thread(ctx);
    cr_debug("thread: ret %d\n", ret);
    if (ret < 0)
    diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
    new file mode 100644
    index 0000000..08bb049
    --- /dev/null
    +++ b/checkpoint/rstr_file.c
    @@ -0,0 +1,246 @@
    +/*
    + * Checkpoint file descriptors
    + *
    + * Copyright (C) 2008 Oren Laadan
    + *
    + * This file is subject to the terms and conditions of the GNU General Public
    + * License. See the file COPYING in the main directory of the Linux
    + * distribution for more details.
    + */
    +
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +
    +#include "checkpoint_file.h"
    +
    +static int cr_close_all_fds(struct files_struct *files)
    +{
    + int *fdtable;
    + int nfds;
    +
    + nfds = cr_scan_fds(files, &fdtable);
    + if (nfds < 0)
    + return nfds;
    + while (nfds--)
    + sys_close(fdtable[nfds]);
    + kfree(fdtable);
    + return 0;
    +}
    +
    +/**
    + * cr_attach_file - attach a lonely file ptr to a file descriptor
    + * @file: lonely file pointer
    + */
    +static int cr_attach_file(struct file *file)
    +{
    + int fd = get_unused_fd_flags(0);
    +
    + if (fd >= 0) {
    + fsnotify_open(file->f_path.dentry);
    + fd_install(fd, file);
    + }
    + return fd;
    +}
    +
    +/**
    + * cr_attach_get_file - attach (and get) lonely file ptr to a file descriptor
    + * @file: lonely file pointer
    + */
    +static int cr_attach_get_file(struct file *file)
    +{
    + int fd = get_unused_fd_flags(0);
    +
    + if (fd >= 0) {
    + fsnotify_open(file->f_path.dentry);
    + fd_install(fd, file);
    + get_file(file);
    + }
    + return fd;
    +}
    +
    +#define CR_SETFL_MASK (O_APPEND|O_NONBLOCK|O_NDELAY|FASYNC|O_DIRECT|O_NO ATIME)
    +
    +/* cr_read_fd_data - restore the state of a given file pointer */
    +static int
    +cr_read_fd_data(struct cr_ctx *ctx, struct files_struct *files, int parent)
    +{
    + struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + struct file *file;
    + int rparent, ret;
    + int fd = 0; /* pacify gcc warning */
    +
    + rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_DATA);
    + cr_debug("rparent %d parent %d flags %#x mode %#x how %d\n",
    + rparent, parent, hh->f_flags, hh->f_mode, hh->fd_type);
    + if (rparent < 0) {
    + ret = parent;
    + goto out;
    + }
    +
    + ret = -EINVAL;
    +
    + if (rparent != parent)
    + goto out;
    +
    + /* FIX: more sanity checks on f_flags, f_mode etc */
    +
    + switch (hh->fd_type) {
    + case CR_FD_FILE:
    + case CR_FD_DIR:
    + case CR_FD_LINK:
    + file = cr_read_open_fname(ctx, hh->f_flags, hh->f_mode);
    + break;
    + default:
    + goto out;
    + }
    +
    + if (IS_ERR(file)) {
    + ret = PTR_ERR(file);
    + goto out;
    + }
    +
    + /* FIX: need to restore uid, gid, owner etc */
    +
    + fd = cr_attach_file(file); /* no need to cleanup 'file' below */
    + if (fd < 0) {
    + filp_close(file, NULL);
    + ret = fd;
    + goto out;
    + }
    +
    + /* register new tuple in hash table */
    + ret = cr_obj_add_ref(ctx, (void *) file, parent, CR_OBJ_FILE, 0);
    + if (ret < 0)
    + goto out;
    + ret = sys_fcntl(fd, F_SETFL, hh->f_flags & CR_SETFL_MASK);
    + if (ret < 0)
    + goto out;
    + ret = vfs_llseek(file, hh->f_pos, SEEK_SET);
    + if (ret == -ESPIPE) /* ignore error on non-seekable files */
    + ret = 0;
    +
    + ret = 0;
    + out:
    + cr_hbuf_put(ctx, sizeof(*hh));
    + return ret < 0 ? ret : fd;
    +}
    +
    +/**
    + * cr_read_fd_ent - restore the state of a given file descriptor
    + * @ctx: checkpoint context
    + * @files: files_struct pointer
    + * @parent: parent objref
    + *
    + * Restores the state of a file descriptor; looks up the objref (in the
    + * header) in the hash table, and if found picks the matching file and
    + * use it; otherwise calls cr_read_fd_data to restore the file too.
    + */
    +static int
    +cr_read_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int parent)
    +{
    + struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + struct file *file;
    + int newfd, rparent, ret;
    +
    + rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_ENT);
    + cr_debug("rparent %d parent %d ref %d fd %d c.o.e %d\n",
    + rparent, parent, hh->objref, hh->fd, hh->close_on_exec);
    + if (rparent < 0) {
    + ret = rparent;
    + goto out;
    + }
    +
    + ret = -EINVAL;
    +
    + if (rparent != parent)
    + goto out;
    + if (hh->objref <= 0)
    + goto out;
    +
    + file = cr_obj_get_by_ref(ctx, hh->objref, CR_OBJ_FILE);
    + if (IS_ERR(file)) {
    + ret = PTR_ERR(file);
    + goto out;
    + }
    +
    + if (file) {
    + /* reuse file descriptor found in the hash table */
    + newfd = cr_attach_get_file(file);
    + } else {
    + /* create new file pointer (and register in hash table) */
    + newfd = cr_read_fd_data(ctx, files, hh->objref);
    + }
    +
    + if (newfd < 0) {
    + ret = newfd;
    + goto out;
    + }
    +
    + cr_debug("newfd got %d wanted %d\n", newfd, hh->fd);
    +
    + /* if newfd isn't desired fd then reposition it */
    + if (newfd != hh->fd) {
    + ret = sys_dup2(newfd, hh->fd);
    + if (ret < 0)
    + goto out;
    + sys_close(newfd);
    + }
    +
    + if (hh->close_on_exec)
    + set_close_on_exec(hh->fd, 1);
    +
    + ret = 0;
    + out:
    + cr_hbuf_put(ctx, sizeof(*hh));
    + return ret;
    +}
    +
    +int cr_read_files(struct cr_ctx *ctx)
    +{
    + struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + struct files_struct *files = current->files;
    + int i, parent, ret;
    +
    + parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FILES);
    + if (parent < 0) {
    + ret = parent;
    + goto out;
    + }
    +
    + ret = -EINVAL;
    +#if 0 /* activate when containers are used */
    + if (parent != task_pid_vnr(current))
    + goto out;
    +#endif
    + cr_debug("objref %d nfds %d\n", hh->objref, hh->nfds);
    + if (hh->objref < 0 || hh->nfds < 0)
    + goto out;
    +
    + if (hh->nfds > sysctl_nr_open) {
    + ret = -EMFILE;
    + goto out;
    + }
    +
    + /* point of no return -- close all file descriptors */
    + ret = cr_close_all_fds(files);
    + if (ret < 0)
    + goto out;
    +
    + for (i = 0; i < hh->nfds; i++) {
    + ret = cr_read_fd_ent(ctx, files, hh->objref);
    + if (ret < 0)
    + break;
    + }
    +
    + ret = 0;
    + out:
    + cr_hbuf_put(ctx, sizeof(*hh));
    + return ret;
    +}
    diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
    index 0856b3b..6c1e87f 100644
    --- a/include/linux/checkpoint.h
    +++ b/include/linux/checkpoint.h
    @@ -85,6 +85,7 @@ extern int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);

    extern int do_restart(struct cr_ctx *ctx);
    extern int cr_read_mm(struct cr_ctx *ctx);
    +extern int cr_read_files(struct cr_ctx *ctx);

    #define cr_debug(fmt, args...) \
    pr_debug("[CR:%s] " fmt, __func__, ## args)
    --
    1.5.4.3

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. [RFC v8][PATCH 03/12] Make file_pos_read/write() public

    These two are needed when we will use vfs_read() and vfs_write(),
    in the next patch.

    Signed-off-by: Oren Laadan
    ---
    fs/read_write.c | 10 ----------
    include/linux/fs.h | 10 ++++++++++
    2 files changed, 10 insertions(+), 10 deletions(-)

    diff --git a/fs/read_write.c b/fs/read_write.c
    index 9ba495d..5d5c192 100644
    --- a/fs/read_write.c
    +++ b/fs/read_write.c
    @@ -324,16 +324,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_

    EXPORT_SYMBOL(vfs_write);

    -static inline loff_t file_pos_read(struct file *file)
    -{
    - return file->f_pos;
    -}
    -
    -static inline void file_pos_write(struct file *file, loff_t pos)
    -{
    - file->f_pos = pos;
    -}
    -
    asmlinkage ssize_t sys_read(unsigned int fd, char __user * buf, size_t count)
    {
    struct file *file;
    diff --git a/include/linux/fs.h b/include/linux/fs.h
    index 580b513..5537435 100644
    --- a/include/linux/fs.h
    +++ b/include/linux/fs.h
    @@ -1296,6 +1296,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
    struct iovec *fast_pointer,
    struct iovec **ret_pointer);

    +static inline loff_t file_pos_read(struct file *file)
    +{
    + return file->f_pos;
    +}
    +
    +static inline void file_pos_write(struct file *file, loff_t pos)
    +{
    + file->f_pos = pos;
    +}
    +
    extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
    extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
    extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
    --
    1.5.4.3

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. [RFC v8][PATCH 05/12] x86 support for checkpoint/restart

    (Following Dave Hansen's refactoring of the original post)

    Add logic to save and restore architecture specific state, including
    thread-specific state, CPU registers and FPU state.

    Currently only x86-32 is supported. Compiling on x86-64 will trigger
    an explicit error.

    Changelog[v7]:
    - Fix save/restore state of FPU

    Changelog[v5]:
    - Remove preempt_disable() when restoring debug registers

    Changelog[v4]:
    - Fix header structure alignment

    Changelog[v2]:
    - Pad header structures to 64 bits to ensure compatibility

    Signed-off-by: Oren Laadan
    Acked-by: Serge Hallyn
    Signed-off-by: Dave Hansen
    ---
    arch/x86/mm/Makefile | 2 +
    arch/x86/mm/checkpoint.c | 200 ++++++++++++++++++++++++++++++++++++++
    arch/x86/mm/restart.c | 194 ++++++++++++++++++++++++++++++++++++
    checkpoint/checkpoint.c | 13 +++-
    checkpoint/checkpoint_arch.h | 7 ++
    checkpoint/restart.c | 13 +++-
    include/asm-x86/checkpoint_hdr.h | 72 ++++++++++++++
    include/linux/checkpoint_hdr.h | 1 +
    8 files changed, 500 insertions(+), 2 deletions(-)
    create mode 100644 arch/x86/mm/checkpoint.c
    create mode 100644 arch/x86/mm/restart.c
    create mode 100644 checkpoint/checkpoint_arch.h
    create mode 100644 include/asm-x86/checkpoint_hdr.h

    diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
    index dfb932d..58fe072 100644
    --- a/arch/x86/mm/Makefile
    +++ b/arch/x86/mm/Makefile
    @@ -22,3 +22,5 @@ endif
    obj-$(CONFIG_ACPI_NUMA) += srat_$(BITS).o

    obj-$(CONFIG_MEMTEST) += memtest.o
    +
    +obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
    diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
    new file mode 100644
    index 0000000..bfa7180
    --- /dev/null
    +++ b/arch/x86/mm/checkpoint.c
    @@ -0,0 +1,200 @@
    +/*
    + * Checkpoint/restart - architecture specific support for x86
    + *
    + * Copyright (C) 2008 Oren Laadan
    + *
    + * This file is subject to the terms and conditions of the GNU General Public
    + * License. See the file COPYING in the main directory of the Linux
    + * distribution for more details.
    + */
    +
    +#include
    +#include
    +
    +#include
    +#include
    +
    +/* dump the thread_struct of a given task */
    +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
    +{
    + struct cr_hdr h;
    + struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + struct thread_struct *thread;
    + struct desc_struct *desc;
    + int ntls = 0;
    + int n, ret;
    +
    + h.type = CR_HDR_THREAD;
    + h.len = sizeof(*hh);
    + h.parent = task_pid_vnr(t);
    +
    + thread = &t->thread;
    +
    + /* calculate no. of TLS entries that follow */
    + desc = thread->tls_array;
    + for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
    + if (desc->a || desc->b)
    + ntls++;
    + }
    +
    + hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
    + hh->sizeof_tls_array = sizeof(thread->tls_array);
    + hh->ntls = ntls;
    +
    + ret = cr_write_obj(ctx, &h, hh);
    + cr_hbuf_put(ctx, sizeof(*hh));
    + if (ret < 0)
    + return ret;
    +
    + /* for simplicity dump the entire array, cherry-pick upon restart */
    + ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
    +
    + cr_debug("ntls %d\n", ntls);
    +
    + /* IGNORE RESTART BLOCKS FOR NOW ... */
    +
    + return ret;
    +}
    +
    +#ifdef CONFIG_X86_64
    +
    +#error "CONFIG_X86_64 unsupported yet."
    +
    +#else /* !CONFIG_X86_64 */
    +
    +void cr_write_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
    +{
    + struct thread_struct *thread = &t->thread;
    + struct pt_regs *regs = task_pt_regs(t);
    +
    + hh->bp = regs->bp;
    + hh->bx = regs->bx;
    + hh->ax = regs->ax;
    + hh->cx = regs->cx;
    + hh->dx = regs->dx;
    + hh->si = regs->si;
    + hh->di = regs->di;
    + hh->orig_ax = regs->orig_ax;
    + hh->ip = regs->ip;
    + hh->cs = regs->cs;
    + hh->flags = regs->flags;
    + hh->sp = regs->sp;
    + hh->ss = regs->ss;
    +
    + hh->ds = regs->ds;
    + hh->es = regs->es;
    +
    + /*
    + * for checkpoint in process context (from within a container)
    + * the GS and FS registers should be saved from the hardware;
    + * otherwise they are already sabed on the thread structure
    + */
    + if (t == current) {
    + savesegment(gs, hh->gs);
    + savesegment(fs, hh->fs);
    + } else {
    + hh->gs = thread->gs;
    + hh->fs = thread->fs;
    + }
    +
    + /*
    + * for checkpoint in process context (from within a container),
    + * the actual syscall is taking place at this very moment; so
    + * we (optimistically) subtitute the future return value (0) of
    + * this syscall into the orig_eax, so that upon restart it will
    + * succeed (or it will endlessly retry checkpoint...)
    + */
    + if (t == current) {
    + BUG_ON(hh->orig_ax < 0);
    + hh->ax = 0;
    + }
    +}
    +
    +void cr_write_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
    +{
    + struct thread_struct *thread = &t->thread;
    +
    + /* debug regs */
    +
    + preempt_disable();
    +
    + /*
    + * for checkpoint in process context (from within a container),
    + * get the actual registers; otherwise get the saved values.
    + */
    +
    + if (t == current) {
    + get_debugreg(hh->debugreg0, 0);
    + get_debugreg(hh->debugreg1, 1);
    + get_debugreg(hh->debugreg2, 2);
    + get_debugreg(hh->debugreg3, 3);
    + get_debugreg(hh->debugreg6, 6);
    + get_debugreg(hh->debugreg7, 7);
    + } else {
    + hh->debugreg0 = thread->debugreg0;
    + hh->debugreg1 = thread->debugreg1;
    + hh->debugreg2 = thread->debugreg2;
    + hh->debugreg3 = thread->debugreg3;
    + hh->debugreg6 = thread->debugreg6;
    + hh->debugreg7 = thread->debugreg7;
    + }
    +
    + hh->debugreg4 = 0;
    + hh->debugreg5 = 0;
    +
    + hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
    +
    + preempt_enable();
    +}
    +
    +void cr_write_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
    +{
    + struct thread_struct *thread = &t->thread;
    + struct thread_info *thread_info = task_thread_info(t);
    +
    + /* i387 + MMU + SSE logic */
    +
    + preempt_disable(); /* needed it (t == current) */
    +
    + hh->used_math = tsk_used_math(t) ? 1 : 0;
    + if (hh->used_math) {
    + /*
    + * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
    + * have been cleared when task was conexted-switched out...
    + * except if we are in process context, in which case we do
    + */
    + if (t == current) {
    + if (thread_info->status & TS_USEDFPU)
    + unlazy_fpu(current);
    + }
    +
    + hh->has_fxsr = cpu_has_fxsr;
    + memcpy(&hh->xstate, thread->xstate, sizeof(*thread->xstate));
    + }
    +
    + preempt_enable(); /* needed it (t == current) */
    +}
    +
    +#endif /* CONFIG_X86_64 */
    +
    +/* dump the cpu state and registers of a given task */
    +int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
    +{
    + struct cr_hdr h;
    + struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + int ret;
    +
    + h.type = CR_HDR_CPU;
    + h.len = sizeof(*hh);
    + h.parent = task_pid_vnr(t);
    +
    + cr_write_cpu_regs(hh, t);
    + cr_write_cpu_debug(hh, t);
    + cr_write_cpu_fpu(hh, t);
    +
    + cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
    +
    + ret = cr_write_obj(ctx, &h, hh);
    + cr_hbuf_put(ctx, sizeof(*hh));
    + return ret;
    +}
    diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
    new file mode 100644
    index 0000000..2bff5eb
    --- /dev/null
    +++ b/arch/x86/mm/restart.c
    @@ -0,0 +1,194 @@
    +/*
    + * Checkpoint/restart - architecture specific support for x86
    + *
    + * Copyright (C) 2008 Oren Laadan
    + *
    + * This file is subject to the terms and conditions of the GNU General Public
    + * License. See the file COPYING in the main directory of the Linux
    + * distribution for more details.
    + */
    +
    +#include
    +#include
    +
    +#include
    +#include
    +
    +/* read the thread_struct into the current task */
    +int cr_read_thread(struct cr_ctx *ctx)
    +{
    + struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + struct task_struct *t = current;
    + struct thread_struct *thread = &t->thread;
    + int parent, ret;
    +
    + parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
    + if (parent < 0) {
    + ret = parent;
    + goto out;
    + }
    +
    + ret = -EINVAL;
    +
    +#if 0 /* activate when containers are used */
    + if (parent != task_pid_vnr(t))
    + goto out;
    +#endif
    + cr_debug("ntls %d\n", hh->ntls);
    +
    + if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
    + hh->sizeof_tls_array != sizeof(thread->tls_array) ||
    + hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
    + goto out;
    +
    + if (hh->ntls > 0) {
    + struct desc_struct *desc;
    + int size, cpu;
    +
    + /*
    + * restore TLS by hand: why convert to struct user_desc if
    + * sys_set_thread_entry() will convert it back ?
    + */
    +
    + size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
    + desc = kmalloc(size, GFP_KERNEL);
    + if (!desc)
    + return -ENOMEM;
    +
    + ret = cr_kread(ctx, desc, size);
    + if (ret >= 0) {
    + /*
    + * FIX: add sanity checks (eg. that values makes
    + * sense, that we don't overwrite old values, etc
    + */
    + cpu = get_cpu();
    + memcpy(thread->tls_array, desc, size);
    + load_TLS(thread, cpu);
    + put_cpu();
    + }
    + kfree(desc);
    + }
    +
    + ret = 0;
    + out:
    + cr_hbuf_put(ctx, sizeof(*hh));
    + return ret;
    +}
    +
    +#ifdef CONFIG_X86_64
    +
    +#error "CONFIG_X86_64 unsupported yet."
    +
    +#else /* !CONFIG_X86_64 */
    +
    +int cr_read_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
    +{
    + struct thread_struct *thread = &t->thread;
    + struct pt_regs *regs = task_pt_regs(t);
    +
    + regs->bx = hh->bx;
    + regs->cx = hh->cx;
    + regs->dx = hh->dx;
    + regs->si = hh->si;
    + regs->di = hh->di;
    + regs->bp = hh->bp;
    + regs->ax = hh->ax;
    + regs->ds = hh->ds;
    + regs->es = hh->es;
    + regs->orig_ax = hh->orig_ax;
    + regs->ip = hh->ip;
    + regs->cs = hh->cs;
    + regs->flags = hh->flags;
    + regs->sp = hh->sp;
    + regs->ss = hh->ss;
    +
    + thread->gs = hh->gs;
    + thread->fs = hh->fs;
    + loadsegment(gs, hh->gs);
    + loadsegment(fs, hh->fs);
    +
    + return 0;
    +}
    +
    +int cr_read_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
    +{
    + /* debug regs */
    +
    + if (hh->uses_debug) {
    + set_debugreg(hh->debugreg0, 0);
    + set_debugreg(hh->debugreg1, 1);
    + /* ignore 4, 5 */
    + set_debugreg(hh->debugreg2, 2);
    + set_debugreg(hh->debugreg3, 3);
    + set_debugreg(hh->debugreg6, 6);
    + set_debugreg(hh->debugreg7, 7);
    + }
    +
    + return 0;
    +}
    +
    +int cr_read_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
    +{
    + struct thread_struct *thread = &t->thread;
    + int ret;
    +
    + /* i387 + MMU + SSE */
    +
    + preempt_disable();
    +
    + __clear_fpu(t); /* in case we used FPU in user mode */
    +
    + if (!hh->used_math)
    + clear_used_math();
    + else {
    + if (hh->has_fxsr != cpu_has_fxsr) {
    + force_sig(SIGFPE, t);
    + return -EINVAL;
    + }
    + /* init_fpu() also calls set_used_math() */
    + ret = init_fpu(current);
    + if (ret < 0)
    + return ret;
    + memcpy(thread->xstate, &hh->xstate, sizeof(*thread->xstate));
    + }
    +
    + preempt_enable();
    + return 0;
    +}
    +
    +#endif /* CONFIG_X86_64 */
    +
    +/* read the cpu state and registers for the current task */
    +int cr_read_cpu(struct cr_ctx *ctx)
    +{
    + struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + struct task_struct *t = current;
    + int parent, ret;
    +
    + parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
    + if (parent < 0) {
    + ret = parent;
    + goto out;
    + }
    +
    + ret = -EINVAL;
    +
    +#if 0 /* activate when containers are used */
    + if (parent != task_pid_vnr(t))
    + goto out;
    +#endif
    + /* FIX: sanity check for sensitive registers (eg. eflags) */
    +
    + ret = cr_read_cpu_regs(hh, t);
    + if (ret < 0)
    + goto out;
    + ret = cr_read_cpu_debug(hh, t);
    + if (ret < 0)
    + goto out;
    + ret = cr_read_cpu_fpu(hh, t);
    +
    + cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
    + out:
    + cr_hbuf_put(ctx, sizeof(*hh));
    + return ret;
    +}
    diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
    index 317b0e8..ba18e44 100644
    --- a/checkpoint/checkpoint.c
    +++ b/checkpoint/checkpoint.c
    @@ -20,6 +20,8 @@
    #include
    #include

    +#include "checkpoint_arch.h"
    +
    /**
    * cr_write_obj - write a record described by a cr_hdr
    * @ctx: checkpoint context
    @@ -145,8 +147,17 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
    }

    ret = cr_write_task_struct(ctx, t);
    - cr_debug("ret %d\n", ret);
    + cr_debug("task_struct: ret %d\n", ret);
    + if (ret < 0)
    + goto out;
    + ret = cr_write_thread(ctx, t);
    + cr_debug("thread: ret %d\n", ret);
    + if (ret < 0)
    + goto out;
    + ret = cr_write_cpu(ctx, t);
    + cr_debug("cpu: ret %d\n", ret);

    + out:
    return ret;
    }

    diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
    new file mode 100644
    index 0000000..bf2d21e
    --- /dev/null
    +++ b/checkpoint/checkpoint_arch.h
    @@ -0,0 +1,7 @@
    +#include
    +
    +extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
    +extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
    +
    +extern int cr_read_thread(struct cr_ctx *ctx);
    +extern int cr_read_cpu(struct cr_ctx *ctx);
    diff --git a/checkpoint/restart.c b/checkpoint/restart.c
    index 69befa7..766e381 100644
    --- a/checkpoint/restart.c
    +++ b/checkpoint/restart.c
    @@ -15,6 +15,8 @@
    #include
    #include

    +#include "checkpoint_arch.h"
    +
    /**
    * cr_read_obj - read a whole record (cr_hdr followed by payload)
    * @ctx: checkpoint context
    @@ -172,8 +174,17 @@ static int cr_read_task(struct cr_ctx *ctx)
    int ret;

    ret = cr_read_task_struct(ctx);
    - cr_debug("ret %d\n", ret);
    + cr_debug("task_struct: ret %d\n", ret);
    + if (ret < 0)
    + goto out;
    + ret = cr_read_thread(ctx);
    + cr_debug("thread: ret %d\n", ret);
    + if (ret < 0)
    + goto out;
    + ret = cr_read_cpu(ctx);
    + cr_debug("cpu: ret %d\n", ret);

    + out:
    return ret;
    }

    diff --git a/include/asm-x86/checkpoint_hdr.h b/include/asm-x86/checkpoint_hdr.h
    new file mode 100644
    index 0000000..44a903c
    --- /dev/null
    +++ b/include/asm-x86/checkpoint_hdr.h
    @@ -0,0 +1,72 @@
    +#ifndef __ASM_X86_CKPT_HDR_H
    +#define __ASM_X86_CKPT_HDR_H
    +/*
    + * Checkpoint/restart - architecture specific headers x86
    + *
    + * Copyright (C) 2008 Oren Laadan
    + *
    + * This file is subject to the terms and conditions of the GNU General Public
    + * License. See the file COPYING in the main directory of the Linux
    + * distribution for more details.
    + */
    +
    +#include
    +
    +struct cr_hdr_thread {
    + /* NEED: restart blocks */
    +
    + __s16 gdt_entry_tls_entries;
    + __s16 sizeof_tls_array;
    + __s16 ntls; /* number of TLS entries to follow */
    +} __attribute__((aligned(8)));
    +
    +struct cr_hdr_cpu {
    + /* see struct pt_regs (x86-64) */
    + __u64 r15;
    + __u64 r14;
    + __u64 r13;
    + __u64 r12;
    + __u64 bp;
    + __u64 bx;
    + __u64 r11;
    + __u64 r10;
    + __u64 r9;
    + __u64 r8;
    + __u64 ax;
    + __u64 cx;
    + __u64 dx;
    + __u64 si;
    + __u64 di;
    + __u64 orig_ax;
    + __u64 ip;
    + __u64 cs;
    + __u64 flags;
    + __u64 sp;
    + __u64 ss;
    +
    + /* segment registers */
    + __u64 ds;
    + __u64 es;
    + __u64 fs;
    + __u64 gs;
    +
    + /* debug registers */
    + __u64 debugreg0;
    + __u64 debugreg1;
    + __u64 debugreg2;
    + __u64 debugreg3;
    + __u64 debugreg4;
    + __u64 debugreg5;
    + __u64 debugreg6;
    + __u64 debugreg7;
    +
    + __u16 uses_debug;
    + __u16 used_math;
    + __u16 has_fxsr;
    + __u16 _padding;
    +
    + union thread_xstate xstate; /* i387 */
    +
    +} __attribute__((aligned(8)));
    +
    +#endif /* __ASM_X86_CKPT_HDR__H */
    diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
    index 79e4df2..03ec72e 100644
    --- a/include/linux/checkpoint_hdr.h
    +++ b/include/linux/checkpoint_hdr.h
    @@ -12,6 +12,7 @@

    #include
    #include
    +#include

    /*
    * To maintain compatibility between 32-bit and 64-bit architecture flavors,
    --
    1.5.4.3

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. [RFC v8][PATCH 07/12] Restore memory address space

    Restoring the memory address space begins with nuking the existing one
    of the current process, and then reading the VMA state and contents.
    Call do_mmap_pgoffset() for each VMA and then read in the data.

    Changelog[v7]:
    - Fix argument given to kunmap_atomic() in memory dump/restore

    Changelog[v6]:
    - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)

    Changelog[v5]:
    - Improve memory restore code (following Dave Hansen's comments)
    - Change dump format (and code) to allow chunks of
    instead of one long list of each
    - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()

    Changelog[v4]:
    - Use standard list_... for cr_pgarr


    Signed-off-by: Oren Laadan
    Acked-by: Serge Hallyn
    Signed-off-by: Dave Hansen
    ---
    arch/x86/mm/restart.c | 64 ++++++-
    checkpoint/Makefile | 2 +-
    checkpoint/checkpoint_arch.h | 2 +
    checkpoint/checkpoint_mem.h | 5 +
    checkpoint/restart.c | 42 ++++
    checkpoint/rstr_mem.c | 384 ++++++++++++++++++++++++++++++++++++++
    include/asm-x86/checkpoint_hdr.h | 4 +
    include/linux/checkpoint.h | 3 +
    8 files changed, 503 insertions(+), 3 deletions(-)
    create mode 100644 checkpoint/rstr_mem.c

    diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
    index bc2a502..aeae29f 100644
    --- a/arch/x86/mm/restart.c
    +++ b/arch/x86/mm/restart.c
    @@ -53,8 +53,10 @@ int cr_read_thread(struct cr_ctx *ctx)

    size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
    desc = kmalloc(size, GFP_KERNEL);
    - if (!desc)
    - return -ENOMEM;
    + if (!desc) {
    + ret = -ENOMEM;
    + goto out;
    + }

    ret = cr_kread(ctx, desc, size);
    if (ret >= 0) {
    @@ -193,3 +195,61 @@ int cr_read_cpu(struct cr_ctx *ctx)
    cr_hbuf_put(ctx, sizeof(*hh));
    return ret;
    }
    +
    +int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
    +{
    + struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + int n, rparent, ret = -EINVAL;
    +
    + rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
    + cr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
    + if (rparent < 0) {
    + ret = rparent;
    + goto out;
    + }
    + if (rparent != parent)
    + goto out;
    +
    + if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
    + goto out;
    +
    + /*
    + * to utilize the syscall modify_ldt() we first convert the data
    + * in the checkpoint image from 'struct desc_struct' to 'struct
    + * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
    + */
    +
    + for (n = 0; n < hh->nldt; n++) {
    + struct user_desc info;
    + struct desc_struct desc;
    + mm_segment_t old_fs;
    +
    + ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
    + if (ret < 0)
    + goto out;
    +
    + info.entry_number = n;
    + info.base_addr = desc.base0 | (desc.base1 << 16);
    + info.limit = desc.limit0;
    + info.seg_32bit = desc.d;
    + info.contents = desc.type >> 2;
    + info.read_exec_only = (desc.type >> 1) ^ 1;
    + info.limit_in_pages = desc.g;
    + info.seg_not_present = desc.p ^ 1;
    + info.useable = desc.avl;
    +
    + old_fs = get_fs();
    + set_fs(get_ds());
    + ret = sys_modify_ldt(1, (struct user_desc __user *) &info,
    + sizeof(info));
    + set_fs(old_fs);
    +
    + if (ret < 0)
    + goto out;
    + }
    +
    + ret = 0;
    + out:
    + cr_hbuf_put(ctx, sizeof(*hh));
    + return ret;
    +}
    diff --git a/checkpoint/Makefile b/checkpoint/Makefile
    index 3a0df6d..ac35033 100644
    --- a/checkpoint/Makefile
    +++ b/checkpoint/Makefile
    @@ -3,4 +3,4 @@
    #

    obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
    - ckpt_mem.o
    + ckpt_mem.o rstr_mem.o
    diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
    index 7da4ad0..018a72e 100644
    --- a/checkpoint/checkpoint_arch.h
    +++ b/checkpoint/checkpoint_arch.h
    @@ -7,3 +7,5 @@ extern int cr_write_mm_context(struct cr_ctx *ctx,

    extern int cr_read_thread(struct cr_ctx *ctx);
    extern int cr_read_cpu(struct cr_ctx *ctx);
    +extern int cr_read_mm_context(struct cr_ctx *ctx,
    + struct mm_struct *mm, int parent);
    diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
    index 85546f4..85a5cf3 100644
    --- a/checkpoint/checkpoint_mem.h
    +++ b/checkpoint/checkpoint_mem.h
    @@ -38,4 +38,9 @@ static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
    return (pgarr->nr_used == CR_PGARR_TOTAL);
    }

    +static inline int cr_pgarr_nr_free(struct cr_pgarr *pgarr)
    +{
    + return CR_PGARR_TOTAL - pgarr->nr_used;
    +}
    +
    #endif /* _CHECKPOINT_CKPT_MEM_H_ */
    diff --git a/checkpoint/restart.c b/checkpoint/restart.c
    index 766e381..f4d87ba 100644
    --- a/checkpoint/restart.c
    +++ b/checkpoint/restart.c
    @@ -78,6 +78,44 @@ int cr_read_string(struct cr_ctx *ctx, void *str, int len)
    return cr_read_obj_type(ctx, str, len, CR_HDR_STRING);
    }

    +/**
    + * cr_read_fname - read a file name
    + * @ctx: checkpoint context
    + * @fname: buffer
    + * @n: buffer length
    + */
    +int cr_read_fname(struct cr_ctx *ctx, void *fname, int flen)
    +{
    + return cr_read_obj_type(ctx, fname, flen, CR_HDR_FNAME);
    +}
    +
    +/**
    + * cr_read_open_fname - read a file name and open a file
    + * @ctx: checkpoint context
    + * @flags: file flags
    + * @mode: file mode
    + */
    +struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode)
    +{
    + struct file *file;
    + char *fname;
    + int ret;
    +
    + fname = kmalloc(PATH_MAX, GFP_KERNEL);
    + if (!fname)
    + return ERR_PTR(-ENOMEM);
    +
    + ret = cr_read_fname(ctx, fname, PATH_MAX);
    + cr_debug("fname '%s' flags %#x mode %#x\n", fname, flags, mode);
    + if (ret >= 0)
    + file = filp_open(fname, flags, mode);
    + else
    + file = ERR_PTR(ret);
    +
    + kfree(fname);
    + return file;
    +}
    +
    /* read the checkpoint header */
    static int cr_read_head(struct cr_ctx *ctx)
    {
    @@ -177,6 +215,10 @@ static int cr_read_task(struct cr_ctx *ctx)
    cr_debug("task_struct: ret %d\n", ret);
    if (ret < 0)
    goto out;
    + ret = cr_read_mm(ctx);
    + cr_debug("memory: ret %d\n", ret);
    + if (ret < 0)
    + goto out;
    ret = cr_read_thread(ctx);
    cr_debug("thread: ret %d\n", ret);
    if (ret < 0)
    diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
    new file mode 100644
    index 0000000..062e56e
    --- /dev/null
    +++ b/checkpoint/rstr_mem.c
    @@ -0,0 +1,384 @@
    +/*
    + * Restart memory contents
    + *
    + * Copyright (C) 2008 Oren Laadan
    + *
    + * This file is subject to the terms and conditions of the GNU General Public
    + * License. See the file COPYING in the main directory of the Linux
    + * distribution for more details.
    + */
    +
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +
    +#include "checkpoint_arch.h"
    +#include "checkpoint_mem.h"
    +
    +/*
    + * Unlike checkpoint, restart is executed in the context of each restarting
    + * process: vma regions are restored via a call to mmap(), and the data is
    + * read into the address space of the current process.
    + */
    +
    +
    +/**
    + * cr_read_pages_vaddrs - read addresses of pages to page-array chain
    + * @ctx - restart context
    + * @nr_pages - number of address to read
    + */
    +static int cr_read_pages_vaddrs(struct cr_ctx *ctx, unsigned long nr_pages)
    +{
    + struct cr_pgarr *pgarr;
    + unsigned long *vaddrp;
    + int nr, ret;
    +
    + while (nr_pages) {
    + pgarr = cr_pgarr_current(ctx);
    + if (!pgarr)
    + return -ENOMEM;
    + nr = cr_pgarr_nr_free(pgarr);
    + if (nr > nr_pages)
    + nr = nr_pages;
    + vaddrp = &pgarr->vaddrs[pgarr->nr_used];
    + ret = cr_kread(ctx, vaddrp, nr * sizeof(unsigned long));
    + if (ret < 0)
    + return ret;
    + pgarr->nr_used += nr;
    + nr_pages -= nr;
    + }
    + return 0;
    +}
    +
    +static int cr_page_read(struct cr_ctx *ctx, struct page *page, char *buf)
    +{
    + void *ptr;
    + int ret;
    +
    + ret = cr_kread(ctx, buf, PAGE_SIZE);
    + if (ret < 0)
    + return ret;
    +
    + ptr = kmap_atomic(page, KM_USER1);
    + memcpy(ptr, buf, PAGE_SIZE);
    + kunmap_atomic(ptr, KM_USER1);
    +
    + return 0;
    +}
    +
    +/**
    + * cr_read_pages_contents - read in data of pages in page-array chain
    + * @ctx - restart context
    + */
    +static int cr_read_pages_contents(struct cr_ctx *ctx)
    +{
    + struct mm_struct *mm = current->mm;
    + struct cr_pgarr *pgarr;
    + unsigned long *vaddrs;
    + char *buf;
    + int i, ret = 0;
    +
    + buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
    + if (!buf)
    + return -ENOMEM;
    +
    + down_read(&mm->mmap_sem);
    + list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
    + vaddrs = pgarr->vaddrs;
    + for (i = 0; i < pgarr->nr_used; i++) {
    + struct page *page;
    +
    + ret = get_user_pages(current, mm, vaddrs[i],
    + 1, 1, 1, &page, NULL);
    + if (ret < 0)
    + goto out;
    +
    + ret = cr_page_read(ctx, page, buf);
    + page_cache_release(page);
    +
    + if (ret < 0)
    + goto out;
    + }
    + }
    +
    + out:
    + up_read(&mm->mmap_sem);
    + kfree(buf);
    + return 0;
    +}
    +
    +/**
    + * cr_read_private_vma_contents - restore contents of a VMA with private memory
    + * @ctx - restart context
    + *
    + * Reads a header that specifies how many pages will follow, then reads
    + * a list of virtual addresses into ctx->pgarr_list page-array chain,
    + * followed by the actual contents of the corresponding pages. Iterates
    + * these steps until reaching a header specifying "0" pages, which marks
    + * the end of the contents.
    + */
    +static int cr_read_private_vma_contents(struct cr_ctx *ctx)
    +{
    + struct cr_hdr_pgarr *hh;
    + unsigned long nr_pages;
    + int parent, ret = 0;
    +
    + while (1) {
    + hh = cr_hbuf_get(ctx, sizeof(*hh));
    + parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_PGARR);
    + if (parent != 0) {
    + if (parent < 0)
    + ret = parent;
    + else
    + ret = -EINVAL;
    + cr_hbuf_put(ctx, sizeof(*hh));
    + break;
    + }
    +
    + cr_debug("nr_pages %ld\n", (unsigned long) hh->nr_pages);
    +
    + nr_pages = hh->nr_pages;
    + cr_hbuf_put(ctx, sizeof(*hh));
    +
    + if (!nr_pages)
    + break;
    +
    + ret = cr_read_pages_vaddrs(ctx, nr_pages);
    + if (ret < 0)
    + break;
    + ret = cr_read_pages_contents(ctx);
    + if (ret < 0)
    + break;
    + cr_pgarr_reset_all(ctx);
    + }
    +
    + return ret;
    +}
    +
    +/**
    + * cr_calc_map_prot_bits - convert vm_flags to mmap protection
    + * orig_vm_flags: source vm_flags
    + */
    +static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags)
    +{
    + unsigned long vm_prot = 0;
    +
    + if (orig_vm_flags & VM_READ)
    + vm_prot |= PROT_READ;
    + if (orig_vm_flags & VM_WRITE)
    + vm_prot |= PROT_WRITE;
    + if (orig_vm_flags & VM_EXEC)
    + vm_prot |= PROT_EXEC;
    + if (orig_vm_flags & PROT_SEM) /* only (?) with IPC-SHM */
    + vm_prot |= PROT_SEM;
    +
    + return vm_prot;
    +}
    +
    +/**
    + * cr_calc_map_flags_bits - convert vm_flags to mmap flags
    + * orig_vm_flags: source vm_flags
    + */
    +static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
    +{
    + unsigned long vm_flags = 0;
    +
    + vm_flags = MAP_FIXED;
    + if (orig_vm_flags & VM_GROWSDOWN)
    + vm_flags |= MAP_GROWSDOWN;
    + if (orig_vm_flags & VM_DENYWRITE)
    + vm_flags |= MAP_DENYWRITE;
    + if (orig_vm_flags & VM_EXECUTABLE)
    + vm_flags |= MAP_EXECUTABLE;
    + if (orig_vm_flags & VM_MAYSHARE)
    + vm_flags |= MAP_SHARED;
    + else
    + vm_flags |= MAP_PRIVATE;
    +
    + return vm_flags;
    +}
    +
    +static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
    +{
    + struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
    + unsigned long addr;
    + struct file *file = NULL;
    + int parent, ret = -EINVAL;
    +
    + parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
    + if (parent < 0) {
    + ret = parent;
    + goto err;
    + } else if (parent != 0)
    + goto err;
    +
    + cr_debug("vma %#lx-%#lx type %d\n", (unsigned long) hh->vm_start,
    + (unsigned long) hh->vm_end, (int) hh->vma_type);
    +
    + if (hh->vm_end < hh->vm_start)
    + goto err;
    +
    + vm_start = hh->vm_start;
    + vm_pgoff = hh->vm_pgoff;
    + vm_size = hh->vm_end - hh->vm_start;
    + vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
    + vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
    +
    + switch (hh->vma_type) {
    +
    + case CR_VMA_ANON: /* anonymous private mapping */
    + if (vm_flags & VM_SHARED)
    + goto err;
    + /*
    + * vm_pgoff for anonymous mapping is the "global" page
    + * offset (namely from addr 0x0), so we force a zero
    + */
    + vm_pgoff = 0;
    + break;
    +
    + case CR_VMA_FILE: /* private mapping from a file */
    + if (vm_flags & VM_SHARED)
    + goto err;
    + /*
    + * for private mapping using 'read-only' is sufficient
    + */
    + file = cr_read_open_fname(ctx, O_RDONLY, 0);
    + if (IS_ERR(file)) {
    + ret = PTR_ERR(file);
    + goto err;
    + }
    + break;
    +
    + default:
    + goto err;
    +
    + }
    +
    + cr_hbuf_put(ctx, sizeof(*hh));
    +
    + down_write(&mm->mmap_sem);
    + addr = do_mmap_pgoff(file, vm_start, vm_size,
    + vm_prot, vm_flags, vm_pgoff);
    + up_write(&mm->mmap_sem);
    + cr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
    + vm_size, vm_prot, vm_flags, vm_pgoff, addr);
    +
    + /* the file (if opened) is now referenced by the vma */
    + if (file)
    + filp_close(file, NULL);
    +
    + if (IS_ERR((void *) addr))
    + return PTR_ERR((void *) addr);
    +
    + /*
    + * CR_VMA_ANON: read in memory as is
    + * CR_VMA_FILE: read in memory as is
    + * (more to follow ...)
    + */
    +
    + switch (hh->vma_type) {
    + case CR_VMA_ANON:
    + case CR_VMA_FILE:
    + /* standard case: read the data into the memory */
    + ret = cr_read_private_vma_contents(ctx);
    + break;
    + }
    +
    + if (ret < 0)
    + return ret;
    +
    + cr_debug("vma retval %d\n", ret);
    + return 0;
    +
    + err:
    + cr_hbuf_put(ctx, sizeof(*hh));
    + return ret;
    +}
    +
    +static int cr_destroy_mm(struct mm_struct *mm)
    +{
    + struct vm_area_struct *vmnext = mm->mmap;
    + struct vm_area_struct *vma;
    + int ret;
    +
    + while (vmnext) {
    + vma = vmnext;
    + vmnext = vmnext->vm_next;
    + ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
    + if (ret < 0) {
    + pr_debug("C/R: restart failed do_munmap (%d)\n", ret);
    + return ret;
    + }
    + }
    + return 0;
    +}
    +
    +int cr_read_mm(struct cr_ctx *ctx)
    +{
    + struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + struct mm_struct *mm;
    + int nr, parent, ret;
    +
    + parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
    + if (parent < 0) {
    + ret = parent;
    + goto out;
    + }
    +
    + ret = -EINVAL;
    +#if 0 /* activate when containers are used */
    + if (parent != task_pid_vnr(current))
    + goto out;
    +#endif
    + cr_debug("map_count %d\n", hh->map_count);
    +
    + /* XXX need more sanity checks */
    + if (hh->start_code > hh->end_code ||
    + hh->start_data > hh->end_data || hh->map_count < 0)
    + goto out;
    +
    + mm = current->mm;
    +
    + /* point of no return -- destruct current mm */
    + down_write(&mm->mmap_sem);
    + ret = cr_destroy_mm(mm);
    + if (ret < 0) {
    + up_write(&mm->mmap_sem);
    + goto out;
    + }
    + mm->start_code = hh->start_code;
    + mm->end_code = hh->end_code;
    + mm->start_data = hh->start_data;
    + mm->end_data = hh->end_data;
    + mm->start_brk = hh->start_brk;
    + mm->brk = hh->brk;
    + mm->start_stack = hh->start_stack;
    + mm->arg_start = hh->arg_start;
    + mm->arg_end = hh->arg_end;
    + mm->env_start = hh->env_start;
    + mm->env_end = hh->env_end;
    + up_write(&mm->mmap_sem);
    +
    + /* FIX: need also mm->flags */
    +
    + for (nr = hh->map_count; nr; nr--) {
    + ret = cr_read_vma(ctx, mm);
    + if (ret < 0)
    + goto out;
    + }
    +
    + ret = cr_read_mm_context(ctx, mm, hh->objref);
    + out:
    + cr_hbuf_put(ctx, sizeof(*hh));
    + return ret;
    +}
    diff --git a/include/asm-x86/checkpoint_hdr.h b/include/asm-x86/checkpoint_hdr.h
    index 6bc61ac..f8eee6a 100644
    --- a/include/asm-x86/checkpoint_hdr.h
    +++ b/include/asm-x86/checkpoint_hdr.h
    @@ -74,4 +74,8 @@ struct cr_hdr_mm_context {
    __s16 nldt;
    } __attribute__((aligned(8)));

    +
    +/* misc prototypes from kernel (not defined elsewhere) */
    +asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
    +
    #endif /* __ASM_X86_CKPT_HDR__H */
    diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
    index 70ef22f..3563bce 100644
    --- a/include/linux/checkpoint.h
    +++ b/include/linux/checkpoint.h
    @@ -55,6 +55,9 @@ extern int cr_write_fname(struct cr_ctx *ctx,
    extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
    extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
    extern int cr_read_string(struct cr_ctx *ctx, void *str, int len);
    +extern int cr_read_fname(struct cr_ctx *ctx, void *fname, int n);
    +extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
    + int flags, int mode);

    extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
    extern int cr_read_mm(struct cr_ctx *ctx);
    --
    1.5.4.3

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. [RFC v8][PATCH 04/12] General infrastructure for checkpoint restart

    Add those interfaces, as well as helpers needed to easily manage the
    file format. The code is roughly broken out as follows:

    checkpoint/sys.c - user/kernel data transfer, as well as setup of the
    CR context (a per-checkpoint data structure for housekeeping)
    checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
    checkpoint/restart.c - input wrappers and basic restart handling

    For now, we can only checkpoint the 'current' task ("self" checkpoint),
    and the 'pid' argument to to the syscall is ignored.

    Patches to add the per-architecture support as well as the actual
    work to do the memory checkpoint follow in subsequent patches.

    Changelog[v6]:
    - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (although it's not really needed)

    Changelog[v5]:
    - Rename headers files s/ckpt/checkpoint/

    Changelog[v2]:
    - Added utsname->{release,version,machine} to checkpoint header
    - Pad header structures to 64 bits to ensure compatibility

    Signed-off-by: Oren Laadan
    Acked-by: Serge Hallyn
    Signed-off-by: Dave Hansen
    ---
    Makefile | 2 +-
    checkpoint/Makefile | 2 +-
    checkpoint/checkpoint.c | 174 +++++++++++++++++++++++++++++++
    checkpoint/restart.c | 197 ++++++++++++++++++++++++++++++++++++
    checkpoint/sys.c | 219 +++++++++++++++++++++++++++++++++++++++-
    include/linux/checkpoint.h | 56 ++++++++++
    include/linux/checkpoint_hdr.h | 75 ++++++++++++++
    include/linux/magic.h | 3 +
    8 files changed, 722 insertions(+), 6 deletions(-)
    create mode 100644 checkpoint/checkpoint.c
    create mode 100644 checkpoint/restart.c
    create mode 100644 include/linux/checkpoint.h
    create mode 100644 include/linux/checkpoint_hdr.h

    diff --git a/Makefile b/Makefile
    index ce9eceb..cb99128 100644
    --- a/Makefile
    +++ b/Makefile
    @@ -619,7 +619,7 @@ export mod_strip_cmd


    ifeq ($(KBUILD_EXTMOD),)
    -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
    +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/

    vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
    $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
    diff --git a/checkpoint/Makefile b/checkpoint/Makefile
    index 07d018b..d2df68c 100644
    --- a/checkpoint/Makefile
    +++ b/checkpoint/Makefile
    @@ -2,4 +2,4 @@
    # Makefile for linux checkpoint/restart.
    #

    -obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
    +obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
    diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
    new file mode 100644
    index 0000000..317b0e8
    --- /dev/null
    +++ b/checkpoint/checkpoint.c
    @@ -0,0 +1,174 @@
    +/*
    + * Checkpoint logic and helpers
    + *
    + * Copyright (C) 2008 Oren Laadan
    + *
    + * This file is subject to the terms and conditions of the GNU General Public
    + * License. See the file COPYING in the main directory of the Linux
    + * distribution for more details.
    + */
    +
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +
    +/**
    + * cr_write_obj - write a record described by a cr_hdr
    + * @ctx: checkpoint context
    + * @h: record descriptor
    + * @buf: record buffer
    + */
    +int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
    +{
    + int ret;
    +
    + ret = cr_kwrite(ctx, h, sizeof(*h));
    + if (ret < 0)
    + return ret;
    + return cr_kwrite(ctx, buf, h->len);
    +}
    +
    +/**
    + * cr_write_string - write a string
    + * @ctx: checkpoint context
    + * @str: string pointer
    + * @len: string length
    + */
    +int cr_write_string(struct cr_ctx *ctx, char *str, int len)
    +{
    + struct cr_hdr h;
    +
    + h.type = CR_HDR_STRING;
    + h.len = len;
    + h.parent = 0;
    +
    + return cr_write_obj(ctx, &h, str);
    +}
    +
    +/* write the checkpoint header */
    +static int cr_write_head(struct cr_ctx *ctx)
    +{
    + struct cr_hdr h;
    + struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + struct new_utsname *uts;
    + struct timeval ktv;
    + int ret;
    +
    + h.type = CR_HDR_HEAD;
    + h.len = sizeof(*hh);
    + h.parent = 0;
    +
    + do_gettimeofday(&ktv);
    +
    + hh->magic = CHECKPOINT_MAGIC_HEAD;
    + hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
    + hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
    + hh->patch = (LINUX_VERSION_CODE) & 0xff;
    +
    + hh->rev = CR_VERSION;
    +
    + hh->flags = ctx->flags;
    + hh->time = ktv.tv_sec;
    +
    + uts = utsname();
    + memcpy(hh->release, uts->release, __NEW_UTS_LEN);
    + memcpy(hh->version, uts->version, __NEW_UTS_LEN);
    + memcpy(hh->machine, uts->machine, __NEW_UTS_LEN);
    +
    + ret = cr_write_obj(ctx, &h, hh);
    + cr_hbuf_put(ctx, sizeof(*hh));
    + return ret;
    +}
    +
    +/* write the checkpoint trailer */
    +static int cr_write_tail(struct cr_ctx *ctx)
    +{
    + struct cr_hdr h;
    + struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + int ret;
    +
    + h.type = CR_HDR_TAIL;
    + h.len = sizeof(*hh);
    + h.parent = 0;
    +
    + hh->magic = CHECKPOINT_MAGIC_TAIL;
    +
    + ret = cr_write_obj(ctx, &h, hh);
    + cr_hbuf_put(ctx, sizeof(*hh));
    + return ret;
    +}
    +
    +/* dump the task_struct of a given task */
    +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
    +{
    + struct cr_hdr h;
    + struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + int ret;
    +
    + h.type = CR_HDR_TASK;
    + h.len = sizeof(*hh);
    + h.parent = 0;
    +
    + hh->state = t->state;
    + hh->exit_state = t->exit_state;
    + hh->exit_code = t->exit_code;
    + hh->exit_signal = t->exit_signal;
    +
    + hh->task_comm_len = TASK_COMM_LEN;
    +
    + /* FIXME: save remaining relevant task_struct fields */
    +
    + ret = cr_write_obj(ctx, &h, hh);
    + cr_hbuf_put(ctx, sizeof(*hh));
    + if (ret < 0)
    + return ret;
    +
    + return cr_write_string(ctx, t->comm, TASK_COMM_LEN);
    +}
    +
    +/* dump the entire state of a given task */
    +static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
    +{
    + int ret ;
    +
    + if (t->state == TASK_DEAD) {
    + pr_warning("C/R: task may not be in state TASK_DEAD\n");
    + return -EAGAIN;
    + }
    +
    + ret = cr_write_task_struct(ctx, t);
    + cr_debug("ret %d\n", ret);
    +
    + return ret;
    +}
    +
    +int do_checkpoint(struct cr_ctx *ctx)
    +{
    + int ret;
    +
    + /* FIX: need to test whether container is checkpointable */
    +
    + ret = cr_write_head(ctx);
    + if (ret < 0)
    + goto out;
    + ret = cr_write_task(ctx, current);
    + if (ret < 0)
    + goto out;
    + ret = cr_write_tail(ctx);
    + if (ret < 0)
    + goto out;
    +
    + /* on success, return (unique) checkpoint identifier */
    + ret = ctx->crid;
    +
    + out:
    + return ret;
    +}
    diff --git a/checkpoint/restart.c b/checkpoint/restart.c
    new file mode 100644
    index 0000000..69befa7
    --- /dev/null
    +++ b/checkpoint/restart.c
    @@ -0,0 +1,197 @@
    +/*
    + * Restart logic and helpers
    + *
    + * Copyright (C) 2008 Oren Laadan
    + *
    + * This file is subject to the terms and conditions of the GNU General Public
    + * License. See the file COPYING in the main directory of the Linux
    + * distribution for more details.
    + */
    +
    +#include
    +#include
    +#include
    +#include
    +#include
    +#include
    +
    +/**
    + * cr_read_obj - read a whole record (cr_hdr followed by payload)
    + * @ctx: checkpoint context
    + * @h: record descriptor
    + * @buf: record buffer
    + * @n: available buffer size
    + *
    + * Returns size of payload
    + */
    +int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n)
    +{
    + int ret;
    +
    + ret = cr_kread(ctx, h, sizeof(*h));
    + if (ret < 0)
    + return ret;
    +
    + cr_debug("type %d len %d parent %d\n", h->type, h->len, h->parent);
    +
    + if (h->len < 0 || h->len > n)
    + return -EINVAL;
    +
    + return cr_kread(ctx, buf, h->len);
    +}
    +
    +/**
    + * cr_read_obj_type - read a whole record of expected type
    + * @ctx: checkpoint context
    + * @buf: record buffer
    + * @n: available buffer size
    + * @type: expected record type
    + *
    + * Returns object reference of the parent object
    + */
    +int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type)
    +{
    + struct cr_hdr h;
    + int ret;
    +
    + ret = cr_read_obj(ctx, &h, buf, n);
    + if (ret < 0)
    + return ret;
    +
    + ret = -EINVAL;
    + if (h.type == type)
    + ret = h.parent;
    +
    + return ret;
    +}
    +
    +/**
    + * cr_read_string - read a string
    + * @ctx: checkpoint context
    + * @str: string buffer
    + * @len: buffer buffer length
    + */
    +int cr_read_string(struct cr_ctx *ctx, void *str, int len)
    +{
    + return cr_read_obj_type(ctx, str, len, CR_HDR_STRING);
    +}
    +
    +/* read the checkpoint header */
    +static int cr_read_head(struct cr_ctx *ctx)
    +{
    + struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + int parent, ret = -EINVAL;
    +
    + parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
    + if (parent < 0) {
    + ret = parent;
    + goto out;
    + } else if (parent != 0)
    + goto out;
    +
    + if (hh->magic != CHECKPOINT_MAGIC_HEAD || hh->rev != CR_VERSION ||
    + hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
    + hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
    + hh->patch != ((LINUX_VERSION_CODE) & 0xff))
    + goto out;
    +
    + if (hh->flags & ~CR_CTX_CKPT)
    + goto out;
    +
    + ctx->oflags = hh->flags;
    +
    + /* FIX: verify compatibility of release, version and machine */
    +
    + ret = 0;
    + out:
    + cr_hbuf_put(ctx, sizeof(*hh));
    + return ret;
    +}
    +
    +/* read the checkpoint trailer */
    +static int cr_read_tail(struct cr_ctx *ctx)
    +{
    + struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + int parent, ret = -EINVAL;
    +
    + parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
    + if (parent < 0) {
    + ret = parent;
    + goto out;
    + } else if (parent != 0)
    + goto out;
    +
    + if (hh->magic != CHECKPOINT_MAGIC_TAIL)
    + goto out;
    +
    + ret = 0;
    + out:
    + cr_hbuf_put(ctx, sizeof(*hh));
    + return ret;
    +}
    +
    +/* read the task_struct into the current task */
    +static int cr_read_task_struct(struct cr_ctx *ctx)
    +{
    + struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
    + struct task_struct *t = current;
    + char *buf;
    + int parent, ret = -EINVAL;
    +
    + parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
    + if (parent < 0) {
    + ret = parent;
    + goto out;
    + } else if (parent != 0)
    + goto out;
    +
    + /* upper limit for task_comm_len to prevent DoS */
    + if (hh->task_comm_len < 0 || hh->task_comm_len > PAGE_SIZE)
    + goto out;
    +
    + buf = kmalloc(hh->task_comm_len, GFP_KERNEL);
    + if (!buf)
    + goto out;
    + ret = cr_read_string(ctx, buf, hh->task_comm_len);
    + if (!ret) {
    + /* if t->comm is too long, silently truncate */
    + memset(t->comm, 0, TASK_COMM_LEN);
    + memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
    + }
    + kfree(buf);
    +
    + /* FIXME: restore remaining relevant task_struct fields */
    + out:
    + cr_hbuf_put(ctx, sizeof(*hh));
    + return ret;
    +}
    +
    +/* read the entire state of the current task */
    +static int cr_read_task(struct cr_ctx *ctx)
    +{
    + int ret;
    +
    + ret = cr_read_task_struct(ctx);
    + cr_debug("ret %d\n", ret);
    +
    + return ret;
    +}
    +
    +int do_restart(struct cr_ctx *ctx)
    +{
    + int ret;
    +
    + ret = cr_read_head(ctx);
    + if (ret < 0)
    + goto out;
    + ret = cr_read_task(ctx);
    + if (ret < 0)
    + goto out;
    + ret = cr_read_tail(ctx);
    + if (ret < 0)
    + goto out;
    +
    + /* on success, adjust the return value if needed [TODO] */
    + out:
    + return ret;
    +}
    diff --git a/checkpoint/sys.c b/checkpoint/sys.c
    index 375129c..3ce84ba 100644
    --- a/checkpoint/sys.c
    +++ b/checkpoint/sys.c
    @@ -10,6 +10,187 @@

    #include
    #include
    +#include
    +#include
    +#include
    +#include
    +#include
    +
    +/*
    + * helpers to write/read to/from the image file descriptor
    + *
    + * cr_uwrite() - write a user-space buffer to the checkpoint image
    + * cr_kwrite() - write a kernel-space buffer to the checkpoint image
    + * cr_uread() - read from the checkpoint image to a user-space buffer
    + * cr_kread() - read from the checkpoint image to a kernel-space buffer
    + */
    +
    +int cr_uwrite(struct cr_ctx *ctx, void *buf, int count)
    +{
    + struct file *file = ctx->file;
    + ssize_t nwrite;
    + int nleft;
    +
    + for (nleft = count; nleft; nleft -= nwrite) {
    + loff_t pos = file_pos_read(file);
    + nwrite = vfs_write(file, (char __user *) buf, nleft, &pos);
    + file_pos_write(file, pos);
    + if (nwrite <= 0) {
    + if (nwrite == -EAGAIN)
    + nwrite = 0;
    + else
    + return nwrite;
    + }
    + buf += nwrite;
    + }
    +
    + ctx->total += count;
    + return 0;
    +}
    +
    +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
    +{
    + mm_segment_t oldfs;
    + int ret;
    +
    + oldfs = get_fs();
    + set_fs(KERNEL_DS);
    + ret = cr_uwrite(ctx, buf, count);
    + set_fs(oldfs);
    +
    + return ret;
    +}
    +
    +int cr_uread(struct cr_ctx *ctx, void *buf, int count)
    +{
    + struct file *file = ctx->file;
    + ssize_t nread;
    + int nleft;
    +
    + for (nleft = count; nleft; nleft -= nread) {
    + loff_t pos = file_pos_read(file);
    + nread = vfs_read(file, (char __user *) buf, nleft, &pos);
    + file_pos_write(file, pos);
    + if (nread <= 0) {
    + if (nread == -EAGAIN)
    + nread = 0;
    + else
    + return nread;
    + }
    + buf += nread;
    + }
    +
    + ctx->total += count;
    + return 0;
    +}
    +
    +int cr_kread(struct cr_ctx *ctx, void *buf, int count)
    +{
    + mm_segment_t oldfs;
    + int ret;
    +
    + oldfs = get_fs();
    + set_fs(KERNEL_DS);
    + ret = cr_uread(ctx, buf, count);
    + set_fs(oldfs);
    +
    + return ret;
    +}
    +
    +/*
    + * During checkpoint and restart the code writes outs/reads in data
    + * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
    + * Because operations can be nested, use cr_hbuf_get() to reserve space
    + * in the buffer, then cr_hbuf_put() when you no longer need that space.
    + */
    +
    +/*
    + * ctx->hbuf is used to hold headers and data of known (or bound),
    + * static sizes. In some cases, multiple headers may be allocated in
    + * a nested manner. The size should accommodate all headers, nested
    + * or not, on all archs.
    + */
    +#define CR_HBUF_TOTAL (8 * 4096)
    +
    +/**
    + * cr_hbuf_get - reserve space on the hbuf
    + * @ctx: checkpoint context
    + * @n: number of bytes to reserve
    + *
    + * Returns pointer to reserved space
    + */
    +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
    +{
    + void *ptr;
    +
    + /*
    + * Since requests depend on logic and static header sizes (not on
    + * user data), space should always suffice, unless someone either
    + * made a structure bigger or call path deeper than expected.
    + */
    + BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
    + ptr = ctx->hbuf + ctx->hpos;
    + ctx->hpos += n;
    + return ptr;
    +}
    +
    +/**
    + * cr_hbuf_put - unreserve space on the hbuf
    + * @ctx: checkpoint context
    + * @n: number of bytes to reserve
    + */
    +void cr_hbuf_put(struct cr_ctx *ctx, int n)
    +{
    + BUG_ON(ctx->hpos < n);
    + ctx->hpos -= n;
    +}
    +
    +/*
    + * helpers to manage C/R contexts: allocated for each checkpoint and/or
    + * restart operation, and persists until the operation is completed.
    + */
    +
    +/* unique checkpoint identifier (FIXME: should be per-container) */
    +static atomic_t cr_ctx_count = ATOMIC_INIT(0);
    +
    +static void cr_ctx_free(struct cr_ctx *ctx)
    +{
    + if (ctx->file)
    + fput(ctx->file);
    + kfree(ctx->hbuf);
    + kfree(ctx);
    +}
    +
    +static struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
    +{
    + struct cr_ctx *ctx;
    + int err;
    +
    + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
    + if (!ctx)
    + return ERR_PTR(-ENOMEM);
    +
    + ctx->pid = pid;
    + ctx->flags = flags;
    +
    + err = -EBADF;
    + ctx->file = fget(fd);
    + if (!ctx->file)
    + goto err;
    +
    + err = -ENOMEM;
    + ctx->hbuf = kmalloc(CR_HBUF_TOTAL, GFP_KERNEL);
    + if (!ctx->hbuf)
    + goto err;
    +
    + ctx->crid = atomic_inc_return(&cr_ctx_count);
    +
    + return ctx;
    +
    + err:
    + cr_ctx_free(ctx);
    + return ERR_PTR(err);
    +}

    /**
    * sys_checkpoint - checkpoint a container
    @@ -22,9 +203,26 @@
    */
    asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
    {
    - pr_debug("sys_checkpoint not implemented yet\n");
    - return -ENOSYS;
    + struct cr_ctx *ctx;
    + int ret;
    +
    + /* no flags for now */
    + if (flags)
    + return -EINVAL;
    +
    + ctx = cr_ctx_alloc(pid, fd, flags | CR_CTX_CKPT);
    + if (IS_ERR(ctx))
    + return PTR_ERR(ctx);
    +
    + ret = do_checkpoint(ctx);
    +
    + if (!ret)
    + ret = ctx->crid;
    +
    + cr_ctx_free(ctx);
    + return ret;
    }
    +
    /**
    * sys_restart - restart a container
    * @crid: checkpoint image identifier
    @@ -36,6 +234,19 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
    */
    asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
    {
    - pr_debug("sys_restart not implemented yet\n");
    - return -ENOSYS;
    + struct cr_ctx *ctx;
    + int ret;
    +
    + /* no flags for now */
    + if (flags)
    + return -EINVAL;
    +
    + ctx = cr_ctx_alloc(crid, fd, flags | CR_CTX_RSTR);
    + if (IS_ERR(ctx))
    + return PTR_ERR(ctx);
    +
    + ret = do_restart(ctx);
    +
    + cr_ctx_free(ctx);
    + return ret;
    }
    diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
    new file mode 100644
    index 0000000..0f47160
    --- /dev/null
    +++ b/include/linux/checkpoint.h
    @@ -0,0 +1,56 @@
    +#ifndef _CHECKPOINT_CKPT_H_
    +#define _CHECKPOINT_CKPT_H_
    +/*
    + * Generic container checkpoint-restart
    + *
    + * Copyright (C) 2008 Oren Laadan
    + *
    + * This file is subject to the terms and conditions of the GNU General Public
    + * License. See the file COPYING in the main directory of the Linux
    + * distribution for more details.
    + */
    +
    +#define CR_VERSION 1
    +
    +struct cr_ctx {
    + pid_t pid; /* container identifier */
    + int crid; /* unique checkpoint id */
    +
    + unsigned long flags;
    + unsigned long oflags; /* restart: old flags */
    +
    + struct file *file;
    + int total; /* total read/written */
    +
    + void *hbuf; /* temporary buffer for headers */
    + int hpos; /* position in headers buffer */
    +};
    +
    +/* cr_ctx: flags */
    +#define CR_CTX_CKPT 0x1
    +#define CR_CTX_RSTR 0x2
    +
    +extern int cr_uwrite(struct cr_ctx *ctx, void *buf, int count);
    +extern int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
    +extern int cr_uread(struct cr_ctx *ctx, void *buf, int count);
    +extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
    +
    +extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
    +extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
    +
    +struct cr_hdr;
    +
    +extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
    +extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
    +
    +extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
    +extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
    +extern int cr_read_string(struct cr_ctx *ctx, void *str, int len);
    +
    +extern int do_checkpoint(struct cr_ctx *ctx);
    +extern int do_restart(struct cr_ctx *ctx);
    +
    +#define cr_debug(fmt, args...) \
    + pr_debug("[CR:%s] " fmt, __func__, ## args)
    +
    +#endif /* _CHECKPOINT_CKPT_H_ */
    diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
    new file mode 100644
    index 0000000..79e4df2
    --- /dev/null
    +++ b/include/linux/checkpoint_hdr.h
    @@ -0,0 +1,75 @@
    +#ifndef _CHECKPOINT_CKPT_HDR_H_
    +#define _CHECKPOINT_CKPT_HDR_H_
    +/*
    + * Generic container checkpoint-restart
    + *
    + * Copyright (C) 2008 Oren Laadan
    + *
    + * This file is subject to the terms and conditions of the GNU General Public
    + * License. See the file COPYING in the main directory of the Linux
    + * distribution for more details.
    + */
    +
    +#include
    +#include
    +
    +/*
    + * To maintain compatibility between 32-bit and 64-bit architecture flavors,
    + * keep data 64-bit aligned: use padding for structure members, and use
    + * __attribute__ ((aligned (8))) for the entire structure.
    + */
    +
    +/* records: generic header */
    +
    +struct cr_hdr {
    + __s16 type;
    + __s16 len;
    + __u32 parent;
    +};
    +
    +/* header types */
    +enum {
    + CR_HDR_HEAD = 1,
    + CR_HDR_STRING,
    +
    + CR_HDR_TASK = 101,
    + CR_HDR_THREAD,
    + CR_HDR_CPU,
    +
    + CR_HDR_MM = 201,
    + CR_HDR_VMA,
    + CR_HDR_MM_CONTEXT,
    +
    + CR_HDR_TAIL = 5001
    +};
    +
    +struct cr_hdr_head {
    + __u64 magic;
    +
    + __u16 major;
    + __u16 minor;
    + __u16 patch;
    + __u16 rev;
    +
    + __u64 time; /* when checkpoint taken */
    + __u64 flags; /* checkpoint options */
    +
    + char release[__NEW_UTS_LEN];
    + char version[__NEW_UTS_LEN];
    + char machine[__NEW_UTS_LEN];
    +} __attribute__((aligned(8)));
    +
    +struct cr_hdr_tail {
    + __u64 magic;
    +} __attribute__((aligned(8)));
    +
    +struct cr_hdr_task {
    + __u32 state;
    + __u32 exit_state;
    + __u32 exit_code;
    + __u32 exit_signal;
    +
    + __s32 task_comm_len;
    +} __attribute__((aligned(8)));
    +
    +#endif /* _CHECKPOINT_CKPT_HDR_H_ */
    diff --git a/include/linux/magic.h b/include/linux/magic.h
    index 1fa0c2c..c2b811c 100644
    --- a/include/linux/magic.h
    +++ b/include/linux/magic.h
    @@ -42,4 +42,7 @@
    #define FUTEXFS_SUPER_MAGIC 0xBAD1DEA
    #define INOTIFYFS_SUPER_MAGIC 0x2BAD1DEA

    +#define CHECKPOINT_MAGIC_HEAD 0x00feed0cc0a2d200LL
    +#define CHECKPOINT_MAGIC_TAIL 0x002d2a0cc0deef00LL
    +
    #endif /* __LINUX_MAGIC_H__ */
    --
    1.5.4.3

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. [RFC v8][PATCH 12/12] Track in-kernel when we expect checkpoint/restart to work

    From: Dave Hansen

    Suggested by Ingo.

    Checkpoint/restart is going to be a long effort to get things working.
    We're going to have a lot of things that we know just don't work for
    a long time. That doesn't mean that it will be useless, it just means
    that there's some complicated features that we are going to have to
    work incrementally to fix.

    This patch introduces a new mechanism to help the checkpoint/restart
    developers. A new function pair: task/process_deny_checkpoint() is
    created. When called, these tell the kernel that we *know* that the
    process has performed some activity that will keep it from being
    properly checkpointed.

    The 'flag' is an atomic_t for now so that we can have some level
    of atomicity and make sure to only warn once.

    For now, this is a one-way trip. Once a process is no longer
    'may_checkpoint' capable, neither it nor its children ever will be.
    This can, of course, be fixed up in the future. We might want to
    reset the flag when a new pid namespace is created, for instance.

    Signed-off-by: Dave Hansen
    Signed-off-by: Oren Laadan
    ---
    include/linux/checkpoint.h | 33 ++++++++++++++++++++++++++++++++-
    include/linux/sched.h | 3 +++
    kernel/fork.c | 10 ++++++++++
    3 files changed, 45 insertions(+), 1 deletions(-)

    diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
    index e9d554e..70cfceb 100644
    --- a/include/linux/checkpoint.h
    +++ b/include/linux/checkpoint.h
    @@ -10,8 +10,11 @@
    * distribution for more details.
    */

    -#include
    #include
    +#include
    +#include
    +
    +#ifdef CONFIG_CHECKPOINT_RESTART

    #define CR_VERSION 2

    @@ -93,4 +96,32 @@ extern int cr_read_files(struct cr_ctx *ctx);
    #define cr_debug(fmt, args...) \
    pr_debug("[CR:%s] " fmt, __func__, ## args)

    +static inline void __task_deny_checkpointing(struct task_struct *task,
    + char *file, int line)
    +{
    + if (!atomic_dec_and_test(&task->may_checkpoint))
    + return;
    + printk(KERN_INFO "process performed an action that can not be "
    + "checkpointed at: %s:%d\n", file, line);
    + WARN_ON(1);
    +}
    +#define process_deny_checkpointing(p) \
    + __task_deny_checkpointing(p, __FILE__, __LINE__)
    +
    +/*
    + * For now, we're not going to have a distinction between
    + * tasks and processes for the purpose of c/r. But, allow
    + * these two calls anyway to make new users at least think
    + * about it.
    + */
    +#define task_deny_checkpointing(p) \
    + __task_deny_checkpointing(p, __FILE__, __LINE__)
    +
    +#else
    +
    +static inline void task_deny_checkpointing(struct task_struct *task) {}
    +static inline void process_deny_checkpointing(struct task_struct *task) {}
    +
    +#endif
    +
    #endif /* _CHECKPOINT_CKPT_H_ */
    diff --git a/include/linux/sched.h b/include/linux/sched.h
    index 3d9120c..8c50e3b 100644
    --- a/include/linux/sched.h
    +++ b/include/linux/sched.h
    @@ -1301,6 +1301,9 @@ struct task_struct {
    int latency_record_count;
    struct latency_record latency_record[LT_SAVECOUNT];
    #endif
    +#ifdef CONFIG_CHECKPOINT_RESTART
    + atomic_t may_checkpoint;
    +#endif
    };

    /*
    diff --git a/kernel/fork.c b/kernel/fork.c
    index 7ce2ebe..d6cf7e4 100644
    --- a/kernel/fork.c
    +++ b/kernel/fork.c
    @@ -194,6 +194,13 @@ void __init fork_init(unsigned long mempages)
    init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
    init_task.signal->rlim[RLIMIT_SIGPENDING] =
    init_task.signal->rlim[RLIMIT_NPROC];
    +
    +#ifdef CONFIG_CHECKPOINT_RESTART
    + /*
    + * This probably won't stay set for long...
    + */
    + atomic_set(&init_task.may_checkpoint, 1);
    +#endif
    }

    int __attribute__((weak)) arch_dup_task_struct(struct task_struct *dst,
    @@ -244,6 +251,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
    tsk->btrace_seq = 0;
    #endif
    tsk->splice_pipe = NULL;
    +#ifdef CONFIG_CHECKPOINT_RESTART
    + atomic_set(&tsk->may_checkpoint, atomic_read(&orig->may_checkpoint));
    +#endif
    return tsk;

    out:
    --
    1.5.4.3

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. [RFC v8][PATCH 01/12] Create syscalls: sys_checkpoint, sys_restart

    Create trivial sys_checkpoint and sys_restore system calls. They will
    enable to checkpoint and restart an entire container, to and from a
    checkpoint image file descriptor.

    The syscalls take a file descriptor (for the image file) and flags as
    arguments. For sys_checkpoint the first argument identifies the target
    container; for sys_restart it will identify the checkpoint image.

    The checkpoint image is written to (and read from) the file descriptor
    directly from the kernel. This way the data is generated and pushed
    out naturally as resources and tasks are scanned to save their state.
    This is the approach taken by, e.g., Zap and OpenVZ.

    By using a return value and not a file descriptor, we can distinguish
    between a return from checkpoint, a return from restart (in case of a
    checkpoint that includes self, i.e. a task checkpointing its own
    container, or itself), and an error condition, in a manner analogous
    to a fork() call.

    We don't use copyin()/copyout() because it requires holding the entire
    image in user space, and does not make sense for restart. Also, we
    don't use a pipe, pseudo-fs file and the like, because they work by
    generating data on demand as the user pulls it (unless the entire
    image is buffered in the kernel) and would require more complex logic.
    They also would significantly complicate checkpoint that includes self.

    Changelog[v5]:
    - Config is 'def_bool n' by default

    Signed-off-by: Oren Laadan
    Acked-by: Serge Hallyn
    Signed-off-by: Dave Hansen
    ---
    arch/x86/kernel/syscall_table_32.S | 2 +
    checkpoint/Kconfig | 11 +++++++++
    checkpoint/Makefile | 5 ++++
    checkpoint/sys.c | 41 ++++++++++++++++++++++++++++++++++++
    include/asm-x86/unistd_32.h | 2 +
    include/linux/syscalls.h | 2 +
    init/Kconfig | 2 +
    kernel/sys_ni.c | 4 +++
    8 files changed, 69 insertions(+), 0 deletions(-)
    create mode 100644 checkpoint/Kconfig
    create mode 100644 checkpoint/Makefile
    create mode 100644 checkpoint/sys.c

    diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
    index d44395f..5543136 100644
    --- a/arch/x86/kernel/syscall_table_32.S
    +++ b/arch/x86/kernel/syscall_table_32.S
    @@ -332,3 +332,5 @@ ENTRY(sys_call_table)
    .long sys_dup3 /* 330 */
    .long sys_pipe2
    .long sys_inotify_init1
    + .long sys_checkpoint
    + .long sys_restart
    diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
    new file mode 100644
    index 0000000..ffaa635
    --- /dev/null
    +++ b/checkpoint/Kconfig
    @@ -0,0 +1,11 @@
    +config CHECKPOINT_RESTART
    + prompt "Enable checkpoint/restart (EXPERIMENTAL)"
    + def_bool n
    + depends on X86_32 && EXPERIMENTAL
    + help
    + Application checkpoint/restart is the ability to save the
    + state of a running application so that it can later resume
    + its execution from the time at which it was checkpointed.
    +
    + Turning this option on will enable checkpoint and restart
    + functionality in the kernel.
    diff --git a/checkpoint/Makefile b/checkpoint/Makefile
    new file mode 100644
    index 0000000..07d018b
    --- /dev/null
    +++ b/checkpoint/Makefile
    @@ -0,0 +1,5 @@
    +#
    +# Makefile for linux checkpoint/restart.
    +#
    +
    +obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
    diff --git a/checkpoint/sys.c b/checkpoint/sys.c
    new file mode 100644
    index 0000000..375129c
    --- /dev/null
    +++ b/checkpoint/sys.c
    @@ -0,0 +1,41 @@
    +/*
    + * Generic container checkpoint-restart
    + *
    + * Copyright (C) 2008 Oren Laadan
    + *
    + * This file is subject to the terms and conditions of the GNU General Public
    + * License. See the file COPYING in the main directory of the Linux
    + * distribution for more details.
    + */
    +
    +#include
    +#include
    +
    +/**
    + * sys_checkpoint - checkpoint a container
    + * @pid: pid of the container init(1) process
    + * @fd: file to which dump the checkpoint image
    + * @flags: checkpoint operation flags
    + *
    + * Returns positive identifier on success, 0 when returning from restart
    + * or negative value on error
    + */
    +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
    +{
    + pr_debug("sys_checkpoint not implemented yet\n");
    + return -ENOSYS;
    +}
    +/**
    + * sys_restart - restart a container
    + * @crid: checkpoint image identifier
    + * @fd: file from which read the checkpoint image
    + * @flags: restart operation flags
    + *
    + * Returns negative value on error, or otherwise returns in the realm
    + * of the original checkpoint
    + */
    +asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
    +{
    + pr_debug("sys_restart not implemented yet\n");
    + return -ENOSYS;
    +}
    diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
    index d739467..88bdec4 100644
    --- a/include/asm-x86/unistd_32.h
    +++ b/include/asm-x86/unistd_32.h
    @@ -338,6 +338,8 @@
    #define __NR_dup3 330
    #define __NR_pipe2 331
    #define __NR_inotify_init1 332
    +#define __NR_checkpoint 333
    +#define __NR_restart 334

    #ifdef __KERNEL__

    diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
    index d6ff145..edc218b 100644
    --- a/include/linux/syscalls.h
    +++ b/include/linux/syscalls.h
    @@ -622,6 +622,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
    asmlinkage long sys_eventfd(unsigned int count);
    asmlinkage long sys_eventfd2(unsigned int count, int flags);
    asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
    +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
    +asmlinkage long sys_restart(int crid, int fd, unsigned long flags);

    int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

    diff --git a/init/Kconfig b/init/Kconfig
    index c11da38..fd5f7bf 100644
    --- a/init/Kconfig
    +++ b/init/Kconfig
    @@ -779,6 +779,8 @@ config MARKERS

    source "arch/Kconfig"

    +source "checkpoint/Kconfig"
    +
    config PROC_PAGE_MONITOR
    default y
    depends on PROC_FS && MMU
    diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
    index 08d6e1b..ca95c25 100644
    --- a/kernel/sys_ni.c
    +++ b/kernel/sys_ni.c
    @@ -168,3 +168,7 @@ cond_syscall(compat_sys_timerfd_settime);
    cond_syscall(compat_sys_timerfd_gettime);
    cond_syscall(sys_eventfd);
    cond_syscall(sys_eventfd2);
    +
    +/* checkpoint/restart */
    +cond_syscall(sys_checkpoint);
    +cond_syscall(sys_restart);
    --
    1.5.4.3

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [Devel] [RFC v8][PATCH 0/12] Kernel based checkpoint/restart

    Oren,

    Can you please check your git server. I can't update to the latest version:

    # git-pull
    fatal: The remote end hung up unexpectedly

    git-clone exits with the same error.

    Andrey


    On Thursday 30 October 2008 16:51 Oren Laadan wrote:
    > Basic checkpoint-restart [C/R]: v8 adds support for "external" checkpoint
    > and improves documentation. Older announcements below.
    >
    > The git tree tracking v8 (branch 'ckpt-v8'), and older versions, is at:
    > git://gorgona.ncl.cs.columbia.edu/pub/git/linux-cr-dev.git
    >
    > (or for the latest version -
    > git://gorgona.ncl.cs.columbia.edu/pub/git/linux-cr.git)
    >
    > We'd like to see these make their way into -mm.
    > As Dave Hansen put it:
    >
    > --
    > Why do we want it? It allows containers to be moved between physical
    > machines' kernels in the same way that VMWare can move VMs between
    > physical machines' hypervisors. There are currently at least two
    > out-of-tree implementations of this in the commercial world (IBM's
    > Metacluster and Parallels' OpenVZ/Virtuozzo) and several in the academic
    > world like Zap.
    >
    > Why do we need it in mainline now? Because we already have plenty of
    > out-of-tree ones, and want to know what an in-tree one will be like.
    > What *I* want right now is the extra review and scrutiny that comes with
    > a mainline submission to make sure we're not going in a direction
    > contrary to the community.
    >
    > This only supports pretty simple apps. But, I trust Ingo when he says:
    > >> > > Generally, if something works for simple apps already (in a robust,
    > >> > > compatible and supportable way) and users find it "very cool", then
    > >> > > support for more complex apps is not far in the future. but if you
    > >> > > want to support more complex apps straight away, it takes forever
    > >> > > and gets ugly.

    >
    > We're *certainly* going to be changing the ABI (which is the format of
    > the checkpoint). I'd like to follow the model that we used for
    > ext4-dev, which is to make it very clear that this is a development-only
    > feature for now. Perhaps we do that by making the interface only
    > available through debugfs or something similar for now. Or, reserving
    > the syscall numbers but require some runtime switch to be thrown before
    > they can be used. I'm open to suggestions here.
    > --
    >
    > Oren.
    >
    > --
    > Todo:
    > - Add support for x86-64 and improve ABI
    > - Refine or change syscall interface
    > - Extend to handle (multiple) tasks in a container
    > - Handle multiple namespaces in a container (e.g. save the filesystem
    > namespaces state with the file descriptors)
    > - Security (without CAPS_SYS_ADMIN files restore may fail)
    >
    > Changelog:
    >
    > [2008-Oct-29] v8:
    > - Support "external" checkpoint
    > - Include Dave Hansen's 'deny-checkpoint' patch
    > - Split docs in Documentation/checkpoint/..., and improve contents
    >
    > [2008-Oct-17] v7:
    > - Fix save/restore state of FPU
    > - Fix argument given to kunmap_atomic() in memory dump/restore
    >
    > [2008-Oct-07] v6:
    > - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    > (even though it's not really needed)
    > - Add assumptions and what's-missing to documentation
    > - Misc fixes and cleanups
    >
    > [2008-Sep-11] v5:
    > - Config is now 'def_bool n' by default
    > - Improve memory dump/restore code (following Dave Hansen's comments)
    > - Change dump format (and code) to allow chunks of
    > instead of one long list of each
    > - Fix use of follow_page() to avoid faulting in non-present pages
    > - Memory restore now maps user pages explicitly to copy data into them,
    > instead of reading directly to user space; got rid of mprotect_fixup()
    > - Remove preempt_disable() when restoring debug registers
    > - Rename headers files s/ckpt/checkpoint/
    > - Fix misc bugs in files dump/restore
    > - Fixes and cleanups on some error paths
    > - Fix misc coding style
    >
    > [2008-Sep-09] v4:
    > - Various fixes and clean-ups
    > - Fix calculation of hash table size
    > - Fix header structure alignment
    > - Use stand list_... for cr_pgarr
    >
    > [2008-Aug-29] v3:
    > - Various fixes and clean-ups
    > - Use standard hlist_... for hash table
    > - Better use of standard kmalloc/kfree
    >
    > [2008-Aug-20] v2:
    > - Added Dump and restore of open files (regular and directories)
    > - Added basic handling of shared objects, and improve handling of
    > 'parent tag' concept
    > - Added documentation
    > - Improved ABI, 64bit padding for image data
    > - Improved locking when saving/restoring memory
    > - Added UTS information to header (release, version, machine)
    > - Cleanup extraction of filename from a file pointer
    > - Refactor to allow easier reviewing
    > - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    > security policy (this means that file restore may fail)
    > - Other cleanup and response to comments for v1
    >
    > [2008-Jul-29] v1:
    > - Initial version: support a single task with address space of only
    > private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    > argument and act on current process.
    >
    > --
    > At the containers mini-conference before OLS, the consensus among
    > all the stakeholders was that doing checkpoint/restart in the kernel
    > as much as possible was the best approach. With this approach, the
    > kernel will export a relatively opaque 'blob' of data to userspace
    > which can then be handed to the new kernel at restore time.
    >
    > This is different than what had been proposed before, which was
    > that a userspace application would be responsible for collecting
    > all of this data. We were also planning on adding lots of new,
    > little kernel interfaces for all of the things that needed
    > checkpointing. This unites those into a single, grand interface.
    >
    > The 'blob' will contain copies of select portions of kernel
    > structures such as vmas and mm_structs. It will also contain
    > copies of the actual memory that the process uses. Any changes
    > in this blob's format between kernel revisions can be handled by
    > an in-userspace conversion program.
    >
    > This is a similar approach to virtually all of the commercial
    > checkpoint/restart products out there, as well as the research
    > project Zap.
    >
    > These patches basically serialize internel kernel state and write
    > it out to a file descriptor. The checkpoint and restore are done
    > with two new system calls: sys_checkpoint and sys_restart.
    >
    > In this incarnation, they can only work checkpoint and restore a
    > single task. The task's address space may consist of only private,
    > simple vma's - anonymous or file-mapped. The open files may consist
    > of only simple files and directories.
    > --
    >
    > _______________________________________________
    > Containers mailing list
    > Containers@lists.linux-foundation.org
    > https://lists.linux-foundation.org/m...nfo/containers
    >
    > _______________________________________________
    > Devel mailing list
    > Devel@openvz.org
    > https://openvz.org/mailman/listinfo/devel

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [Devel] [RFC v8][PATCH 0/12] Kernel based checkpoint/restart


    Andrey Mirkin wrote:
    > Oren,
    >
    > Can you please check your git server. I can't update to the latest version:
    >
    > # git-pull
    > fatal: The remote end hung up unexpectedly
    >
    > git-clone exits with the same error.
    >
    > Andrey


    Not sure what was the problem. It works now.

    Oren.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [RFC v8][PATCH 11/12] External checkpoint of a task other than ourself

    Quoting Oren Laadan (orenl@cs.columbia.edu):
    > Now we can do "external" checkpoint, i.e. act on another task.
    >
    > sys_checkpoint() now looks up the target pid (in our namespace) and
    > checkpoints that corresponding task. That task should be the root of
    > a container.
    >
    > sys_restart() remains the same, as the restart is always done in the
    > context of the restarting task.
    >
    > Signed-off-by: Oren Laadan


    Ok, I'm at a loss right now, and I'm not sure who to blame - Oren,
    Daniel, Matt, or someone else.

    In one terminal I do:

    lxc-execute -n nonet sleep 100

    then in another terminal do

    lxc-checkpoint -s -n nonet > /tmp/o

    or
    lxc-checkpoint -n nonet > /tmp/o
    followed by ctrl-c in the lxc-execute terminal.

    Without fail, the second time I do this (if not the first), I get
    a BUG (see below). It really does look like it should have
    nothing to do with the c/r patches, but I can't reproduce this
    any other way. I've tried doing

    lxc-freeze -n nonet; lxc-unfreeze -n nonet; lxc-stop -n nonet

    I've tried manually doing freeze, checkpoint, unfreeze of containers
    hand-crafted to look like what lxc-execute creates (two tasks in private
    namespaces with private /proc mount, kill container inits of populated
    containers (bc it really looks like another task-exit-vs-container-cleanup
    race).

    I can't find any other way to reproduce this.

    (This is using Oren's patchset with freezer on top, and using
    a freshly pulled liblxc from cvs)

    -serge

    login: ------------[ cut here ]------------
    kernel BUG at fs/dcache.c:666!
    invalid opcode: 0000 [#1] SMP
    Modules linked in:

    Pid: 2963, comm: [vinit] Not tainted (2.6.27-rc9-00020-g2265283-dirty #344)
    EIP: 0060:[] EFLAGS: 00010292 CPU: 1
    EIP is at shrink_dcache_for_umount_subtree+0x14b/0x1fe
    EAX: 0000004e EBX: c04d22a7 ECX: 00000001 EDX: de6b5160
    ESI: df4040a0 EDI: ffffffff EBP: de779d7c ESP: de779d4c
    DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    Process [vinit] (pid: 2963, ti=de778000 task=de6b5160 task.ti=de778000)
    Stack: c04d1eb2 df4040a0 00000001 df404118 ffffffff c04d22a7 df830698 df404118
    00000004 df830400 c041e788 00000020 de779d88 c0188c3a df830400 de779d98
    c017b521 00000001 c054faf0 de779da4 c017b60b df830400 de779db0 c017b64d
    Call Trace:
    [] ? shrink_dcache_for_umount+0x2d/0x3a
    [] ? generic_shutdown_super+0x15/0xd3
    [] ? kill_anon_super+0xc/0x35
    [] ? kill_litter_super+0x19/0x1c
    [] ? deactivate_super+0x53/0x6b
    [] ? mntput_no_expire+0xc3/0xe7
    [] ? release_mounts+0x6b/0x7a
    [] ? __put_mnt_ns+0x62/0x70
    [] ? free_nsproxy+0x25/0x80
    [] ? switch_task_namespaces+0x44/0x49
    [] ? exit_task_namespaces+0xa/0xc
    [] ? do_exit+0x55f/0x6c9
    [] ? do_group_exit+0x5e/0x85
    [] ? get_signal_to_deliver+0x2ea/0x303
    [] ? do_notify_resume+0x6b/0x715
    [] ? lock_release_holdtime+0x1a/0x153
    [] ? trace_hardirqs_on+0xb/0xd
    [] ? trace_hardirqs_on+0xb/0xd
    [] ? remove_wait_queue+0x30/0x34
    [] ? do_wait+0x1d6/0x284
    [] ? audit_syscall_exit+0x2b1/0x2cc
    [] ? trace_hardirqs_on_caller+0xe1/0x102
    [] ? work_notifysig+0x13/0x19
    =======================
    Code: 1c 8b 18 8b 46 40 89 45 ec 8b 46 28 85 c0 74 03 8b 50 20 8d 81 98 02 00 00 50 53 57 ff 75 ec 52 56 68 b2 1e 4d c0 e8 9a 04 28 00 <0f> 0b 83 c4 1c eb fe 8b 7e 34 39 f7 75 04 31 ff eb 03 f0 ff 0f
    EIP: [] shrink_dcache_for_umount_subtree+0x14b/0x1fe SS:ESP 0068:de779d4c
    ---[ end trace 218551429ab07a44 ]---
    Fixing recursive fault but reboot is needed!

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [RFC v8][PATCH 11/12] External checkpoint of a task other than ourself

    Quoting Oren Laadan (orenl@cs.columbia.edu):
    > Now we can do "external" checkpoint, i.e. act on another task.
    >
    > sys_checkpoint() now looks up the target pid (in our namespace) and
    > checkpoints that corresponding task. That task should be the root of
    > a container.
    >
    > sys_restart() remains the same, as the restart is always done in the
    > context of the restarting task.
    >
    > Signed-off-by: Oren Laadan


    (Have looked this up and down, and it looks good, so while it's the
    easiest piece of code to blame for the BUG() I'm getting, it doesn't
    seem possible that it is)

    Acked-by: Serge Hallyn

    thanks, Oren.

    -serge
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [RFC v8][PATCH 09/12] Dump open file descriptors

    I'm still trying to figure out the cause of my BUG at dcache.c:666,
    so as I walk through the code a few more nitpicks:

    Quoting Oren Laadan (orenl@cs.columbia.edu):
    > +int cr_scan_fds(struct files_struct *files, int **fdtable)
    > +{
    > + struct fdtable *fdt;
    > + int *fds;
    > + int i, n = 0;
    > + int tot = CR_DEFAULT_FDTABLE;
    > +
    > + fds = kmalloc(tot * sizeof(*fds), GFP_KERNEL);
    > + if (!fds)
    > + return -ENOMEM;
    > +
    > + /*
    > + * We assume that the target task is frozen (or that we checkpoint
    > + * ourselves), so we can safely proceed after krealloc() from where
    > + * we left off; in the worst cases restart will fail.
    > + */
    > +
    > + spin_lock(&files->file_lock);
    > + rcu_read_lock();
    > + fdt = files_fdtable(files);
    > + for (i = 0; i < fdt->max_fds; i++) {
    > + if (!fcheck_files(files, i))
    > + continue;
    > + if (n == tot) {
    > + /*
    > + * fcheck_files() is safe with drop/re-acquire
    > + * of the lock, because it tests: fd < max_fds
    > + */
    > + spin_unlock(&files->file_lock);
    > + rcu_read_unlock();
    > + tot *= 2; /* won't overflow: kmalloc will fail */
    > + fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
    > + if (!fds) {
    > + kfree(fds);


    If !fds kfree(fds)

    > + return -ENOMEM;
    > + }
    > + rcu_read_lock();
    > + spin_lock(&files->file_lock);
    > + }
    > + fds[n++] = i;
    > + }
    > + rcu_read_unlock();
    > + spin_unlock(&files->file_lock);
    > +
    > + *fdtable = fds;
    > + return n;
    > +}
    > +static int
    > +cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
    > +{
    > + struct cr_hdr h;
    > + struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
    > + struct file *file = NULL;
    > + struct fdtable *fdt;
    > + int objref, new, ret;
    > + int coe = 0; /* avoid gcc warning */
    > +
    > + rcu_read_lock();
    > + fdt = files_fdtable(files);
    > + file = fcheck_files(files, fd);
    > + if (file) {
    > + coe = FD_ISSET(fd, fdt->close_on_exec);
    > + get_file(file);
    > + }
    > + rcu_read_unlock();
    > +
    > + /* sanity check (although this shouldn't happen) */
    > + if (!file) {
    > + ret = -EBADF;


    (As mentioned on irc - and probably already fixed in your v9 - you to an
    fput(NULL) in this case which will bomb)

    > + goto out;
    > + }
    > +
    > + new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
    > + cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
    > +
    > + if (new < 0) {
    > + ret = new;
    > + goto out;
    > + }
    > +
    > + h.type = CR_HDR_FD_ENT;
    > + h.len = sizeof(*hh);
    > + h.parent = 0;
    > +
    > + hh->objref = objref;
    > + hh->fd = fd;
    > + hh->close_on_exec = coe;
    > +
    > + ret = cr_write_obj(ctx, &h, hh);
    > + if (ret < 0)
    > + goto out;
    > +
    > + /* new==1 if-and-only-if file was newly added to hash */
    > + if (new)
    > + ret = cr_write_fd_data(ctx, file, objref);
    > +
    > +out:
    > + cr_hbuf_put(ctx, sizeof(*hh));
    > + fput(file);
    > + return ret;
    > +}
    > +
    > +int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
    > +{
    > + struct cr_hdr h;
    > + struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
    > + struct files_struct *files;
    > + int *fdtable;
    > + int nfds, n, ret;
    > +
    > + h.type = CR_HDR_FILES;
    > + h.len = sizeof(*hh);
    > + h.parent = task_pid_vnr(t);
    > +
    > + files = get_files_struct(t);
    > +
    > + nfds = cr_scan_fds(files, &fdtable);
    > + if (nfds < 0) {
    > + put_files_struct(files);


    need a cr_hbuf_put()

    > + return nfds;
    > + }
    > +


    (Cause of my BUG() doesn't appear to be here )

    thanks,
    -serge
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [RFC v8][PATCH 05/12] x86 support for checkpoint/restart

    Hi Oren,

    I'm now trying to port your patchset to x86_64, and find a tiny
    inconsistency issue.


    On 2008-10-30 at 09:51 -0400, Oren Laadan wrote:
    > +/* dump the thread_struct of a given task */
    > +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
    > +{
    > + struct cr_hdr h;
    > + struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
    > + struct thread_struct *thread;
    > + struct desc_struct *desc;
    > + int ntls = 0;
    > + int n, ret;
    > +
    > + h.type = CR_HDR_THREAD;
    > + h.len = sizeof(*hh);
    > + h.parent = task_pid_vnr(t);
    > +
    > + thread = &t->thread;
    > +
    > + /* calculate no. of TLS entries that follow */
    > + desc = thread->tls_array;
    > + for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
    > + if (desc->a || desc->b)
    > + ntls++;
    > + }
    > +
    > + hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
    > + hh->sizeof_tls_array = sizeof(thread->tls_array);
    > + hh->ntls = ntls;
    > +
    > + ret = cr_write_obj(ctx, &h, hh);
    > + cr_hbuf_put(ctx, sizeof(*hh));
    > + if (ret < 0)
    > + return ret;


    Please add
    if (ntls == 0)
    return ret;
    because, in restart phase, reading TLS entries from the image file
    is skipped if hh->ntls == 0, which may incur inconsistency and fail
    to restart.

    > + /* for simplicity dump the entire array, cherry-pick upon restart */
    > + ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
    > +
    > + cr_debug("ntls %d\n", ntls);
    > +
    > + /* IGNORE RESTART BLOCKS FOR NOW ... */
    > +
    > + return ret;
    > +}

    (snip)
    > +/* read the thread_struct into the current task */
    > +int cr_read_thread(struct cr_ctx *ctx)
    > +{
    > + struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
    > + struct task_struct *t = current;
    > + struct thread_struct *thread = &t->thread;
    > + int parent, ret;
    > +
    > + parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
    > + if (parent < 0) {
    > + ret = parent;
    > + goto out;
    > + }
    > +
    > + ret = -EINVAL;
    > +
    > +#if 0 /* activate when containers are used */
    > + if (parent != task_pid_vnr(t))
    > + goto out;
    > +#endif
    > + cr_debug("ntls %d\n", hh->ntls);
    > +
    > + if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
    > + hh->sizeof_tls_array != sizeof(thread->tls_array) ||
    > + hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
    > + goto out;
    > +
    > + if (hh->ntls > 0) {
    > + struct desc_struct *desc;
    > + int size, cpu;
    > +
    > + /*
    > + * restore TLS by hand: why convert to struct user_desc if
    > + * sys_set_thread_entry() will convert it back ?
    > + */
    > +
    > + size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
    > + desc = kmalloc(size, GFP_KERNEL);
    > + if (!desc)
    > + return -ENOMEM;
    > +
    > + ret = cr_kread(ctx, desc, size);
    > + if (ret >= 0) {
    > + /*
    > + * FIX: add sanity checks (eg. that values makes
    > + * sense, that we don't overwrite old values, etc
    > + */
    > + cpu = get_cpu();
    > + memcpy(thread->tls_array, desc, size);
    > + load_TLS(thread, cpu);
    > + put_cpu();
    > + }
    > + kfree(desc);
    > + }
    > +
    > + ret = 0;
    > + out:
    > + cr_hbuf_put(ctx, sizeof(*hh));
    > + return ret;
    > +}



    Thanks,

    Masahiko.

    ---
    Masahiko Takahashi / m-takahashi@ex.jp.nec.com

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [RFC v8][PATCH 05/12] x86 support for checkpoint/restart



    Masahiko Takahashi wrote:
    > Hi Oren,
    >
    > I'm now trying to port your patchset to x86_64, and find a tiny
    > inconsistency issue.
    >
    >
    > On 2008-10-30 at 09:51 -0400, Oren Laadan wrote:
    >> +/* dump the thread_struct of a given task */
    >> +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
    >> +{
    >> + struct cr_hdr h;
    >> + struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
    >> + struct thread_struct *thread;
    >> + struct desc_struct *desc;
    >> + int ntls = 0;
    >> + int n, ret;
    >> +
    >> + h.type = CR_HDR_THREAD;
    >> + h.len = sizeof(*hh);
    >> + h.parent = task_pid_vnr(t);
    >> +
    >> + thread = &t->thread;
    >> +
    >> + /* calculate no. of TLS entries that follow */
    >> + desc = thread->tls_array;
    >> + for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
    >> + if (desc->a || desc->b)
    >> + ntls++;
    >> + }
    >> +
    >> + hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
    >> + hh->sizeof_tls_array = sizeof(thread->tls_array);
    >> + hh->ntls = ntls;
    >> +
    >> + ret = cr_write_obj(ctx, &h, hh);
    >> + cr_hbuf_put(ctx, sizeof(*hh));
    >> + if (ret < 0)
    >> + return ret;

    >
    > Please add
    > if (ntls == 0)
    > return ret;
    > because, in restart phase, reading TLS entries from the image file
    > is skipped if hh->ntls == 0, which may incur inconsistency and fail
    > to restart.
    >


    Will fix, thanks.

    Oren.


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [RFC v8][PATCH 0/12] Kernel based checkpoint/restart

    Quoting Oren Laadan (orenl@cs.columbia.edu):
    > Basic checkpoint-restart [C/R]: v8 adds support for "external" checkpoint
    > and improves documentation. Older announcements below.


    The following test-program seems to reliably trigger a bug. Run it in a
    new set of namespaces, i.e.
    ns_exec -cmpiuU ./runme > /tmp/o
    then control-c it. The second time I do that, I get the dcache.c:666
    BUG().

    #include
    #include
    #include
    #include
    #include

    #define __NR_checkpoint 333
    int main (int argc, char *argv[])
    {
    pid_t pid = getpid();
    int ret;

    close(0); close(2);
    ret = syscall (__NR_checkpoint, pid, STDOUT_FILENO, 0);

    if (ret < 0)
    perror ("checkpoint");
    else
    printf ("checkpoint id %d\n", ret);

    sleep(200);
    return (ret > 0 ? 0 : 1);
    }

    -serge
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [RFC v8][PATCH 0/12] Kernel based checkpoint/restart

    Quoting Oren Laadan (orenl@cs.columbia.edu):
    > Basic checkpoint-restart [C/R]: v8 adds support for "external" checkpoint
    > and improves documentation. Older announcements below.


    Finally!

    From 8edab186b605f7dddd612e581204f1ad8fd766be Mon Sep 17 00:00:00 2001
    From: Serge Hallyn
    Date: Tue, 4 Nov 2008 15:28:01 -0600
    Subject: [PATCH 1/1] cr: fix use of __d_path()

    __d_path():
    1. should be used under dcache_lock
    2. can change root->{mnt,dentry} without changing refcounts
    The second point was the cause of my BUGs. The ctx->root was passed
    in, and do_checkpoint() had taken a path_get on the vfsroot. So now
    at cleanup it was doing path_put() using another mnt+dentry.

    (Why they are different, I'm not sure - but my guess would be that
    stdin or stdout is inherited from the parent task in parent mntns,
    hence file->mnt is different from root->mnt as it's a different
    namespace.)

    Signed-off-by: Serge Hallyn
    ---
    checkpoint/checkpoint.c | 13 ++++++++++++-
    1 files changed, 12 insertions(+), 1 deletions(-)

    diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
    index 173b637..7f0c1e7 100644
    --- a/checkpoint/checkpoint.c
    +++ b/checkpoint/checkpoint.c
    @@ -70,9 +70,20 @@ static char *
    cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
    {
    char *fname;
    + struct path root2;
    +
    + root2.mnt = root->mnt;
    + root2.dentry = root->dentry;

    BUG_ON(!buf);
    - fname = __d_path(path, root, buf, *n);
    + spin_lock(&dcache_lock);
    + fname = __d_path(path, &root2, buf, *n);
    + spin_unlock(&dcache_lock);
    + if (root2.mnt != root->mnt)
    + printk(KERN_NOTICE "%s: mnt changed\n", __func__);
    + if (root2.dentry != root->dentry)
    + printk(KERN_NOTICE "%s: dentry changed\n", __func__);
    + fname = buf+10;
    if (!IS_ERR(fname))
    *n = (buf + (*n) - fname);
    return fname;
    --
    1.5.6.3

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread