[PATCH 0/10] OpenVZ kernel based checkpointing/restart (v2) - Kernel

This is a discussion on [PATCH 0/10] OpenVZ kernel based checkpointing/restart (v2) - Kernel ; These patchset introduces kernel based checkpointing/restart as it is implemented in OpenVZ project. This version (v2) supports multiple processes with simple private memory and open files (regular files). Todo: - Create processes with the same PID during restart - Add ...

+ Reply to Thread
Page 1 of 3 1 2 3 LastLast
Results 1 to 20 of 41

Thread: [PATCH 0/10] OpenVZ kernel based checkpointing/restart (v2)

  1. [PATCH 0/10] OpenVZ kernel based checkpointing/restart (v2)

    These patchset introduces kernel based checkpointing/restart as it is
    implemented in OpenVZ project. This version (v2) supports multiple
    processes with simple private memory and open files (regular files).

    Todo:
    - Create processes with the same PID during restart
    - Add support for x86-64
    - Add support for shared objects

    Changelog:

    18 Oct 2008 (v2):
    - Add support for multiple processes
    - Cleanup and bug fixes

    --

    This patchset introduces kernel based checkpointing/restart as it is
    implemented in OpenVZ project. This patchset has limited functionality and
    are able to checkpoint/restart only single process. Recently Oren Laaden
    sent another kernel based implementation of checkpoint/restart. The main
    differences between this patchset and Oren's patchset are:

    * In this patchset checkpointing initiated not from the process
    (right now we do not have a container, only namespaces), Oren's patchset
    performs checkpointing from the process context.

    * Restart in this patchset is initiated from process, which restarts a new
    process (in new namespaces) with saved state. Oren's patchset uses the same
    process from which restart was initiated and restore saved state over it.

    * Checkpoint/restart functionality in this patchset is implemented as a kernel
    module


    As checkpointing is initiated not from the process which state should be saved
    we should freeze a process before saving its state. Right now Container Freezer
    from Matt Helsley can be used for this.

    This patchset introduce only a concept how kernel based checkpointing/restart
    can be implemented and are able to checkpoint/restart only a single process
    with simple VMAs.

    I've tried to split my patchset in small patches to make review more easier.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. [PATCH 03/10] Introduce context structure needed during checkpointing/restart

    Add functions for context allocation/destroy.
    Introduce functions to read/write image.
    Introduce image header and object header.

    Signed-off-by: Andrey Mirkin
    ---
    checkpoint/checkpoint.h | 40 +++++++++++++++
    checkpoint/cpt_image.h | 63 ++++++++++++++++++++++++
    checkpoint/sys.c | 125 +++++++++++++++++++++++++++++++++++++++++++++-
    3 files changed, 225 insertions(+), 3 deletions(-)
    create mode 100644 checkpoint/cpt_image.h

    diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
    index 381a9bf..8ea73f5 100644
    --- a/checkpoint/checkpoint.h
    +++ b/checkpoint/checkpoint.h
    @@ -10,6 +10,8 @@
    *
    */

    +#include "cpt_image.h"
    +
    struct cpt_operations
    {
    struct module * owner;
    @@ -17,3 +19,41 @@ struct cpt_operations
    int (*restart)(int ctid, int fd, unsigned long flags);
    };
    extern struct cpt_operations cpt_ops;
    +
    +enum cpt_ctx_state
    +{
    + CPT_CTX_ERROR = -1,
    + CPT_CTX_IDLE = 0,
    + CPT_CTX_DUMPING,
    + CPT_CTX_UNDUMPING
    +};
    +
    +typedef struct cpt_context
    +{
    + pid_t pid; /* should be changed to ctid later */
    + int ctx_id; /* context id */
    + struct list_head ctx_list;
    + int refcount;
    + int ctx_state;
    + struct semaphore main_sem;
    +
    + int errno;
    +
    + struct file *file;
    + loff_t current_object;
    +
    + struct list_head object_array[CPT_OBJ_MAX];
    +
    + int (*write)(const void *addr, size_t count, struct cpt_context *ctx);
    + int (*read)(void *addr, size_t count, struct cpt_context *ctx);
    +} cpt_context_t;
    +
    +extern int debug_level;
    +
    +#define cpt_printk(lvl, fmt, args...) do { \
    + if (lvl <= debug_level) \
    + printk(fmt, ##args); \
    + } while (0)
    +
    +#define eprintk(a...) cpt_printk(1, "CPT ERR: " a)
    +#define dprintk(a...) cpt_printk(1, "CPT DBG: " a)
    diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
    new file mode 100644
    index 0000000..0338dd0
    --- /dev/null
    +++ b/checkpoint/cpt_image.h
    @@ -0,0 +1,63 @@
    +/*
    + * Copyright (C) 2008 Parallels, Inc.
    + *
    + * Author: Andrey Mirkin
    + *
    + * This program is free software; you can redistribute it and/or
    + * modify it under the terms of the GNU General Public License as
    + * published by the Free Software Foundation, version 2 of the
    + * License.
    + *
    + */
    +
    +#ifndef __CPT_IMAGE_H_
    +#define __CPT_IMAGE_H_ 1
    +
    +enum _cpt_object_type
    +{
    + CPT_OBJ_TASK = 0,
    + CPT_OBJ_MAX,
    + /* The objects above are stored in memory while checkpointing */
    +
    + CPT_OBJ_HEAD = 1024,
    +};
    +
    +enum _cpt_content_type {
    + CPT_CONTENT_VOID,
    + CPT_CONTENT_ARRAY,
    + CPT_CONTENT_DATA,
    + CPT_CONTENT_NAME,
    + CPT_CONTENT_REF,
    + CPT_CONTENT_MAX
    +};
    +
    +#define CPT_SIGNATURE0 0x79
    +#define CPT_SIGNATURE1 0x1c
    +#define CPT_SIGNATURE2 0x01
    +#define CPT_SIGNATURE3 0x63
    +
    +struct cpt_head
    +{
    + __u8 cpt_signature[4]; /* Magic number */
    + __u32 cpt_hdrlen; /* Header length */
    + __u16 cpt_image_major; /* Format of this file */
    + __u16 cpt_image_minor; /* Format of this file */
    + __u16 cpt_image_sublevel; /* Format of this file */
    + __u16 cpt_image_extra; /* Format of this file */
    + __u16 cpt_arch; /* Architecture */
    +#define CPT_ARCH_I386 0
    + __u16 cpt_pad1;
    + __u32 cpt_pad2;
    + __u64 cpt_time; /* Time */
    +} __attribute__ ((aligned (8)));
    +
    +/* Common object header. */
    +struct cpt_object_hdr
    +{
    + __u64 cpt_len; /* Size of current chunk of data */
    + __u32 cpt_hdrlen; /* Size of header */
    + __u16 cpt_type; /* Type of object */
    + __u16 cpt_content; /* Content type: array, reference... */
    +} __attribute__ ((aligned (8)));
    +
    +#endif /* __CPT_IMAGE_H_ */
    diff --git a/checkpoint/sys.c b/checkpoint/sys.c
    index 010e4eb..a561a06 100644
    --- a/checkpoint/sys.c
    +++ b/checkpoint/sys.c
    @@ -13,21 +13,140 @@
    #include
    #include
    #include
    -#include
    +#include
    #include

    #include "checkpoint.h"
    +#include "cpt_image.h"

    MODULE_LICENSE("GPL");

    +/* Debug level, constant for now */
    +int debug_level = 1;
    +
    +static int file_write(const void *addr, size_t count, struct cpt_context *ctx)
    +{
    + mm_segment_t oldfs;
    + ssize_t err = -EBADF;
    + struct file *file = ctx->file;
    +
    + oldfs = get_fs(); set_fs(KERNEL_DS);
    + if (file)
    + err = file->f_op->write(file, addr, count, &file->f_pos);
    + set_fs(oldfs);
    + if (err != count)
    + return err >= 0 ? -EIO : err;
    + return 0;
    +}
    +
    +static int file_read(void *addr, size_t count, struct cpt_context *ctx)
    +{
    + mm_segment_t oldfs;
    + ssize_t err = -EBADF;
    + struct file *file = ctx->file;
    +
    + oldfs = get_fs(); set_fs(KERNEL_DS);
    + if (file)
    + err = file->f_op->read(file, addr, count, &file->f_pos);
    + set_fs(oldfs);
    + if (err != count)
    + return err >= 0 ? -EIO : err;
    + return 0;
    +}
    +
    +struct cpt_context * context_alloc(void)
    +{
    + struct cpt_context *ctx;
    + int i;
    +
    + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
    + if (!ctx)
    + return NULL;
    +
    + init_MUTEX(&ctx->main_sem);
    + ctx->refcount = 1;
    +
    + ctx->current_object = -1;
    + ctx->write = file_write;
    + ctx->read = file_read;
    + for (i = 0; i < CPT_OBJ_MAX; i++) {
    + INIT_LIST_HEAD(&ctx->object_array[i]);
    + }
    +
    + return ctx;
    +}
    +
    +void context_release(struct cpt_context *ctx)
    +{
    + ctx->ctx_state = CPT_CTX_ERROR;
    +
    + kfree(ctx);
    +}
    +
    +static void context_put(struct cpt_context *ctx)
    +{
    + if (!--ctx->refcount)
    + context_release(ctx);
    +}
    +
    static int checkpoint(pid_t pid, int fd, unsigned long flags)
    {
    - return -ENOSYS;
    + struct file *file;
    + struct cpt_context *ctx;
    + int err;
    +
    + err = -EBADF;
    + file = fget(fd);
    + if (!file)
    + goto out;
    +
    + err = -ENOMEM;
    + ctx = context_alloc();
    + if (!ctx)
    + goto out_file;
    +
    + ctx->file = file;
    + ctx->ctx_state = CPT_CTX_DUMPING;
    +
    + /* checkpoint */
    + err = -ENOSYS;
    +
    + context_put(ctx);
    +
    +out_file:
    + fput(file);
    +out:
    + return err;
    }

    static int restart(int ctid, int fd, unsigned long flags)
    {
    - return -ENOSYS;
    + struct file *file;
    + struct cpt_context *ctx;
    + int err;
    +
    + err = -EBADF;
    + file = fget(fd);
    + if (!file)
    + goto out;
    +
    + err = -ENOMEM;
    + ctx = context_alloc();
    + if (!ctx)
    + goto out_file;
    +
    + ctx->file = file;
    + ctx->ctx_state = CPT_CTX_UNDUMPING;
    +
    + /* restart */
    + err = -ENOSYS;
    +
    + context_put(ctx);
    +
    +out_file:
    + fput(file);
    +out:
    + return err;
    }

    static int __init init_cptrst(void)
    --
    1.5.6

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. [PATCH 05/10] Introduce function to dump process

    Functions to dump task struct, fpu state and registers are added.
    All IDs are saved from the POV of process (container) namespace.

    Signed-off-by: Andrey Mirkin
    ---
    checkpoint/Makefile | 2 +-
    checkpoint/checkpoint.c | 2 +-
    checkpoint/checkpoint.h | 1 +
    checkpoint/cpt_image.h | 123 ++++++++++++++++++++++++
    checkpoint/cpt_process.c | 236 ++++++++++++++++++++++++++++++++++++++++++++++
    5 files changed, 362 insertions(+), 2 deletions(-)
    create mode 100644 checkpoint/cpt_process.c

    diff --git a/checkpoint/Makefile b/checkpoint/Makefile
    index 173346b..457cc96 100644
    --- a/checkpoint/Makefile
    +++ b/checkpoint/Makefile
    @@ -2,4 +2,4 @@ obj-y += sys_core.o

    obj-$(CONFIG_CHECKPOINT) += cptrst.o

    -cptrst-objs := sys.o checkpoint.o
    +cptrst-objs := sys.o checkpoint.o cpt_process.o
    diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
    index c4bddce..aae198d 100644
    --- a/checkpoint/checkpoint.c
    +++ b/checkpoint/checkpoint.c
    @@ -70,7 +70,7 @@ int dump_container(struct cpt_context *ctx)

    /* Dump task here */
    if (!err)
    - err = -ENOSYS;
    + err = cpt_dump_task(root, ctx);

    out:
    ctx->nsproxy = NULL;
    diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
    index 6926aa2..9e46b10 100644
    --- a/checkpoint/checkpoint.h
    +++ b/checkpoint/checkpoint.h
    @@ -60,3 +60,4 @@ extern int debug_level;
    #define dprintk(a...) cpt_printk(1, "CPT DBG: " a)

    int dump_container(struct cpt_context *ctx);
    +int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
    diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
    index 0338dd0..cddfe37 100644
    --- a/checkpoint/cpt_image.h
    +++ b/checkpoint/cpt_image.h
    @@ -13,6 +13,9 @@
    #ifndef __CPT_IMAGE_H_
    #define __CPT_IMAGE_H_ 1

    +#include
    +#include
    +
    enum _cpt_object_type
    {
    CPT_OBJ_TASK = 0,
    @@ -20,6 +23,8 @@ enum _cpt_object_type
    /* The objects above are stored in memory while checkpointing */

    CPT_OBJ_HEAD = 1024,
    + CPT_OBJ_X86_REGS,
    + CPT_OBJ_BITS,
    };

    enum _cpt_content_type {
    @@ -28,6 +33,8 @@ enum _cpt_content_type {
    CPT_CONTENT_DATA,
    CPT_CONTENT_NAME,
    CPT_CONTENT_REF,
    + CPT_CONTENT_X86_FPUSTATE,
    + CPT_CONTENT_X86_FPUSTATE_OLD,
    CPT_CONTENT_MAX
    };

    @@ -60,4 +67,120 @@ struct cpt_object_hdr
    __u16 cpt_content; /* Content type: array, reference... */
    } __attribute__ ((aligned (8)));

    +struct cpt_task_image {
    + __u64 cpt_len;
    + __u32 cpt_hdrlen;
    + __u16 cpt_type;
    + __u16 cpt_content;
    +
    + __u64 cpt_state;
    + __u64 cpt_flags;
    +#define CPT_PF_EXITING 0
    +#define CPT_PF_FORKNOEXEC 1
    +#define CPT_PF_SUPERPRIV 2
    +#define CPT_PF_DUMPCORE 3
    +#define CPT_PF_SIGNALED 4
    +#define CPT_PF_USED_MATH 5
    +
    + __u64 cpt_thrflags;
    + __u64 cpt_thrstatus;
    + __u32 cpt_pid;
    + __u32 cpt_tgid;
    + __u32 cpt_ppid;
    + __u32 cpt_rppid;
    + __u32 cpt_pgrp;
    + __u32 cpt_session;
    + __u32 cpt_old_pgrp;
    + __u32 cpt_leader;
    + __u64 cpt_set_tid;
    + __u64 cpt_clear_tid;
    + __u32 cpt_exit_code;
    + __u32 cpt_exit_signal;
    + __u32 cpt_pdeath_signal;
    + __u32 cpt_user;
    + __u32 cpt_uid;
    + __u32 cpt_euid;
    + __u32 cpt_suid;
    + __u32 cpt_fsuid;
    + __u32 cpt_gid;
    + __u32 cpt_egid;
    + __u32 cpt_sgid;
    + __u32 cpt_fsgid;
    + __u8 cpt_comm[TASK_COMM_LEN];
    + __u64 cpt_tls[GDT_ENTRY_TLS_ENTRIES];
    + __u64 cpt_utime;
    + __u64 cpt_stime;
    + __u64 cpt_utimescaled;
    + __u64 cpt_stimescaled;
    + __u64 cpt_gtime;
    + __u64 cpt_prev_utime;
    + __u64 cpt_prev_stime;
    + __u64 cpt_start_time;
    + __u64 cpt_real_start_time;
    + __u64 cpt_nvcsw;
    + __u64 cpt_nivcsw;
    + __u64 cpt_min_flt;
    + __u64 cpt_maj_flt;
    +} __attribute__ ((aligned (8)));
    +
    +struct cpt_obj_bits
    +{
    + __u64 cpt_len;
    + __u32 cpt_hdrlen;
    + __u16 cpt_type;
    + __u16 cpt_content;
    +
    + __u32 cpt_size;
    + __u32 __cpt_pad1;
    +} __attribute__ ((aligned (8)));
    +
    +#define CPT_SEG_ZERO 0
    +#define CPT_SEG_TLS1 1
    +#define CPT_SEG_TLS2 2
    +#define CPT_SEG_TLS3 3
    +#define CPT_SEG_USER32_DS 4
    +#define CPT_SEG_USER32_CS 5
    +#define CPT_SEG_USER64_DS 6
    +#define CPT_SEG_USER64_CS 7
    +#define CPT_SEG_LDT 256
    +
    +struct cpt_x86_regs
    +{
    + __u64 cpt_len;
    + __u32 cpt_hdrlen;
    + __u16 cpt_type;
    + __u16 cpt_content;
    +
    + __u32 cpt_debugreg[8];
    + __u32 cpt_gs;
    +
    + __u32 cpt_bx;
    + __u32 cpt_cx;
    + __u32 cpt_dx;
    + __u32 cpt_si;
    + __u32 cpt_di;
    + __u32 cpt_bp;
    + __u32 cpt_ax;
    + __u32 cpt_ds;
    + __u32 cpt_es;
    + __u32 cpt_fs;
    + __u32 cpt_orig_ax;
    + __u32 cpt_ip;
    + __u32 cpt_cs;
    + __u32 cpt_flags;
    + __u32 cpt_sp;
    + __u32 cpt_ss;
    +} __attribute__ ((aligned (8)));
    +
    +static inline __u64 cpt_timespec_export(struct timespec *tv)
    +{
    + return (((u64)tv->tv_sec) << 32) + tv->tv_nsec;
    +}
    +
    +static inline void cpt_timespec_import(struct timespec *tv, __u64 val)
    +{
    + tv->tv_sec = val >> 32;
    + tv->tv_nsec = (val & 0xFFFFFFFF);
    +}
    +
    #endif /* __CPT_IMAGE_H_ */
    diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
    new file mode 100644
    index 0000000..58f608d
    --- /dev/null
    +++ b/checkpoint/cpt_process.c
    @@ -0,0 +1,236 @@
    +/*
    + * Copyright (C) 2008 Parallels, Inc.
    + *
    + * Author: Andrey Mirkin
    + *
    + * This program is free software; you can redistribute it and/or
    + * modify it under the terms of the GNU General Public License as
    + * published by the Free Software Foundation, version 2 of the
    + * License.
    + *
    + */
    +
    +#include
    +#include
    +#include
    +#include
    +#include
    +
    +#include "checkpoint.h"
    +#include "cpt_image.h"
    +
    +static unsigned int encode_task_flags(unsigned int task_flags)
    +{
    + unsigned int flags = 0;
    +
    + if (task_flags & PF_EXITING)
    + flags |= (1 << CPT_PF_EXITING);
    + if (task_flags & PF_FORKNOEXEC)
    + flags |= (1 << CPT_PF_FORKNOEXEC);
    + if (task_flags & PF_SUPERPRIV)
    + flags |= (1 << CPT_PF_SUPERPRIV);
    + if (task_flags & PF_DUMPCORE)
    + flags |= (1 << CPT_PF_DUMPCORE);
    + if (task_flags & PF_SIGNALED)
    + flags |= (1 << CPT_PF_SIGNALED);
    + if (task_flags & PF_USED_MATH)
    + flags |= (1 << CPT_PF_USED_MATH);
    +
    + return flags;
    +
    +}
    +
    +int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx)
    +{
    + struct cpt_task_image *t;
    + int i;
    + int err;
    +
    + t = kzalloc(sizeof(*t), GFP_KERNEL);
    + if (!t)
    + return -ENOMEM;
    +
    + t->cpt_len = sizeof(*t);
    + t->cpt_type = CPT_OBJ_TASK;
    + t->cpt_hdrlen = sizeof(*t);
    + t->cpt_content = CPT_CONTENT_ARRAY;
    +
    + t->cpt_state = tsk->state;
    + t->cpt_flags = encode_task_flags(tsk->flags);
    + t->cpt_exit_code = tsk->exit_code;
    + t->cpt_exit_signal = tsk->exit_signal;
    + t->cpt_pdeath_signal = tsk->pdeath_signal;
    + t->cpt_pid = task_pid_nr_ns(tsk, ctx->nsproxy->pid_ns);
    + t->cpt_tgid = task_tgid_nr_ns(tsk, ctx->nsproxy->pid_ns);
    + t->cpt_ppid = tsk->parent ?
    + task_pid_nr_ns(tsk->parent, ctx->nsproxy->pid_ns) : 0;
    + t->cpt_rppid = tsk->real_parent ?
    + task_pid_nr_ns(tsk->real_parent, ctx->nsproxy->pid_ns) : 0;
    + t->cpt_pgrp = task_pgrp_nr_ns(tsk, ctx->nsproxy->pid_ns);
    + t->cpt_session = task_session_nr_ns(tsk, ctx->nsproxy->pid_ns);
    + t->cpt_old_pgrp = 0;
    + if (tsk->signal->tty_old_pgrp)
    + t->cpt_old_pgrp = pid_vnr(tsk->signal->tty_old_pgrp);
    + t->cpt_leader = tsk->group_leader ? task_pid_vnr(tsk->group_leader) : 0;
    + t->cpt_utime = tsk->utime;
    + t->cpt_stime = tsk->stime;
    + t->cpt_utimescaled = tsk->utimescaled;
    + t->cpt_stimescaled = tsk->stimescaled;
    + t->cpt_gtime = tsk->gtime;
    + t->cpt_prev_utime = tsk->prev_utime;
    + t->cpt_prev_stime = tsk->prev_stime;
    + t->cpt_nvcsw = tsk->nvcsw;
    + t->cpt_nivcsw = tsk->nivcsw;
    + t->cpt_start_time = cpt_timespec_export(&tsk->start_time);
    + t->cpt_real_start_time = cpt_timespec_export(&tsk->real_start_time);
    + t->cpt_min_flt = tsk->min_flt;
    + t->cpt_maj_flt = tsk->maj_flt;
    + memcpy(t->cpt_comm, tsk->comm, TASK_COMM_LEN);
    + for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
    + t->cpt_tls[i] = (((u64)tsk->thread.tls_array[i].b) << 32) +
    + tsk->thread.tls_array[i].a;
    + }
    + /* TODO: encode thread flags and status like task flags */
    + t->cpt_thrflags = task_thread_info(tsk)->flags & ~(1< + t->cpt_thrstatus = task_thread_info(tsk)->status;
    + t->cpt_user = tsk->user->uid;
    + t->cpt_uid = tsk->uid;
    + t->cpt_euid = tsk->euid;
    + t->cpt_suid = tsk->suid;
    + t->cpt_fsuid = tsk->fsuid;
    + t->cpt_gid = tsk->gid;
    + t->cpt_egid = tsk->egid;
    + t->cpt_sgid = tsk->sgid;
    + t->cpt_fsgid = tsk->fsgid;
    +
    + err = ctx->write(t, sizeof(*t), ctx);
    +
    + kfree(t);
    + return err;
    +}
    +
    +static int cpt_dump_fpustate(struct task_struct *tsk, struct cpt_context *ctx)
    +{
    + struct cpt_obj_bits hdr;
    + int err;
    + int content;
    + unsigned long size;
    +
    + content = CPT_CONTENT_X86_FPUSTATE;
    + size = sizeof(struct i387_fxsave_struct);
    +#ifndef CONFIG_X86_64
    + if (!cpu_has_fxsr) {
    + size = sizeof(struct i387_fsave_struct);
    + content = CPT_CONTENT_X86_FPUSTATE_OLD;
    + }
    +#endif
    +
    + hdr.cpt_len = sizeof(hdr) + size;
    + hdr.cpt_type = CPT_OBJ_BITS;
    + hdr.cpt_hdrlen = sizeof(hdr);
    + hdr.cpt_content = content;
    + hdr.cpt_size = size;
    + err = ctx->write(&hdr, sizeof(hdr), ctx);
    + if (!err)
    + ctx->write(tsk->thread.xstate, size, ctx);
    + return err;
    +}
    +
    +static u32 encode_segment(u32 segreg)
    +{
    + segreg &= 0xFFFF;
    +
    + if (segreg == 0)
    + return CPT_SEG_ZERO;
    + if ((segreg & 3) != 3) {
    + eprintk("Invalid RPL of a segment reg %x\n", segreg);
    + return CPT_SEG_ZERO;
    + }
    +
    + /* LDT descriptor, it is just an index to LDT array */
    + if (segreg & 4)
    + return CPT_SEG_LDT + (segreg >> 3);
    +
    + /* TLS descriptor. */
    + if ((segreg >> 3) >= GDT_ENTRY_TLS_MIN &&
    + (segreg >> 3) <= GDT_ENTRY_TLS_MAX)
    + return CPT_SEG_TLS1 + ((segreg>>3) - GDT_ENTRY_TLS_MIN);
    +
    + /* One of standard desriptors */
    +#ifdef CONFIG_X86_64
    + if (segreg == __USER32_DS)
    + return CPT_SEG_USER32_DS;
    + if (segreg == __USER32_CS)
    + return CPT_SEG_USER32_CS;
    + if (segreg == __USER_DS)
    + return CPT_SEG_USER64_DS;
    + if (segreg == __USER_CS)
    + return CPT_SEG_USER64_CS;
    +#else
    + if (segreg == __USER_DS)
    + return CPT_SEG_USER32_DS;
    + if (segreg == __USER_CS)
    + return CPT_SEG_USER32_CS;
    +#endif
    + eprintk("Invalid segment reg %x\n", segreg);
    + return CPT_SEG_ZERO;
    +}
    +
    +static int cpt_dump_registers(struct task_struct *tsk, struct cpt_context *ctx)
    +{
    + struct cpt_x86_regs ri;
    + struct pt_regs *pt_regs;
    +
    + ri.cpt_len = sizeof(ri);
    + ri.cpt_type = CPT_OBJ_X86_REGS;
    + ri.cpt_hdrlen = sizeof(ri);
    + ri.cpt_content = CPT_CONTENT_VOID;
    +
    + ri.cpt_debugreg[0] = tsk->thread.debugreg0;
    + ri.cpt_debugreg[1] = tsk->thread.debugreg1;
    + ri.cpt_debugreg[2] = tsk->thread.debugreg2;
    + ri.cpt_debugreg[3] = tsk->thread.debugreg3;
    + ri.cpt_debugreg[4] = 0;
    + ri.cpt_debugreg[5] = 0;
    + ri.cpt_debugreg[6] = tsk->thread.debugreg6;
    + ri.cpt_debugreg[7] = tsk->thread.debugreg7;
    +
    + pt_regs = task_pt_regs(tsk);
    +
    + ri.cpt_fs = encode_segment(pt_regs->fs);
    + ri.cpt_gs = encode_segment(tsk->thread.gs);
    +
    + ri.cpt_bx = pt_regs->bx;
    + ri.cpt_cx = pt_regs->cx;
    + ri.cpt_dx = pt_regs->dx;
    + ri.cpt_si = pt_regs->si;
    + ri.cpt_di = pt_regs->di;
    + ri.cpt_bp = pt_regs->bp;
    + ri.cpt_ax = pt_regs->ax;
    + ri.cpt_ds = encode_segment(pt_regs->ds);
    + ri.cpt_es = encode_segment(pt_regs->es);
    + ri.cpt_orig_ax = pt_regs->orig_ax;
    + ri.cpt_ip = pt_regs->ip;
    + ri.cpt_cs = encode_segment(pt_regs->cs);
    + ri.cpt_flags = pt_regs->flags;
    + ri.cpt_sp = pt_regs->sp;
    + ri.cpt_ss = encode_segment(pt_regs->ss);
    +
    + return ctx->write(&ri, sizeof(ri), ctx);
    +}
    +
    +int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx)
    +{
    + int err;
    +
    + err = cpt_dump_task_struct(tsk, ctx);
    +
    + /* Dump task mm */
    +
    + if (!err)
    + cpt_dump_fpustate(tsk, ctx);
    + if (!err)
    + cpt_dump_registers(tsk, ctx);
    +
    + return err;
    +}
    --
    1.5.6

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [PATCH 08/10] Introduce functions to restart a process

    Hello Andrey !


    > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
    > index 109792b..a4848a3 100644
    > --- a/arch/x86/kernel/entry_32.S
    > +++ b/arch/x86/kernel/entry_32.S
    > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
    > GET_THREAD_INFO(%ebp)
    > popl %eax
    > CFI_ADJUST_CFA_OFFSET -4
    > +ret_from_fork_tail:
    > pushl $0x0202 # Reset kernel eflags
    > CFI_ADJUST_CFA_OFFSET 4
    > popfl
    > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
    > CFI_ENDPROC
    > END(ret_from_fork)
    >
    > +ENTRY(i386_ret_from_resume)
    > + CFI_STARTPROC
    > + pushl %eax
    > + CFI_ADJUST_CFA_OFFSET 4
    > + call schedule_tail
    > + GET_THREAD_INFO(%ebp)
    > + popl %eax
    > + CFI_ADJUST_CFA_OFFSET -4
    > + movl (%esp), %eax
    > + testl %eax, %eax
    > + jz 1f
    > + pushl %esp
    > + call *%eax
    > + addl $4, %esp
    > +1:
    > + addl $256, %esp
    > + jmp ret_from_fork_tail
    > + CFI_ENDPROC
    > +END(i386_ret_from_resume)


    Could you explain why you need to do this

    call *%eax

    is it related to the freezer code ?

    C.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [PATCH 05/10] Introduce function to dump process

    Hi,

    On Sat, Oct 18, 2008 at 03:11:33AM +0400, Andrey Mirkin wrote:
    > Functions to dump task struct, fpu state and registers are added.
    > All IDs are saved from the POV of process (container) namespace.


    Just a couple of little comments, in case this series should keep on living.

    [...]

    > diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
    > new file mode 100644
    > index 0000000..58f608d
    > --- /dev/null
    > +++ b/checkpoint/cpt_process.c
    > @@ -0,0 +1,236 @@
    > +/*
    > + * Copyright (C) 2008 Parallels, Inc.
    > + *
    > + * Author: Andrey Mirkin
    > + *
    > + * This program is free software; you can redistribute it and/or
    > + * modify it under the terms of the GNU General Public License as
    > + * published by the Free Software Foundation, version 2 of the
    > + * License.
    > + *
    > + */
    > +
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +
    > +#include "checkpoint.h"
    > +#include "cpt_image.h"
    > +
    > +static unsigned int encode_task_flags(unsigned int task_flags)
    > +{
    > + unsigned int flags = 0;
    > +
    > + if (task_flags & PF_EXITING)
    > + flags |= (1 << CPT_PF_EXITING);
    > + if (task_flags & PF_FORKNOEXEC)
    > + flags |= (1 << CPT_PF_FORKNOEXEC);
    > + if (task_flags & PF_SUPERPRIV)
    > + flags |= (1 << CPT_PF_SUPERPRIV);
    > + if (task_flags & PF_DUMPCORE)
    > + flags |= (1 << CPT_PF_DUMPCORE);
    > + if (task_flags & PF_SIGNALED)
    > + flags |= (1 << CPT_PF_SIGNALED);
    > + if (task_flags & PF_USED_MATH)
    > + flags |= (1 << CPT_PF_USED_MATH);
    > +
    > + return flags;
    > +
    > +}
    > +
    > +int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx)
    > +{
    > + struct cpt_task_image *t;
    > + int i;
    > + int err;
    > +
    > + t = kzalloc(sizeof(*t), GFP_KERNEL);
    > + if (!t)
    > + return -ENOMEM;
    > +
    > + t->cpt_len = sizeof(*t);
    > + t->cpt_type = CPT_OBJ_TASK;
    > + t->cpt_hdrlen = sizeof(*t);
    > + t->cpt_content = CPT_CONTENT_ARRAY;
    > +
    > + t->cpt_state = tsk->state;
    > + t->cpt_flags = encode_task_flags(tsk->flags);
    > + t->cpt_exit_code = tsk->exit_code;
    > + t->cpt_exit_signal = tsk->exit_signal;
    > + t->cpt_pdeath_signal = tsk->pdeath_signal;
    > + t->cpt_pid = task_pid_nr_ns(tsk, ctx->nsproxy->pid_ns);
    > + t->cpt_tgid = task_tgid_nr_ns(tsk, ctx->nsproxy->pid_ns);
    > + t->cpt_ppid = tsk->parent ?
    > + task_pid_nr_ns(tsk->parent, ctx->nsproxy->pid_ns) : 0;
    > + t->cpt_rppid = tsk->real_parent ?
    > + task_pid_nr_ns(tsk->real_parent, ctx->nsproxy->pid_ns) : 0;
    > + t->cpt_pgrp = task_pgrp_nr_ns(tsk, ctx->nsproxy->pid_ns);
    > + t->cpt_session = task_session_nr_ns(tsk, ctx->nsproxy->pid_ns);
    > + t->cpt_old_pgrp = 0;
    > + if (tsk->signal->tty_old_pgrp)
    > + t->cpt_old_pgrp = pid_vnr(tsk->signal->tty_old_pgrp);
    > + t->cpt_leader = tsk->group_leader ? task_pid_vnr(tsk->group_leader) :0;


    Why pid_vnr() here, and task_*_nr_ns() above? According to the introducing
    comment, I'd expect something like pid_nr_ns(tsk->signal->tty_old_pgrp,
    tsk->nsproxy->pid_ns), and the same for tsk->group_leader.

    IIUC, pid_vnr() is correct only if ctx->nsproxy->pid_ns == tsk->nsproxy->pid_ns
    == current->nsproxy->pid_ns, and I expect current to live in a different pid_ns.

    Comments?

    > + t->cpt_utime = tsk->utime;
    > + t->cpt_stime = tsk->stime;
    > + t->cpt_utimescaled = tsk->utimescaled;
    > + t->cpt_stimescaled = tsk->stimescaled;
    > + t->cpt_gtime = tsk->gtime;
    > + t->cpt_prev_utime = tsk->prev_utime;
    > + t->cpt_prev_stime = tsk->prev_stime;
    > + t->cpt_nvcsw = tsk->nvcsw;
    > + t->cpt_nivcsw = tsk->nivcsw;
    > + t->cpt_start_time = cpt_timespec_export(&tsk->start_time);
    > + t->cpt_real_start_time = cpt_timespec_export(&tsk->real_start_time);
    > + t->cpt_min_flt = tsk->min_flt;
    > + t->cpt_maj_flt = tsk->maj_flt;
    > + memcpy(t->cpt_comm, tsk->comm, TASK_COMM_LEN);
    > + for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
    > + t->cpt_tls[i] = (((u64)tsk->thread.tls_array[i].b) << 32) +
    > + tsk->thread.tls_array[i].a;
    > + }
    > + /* TODO: encode thread flags and status like task flags */
    > + t->cpt_thrflags = task_thread_info(tsk)->flags & ~(1< > + t->cpt_thrstatus = task_thread_info(tsk)->status;
    > + t->cpt_user = tsk->user->uid;
    > + t->cpt_uid = tsk->uid;
    > + t->cpt_euid = tsk->euid;
    > + t->cpt_suid = tsk->suid;
    > + t->cpt_fsuid = tsk->fsuid;
    > + t->cpt_gid = tsk->gid;
    > + t->cpt_egid = tsk->egid;
    > + t->cpt_sgid = tsk->sgid;
    > + t->cpt_fsgid = tsk->fsgid;
    > +
    > + err = ctx->write(t, sizeof(*t), ctx);
    > +
    > + kfree(t);
    > + return err;
    > +}
    > +
    > +static int cpt_dump_fpustate(struct task_struct *tsk, struct cpt_context*ctx)
    > +{
    > + struct cpt_obj_bits hdr;
    > + int err;
    > + int content;
    > + unsigned long size;
    > +
    > + content = CPT_CONTENT_X86_FPUSTATE;
    > + size = sizeof(struct i387_fxsave_struct);
    > +#ifndef CONFIG_X86_64
    > + if (!cpu_has_fxsr) {
    > + size = sizeof(struct i387_fsave_struct);
    > + content = CPT_CONTENT_X86_FPUSTATE_OLD;
    > + }
    > +#endif
    > +
    > + hdr.cpt_len = sizeof(hdr) + size;
    > + hdr.cpt_type = CPT_OBJ_BITS;
    > + hdr.cpt_hdrlen = sizeof(hdr);
    > + hdr.cpt_content = content;
    > + hdr.cpt_size = size;
    > + err = ctx->write(&hdr, sizeof(hdr), ctx);
    > + if (!err)
    > + ctx->write(tsk->thread.xstate, size, ctx);


    Should check the error code of the line above, right?

    > + return err;
    > +}
    > +
    > +static u32 encode_segment(u32 segreg)
    > +{
    > + segreg &= 0xFFFF;
    > +
    > + if (segreg == 0)
    > + return CPT_SEG_ZERO;
    > + if ((segreg & 3) != 3) {
    > + eprintk("Invalid RPL of a segment reg %x\n", segreg);
    > + return CPT_SEG_ZERO;
    > + }
    > +
    > + /* LDT descriptor, it is just an index to LDT array */
    > + if (segreg & 4)
    > + return CPT_SEG_LDT + (segreg >> 3);
    > +
    > + /* TLS descriptor. */
    > + if ((segreg >> 3) >= GDT_ENTRY_TLS_MIN &&
    > + (segreg >> 3) <= GDT_ENTRY_TLS_MAX)
    > + return CPT_SEG_TLS1 + ((segreg>>3) - GDT_ENTRY_TLS_MIN);
    > +
    > + /* One of standard desriptors */
    > +#ifdef CONFIG_X86_64
    > + if (segreg == __USER32_DS)
    > + return CPT_SEG_USER32_DS;
    > + if (segreg == __USER32_CS)
    > + return CPT_SEG_USER32_CS;
    > + if (segreg == __USER_DS)
    > + return CPT_SEG_USER64_DS;
    > + if (segreg == __USER_CS)
    > + return CPT_SEG_USER64_CS;
    > +#else
    > + if (segreg == __USER_DS)
    > + return CPT_SEG_USER32_DS;
    > + if (segreg == __USER_CS)
    > + return CPT_SEG_USER32_CS;
    > +#endif
    > + eprintk("Invalid segment reg %x\n", segreg);
    > + return CPT_SEG_ZERO;
    > +}
    > +
    > +static int cpt_dump_registers(struct task_struct *tsk, struct cpt_context *ctx)
    > +{
    > + struct cpt_x86_regs ri;
    > + struct pt_regs *pt_regs;
    > +
    > + ri.cpt_len = sizeof(ri);
    > + ri.cpt_type = CPT_OBJ_X86_REGS;
    > + ri.cpt_hdrlen = sizeof(ri);
    > + ri.cpt_content = CPT_CONTENT_VOID;
    > +
    > + ri.cpt_debugreg[0] = tsk->thread.debugreg0;
    > + ri.cpt_debugreg[1] = tsk->thread.debugreg1;
    > + ri.cpt_debugreg[2] = tsk->thread.debugreg2;
    > + ri.cpt_debugreg[3] = tsk->thread.debugreg3;
    > + ri.cpt_debugreg[4] = 0;
    > + ri.cpt_debugreg[5] = 0;
    > + ri.cpt_debugreg[6] = tsk->thread.debugreg6;
    > + ri.cpt_debugreg[7] = tsk->thread.debugreg7;
    > +
    > + pt_regs = task_pt_regs(tsk);
    > +
    > + ri.cpt_fs = encode_segment(pt_regs->fs);
    > + ri.cpt_gs = encode_segment(tsk->thread.gs);
    > +
    > + ri.cpt_bx = pt_regs->bx;
    > + ri.cpt_cx = pt_regs->cx;
    > + ri.cpt_dx = pt_regs->dx;
    > + ri.cpt_si = pt_regs->si;
    > + ri.cpt_di = pt_regs->di;
    > + ri.cpt_bp = pt_regs->bp;
    > + ri.cpt_ax = pt_regs->ax;
    > + ri.cpt_ds = encode_segment(pt_regs->ds);
    > + ri.cpt_es = encode_segment(pt_regs->es);
    > + ri.cpt_orig_ax = pt_regs->orig_ax;
    > + ri.cpt_ip = pt_regs->ip;
    > + ri.cpt_cs = encode_segment(pt_regs->cs);
    > + ri.cpt_flags = pt_regs->flags;
    > + ri.cpt_sp = pt_regs->sp;
    > + ri.cpt_ss = encode_segment(pt_regs->ss);
    > +
    > + return ctx->write(&ri, sizeof(ri), ctx);
    > +}
    > +
    > +int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx)
    > +{
    > + int err;
    > +
    > + err = cpt_dump_task_struct(tsk, ctx);
    > +
    > + /* Dump task mm */
    > +
    > + if (!err)
    > + cpt_dump_fpustate(tsk, ctx);


    error checking...

    > + if (!err)
    > + cpt_dump_registers(tsk, ctx);


    error checking...

    > +
    > + return err;
    > +}
    > --
    > 1.5.6
    >
    > --
    > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    > the body of a message to majordomo@vger.kernel.org
    > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > Please read the FAQ at http://www.tux.org/lkml/


    Louis

    --
    Dr Louis Rilling Kerlabs
    Skype: louis.rilling Batiment Germanium
    Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
    http://www.kerlabs.com/ 35700 Rennes

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.6 (GNU/Linux)

    iD8DBQFI/GVCVKcRuvQ9Q1QRAj0HAKCPboBYRdTcoqVGaQpNm2CwyybUAAC gzXm0
    c4x6rxEAKKV9UpKc6PcVkV8=
    =0bvq
    -----END PGP SIGNATURE-----


  6. Re: [PATCH 06/10] Introduce functions to dump mm

    On Sat, Oct 18, 2008 at 03:11:34AM +0400, Andrey Mirkin wrote:
    > Functions to dump mm struct, VMAs and mm context are added.


    Again, a few little comments.

    [...]

    > diff --git a/checkpoint/cpt_mm.c b/checkpoint/cpt_mm.c
    > new file mode 100644
    > index 0000000..8a22c48
    > --- /dev/null
    > +++ b/checkpoint/cpt_mm.c
    > @@ -0,0 +1,434 @@
    > +/*
    > + * Copyright (C) 2008 Parallels, Inc.
    > + *
    > + * Authors: Andrey Mirkin
    > + *
    > + * This program is free software; you can redistribute it and/or
    > + * modify it under the terms of the GNU General Public License as
    > + * published by the Free Software Foundation, version 2 of the
    > + * License.
    > + *
    > + */
    > +
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +
    > +#include "checkpoint.h"
    > +#include "cpt_image.h"
    > +
    > +struct page_area
    > +{
    > + int type;
    > + unsigned long start;
    > + unsigned long end;
    > + pgoff_t pgoff;
    > + loff_t mm;
    > + __u64 list[16];
    > +};
    > +
    > +struct page_desc
    > +{
    > + int type;
    > + pgoff_t index;
    > + loff_t mm;
    > + int shared;
    > +};
    > +
    > +enum {
    > + PD_ABSENT,
    > + PD_COPY,
    > + PD_FUNKEY,
    > +};
    > +
    > +/* 0: page can be obtained from backstore, or still not mapped anonymouspage,
    > + or something else, which does not requre copy.
    > + 1: page requires copy
    > + 2: page requres copy but its content is zero. Quite useless.
    > + 3: wp page is shared after fork(). It is to be COWed when modified.
    > + 4: page is something unsupported... We copy it right now.
    > + */
    > +
    > +static void page_get_desc(struct vm_area_struct *vma, unsigned long addr,
    > + struct page_desc *pdesc, cpt_context_t * ctx)
    > +{
    > + struct mm_struct *mm = vma->vm_mm;
    > + pgd_t *pgd;
    > + pud_t *pud;
    > + pmd_t *pmd;
    > + pte_t *ptep, pte;
    > + spinlock_t *ptl;
    > + struct page *pg = NULL;
    > + pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff;
    > +
    > + pdesc->index = linear_index;
    > + pdesc->shared = 0;
    > + pdesc->mm = CPT_NULL;
    > +
    > + if (vma->vm_flags & VM_IO) {
    > + pdesc->type = PD_ABSENT;
    > + return;
    > + }
    > +
    > + pgd = pgd_offset(mm, addr);
    > + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
    > + goto out_absent;
    > + pud = pud_offset(pgd, addr);
    > + if (pud_none(*pud) || unlikely(pud_bad(*pud)))
    > + goto out_absent;
    > + pmd = pmd_offset(pud, addr);
    > + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
    > + goto out_absent;
    > +#ifdef CONFIG_X86
    > + if (pmd_huge(*pmd)) {
    > + eprintk("page_huge\n");
    > + goto out_unsupported;
    > + }
    > +#endif
    > + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
    > + pte = *ptep;
    > + pte_unmap(ptep);
    > +
    > + if (pte_none(pte))
    > + goto out_absent_unlock;
    > +
    > + if ((pg = vm_normal_page(vma, addr, pte)) == NULL) {
    > + pdesc->type = PD_COPY;
    > + goto out_unlock;
    > + }
    > +
    > + get_page(pg);
    > + spin_unlock(ptl);
    > +
    > + if (pg->mapping && !PageAnon(pg)) {
    > + if (vma->vm_file == NULL) {
    > + eprintk("pg->mapping!=NULL for fileless vma: %08lx\n", addr);
    > + goto out_unsupported;
    > + }
    > + if (vma->vm_file->f_mapping != pg->mapping) {
    > + eprintk("pg->mapping!=f_mapping: %08lx %p %p\n",
    > + addr, vma->vm_file->f_mapping, pg->mapping);
    > + goto out_unsupported;
    > + }
    > + pdesc->index = (pg->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
    > + /* Page is in backstore. For us it is like
    > + * it is not present.
    > + */
    > + goto out_absent;
    > + }
    > +
    > + if (PageReserved(pg)) {
    > + /* Special case: ZERO_PAGE is used, when an
    > + * anonymous page is accessed but not written. */
    > + if (pg == ZERO_PAGE(addr)) {
    > + if (pte_write(pte)) {
    > + eprintk("not funny already, writable ZERO_PAGE\n");
    > + goto out_unsupported;
    > + }
    > + /* Just copy it for now */
    > + pdesc->type = PD_COPY;
    > + goto out_put;
    > + }
    > + eprintk("reserved page %lu at %08lx\n", pg->index, addr);
    > + goto out_unsupported;
    > + }
    > +
    > + if (!pg->mapping) {
    > + eprintk("page without mapping at %08lx\n", addr);
    > + goto out_unsupported;
    > + }
    > +
    > + pdesc->type = PD_COPY;
    > +
    > +out_put:
    > + if (pg)
    > + put_page(pg);
    > + return;
    > +
    > +out_unlock:
    > + spin_unlock(ptl);
    > + goto out_put;
    > +
    > +out_absent_unlock:
    > + spin_unlock(ptl);
    > +
    > +out_absent:
    > + pdesc->type = PD_ABSENT;
    > + goto out_put;
    > +
    > +out_unsupported:
    > + pdesc->type = PD_FUNKEY;
    > + goto out_put;
    > +}
    > +
    > +static int count_vma_pages(struct vm_area_struct *vma, struct cpt_context *ctx)
    > +{
    > + unsigned long addr;
    > + int page_num = 0;
    > +
    > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
    > + struct page_desc pd;
    > +
    > + page_get_desc(vma, addr, &pd, ctx);
    > +
    > + if (pd.type != PD_COPY) {
    > + return -EINVAL;
    > + } else {
    > + page_num += 1;
    > + }
    > +
    > + }
    > + return page_num;
    > +}
    > +
    > +/* ATTN: We give "current" to get_user_pages(). This is wrong, but get_user_pages()
    > + * does not really need this thing. It just stores some page fault statsthere.
    > + *
    > + * BUG: some archs (f.e. sparc64, but not Intel*) require flush cache pages
    > + * before accessing vma.
    > + */
    > +static int dump_pages(struct vm_area_struct *vma, unsigned long start,
    > + unsigned long end, struct cpt_context *ctx)
    > +{
    > +#define MAX_PAGE_BATCH 16
    > + struct page *pg[MAX_PAGE_BATCH];
    > + int npages = (end - start)/PAGE_SIZE;
    > + int count = 0;
    > +
    > + while (count < npages) {
    > + int copy = npages - count;
    > + int n;
    > +
    > + if (copy > MAX_PAGE_BATCH)
    > + copy = MAX_PAGE_BATCH;
    > + n = get_user_pages(current, vma->vm_mm, start, copy,
    > + 0, 1, pg, NULL);
    > + if (n == copy) {
    > + int i;
    > + for (i=0; i > + char *maddr = kmap(pg[i]);
    > + ctx->write(maddr, PAGE_SIZE, ctx);
    > + kunmap(pg[i]);


    There is no error handling in this inner loop. Should be fixed imho.

    > + }
    > + } else {
    > + eprintk("get_user_pages fault");
    > + for ( ; n > 0; n--)
    > + page_cache_release(pg[n-1]);
    > + return -EFAULT;
    > + }
    > + start += n*PAGE_SIZE;
    > + count += n;
    > + for ( ; n > 0; n--)
    > + page_cache_release(pg[n-1]);
    > + }
    > + return 0;
    > +}
    > +
    > +static int dump_page_block(struct vm_area_struct *vma,
    > + struct cpt_page_block *pgb,
    > + struct cpt_context *ctx)
    > +{
    > + int err;
    > + pgb->cpt_len = sizeof(*pgb) + pgb->cpt_end - pgb->cpt_start;
    > + pgb->cpt_type = CPT_OBJ_PAGES;
    > + pgb->cpt_hdrlen = sizeof(*pgb);
    > + pgb->cpt_content = CPT_CONTENT_DATA;
    > +
    > + err = ctx->write(pgb, sizeof(*pgb), ctx);
    > + if (!err)
    > + err = dump_pages(vma, pgb->cpt_start, pgb->cpt_end, ctx);
    > +
    > + return err;
    > +}
    > +
    > +static int cpt_dump_dentry(struct path *p, cpt_context_t *ctx)
    > +{
    > + int len;
    > + char *path;
    > + char *buf;
    > + struct cpt_object_hdr o;
    > +
    > + buf = (char *)__get_free_page(GFP_KERNEL);
    > + if (!buf)
    > + return -ENOMEM;
    > +
    > + path = d_path(p, buf, PAGE_SIZE);
    > +
    > + if (IS_ERR(path)) {
    > + free_page((unsigned long)buf);
    > + return PTR_ERR(path);
    > + }
    > +
    > + len = buf + PAGE_SIZE - 1 - path;
    > + o.cpt_len = sizeof(o) + len + 1;
    > + o.cpt_type = CPT_OBJ_NAME;
    > + o.cpt_hdrlen = sizeof(o);
    > + o.cpt_content = CPT_CONTENT_NAME;
    > + path[len] = 0;
    > +
    > + ctx->write(&o, sizeof(o), ctx);
    > + ctx->write(path, len + 1, ctx);


    Error handling?

    > + free_page((unsigned long)buf);
    > +
    > + return 0;
    > +}
    > +
    > +static int dump_one_vma(struct mm_struct *mm,
    > + struct vm_area_struct *vma, struct cpt_context *ctx)
    > +{
    > + struct cpt_vma_image *v;
    > + unsigned long addr;
    > + int page_num;
    > + int err;
    > +
    > + v = kzalloc(sizeof(*v), GFP_KERNEL);
    > + if (!v)
    > + return -ENOMEM;
    > +
    > + v->cpt_len = sizeof(*v);
    > + v->cpt_type = CPT_OBJ_VMA;
    > + v->cpt_hdrlen = sizeof(*v);
    > + v->cpt_content = CPT_CONTENT_ARRAY;
    > +
    > + v->cpt_start = vma->vm_start;
    > + v->cpt_end = vma->vm_end;
    > + v->cpt_flags = vma->vm_flags;
    > + if (vma->vm_flags & VM_HUGETLB) {
    > + eprintk("huge TLB VMAs are still not supported\n");
    > + kfree(v);
    > + return -EINVAL;
    > + }
    > + v->cpt_pgprot = vma->vm_page_prot.pgprot;
    > + v->cpt_pgoff = vma->vm_pgoff;
    > + v->cpt_file = CPT_NULL;
    > + v->cpt_vma_type = CPT_VMA_TYPE_0;
    > +
    > + page_num = count_vma_pages(vma, ctx);
    > + if (page_num < 0) {
    > + kfree(v);
    > + return -EINVAL;
    > + }


    AFAICS, since count_vma_pages only supports pages with PD_COPY, and since
    get_page_desc() tags text segment pages (file-mapped and not anonymous since
    not written to) as PD_ABSENT, no executable is checkpointable. So, where isthe
    trick? Am I completely missing something about page mapping?

    > + v->cpt_page_num = page_num;
    > +
    > + if (vma->vm_file) {
    > + v->cpt_file = 0;
    > + v->cpt_vma_type = CPT_VMA_FILE;
    > + }
    > +
    > + ctx->write(v, sizeof(*v), ctx);


    Error handling?

    > + kfree(v);
    > +
    > + if (vma->vm_file) {
    > + err = cpt_dump_dentry(&vma->vm_file->f_path, ctx);
    > + if (err < 0)
    > + return err;
    > + }
    > +
    > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
    > + struct page_desc pd;
    > + struct cpt_page_block pgb;
    > +
    > + page_get_desc(vma, addr, &pd, ctx);
    > +
    > + if (pd.type == PD_FUNKEY || pd.type == PD_ABSENT) {
    > + eprintk("dump_one_vma: funkey page\n");
    > + return -EINVAL;
    > + }
    > +
    > + pgb.cpt_start = addr;
    > + pgb.cpt_end = addr + PAGE_SIZE;
    > + dump_page_block(vma, &pgb, ctx);


    Error handling?

    > + }
    > +
    > + return 0;
    > +}
    > +
    > +static int cpt_dump_mm_context(struct mm_struct *mm, struct cpt_context *ctx)
    > +{
    > +#ifdef CONFIG_X86
    > + if (mm->context.size) {
    > + struct cpt_obj_bits b;
    > + int size;
    > +
    > + mutex_lock(&mm->context.lock);
    > +
    > + b.cpt_type = CPT_OBJ_BITS;
    > + b.cpt_len = sizeof(b);
    > + b.cpt_content = CPT_CONTENT_MM_CONTEXT;
    > + b.cpt_size = mm->context.size * LDT_ENTRY_SIZE;
    > +
    > + ctx->write(&b, sizeof(b), ctx);
    > +
    > + size = mm->context.size * LDT_ENTRY_SIZE;
    > +
    > + ctx->write(mm->context.ldt, size, ctx);


    Error handling?

    > +
    > + mutex_unlock(&mm->context.lock);
    > + }
    > +#endif
    > + return 0;
    > +}
    > +
    > +int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx)
    > +{
    > + struct mm_struct *mm = tsk->mm;
    > + struct cpt_mm_image *v;
    > + struct vm_area_struct *vma;
    > + int err;
    > +
    > + v = kzalloc(sizeof(*v), GFP_KERNEL);
    > + if (!v)
    > + return -ENOMEM;
    > +
    > + v->cpt_len = sizeof(*v);
    > + v->cpt_type = CPT_OBJ_MM;
    > + v->cpt_hdrlen = sizeof(*v);
    > + v->cpt_content = CPT_CONTENT_ARRAY;
    > +
    > + down_read(&mm->mmap_sem);
    > + v->cpt_start_code = mm->start_code;
    > + v->cpt_end_code = mm->end_code;
    > + v->cpt_start_data = mm->start_data;
    > + v->cpt_end_data = mm->end_data;
    > + v->cpt_start_brk = mm->start_brk;
    > + v->cpt_brk = mm->brk;
    > + v->cpt_start_stack = mm->start_stack;
    > + v->cpt_start_arg = mm->arg_start;
    > + v->cpt_end_arg = mm->arg_end;
    > + v->cpt_start_env = mm->env_start;
    > + v->cpt_end_env = mm->env_end;
    > + v->cpt_def_flags = mm->def_flags;
    > + v->cpt_flags = mm->flags;
    > + v->cpt_map_count = mm->map_count;
    > +
    > + err = ctx->write(v, sizeof(*v), ctx);
    > + kfree(v);
    > +
    > + if (err) {
    > + eprintk("error during writing mm\n");
    > + goto err_up;
    > + }
    > +
    > + for (vma = mm->mmap; vma; vma = vma->vm_next) {
    > + if ((err = dump_one_vma(mm, vma, ctx)) != 0)
    > + goto err_up;
    > + }
    > +
    > + err = cpt_dump_mm_context(mm, ctx);
    > +
    > +err_up:
    > + up_read(&mm->mmap_sem);
    > +
    > + return err;
    > +}
    > +


    [...]

    Louis

    --
    Dr Louis Rilling Kerlabs
    Skype: louis.rilling Batiment Germanium
    Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
    http://www.kerlabs.com/ 35700 Rennes

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.6 (GNU/Linux)

    iD8DBQFI/HiqVKcRuvQ9Q1QRAtf1AKDStIl9KjEcB2vpy+zbEs84xdViygC gx+h9
    +rZgMWds/3WvkQvqH7TRK84=
    =sY0q
    -----END PGP SIGNATURE-----


  7. Re: [PATCH 08/10] Introduce functions to restart a process

    On Sat, Oct 18, 2008 at 03:11:36AM +0400, Andrey Mirkin wrote:
    > Functions to restart process, restore its state, fpu and registers are added.


    [...]

    > diff --git a/checkpoint/rst_process.c b/checkpoint/rst_process.c
    > new file mode 100644
    > index 0000000..b9f745e
    > --- /dev/null
    > +++ b/checkpoint/rst_process.c
    > @@ -0,0 +1,277 @@
    > +/*
    > + * Copyright (C) 2008 Parallels, Inc.
    > + *
    > + * Author: Andrey Mirkin
    > + *
    > + * This program is free software; you can redistribute it and/or
    > + * modify it under the terms of the GNU General Public License as
    > + * published by the Free Software Foundation, version 2 of the
    > + * License.
    > + *
    > + */
    > +
    > +#include
    > +#include
    > +#include
    > +#include
    > +#include
    > +
    > +#include "checkpoint.h"
    > +#include "cpt_image.h"
    > +
    > +#define HOOK_RESERVE 256
    > +
    > +struct thr_context {
    > + struct completion complete;
    > + int error;
    > + struct cpt_context *ctx;
    > + struct task_struct *tsk;
    > +};
    > +
    > +int local_kernel_thread(int (*fn)(void *), void * arg, unsigned long flags, pid_t pid)
    > +{
    > + pid_t ret;
    > +
    > + if (current->fs == NULL) {
    > + /* do_fork_pid() hates processes without fs, oopses. */
    > + eprintk("local_kernel_thread: current->fs==NULL\n");
    > + return -EINVAL;
    > + }
    > + if (!try_module_get(THIS_MODULE))
    > + return -EBUSY;
    > + ret = kernel_thread(fn, arg, flags);
    > + if (ret < 0)
    > + module_put(THIS_MODULE);
    > + return ret;
    > +}
    > +
    > +static unsigned int decode_task_flags(unsigned int task_flags)
    > +{
    > + unsigned int flags = 0;
    > +
    > + if (task_flags & (1 << CPT_PF_EXITING))
    > + flags |= PF_EXITING;
    > + if (task_flags & (1 << CPT_PF_FORKNOEXEC))
    > + flags |= PF_FORKNOEXEC;
    > + if (task_flags & (1 << CPT_PF_SUPERPRIV))
    > + flags |= PF_SUPERPRIV;
    > + if (task_flags & (1 << CPT_PF_DUMPCORE))
    > + flags |= PF_DUMPCORE;
    > + if (task_flags & (1 << CPT_PF_SIGNALED))
    > + flags |= PF_SIGNALED;
    > +
    > + return flags;
    > +
    > +}
    > +
    > +int rst_restore_task_struct(struct task_struct *tsk, struct cpt_task_image *ti,
    > + struct cpt_context *ctx)
    > +{
    > + int i;
    > +
    > + /* Restore only saved flags, comm and tls for now */
    > + tsk->flags = decode_task_flags(ti->cpt_flags);
    > + clear_tsk_thread_flag(tsk, TIF_FREEZE);
    > + memcpy(tsk->comm, ti->cpt_comm, TASK_COMM_LEN);
    > + for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
    > + tsk->thread.tls_array[i].a = ti->cpt_tls[i] & 0xFFFFFFFF;
    > + tsk->thread.tls_array[i].b = ti->cpt_tls[i] >> 32;
    > + }
    > +
    > + return 0;
    > +}
    > +
    > +static int rst_restore_fpustate(struct task_struct *tsk, struct cpt_task_image *ti,
    > + struct cpt_context *ctx)
    > +{
    > + struct cpt_obj_bits hdr;
    > + int err;
    > + char *buf;
    > +
    > + clear_stopped_child_used_math(tsk);
    > +
    > + err = rst_get_object(CPT_OBJ_BITS, &hdr, sizeof(hdr), ctx);
    > + if (err < 0)
    > + return err;
    > +
    > + buf = kmalloc(hdr.cpt_size, GFP_KERNEL);
    > + if (!buf)
    > + return -ENOMEM;
    > +
    > + err = ctx->read(buf, hdr.cpt_size, ctx);
    > + if (err)
    > + goto out;
    > +
    > + if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE && cpu_has_fxsr) {
    > + memcpy(&tsk->thread.xstate, buf,
    > + sizeof(struct i387_fxsave_struct));
    > + if (ti->cpt_flags & CPT_PF_USED_MATH)
    > + set_stopped_child_used_math(tsk);
    > + }
    > +#ifndef CONFIG_X86_64
    > + else if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE_OLD &&
    > + !cpu_has_fxsr) {
    > + memcpy(&tsk->thread.xstate, buf,
    > + sizeof(struct i387_fsave_struct));
    > + if (ti->cpt_flags & CPT_PF_USED_MATH)
    > + set_stopped_child_used_math(tsk);
    > + }
    > +#endif
    > +
    > +out:
    > + kfree(buf);
    > + return err;
    > +}
    > +
    > +static u32 decode_segment(u32 segid)
    > +{
    > + if (segid == CPT_SEG_ZERO)
    > + return 0;
    > +
    > + /* TLS descriptors */
    > + if (segid <= CPT_SEG_TLS3)
    > + return ((GDT_ENTRY_TLS_MIN + segid - CPT_SEG_TLS1) << 3) + 3;
    > +
    > + /* LDT descriptor, it is just an index to LDT array */
    > + if (segid >= CPT_SEG_LDT)
    > + return ((segid - CPT_SEG_LDT) << 3) | 7;
    > +
    > + /* Check for one of standard descriptors */
    > + if (segid == CPT_SEG_USER32_DS)
    > + return __USER_DS;
    > + if (segid == CPT_SEG_USER32_CS)
    > + return __USER_CS;
    > +
    > + eprintk("Invalid segment reg %d\n", segid);
    > + return 0;
    > +}
    > +
    > +static int rst_restore_registers(struct task_struct *tsk, struct cpt_context *ctx)
    > +{
    > + struct cpt_x86_regs ri;
    > + struct pt_regs *regs = task_pt_regs(tsk);
    > + extern char i386_ret_from_resume;
    > + int err;
    > +
    > + err = rst_get_object(CPT_OBJ_X86_REGS, &ri, sizeof(ri), ctx);
    > + if (err < 0)
    > + return err;
    > +
    > + tsk->thread.sp = (unsigned long) regs;
    > + tsk->thread.sp0 = (unsigned long) (regs+1);
    > + tsk->thread.ip = (unsigned long) &i386_ret_from_resume;
    > +
    > + tsk->thread.gs = decode_segment(ri.cpt_gs);
    > + tsk->thread.debugreg0 = ri.cpt_debugreg[0];
    > + tsk->thread.debugreg1 = ri.cpt_debugreg[1];
    > + tsk->thread.debugreg2 = ri.cpt_debugreg[2];
    > + tsk->thread.debugreg3 = ri.cpt_debugreg[3];
    > + tsk->thread.debugreg6 = ri.cpt_debugreg[6];
    > + tsk->thread.debugreg7 = ri.cpt_debugreg[7];
    > +
    > + regs->bx = ri.cpt_bx;
    > + regs->cx = ri.cpt_cx;
    > + regs->dx = ri.cpt_dx;
    > + regs->si = ri.cpt_si;
    > + regs->di = ri.cpt_di;
    > + regs->bp = ri.cpt_bp;
    > + regs->ax = ri.cpt_ax;
    > + regs->orig_ax = ri.cpt_orig_ax;
    > + regs->ip = ri.cpt_ip;
    > + regs->flags = ri.cpt_flags;
    > + regs->sp = ri.cpt_sp;
    > +
    > + regs->cs = decode_segment(ri.cpt_cs);
    > + regs->ss = decode_segment(ri.cpt_ss);
    > + regs->ds = decode_segment(ri.cpt_ds);
    > + regs->es = decode_segment(ri.cpt_es);
    > + regs->fs = decode_segment(ri.cpt_fs);
    > +
    > + tsk->thread.sp -= HOOK_RESERVE;
    > + memset((void*)tsk->thread.sp, 0, HOOK_RESERVE);
    > +
    > + return 0;
    > +}
    > +
    > +static int restart_thread(void *arg)
    > +{
    > + struct thr_context *thr_ctx = arg;
    > + struct cpt_context *ctx;
    > + struct cpt_task_image *ti;
    > + int err;
    > +
    > + current->state = TASK_UNINTERRUPTIBLE;
    > +
    > + ctx = thr_ctx->ctx;
    > + ti = kmalloc(sizeof(*ti), GFP_KERNEL);
    > + if (!ti)
    > + return -ENOMEM;
    > +
    > + err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx);
    > + if (!err)
    > + err = rst_restore_task_struct(current, ti, ctx);
    > + /* Restore mm here */
    > + if (!err)
    > + err = rst_restore_fpustate(current, ti, ctx);
    > + if (!err)
    > + err = rst_restore_registers(current, ctx);
    > +
    > + thr_ctx->error = err;
    > + complete(&thr_ctx->complete);
    > +
    > + if (!err && (ti->cpt_state & (EXIT_ZOMBIE|EXIT_DEAD))) {
    > + do_exit(ti->cpt_exit_code);
    > + } else {
    > + __set_current_state(TASK_UNINTERRUPTIBLE);
    > + }
    > +
    > + kfree(ti);
    > + schedule();
    > +
    > + eprintk("leaked %d/%d %p\n", task_pid_nr(current), task_pid_vnr(current), current->mm);
    > +
    > + module_put(THIS_MODULE);


    I'm sorry, I still do not understand what you are doing with this self-module
    pinning stuff. AFAICS, we should not get here unless there is a bug. So the
    checkpoint module ref count is never decreased, right?

    Could you detail what is this self-module pinning for? As I already told you,
    this looks like a bogus solution to avoid unloading the checkpoint module during
    restart.

    Thanks!

    Louis

    [...]

    --
    Dr Louis Rilling Kerlabs
    Skype: louis.rilling Batiment Germanium
    Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
    http://www.kerlabs.com/ 35700 Rennes

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.6 (GNU/Linux)

    iD8DBQFI/IbQVKcRuvQ9Q1QRAiGBAKC4Z8k/6JoKv8ZUPiICCENVm/3WzACff24m
    6BQS1z6dHnMA8QzMK7MMuk8=
    =iJ08
    -----END PGP SIGNATURE-----


  8. Re: [PATCH 02/10] Make checkpoint/restart functionality modular

    On Sat, 2008-10-18 at 03:11 +0400, Andrey Mirkin wrote:
    > +struct cpt_operations
    > +{
    > + struct module * owner;
    > + int (*checkpoint)(pid_t pid, int fd, unsigned long flags);
    > + int (*restart)(int ctid, int fd, unsigned long flags);
    > +};


    I think this is pretty useless obfuscation. We're not going to have
    pluggable checkpoint/restart implementations, are we? So, why bother
    putting it in a module?

    I can understand that it's easier to develop your code when it's in a
    module and you don't have to reboot the machine to load a new kernel
    each time. But, that's an individual developer thing, and doesn't
    belong in an upstream submission.

    I know people have given you a hard time for this in the past. Why is
    it still here?

    -- Dave

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [PATCH 03/10] Introduce context structure needed during checkpointing/restart

    On Sat, 2008-10-18 at 03:11 +0400, Andrey Mirkin wrote:
    > +typedef struct cpt_context
    > +{
    > + pid_t pid; /* should be changed to ctid later */
    > + int ctx_id; /* context id */
    > + struct list_head ctx_list;
    > + int refcount;
    > + int ctx_state;
    > + struct semaphore main_sem;


    Does this really need to be a semaphore or is a mutex OK?

    > + int errno;


    Could you hold off on adding these things to the struct until the patch
    where they're actually used? It's hard to judge this without seeing
    what you do with it.

    > + struct file *file;
    > + loff_t current_object;
    > +
    > + struct list_head object_array[CPT_OBJ_MAX];
    > +
    > + int (*write)(const void *addr, size_t count, struct cpt_context *ctx);
    > + int (*read)(void *addr, size_t count, struct cpt_context *ctx);
    > +} cpt_context_t;


    Man, this is hard to review. I was going to try and make sure that your
    refcounting was right and atomic, but there's no use of it in this patch
    except for the initialization and accessor functions. Darn.

    > +extern int debug_level;


    I'm going to go out on a limb here and say that "debug_level" is
    probably a wee bit too generic of a variable name.

    > +#define cpt_printk(lvl, fmt, args...) do { \
    > + if (lvl <= debug_level) \
    > + printk(fmt, ##args); \
    > + } while (0)


    I think you can use pr_debug() here, too, just like Oren did.

    > +struct cpt_context * context_alloc(void)
    > +{
    > + struct cpt_context *ctx;
    > + int i;
    > +
    > + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
    > + if (!ctx)
    > + return NULL;
    > +
    > + init_MUTEX(&ctx->main_sem);
    > + ctx->refcount = 1;
    > +
    > + ctx->current_object = -1;
    > + ctx->write = file_write;
    > + ctx->read = file_read;
    > + for (i = 0; i < CPT_OBJ_MAX; i++) {
    > + INIT_LIST_HEAD(&ctx->object_array[i]);
    > + }
    > +
    > + return ctx;
    > +}
    > +
    > +void context_release(struct cpt_context *ctx)
    > +{
    > + ctx->ctx_state = CPT_CTX_ERROR;
    > +
    > + kfree(ctx);
    > +}
    > +
    > +static void context_put(struct cpt_context *ctx)
    > +{
    > + if (!--ctx->refcount)
    > + context_release(ctx);
    > +}
    > +
    > static int checkpoint(pid_t pid, int fd, unsigned long flags)
    > {
    > - return -ENOSYS;
    > + struct file *file;
    > + struct cpt_context *ctx;
    > + int err;
    > +
    > + err = -EBADF;
    > + file = fget(fd);
    > + if (!file)
    > + goto out;
    > +
    > + err = -ENOMEM;
    > + ctx = context_alloc();
    > + if (!ctx)
    > + goto out_file;
    > +
    > + ctx->file = file;
    > + ctx->ctx_state = CPT_CTX_DUMPING;
    > +
    > + /* checkpoint */
    > + err = -ENOSYS;
    > +
    > + context_put(ctx);
    > +
    > +out_file:
    > + fput(file);
    > +out:
    > + return err;
    > }


    So, where is context_get()? Is there only single-threaded access to the
    refcount? If so, why do we even need it? We should probably just use
    context_release() driectly.

    If there is multithreaded access to context_put() or the refcount, then
    they're unsafe without additional locking.

    -- Dave

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [PATCH 02/10] Make checkpoint/restart functionality modular

    Quoting Andrey Mirkin (major@openvz.org):
    > A config option CONFIG_CHECKPOINT is introduced.
    > New structure cpt_operations is introduced to store pointers to
    > checkpoint/restart functions from module.


    I thought we had decided not to use a kernel module?

    Louis' comments on your patch 8 regarding module pinning suggests that
    details about using a module will detract from proper review of the core
    c/r functionality...

    -serge
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [PATCH 06/10] Introduce functions to dump mm

    On Sat, 2008-10-18 at 03:11 +0400, Andrey Mirkin wrote:
    > +static void page_get_desc(struct vm_area_struct *vma, unsigned long addr,
    > + struct page_desc *pdesc, cpt_context_t * ctx)
    > +{
    > + struct mm_struct *mm = vma->vm_mm;
    > + pgd_t *pgd;
    > + pud_t *pud;
    > + pmd_t *pmd;
    > + pte_t *ptep, pte;
    > + spinlock_t *ptl;
    > + struct page *pg = NULL;
    > + pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff;
    > +
    > + pdesc->index = linear_index;
    > + pdesc->shared = 0;
    > + pdesc->mm = CPT_NULL;
    > +
    > + if (vma->vm_flags & VM_IO) {
    > + pdesc->type = PD_ABSENT;
    > + return;
    > + }
    > +
    > + pgd = pgd_offset(mm, addr);
    > + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
    > + goto out_absent;
    > + pud = pud_offset(pgd, addr);
    > + if (pud_none(*pud) || unlikely(pud_bad(*pud)))
    > + goto out_absent;
    > + pmd = pmd_offset(pud, addr);
    > + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
    > + goto out_absent;
    > +#ifdef CONFIG_X86
    > + if (pmd_huge(*pmd)) {
    > + eprintk("page_huge\n");
    > + goto out_unsupported;
    > + }
    > +#endif


    I take it you know that this breaks with the 1GB (x86_64) and 16GB (ppc)
    large pages.

    Since you have the VMA, why not use is_vm_hugetlb_page()?

    -- Dave

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [PATCH 05/10] Introduce function to dump process

    Quoting Andrey Mirkin (major@openvz.org):
    > + t->cpt_uid = tsk->uid;
    > + t->cpt_euid = tsk->euid;
    > + t->cpt_suid = tsk->suid;
    > + t->cpt_fsuid = tsk->fsuid;
    > + t->cpt_gid = tsk->gid;
    > + t->cpt_egid = tsk->egid;
    > + t->cpt_sgid = tsk->sgid;
    > + t->cpt_fsgid = tsk->fsgid;


    I don't see where any of these are restored. (Obviously, I wanted
    to think about how you're verifying the restarter's authorization
    to do so)

    thanks,
    -serge
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

    On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
    > Hello Andrey !
    >
    > > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
    > > index 109792b..a4848a3 100644
    > > --- a/arch/x86/kernel/entry_32.S
    > > +++ b/arch/x86/kernel/entry_32.S
    > > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
    > > GET_THREAD_INFO(%ebp)
    > > popl %eax
    > > CFI_ADJUST_CFA_OFFSET -4
    > > +ret_from_fork_tail:
    > > pushl $0x0202 # Reset kernel eflags
    > > CFI_ADJUST_CFA_OFFSET 4
    > > popfl
    > > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
    > > CFI_ENDPROC
    > > END(ret_from_fork)
    > >
    > > +ENTRY(i386_ret_from_resume)
    > > + CFI_STARTPROC
    > > + pushl %eax
    > > + CFI_ADJUST_CFA_OFFSET 4
    > > + call schedule_tail
    > > + GET_THREAD_INFO(%ebp)
    > > + popl %eax
    > > + CFI_ADJUST_CFA_OFFSET -4
    > > + movl (%esp), %eax
    > > + testl %eax, %eax
    > > + jz 1f
    > > + pushl %esp
    > > + call *%eax
    > > + addl $4, %esp
    > > +1:
    > > + addl $256, %esp
    > > + jmp ret_from_fork_tail
    > > + CFI_ENDPROC
    > > +END(i386_ret_from_resume)

    >
    > Could you explain why you need to do this
    >
    > call *%eax
    >
    > is it related to the freezer code ?


    It is not related to the freezer code actually.
    That is needed to restart syscalls. Right now I don't have a code in my
    patchset which restarts a syscall, but later I plan to add it.
    In OpenVZ checkpointing we restart syscalls if process was caught in syscall
    during checkpointing.

    Andrey
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [Devel] Re: [PATCH 06/10] Introduce functions to dump mm

    On Monday 20 October 2008 16:25 Louis Rilling wrote:
    > On Sat, Oct 18, 2008 at 03:11:34AM +0400, Andrey Mirkin wrote:
    > > Functions to dump mm struct, VMAs and mm context are added.

    >
    > Again, a few little comments.
    >
    > [...]
    >
    > > diff --git a/checkpoint/cpt_mm.c b/checkpoint/cpt_mm.c
    > > new file mode 100644
    > > index 0000000..8a22c48
    > > --- /dev/null
    > > +++ b/checkpoint/cpt_mm.c
    > > @@ -0,0 +1,434 @@
    > > +/*
    > > + * Copyright (C) 2008 Parallels, Inc.
    > > + *
    > > + * Authors: Andrey Mirkin
    > > + *
    > > + * This program is free software; you can redistribute it and/or
    > > + * modify it under the terms of the GNU General Public License as
    > > + * published by the Free Software Foundation, version 2 of the
    > > + * License.
    > > + *
    > > + */
    > > +
    > > +#include
    > > +#include
    > > +#include
    > > +#include
    > > +#include
    > > +#include
    > > +#include
    > > +#include
    > > +#include
    > > +#include
    > > +#include
    > > +#include
    > > +#include
    > > +
    > > +#include "checkpoint.h"
    > > +#include "cpt_image.h"
    > > +
    > > +struct page_area
    > > +{
    > > + int type;
    > > + unsigned long start;
    > > + unsigned long end;
    > > + pgoff_t pgoff;
    > > + loff_t mm;
    > > + __u64 list[16];
    > > +};
    > > +
    > > +struct page_desc
    > > +{
    > > + int type;
    > > + pgoff_t index;
    > > + loff_t mm;
    > > + int shared;
    > > +};
    > > +
    > > +enum {
    > > + PD_ABSENT,
    > > + PD_COPY,
    > > + PD_FUNKEY,
    > > +};
    > > +
    > > +/* 0: page can be obtained from backstore, or still not mapped anonymous
    > > page, + or something else, which does not requre copy.
    > > + 1: page requires copy
    > > + 2: page requres copy but its content is zero. Quite useless.
    > > + 3: wp page is shared after fork(). It is to be COWed when modified.
    > > + 4: page is something unsupported... We copy it right now.
    > > + */
    > > +
    > > +static void page_get_desc(struct vm_area_struct *vma, unsigned long
    > > addr, + struct page_desc *pdesc, cpt_context_t * ctx)
    > > +{
    > > + struct mm_struct *mm = vma->vm_mm;
    > > + pgd_t *pgd;
    > > + pud_t *pud;
    > > + pmd_t *pmd;
    > > + pte_t *ptep, pte;
    > > + spinlock_t *ptl;
    > > + struct page *pg = NULL;
    > > + pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE +
    > > vma->vm_pgoff; +
    > > + pdesc->index = linear_index;
    > > + pdesc->shared = 0;
    > > + pdesc->mm = CPT_NULL;
    > > +
    > > + if (vma->vm_flags & VM_IO) {
    > > + pdesc->type = PD_ABSENT;
    > > + return;
    > > + }
    > > +
    > > + pgd = pgd_offset(mm, addr);
    > > + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
    > > + goto out_absent;
    > > + pud = pud_offset(pgd, addr);
    > > + if (pud_none(*pud) || unlikely(pud_bad(*pud)))
    > > + goto out_absent;
    > > + pmd = pmd_offset(pud, addr);
    > > + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
    > > + goto out_absent;
    > > +#ifdef CONFIG_X86
    > > + if (pmd_huge(*pmd)) {
    > > + eprintk("page_huge\n");
    > > + goto out_unsupported;
    > > + }
    > > +#endif
    > > + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
    > > + pte = *ptep;
    > > + pte_unmap(ptep);
    > > +
    > > + if (pte_none(pte))
    > > + goto out_absent_unlock;
    > > +
    > > + if ((pg = vm_normal_page(vma, addr, pte)) == NULL) {
    > > + pdesc->type = PD_COPY;
    > > + goto out_unlock;
    > > + }
    > > +
    > > + get_page(pg);
    > > + spin_unlock(ptl);
    > > +
    > > + if (pg->mapping && !PageAnon(pg)) {
    > > + if (vma->vm_file == NULL) {
    > > + eprintk("pg->mapping!=NULL for fileless vma: %08lx\n", addr);
    > > + goto out_unsupported;
    > > + }
    > > + if (vma->vm_file->f_mapping != pg->mapping) {
    > > + eprintk("pg->mapping!=f_mapping: %08lx %p %p\n",
    > > + addr, vma->vm_file->f_mapping, pg->mapping);
    > > + goto out_unsupported;
    > > + }
    > > + pdesc->index = (pg->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
    > > + /* Page is in backstore. For us it is like
    > > + * it is not present.
    > > + */
    > > + goto out_absent;
    > > + }
    > > +
    > > + if (PageReserved(pg)) {
    > > + /* Special case: ZERO_PAGE is used, when an
    > > + * anonymous page is accessed but not written. */
    > > + if (pg == ZERO_PAGE(addr)) {
    > > + if (pte_write(pte)) {
    > > + eprintk("not funny already, writable ZERO_PAGE\n");
    > > + goto out_unsupported;
    > > + }
    > > + /* Just copy it for now */
    > > + pdesc->type = PD_COPY;
    > > + goto out_put;
    > > + }
    > > + eprintk("reserved page %lu at %08lx\n", pg->index, addr);
    > > + goto out_unsupported;
    > > + }
    > > +
    > > + if (!pg->mapping) {
    > > + eprintk("page without mapping at %08lx\n", addr);
    > > + goto out_unsupported;
    > > + }
    > > +
    > > + pdesc->type = PD_COPY;
    > > +
    > > +out_put:
    > > + if (pg)
    > > + put_page(pg);
    > > + return;
    > > +
    > > +out_unlock:
    > > + spin_unlock(ptl);
    > > + goto out_put;
    > > +
    > > +out_absent_unlock:
    > > + spin_unlock(ptl);
    > > +
    > > +out_absent:
    > > + pdesc->type = PD_ABSENT;
    > > + goto out_put;
    > > +
    > > +out_unsupported:
    > > + pdesc->type = PD_FUNKEY;
    > > + goto out_put;
    > > +}
    > > +
    > > +static int count_vma_pages(struct vm_area_struct *vma, struct
    > > cpt_context *ctx) +{
    > > + unsigned long addr;
    > > + int page_num = 0;
    > > +
    > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
    > > + struct page_desc pd;
    > > +
    > > + page_get_desc(vma, addr, &pd, ctx);
    > > +
    > > + if (pd.type != PD_COPY) {
    > > + return -EINVAL;
    > > + } else {
    > > + page_num += 1;
    > > + }
    > > +
    > > + }
    > > + return page_num;
    > > +}
    > > +
    > > +/* ATTN: We give "current" to get_user_pages(). This is wrong, but
    > > get_user_pages() + * does not really need this thing. It just stores some
    > > page fault stats there. + *
    > > + * BUG: some archs (f.e. sparc64, but not Intel*) require flush cache
    > > pages + * before accessing vma.
    > > + */
    > > +static int dump_pages(struct vm_area_struct *vma, unsigned long start,
    > > + unsigned long end, struct cpt_context *ctx)
    > > +{
    > > +#define MAX_PAGE_BATCH 16
    > > + struct page *pg[MAX_PAGE_BATCH];
    > > + int npages = (end - start)/PAGE_SIZE;
    > > + int count = 0;
    > > +
    > > + while (count < npages) {
    > > + int copy = npages - count;
    > > + int n;
    > > +
    > > + if (copy > MAX_PAGE_BATCH)
    > > + copy = MAX_PAGE_BATCH;
    > > + n = get_user_pages(current, vma->vm_mm, start, copy,
    > > + 0, 1, pg, NULL);
    > > + if (n == copy) {
    > > + int i;
    > > + for (i=0; i > > + char *maddr = kmap(pg[i]);
    > > + ctx->write(maddr, PAGE_SIZE, ctx);
    > > + kunmap(pg[i]);

    >
    > There is no error handling in this inner loop. Should be fixed imho.


    Yes, you right. Already fixed in next version. I'll try to send it out
    shortly.

    >
    > > + }
    > > + } else {
    > > + eprintk("get_user_pages fault");
    > > + for ( ; n > 0; n--)
    > > + page_cache_release(pg[n-1]);
    > > + return -EFAULT;
    > > + }
    > > + start += n*PAGE_SIZE;
    > > + count += n;
    > > + for ( ; n > 0; n--)
    > > + page_cache_release(pg[n-1]);
    > > + }
    > > + return 0;
    > > +}
    > > +
    > > +static int dump_page_block(struct vm_area_struct *vma,
    > > + struct cpt_page_block *pgb,
    > > + struct cpt_context *ctx)
    > > +{
    > > + int err;
    > > + pgb->cpt_len = sizeof(*pgb) + pgb->cpt_end - pgb->cpt_start;
    > > + pgb->cpt_type = CPT_OBJ_PAGES;
    > > + pgb->cpt_hdrlen = sizeof(*pgb);
    > > + pgb->cpt_content = CPT_CONTENT_DATA;
    > > +
    > > + err = ctx->write(pgb, sizeof(*pgb), ctx);
    > > + if (!err)
    > > + err = dump_pages(vma, pgb->cpt_start, pgb->cpt_end, ctx);
    > > +
    > > + return err;
    > > +}
    > > +
    > > +static int cpt_dump_dentry(struct path *p, cpt_context_t *ctx)
    > > +{
    > > + int len;
    > > + char *path;
    > > + char *buf;
    > > + struct cpt_object_hdr o;
    > > +
    > > + buf = (char *)__get_free_page(GFP_KERNEL);
    > > + if (!buf)
    > > + return -ENOMEM;
    > > +
    > > + path = d_path(p, buf, PAGE_SIZE);
    > > +
    > > + if (IS_ERR(path)) {
    > > + free_page((unsigned long)buf);
    > > + return PTR_ERR(path);
    > > + }
    > > +
    > > + len = buf + PAGE_SIZE - 1 - path;
    > > + o.cpt_len = sizeof(o) + len + 1;
    > > + o.cpt_type = CPT_OBJ_NAME;
    > > + o.cpt_hdrlen = sizeof(o);
    > > + o.cpt_content = CPT_CONTENT_NAME;
    > > + path[len] = 0;
    > > +
    > > + ctx->write(&o, sizeof(o), ctx);
    > > + ctx->write(path, len + 1, ctx);

    >
    > Error handling?

    Will fix it, thanks.

    >
    > > + free_page((unsigned long)buf);
    > > +
    > > + return 0;
    > > +}
    > > +
    > > +static int dump_one_vma(struct mm_struct *mm,
    > > + struct vm_area_struct *vma, struct cpt_context *ctx)
    > > +{
    > > + struct cpt_vma_image *v;
    > > + unsigned long addr;
    > > + int page_num;
    > > + int err;
    > > +
    > > + v = kzalloc(sizeof(*v), GFP_KERNEL);
    > > + if (!v)
    > > + return -ENOMEM;
    > > +
    > > + v->cpt_len = sizeof(*v);
    > > + v->cpt_type = CPT_OBJ_VMA;
    > > + v->cpt_hdrlen = sizeof(*v);
    > > + v->cpt_content = CPT_CONTENT_ARRAY;
    > > +
    > > + v->cpt_start = vma->vm_start;
    > > + v->cpt_end = vma->vm_end;
    > > + v->cpt_flags = vma->vm_flags;
    > > + if (vma->vm_flags & VM_HUGETLB) {
    > > + eprintk("huge TLB VMAs are still not supported\n");
    > > + kfree(v);
    > > + return -EINVAL;
    > > + }
    > > + v->cpt_pgprot = vma->vm_page_prot.pgprot;
    > > + v->cpt_pgoff = vma->vm_pgoff;
    > > + v->cpt_file = CPT_NULL;
    > > + v->cpt_vma_type = CPT_VMA_TYPE_0;
    > > +
    > > + page_num = count_vma_pages(vma, ctx);
    > > + if (page_num < 0) {
    > > + kfree(v);
    > > + return -EINVAL;
    > > + }

    >
    > AFAICS, since count_vma_pages only supports pages with PD_COPY, and since
    > get_page_desc() tags text segment pages (file-mapped and not anonymous
    > since not written to) as PD_ABSENT, no executable is checkpointable. So,
    > where is the trick? Am I completely missing something about page mapping?

    Oh, that's my fault, I have sent wrong version. I will send new patchset with
    correct page mapping today.

    >
    > > + v->cpt_page_num = page_num;
    > > +
    > > + if (vma->vm_file) {
    > > + v->cpt_file = 0;
    > > + v->cpt_vma_type = CPT_VMA_FILE;
    > > + }
    > > +
    > > + ctx->write(v, sizeof(*v), ctx);

    >
    > Error handling?

    Yes, will add it.

    >
    > > + kfree(v);
    > > +
    > > + if (vma->vm_file) {
    > > + err = cpt_dump_dentry(&vma->vm_file->f_path, ctx);
    > > + if (err < 0)
    > > + return err;
    > > + }
    > > +
    > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
    > > + struct page_desc pd;
    > > + struct cpt_page_block pgb;
    > > +
    > > + page_get_desc(vma, addr, &pd, ctx);
    > > +
    > > + if (pd.type == PD_FUNKEY || pd.type == PD_ABSENT) {
    > > + eprintk("dump_one_vma: funkey page\n");
    > > + return -EINVAL;
    > > + }
    > > +
    > > + pgb.cpt_start = addr;
    > > + pgb.cpt_end = addr + PAGE_SIZE;
    > > + dump_page_block(vma, &pgb, ctx);

    >
    > Error handling?

    Yeap, thanks.

    >
    > > + }
    > > +
    > > + return 0;
    > > +}
    > > +
    > > +static int cpt_dump_mm_context(struct mm_struct *mm, struct cpt_context
    > > *ctx) +{
    > > +#ifdef CONFIG_X86
    > > + if (mm->context.size) {
    > > + struct cpt_obj_bits b;
    > > + int size;
    > > +
    > > + mutex_lock(&mm->context.lock);
    > > +
    > > + b.cpt_type = CPT_OBJ_BITS;
    > > + b.cpt_len = sizeof(b);
    > > + b.cpt_content = CPT_CONTENT_MM_CONTEXT;
    > > + b.cpt_size = mm->context.size * LDT_ENTRY_SIZE;
    > > +
    > > + ctx->write(&b, sizeof(b), ctx);
    > > +
    > > + size = mm->context.size * LDT_ENTRY_SIZE;
    > > +
    > > + ctx->write(mm->context.ldt, size, ctx);

    >
    > Error handling?

    Thanks again!

    >
    > > +
    > > + mutex_unlock(&mm->context.lock);
    > > + }
    > > +#endif
    > > + return 0;
    > > +}
    > > +
    > > +int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx)
    > > +{
    > > + struct mm_struct *mm = tsk->mm;
    > > + struct cpt_mm_image *v;
    > > + struct vm_area_struct *vma;
    > > + int err;
    > > +
    > > + v = kzalloc(sizeof(*v), GFP_KERNEL);
    > > + if (!v)
    > > + return -ENOMEM;
    > > +
    > > + v->cpt_len = sizeof(*v);
    > > + v->cpt_type = CPT_OBJ_MM;
    > > + v->cpt_hdrlen = sizeof(*v);
    > > + v->cpt_content = CPT_CONTENT_ARRAY;
    > > +
    > > + down_read(&mm->mmap_sem);
    > > + v->cpt_start_code = mm->start_code;
    > > + v->cpt_end_code = mm->end_code;
    > > + v->cpt_start_data = mm->start_data;
    > > + v->cpt_end_data = mm->end_data;
    > > + v->cpt_start_brk = mm->start_brk;
    > > + v->cpt_brk = mm->brk;
    > > + v->cpt_start_stack = mm->start_stack;
    > > + v->cpt_start_arg = mm->arg_start;
    > > + v->cpt_end_arg = mm->arg_end;
    > > + v->cpt_start_env = mm->env_start;
    > > + v->cpt_end_env = mm->env_end;
    > > + v->cpt_def_flags = mm->def_flags;
    > > + v->cpt_flags = mm->flags;
    > > + v->cpt_map_count = mm->map_count;
    > > +
    > > + err = ctx->write(v, sizeof(*v), ctx);
    > > + kfree(v);
    > > +
    > > + if (err) {
    > > + eprintk("error during writing mm\n");
    > > + goto err_up;
    > > + }
    > > +
    > > + for (vma = mm->mmap; vma; vma = vma->vm_next) {
    > > + if ((err = dump_one_vma(mm, vma, ctx)) != 0)
    > > + goto err_up;
    > > + }
    > > +
    > > + err = cpt_dump_mm_context(mm, ctx);
    > > +
    > > +err_up:
    > > + up_read(&mm->mmap_sem);
    > > +
    > > + return err;
    > > +}
    > > +

    >
    > [...]
    >
    > Louis

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

    On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
    > On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
    > > Hello Andrey !
    > >
    > > > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
    > > > index 109792b..a4848a3 100644
    > > > --- a/arch/x86/kernel/entry_32.S
    > > > +++ b/arch/x86/kernel/entry_32.S
    > > > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
    > > > GET_THREAD_INFO(%ebp)
    > > > popl %eax
    > > > CFI_ADJUST_CFA_OFFSET -4
    > > > +ret_from_fork_tail:
    > > > pushl $0x0202 # Reset kernel eflags
    > > > CFI_ADJUST_CFA_OFFSET 4
    > > > popfl
    > > > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
    > > > CFI_ENDPROC
    > > > END(ret_from_fork)
    > > >
    > > > +ENTRY(i386_ret_from_resume)
    > > > + CFI_STARTPROC
    > > > + pushl %eax
    > > > + CFI_ADJUST_CFA_OFFSET 4
    > > > + call schedule_tail
    > > > + GET_THREAD_INFO(%ebp)
    > > > + popl %eax
    > > > + CFI_ADJUST_CFA_OFFSET -4
    > > > + movl (%esp), %eax
    > > > + testl %eax, %eax
    > > > + jz 1f
    > > > + pushl %esp
    > > > + call *%eax
    > > > + addl $4, %esp
    > > > +1:
    > > > + addl $256, %esp
    > > > + jmp ret_from_fork_tail
    > > > + CFI_ENDPROC
    > > > +END(i386_ret_from_resume)

    > >
    > > Could you explain why you need to do this
    > >
    > > call *%eax
    > >
    > > is it related to the freezer code ?

    >
    > It is not related to the freezer code actually.
    > That is needed to restart syscalls. Right now I don't have a code in my
    > patchset which restarts a syscall, but later I plan to add it.
    > In OpenVZ checkpointing we restart syscalls if process was caught in syscall
    > during checkpointing.


    Do you checkpoint uninterruptible syscalls as well? If only interruptible
    syscalls are checkpointed, I'd say that either this syscall uses ERESTARTSYS or
    ERESTART_RESTARTBLOCK, and then signal handling code already does the trick, or
    this syscall does not restart itself when interrupted, and well, this is life,
    userspace just sees -EINTR, which is allowed by the syscall spec.
    Actually this is how we checkpoint/migrate tasks in interruptible syscalls in
    Kerrighed and this works.

    Louis

    --
    Dr Louis Rilling Kerlabs
    Skype: louis.rilling Batiment Germanium
    Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
    http://www.kerlabs.com/ 35700 Rennes

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.6 (GNU/Linux)

    iD8DBQFI/vFuVKcRuvQ9Q1QRAjIRAKCoR0VUnHwiEJ57zm6Vw/5IJsx9xACgi7+U
    gM14Y8eKgo+xAz7qEGvFElo=
    =ymRV
    -----END PGP SIGNATURE-----


  16. Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

    On Wed, 2008-10-22 at 11:25 +0200, Louis Rilling wrote:
    > Do you checkpoint uninterruptible syscalls as well? If only interruptible
    > syscalls are checkpointed, I'd say that either this syscall uses ERESTARTSYS or
    > ERESTART_RESTARTBLOCK, and then signal handling code already does the trick, or
    > this syscall does not restart itself when interrupted, and well, this is life,
    > userspace just sees -EINTR, which is allowed by the syscall spec.
    > Actually this is how we checkpoint/migrate tasks in interruptible syscalls in
    > Kerrighed and this works.
    >
    > Louis
    >


    I don't know Kerrighed internals but I understand you perform checkpoint
    with a signal handler. Right ? This approach has a huge benefit: the
    signal handling code do all the arch dependant stuff to save registers
    in user memory.

    --
    Gregory Kurz gkurz@fr.ibm.com
    Software Engineer @ IBM/Meiosys http://www.ibm.com
    Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

    "Anarchy is about taking complete responsibility for yourself."
    Alan Moore.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

    On Wednesday 22 October 2008 13:25 Louis Rilling wrote:
    > On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
    > > On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
    > > > Hello Andrey !
    > > >
    > > > > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
    > > > > index 109792b..a4848a3 100644
    > > > > --- a/arch/x86/kernel/entry_32.S
    > > > > +++ b/arch/x86/kernel/entry_32.S
    > > > > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
    > > > > GET_THREAD_INFO(%ebp)
    > > > > popl %eax
    > > > > CFI_ADJUST_CFA_OFFSET -4
    > > > > +ret_from_fork_tail:
    > > > > pushl $0x0202 # Reset kernel eflags
    > > > > CFI_ADJUST_CFA_OFFSET 4
    > > > > popfl
    > > > > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
    > > > > CFI_ENDPROC
    > > > > END(ret_from_fork)
    > > > >
    > > > > +ENTRY(i386_ret_from_resume)
    > > > > + CFI_STARTPROC
    > > > > + pushl %eax
    > > > > + CFI_ADJUST_CFA_OFFSET 4
    > > > > + call schedule_tail
    > > > > + GET_THREAD_INFO(%ebp)
    > > > > + popl %eax
    > > > > + CFI_ADJUST_CFA_OFFSET -4
    > > > > + movl (%esp), %eax
    > > > > + testl %eax, %eax
    > > > > + jz 1f
    > > > > + pushl %esp
    > > > > + call *%eax
    > > > > + addl $4, %esp
    > > > > +1:
    > > > > + addl $256, %esp
    > > > > + jmp ret_from_fork_tail
    > > > > + CFI_ENDPROC
    > > > > +END(i386_ret_from_resume)
    > > >
    > > > Could you explain why you need to do this
    > > >
    > > > call *%eax
    > > >
    > > > is it related to the freezer code ?

    > >
    > > It is not related to the freezer code actually.
    > > That is needed to restart syscalls. Right now I don't have a code in my
    > > patchset which restarts a syscall, but later I plan to add it.
    > > In OpenVZ checkpointing we restart syscalls if process was caught in
    > > syscall during checkpointing.

    >
    > Do you checkpoint uninterruptible syscalls as well? If only interruptible
    > syscalls are checkpointed, I'd say that either this syscall uses
    > ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal handling code already
    > does the trick, or this syscall does not restart itself when interrupted,
    > and well, this is life, userspace just sees -EINTR, which is allowed by the
    > syscall spec.
    > Actually this is how we checkpoint/migrate tasks in interruptible syscalls
    > in Kerrighed and this works.


    We checkpoint only interruptible syscalls. Some syscalls do not restart
    themself, that is why after restarting a process we restart syscall to
    complete it.

    Andrey
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

    On Wed, Oct 22, 2008 at 02:12:12PM +0400, Andrey Mirkin wrote:
    > On Wednesday 22 October 2008 13:25 Louis Rilling wrote:
    > > On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
    > > > On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
    > > > > Hello Andrey !
    > > > >
    > > > > > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
    > > > > > index 109792b..a4848a3 100644
    > > > > > --- a/arch/x86/kernel/entry_32.S
    > > > > > +++ b/arch/x86/kernel/entry_32.S
    > > > > > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
    > > > > > GET_THREAD_INFO(%ebp)
    > > > > > popl %eax
    > > > > > CFI_ADJUST_CFA_OFFSET -4
    > > > > > +ret_from_fork_tail:
    > > > > > pushl $0x0202 # Reset kernel eflags
    > > > > > CFI_ADJUST_CFA_OFFSET 4
    > > > > > popfl
    > > > > > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
    > > > > > CFI_ENDPROC
    > > > > > END(ret_from_fork)
    > > > > >
    > > > > > +ENTRY(i386_ret_from_resume)
    > > > > > + CFI_STARTPROC
    > > > > > + pushl %eax
    > > > > > + CFI_ADJUST_CFA_OFFSET 4
    > > > > > + call schedule_tail
    > > > > > + GET_THREAD_INFO(%ebp)
    > > > > > + popl %eax
    > > > > > + CFI_ADJUST_CFA_OFFSET -4
    > > > > > + movl (%esp), %eax
    > > > > > + testl %eax, %eax
    > > > > > + jz 1f
    > > > > > + pushl %esp
    > > > > > + call *%eax
    > > > > > + addl $4, %esp
    > > > > > +1:
    > > > > > + addl $256, %esp
    > > > > > + jmp ret_from_fork_tail
    > > > > > + CFI_ENDPROC
    > > > > > +END(i386_ret_from_resume)
    > > > >
    > > > > Could you explain why you need to do this
    > > > >
    > > > > call *%eax
    > > > >
    > > > > is it related to the freezer code ?
    > > >
    > > > It is not related to the freezer code actually.
    > > > That is needed to restart syscalls. Right now I don't have a code in my
    > > > patchset which restarts a syscall, but later I plan to add it.
    > > > In OpenVZ checkpointing we restart syscalls if process was caught in
    > > > syscall during checkpointing.

    > >
    > > Do you checkpoint uninterruptible syscalls as well? If only interruptible
    > > syscalls are checkpointed, I'd say that either this syscall uses
    > > ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal handling code already
    > > does the trick, or this syscall does not restart itself when interrupted,
    > > and well, this is life, userspace just sees -EINTR, which is allowed bythe
    > > syscall spec.
    > > Actually this is how we checkpoint/migrate tasks in interruptible syscalls
    > > in Kerrighed and this works.

    >
    > We checkpoint only interruptible syscalls. Some syscalls do not restart
    > themself, that is why after restarting a process we restart syscall to
    > complete it.


    I guess you do that to avoid breaking application that are badly written and do
    not handle -EINTR correctly with interruptible syscalls. Right?

    Louis

    --
    Dr Louis Rilling Kerlabs
    Skype: louis.rilling Batiment Germanium
    Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
    http://www.kerlabs.com/ 35700 Rennes

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.6 (GNU/Linux)

    iD8DBQFI/wSGVKcRuvQ9Q1QRAn6/AJ41IaNsbve2hQKp28BsEK0eVxZTcQCeKA9c
    fz/uGTwQLCUe3lj6jbq+yKI=
    =/R1X
    -----END PGP SIGNATURE-----


  19. Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

    On Wed, Oct 22, 2008 at 12:06:19PM +0200, Greg Kurz wrote:
    > On Wed, 2008-10-22 at 11:25 +0200, Louis Rilling wrote:
    > > Do you checkpoint uninterruptible syscalls as well? If only interruptible
    > > syscalls are checkpointed, I'd say that either this syscall uses ERESTARTSYS or
    > > ERESTART_RESTARTBLOCK, and then signal handling code already does the trick, or
    > > this syscall does not restart itself when interrupted, and well, this is life,
    > > userspace just sees -EINTR, which is allowed by the syscall spec.
    > > Actually this is how we checkpoint/migrate tasks in interruptible syscalls in
    > > Kerrighed and this works.
    > >
    > > Louis
    > >

    >
    > I don't know Kerrighed internals but I understand you perform checkpoint
    > with a signal handler. Right ?


    Right. This is an kernel-internal-only signal, so all signals remain available
    for userspace.

    > This approach has a huge benefit: the
    > signal handling code do all the arch dependant stuff to save registers
    > in user memory.


    Hm, I'm not sure to understand what you mean here. We just rely on arch code
    that jumps to signal handling to correctly setup struct pt_regs, which is then
    passed to the checkpoint code. So yes, userspace registers are mostly savedby
    existing arch code. But in x86-64 for instance, segment registers still need to
    be saved by the checkpoint code (a bit like copy_thread() does), and I don't
    know arch-independent functions doing this.

    Louis

    --
    Dr Louis Rilling Kerlabs
    Skype: louis.rilling Batiment Germanium
    Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
    http://www.kerlabs.com/ 35700 Rennes

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.6 (GNU/Linux)

    iD8DBQFI/wQgVKcRuvQ9Q1QRAlYSAJ9m7SbP58UJecxrODh20bCidW8GiQC aA8au
    cwPAhqb8bpNORt8BMVatako=
    =seeX
    -----END PGP SIGNATURE-----


  20. Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

    >>> +ENTRY(i386_ret_from_resume)
    >>> + CFI_STARTPROC
    >>> + pushl %eax
    >>> + CFI_ADJUST_CFA_OFFSET 4
    >>> + call schedule_tail
    >>> + GET_THREAD_INFO(%ebp)
    >>> + popl %eax
    >>> + CFI_ADJUST_CFA_OFFSET -4
    >>> + movl (%esp), %eax
    >>> + testl %eax, %eax
    >>> + jz 1f
    >>> + pushl %esp
    >>> + call *%eax
    >>> + addl $4, %esp
    >>> +1:
    >>> + addl $256, %esp
    >>> + jmp ret_from_fork_tail
    >>> + CFI_ENDPROC
    >>> +END(i386_ret_from_resume)

    >> Could you explain why you need to do this
    >>
    >> call *%eax
    >>
    >> is it related to the freezer code ?

    >
    > It is not related to the freezer code actually.
    > That is needed to restart syscalls. Right now I don't have a code in my
    > patchset which restarts a syscall, but later I plan to add it.
    > In OpenVZ checkpointing we restart syscalls if process was caught in syscall
    > during checkpointing.


    ok. I get it now. why 256 bytes of extra stack ? I'm sure it's not random.

    Thanks,

    C.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 3 1 2 3 LastLast