[RFC, PATCH 0/6] ext3: do not modify data on-disk when mounting read-only filesystem - Kernel

This is a discussion on [RFC, PATCH 0/6] ext3: do not modify data on-disk when mounting read-only filesystem - Kernel ; At present, as discussed in this LKML thread, http://marc.info/?l=linux-kernel&m=117607695406580 , when a dirty ext3 filesystem is mounted read-only it writes to the disk while replaying the journal log and cleaning up the orphan list. This behaviour may surprise users and ...

+ Reply to Thread
Results 1 to 11 of 11

Thread: [RFC, PATCH 0/6] ext3: do not modify data on-disk when mounting read-only filesystem

  1. [RFC, PATCH 0/6] ext3: do not modify data on-disk when mounting read-only filesystem

    At present, as discussed in this LKML thread,
    http://marc.info/?l=linux-kernel&m=117607695406580, when a dirty ext3
    filesystem is mounted read-only it writes to the disk while replaying the
    journal log and cleaning up the orphan list. This behaviour may surprise users
    and can potentially cause data corruption/loss (e.g. if a system is suspended,
    booted into a different OS, then resumed).

    This patch series attempts to address this by using a block translation table
    instead of replaying the journal on a read-only filesystem.

    Patches 1-3 are independent cleanups/bug-fixes for things I came across while
    working on this. They could be submitted separately and are not required for
    following patches.

    Patch 4 is a refactoring change that simplifies the code prior to later
    substantive changes.

    Patch 5 introduces the translation table and support for a truly read-only
    journal into jbd.

    Patch 6 uses the facility introduced in patch 5 to add support for true
    read-only ext3.

    For testing I've been using qemu VMs to create and mount dirtied filesystems. I
    have a set of scripts that fully automates creating a dirty filesystem then
    checking mounting read-only and read-write produces consistent results. On my
    system it can get through around ~30 iteration overnight. If anyone is
    interested in the scripts please let me know. Any suggestions for additional
    tests or enhancements that could be made to the scripts would be gratefully
    received.

    TODO:
    * Add R/W remount support
    * Port to ext4

    Cheers,
    Duane Griffin.

    > git diff --stat origin

    fs/ext3/balloc.c | 2 +-
    fs/ext3/ialloc.c | 2 +-
    fs/ext3/inode.c | 8 +-
    fs/ext3/resize.c | 2 +-
    fs/ext3/super.c | 123 ++++++++++-----
    fs/ext3/xattr.c | 8 +-
    fs/jbd/checkpoint.c | 2 +-
    fs/jbd/commit.c | 2 +-
    fs/jbd/journal.c | 68 +++++---
    fs/jbd/recovery.c | 402 +++++++++++++++++++++++++++++++++--------------
    fs/jbd/revoke.c | 133 ++++++----------
    fs/ocfs2/journal.c | 4 +-
    include/linux/ext3_fs.h | 7 +
    include/linux/jbd.h | 41 +++++-
    14 files changed, 516 insertions(+), 288 deletions(-)

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. [RFC, PATCH 1/6] jbd: eliminate duplicated code in revocation table init/destroy functions

    The revocation table initialisation/destruction code is repeated for each of
    the two revocation tables stored in the journal. Refactoring the duplicated
    code into functions is tidier, simplifies the logic in initialisation in
    particular, and slightly reduces the code size.

    There should not be any functional change.

    Signed-off-by: Duane Griffin
    ---

    This change is an independent cleanup which is not required by following
    patches in this series. Also it should perhaps be two separate patches?

    fs/jbd/revoke.c | 126 +++++++++++++++++++++++--------------------------------
    1 files changed, 52 insertions(+), 74 deletions(-)

    diff --git a/fs/jbd/revoke.c b/fs/jbd/revoke.c
    index ad2eacf..5f64df4 100644
    --- a/fs/jbd/revoke.c
    +++ b/fs/jbd/revoke.c
    @@ -195,108 +195,86 @@ void journal_destroy_revoke_caches(void)
    revoke_table_cache = NULL;
    }

    -/* Initialise the revoke table for a given journal to a given size. */
    -
    -int journal_init_revoke(journal_t *journal, int hash_size)
    +static int journal_init_revoke_table(struct jbd_revoke_table_s *table, int size)
    {
    - int shift, tmp;
    -
    - J_ASSERT (journal->j_revoke_table[0] == NULL);
    -
    - shift = 0;
    - tmp = hash_size;
    + int shift = 0;
    + int tmp = size;
    while((tmp >>= 1UL) != 0UL)
    shift++;

    - journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
    - if (!journal->j_revoke_table[0])
    - return -ENOMEM;
    - journal->j_revoke = journal->j_revoke_table[0];
    -
    - /* Check that the hash_size is a power of two */
    - J_ASSERT(is_power_of_2(hash_size));
    -
    - journal->j_revoke->hash_size = hash_size;
    -
    - journal->j_revoke->hash_shift = shift;
    -
    - journal->j_revoke->hash_table =
    - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
    - if (!journal->j_revoke->hash_table) {
    - kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
    - journal->j_revoke = NULL;
    + table->hash_size = size;
    + table->hash_shift = shift;
    + table->hash_table = kmalloc(
    + size * sizeof(struct list_head), GFP_KERNEL);
    + if (!table->hash_table)
    return -ENOMEM;
    - }
    -
    - for (tmp = 0; tmp < hash_size; tmp++)
    - INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);

    - journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
    - if (!journal->j_revoke_table[1]) {
    - kfree(journal->j_revoke_table[0]->hash_table);
    - kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
    - return -ENOMEM;
    - }
    + for (tmp = 0; tmp < size; tmp++)
    + INIT_LIST_HEAD(&table->hash_table[tmp]);

    - journal->j_revoke = journal->j_revoke_table[1];
    + return 0;
    +}

    - /* Check that the hash_size is a power of two */
    +/* Initialise the revoke table for a given journal to a given size. */
    +int journal_init_revoke(journal_t *journal, int hash_size)
    +{
    + J_ASSERT(journal->j_revoke_table[0] == NULL);
    J_ASSERT(is_power_of_2(hash_size));

    - journal->j_revoke->hash_size = hash_size;
    + journal->j_revoke_table[0] = kmem_cache_alloc(
    + revoke_table_cache, GFP_KERNEL);
    + if (!journal->j_revoke_table[0])
    + goto failed_alloc1;
    + if (journal_init_revoke_table(journal->j_revoke_table[0], hash_size))
    + goto failed_init1;

    - journal->j_revoke->hash_shift = shift;
    + journal->j_revoke_table[1] = kmem_cache_alloc(
    + revoke_table_cache, GFP_KERNEL);
    + if (!journal->j_revoke_table[1])
    + goto failed_alloc2;
    + if (journal_init_revoke_table(journal->j_revoke_table[1], hash_size))
    + goto failed_init2;

    - journal->j_revoke->hash_table =
    - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
    - if (!journal->j_revoke->hash_table) {
    - kfree(journal->j_revoke_table[0]->hash_table);
    - kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
    - kmem_cache_free(revoke_table_cache, journal->j_revoke_table[1]);
    - journal->j_revoke = NULL;
    - return -ENOMEM;
    - }
    -
    - for (tmp = 0; tmp < hash_size; tmp++)
    - INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
    + journal->j_revoke = journal->j_revoke_table[1];

    spin_lock_init(&journal->j_revoke_lock);

    return 0;
    -}

    -/* Destoy a journal's revoke table. The table must already be empty! */
    +failed_init2:
    + kmem_cache_free(revoke_table_cache, journal->j_revoke_table[1]);
    +failed_alloc2:
    + kfree(journal->j_revoke_table[0]->hash_table);
    +failed_init1:
    + kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
    +failed_alloc1:
    + return -ENOMEM;
    +}

    -void journal_destroy_revoke(journal_t *journal)
    +static void journal_destroy_revoke_table(struct jbd_revoke_table_s *table)
    {
    - struct jbd_revoke_table_s *table;
    - struct list_head *hash_list;
    int i;
    -
    - table = journal->j_revoke_table[0];
    - if (!table)
    - return;
    -
    - for (i=0; ihash_size; i++) {
    + struct list_head *hash_list;
    + for (i = 0; i < table->hash_size; i++) {
    hash_list = &table->hash_table[i];
    - J_ASSERT (list_empty(hash_list));
    + J_ASSERT(list_empty(hash_list));
    }

    kfree(table->hash_table);
    kmem_cache_free(revoke_table_cache, table);
    - journal->j_revoke = NULL;
    +}

    - table = journal->j_revoke_table[1];
    - if (!table)
    +/* Destroy a journal's revoke table. The table must already be empty! */
    +void journal_destroy_revoke(journal_t *journal)
    +{
    + if (!journal->j_revoke_table[0])
    return;
    + journal_destroy_revoke_table(journal->j_revoke_table[0]);
    + journal->j_revoke = NULL;

    - for (i=0; ihash_size; i++) {
    - hash_list = &table->hash_table[i];
    - J_ASSERT (list_empty(hash_list));
    - }
    -
    - kfree(table->hash_table);
    - kmem_cache_free(revoke_table_cache, table);
    + if (!journal->j_revoke_table[1])
    + return;
    + journal_destroy_revoke_table(journal->j_revoke_table[1]);
    journal->j_revoke = NULL;
    }

    --
    1.5.3.7

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. [RFC, PATCH 4/6] jbd: refactor nested journal log recovery loop into separate functions

    The do_one_pass function iterates through the jbd log operating on blocks
    depending on the pass. During the replay pass descriptor blocks are processed
    by an inner loop which replays them. The nesting makes the code difficult to
    follow or to modify. Splitting the inner loop and replay code into separate
    functions simplifies matters.

    The refactored code no longer unnecessarily reads revoked data blocks, so
    potential read errors from such blocks will cause differing behaviour. Aside
    from that there should be no functional effect.

    Signed-off-by: Duane Griffin
    ---

    Note the TODO before the block read in the outer loop below. We could possibly
    attempt to continue after a failed read as we do when replaying descriptor
    blocks. However if it was a descriptor block we'd likely bomb out on the next
    iteration anyway, so I'm not sure it would be worth it. Thoughts?

    fs/jbd/recovery.c | 243 ++++++++++++++++++++++++++++-------------------------
    1 files changed, 127 insertions(+), 116 deletions(-)

    diff --git a/fs/jbd/recovery.c b/fs/jbd/recovery.c
    index 2b8edf4..e2ac78f 100644
    --- a/fs/jbd/recovery.c
    +++ b/fs/jbd/recovery.c
    @@ -307,6 +307,101 @@ int journal_skip_recovery(journal_t *journal)
    return err;
    }

    +static int replay_data_block(
    + journal_t *journal, struct buffer_head *obh, char *data,
    + int flags, unsigned long blocknr)
    +{
    + struct buffer_head *nbh;
    +
    + /* Find a buffer for the new data being restored */
    + nbh = __getblk(journal->j_fs_dev, blocknr, journal->j_blocksize);
    + if (nbh == NULL)
    + return -ENOMEM;
    +
    + lock_buffer(nbh);
    + memcpy(nbh->b_data, obh->b_data, journal->j_blocksize);
    + if (flags & JFS_FLAG_ESCAPE)
    + *((__be32 *)data) = cpu_to_be32(JFS_MAGIC_NUMBER);
    +
    + BUFFER_TRACE(nbh, "marking dirty");
    + set_buffer_uptodate(nbh);
    + mark_buffer_dirty(nbh);
    + BUFFER_TRACE(nbh, "marking uptodate");
    + /* ll_rw_block(WRITE, 1, &nbh); */
    + unlock_buffer(nbh);
    + brelse(nbh);
    +
    + return 0;
    +}
    +
    +/* Replay a descriptor block: write all of the data blocks. */
    +static int replay_descriptor_block(
    + journal_t *journal, char *data,
    + unsigned int next_commit_ID,
    + unsigned long *next_log_block,
    + struct recovery_info *info)
    +{
    + int err;
    + int success = 0;
    + char *tagp;
    + unsigned long io_block;
    + struct buffer_head *obh;
    + unsigned long next = *next_log_block;
    +
    + tagp = &data[sizeof(journal_header_t)];
    + while ((tagp - data + sizeof(journal_block_tag_t)) <=
    + journal->j_blocksize) {
    + unsigned long blocknr;
    + journal_block_tag_t *tag = (journal_block_tag_t *) tagp;
    + int flags = be32_to_cpu(tag->t_flags);
    +
    + io_block = next++;
    + wrap(journal, next);
    + blocknr = be32_to_cpu(tag->t_blocknr);
    +
    + /* If the block has been revoked, then we're all done here. */
    + if (journal_test_revoke(journal, blocknr, next_commit_ID)) {
    + ++info->nr_revoke_hits;
    + goto skip_write;
    + }
    +
    + err = jread(&obh, journal, io_block);
    + if (err) {
    +
    + /* Recover what we can, but
    + * report failure at the end. */
    + success = err;
    + printk(KERN_ERR "JBD: IO error %d recovering "
    + "block %ld in log\n", err, io_block);
    + continue;
    + }
    + J_ASSERT(obh != NULL);
    +
    + err = replay_data_block(journal, obh, data, flags, blocknr);
    + if (err) {
    + brelse(obh);
    + printk(KERN_ERR "JBD: Out of memory during recovery\n");
    + success = err;
    + goto failed;
    + }
    +
    + ++info->nr_replays;
    + brelse(obh);
    +
    +skip_write:
    + tagp += sizeof(journal_block_tag_t);
    + if (!(flags & JFS_FLAG_SAME_UUID))
    + tagp += 16;
    +
    + if (flags & JFS_FLAG_LAST_TAG)
    + break;
    + }
    +
    + *next_log_block = next;
    +failed:
    + return success;
    +}
    +
    static int do_one_pass(journal_t *journal,
    struct recovery_info *info, enum passtype pass)
    {
    @@ -316,8 +411,6 @@ static int do_one_pass(journal_t *journal,
    journal_superblock_t * sb;
    journal_header_t * tmp;
    struct buffer_head * bh;
    - unsigned int sequence;
    - int blocktype;

    /* Precompute the maximum metadata descriptors in a descriptor block */
    int MAX_BLOCKS_PER_DESC;
    @@ -348,11 +441,8 @@ static int do_one_pass(journal_t *journal,
    */

    while (1) {
    - int flags;
    - char * tagp;
    - journal_block_tag_t * tag;
    - struct buffer_head * obh;
    - struct buffer_head * nbh;
    + int blocktype;
    + unsigned int sequence;

    cond_resched();

    @@ -371,10 +461,13 @@ static int do_one_pass(journal_t *journal,
    * either the next descriptor block or the final commit
    * record. */

    + /* TODO: Should we continue on error? */
    jbd_debug(3, "JBD: checking block %ld\n", next_log_block);
    err = jread(&bh, journal, next_log_block);
    - if (err)
    + if (err) {
    + success = err;
    goto failed;
    + }

    next_log_block++;
    wrap(journal, next_log_block);
    @@ -406,137 +499,57 @@ static int do_one_pass(journal_t *journal,
    * all of the sequence number checks. What are we going
    * to do with it? That depends on the pass... */

    - switch(blocktype) {
    + switch (blocktype) {
    case JFS_DESCRIPTOR_BLOCK:
    /* If it is a valid descriptor block, replay it
    * in pass REPLAY; otherwise, just skip over the
    * blocks it describes. */
    - if (pass != PASS_REPLAY) {
    + if (pass == PASS_REPLAY) {
    + err = replay_descriptor_block(
    + journal,
    + bh->b_data,
    + next_commit_ID,
    + &next_log_block,
    + info);
    + } else {
    next_log_block +=
    count_tags(bh, journal->j_blocksize);
    wrap(journal, next_log_block);
    - brelse(bh);
    - continue;
    }
    -
    - /* A descriptor block: we can now write all of
    - * the data blocks. Yay, useful work is finally
    - * getting done here! */
    -
    - tagp = &bh->b_data[sizeof(journal_header_t)];
    - while ((tagp - bh->b_data +sizeof(journal_block_tag_t))
    - <= journal->j_blocksize) {
    - unsigned long io_block;
    -
    - tag = (journal_block_tag_t *) tagp;
    - flags = be32_to_cpu(tag->t_flags);
    -
    - io_block = next_log_block++;
    - wrap(journal, next_log_block);
    - err = jread(&obh, journal, io_block);
    - if (err) {
    - /* Recover what we can, but
    - * report failure at the end. */
    - success = err;
    - printk (KERN_ERR
    - "JBD: IO error %d recovering "
    - "block %ld in log\n",
    - err, io_block);
    - } else {
    - unsigned long blocknr;
    -
    - J_ASSERT(obh != NULL);
    - blocknr = be32_to_cpu(tag->t_blocknr);
    -
    - /* If the block has been
    - * revoked, then we're all done
    - * here. */
    - if (journal_test_revoke
    - (journal, blocknr,
    - next_commit_ID)) {
    - brelse(obh);
    - ++info->nr_revoke_hits;
    - goto skip_write;
    - }
    -
    - /* Find a buffer for the new
    - * data being restored */
    - nbh = __getblk(journal->j_fs_dev,
    - blocknr,
    - journal->j_blocksize);
    - if (nbh == NULL) {
    - printk(KERN_ERR
    - "JBD: Out of memory "
    - "during recovery.\n");
    - err = -ENOMEM;
    - brelse(bh);
    - brelse(obh);
    - goto failed;
    - }
    -
    - lock_buffer(nbh);
    - memcpy(nbh->b_data, obh->b_data,
    - journal->j_blocksize);
    - if (flags & JFS_FLAG_ESCAPE) {
    - *((__be32 *)bh->b_data) =
    - cpu_to_be32(JFS_MAGIC_NUMBER);
    - }
    -
    - BUFFER_TRACE(nbh, "marking dirty");
    - set_buffer_uptodate(nbh);
    - mark_buffer_dirty(nbh);
    - BUFFER_TRACE(nbh, "marking uptodate");
    - ++info->nr_replays;
    - /* ll_rw_block(WRITE, 1, &nbh); */
    - unlock_buffer(nbh);
    - brelse(obh);
    - brelse(nbh);
    - }
    -
    - skip_write:
    - tagp += sizeof(journal_block_tag_t);
    - if (!(flags & JFS_FLAG_SAME_UUID))
    - tagp += 16;
    -
    - if (flags & JFS_FLAG_LAST_TAG)
    - break;
    - }
    -
    - brelse(bh);
    - continue;
    + break;

    case JFS_COMMIT_BLOCK:
    /* Found an expected commit block: not much to
    * do other than move on to the next sequence
    * number. */
    - brelse(bh);
    next_commit_ID++;
    - continue;
    + break;

    case JFS_REVOKE_BLOCK:
    /* If we aren't in the REVOKE pass, then we can
    * just skip over this block. */
    - if (pass != PASS_REVOKE) {
    - brelse(bh);
    - continue;
    + if (pass == PASS_REVOKE) {
    + err = scan_revoke_records(
    + journal, bh, next_commit_ID, info);
    }
    -
    - err = scan_revoke_records(journal, bh,
    - next_commit_ID, info);
    - brelse(bh);
    - if (err)
    - goto failed;
    - continue;
    + break;

    default:
    jbd_debug(3, "Unrecognised magic %d, end of scan.\n",
    blocktype);
    - brelse(bh);
    - goto done;
    + break;
    }
    +
    + brelse(bh);
    +
    + /* Immediately fail on OOM; on other errors try to recover
    + * as much as we can, but report the failure at the end. */
    + if (err)
    + success = err;
    + if (err == -ENOMEM)
    + goto failed;
    }

    - done:
    /*
    * We broke out of the log scan loop: either we came to the
    * known end of the log or we found an unexpected block in the
    @@ -558,10 +571,8 @@ static int do_one_pass(journal_t *journal,
    }
    }

    - return success;
    -
    failed:
    - return err;
    + return success;
    }


    --
    1.5.3.7

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. [RFC, PATCH 2/6] jbd: replace potentially false assertion with if block

    If an error occurs during jbd cache initialisation it is possible for the
    journal_head_cache to be NULL when journal_destroy_journal_head_cache is
    called. Replace the J_ASSERT with an if block to handle the situation
    correctly.

    Note that even with this fix things will break badly if ext3 is statically
    compiled in and cache initialisation fails.

    Signed-off-by: Duane Griffin
    ---

    At the moment the cache destruction methods are inconsistent in whether they
    set the cache pointers to NULL after destroying them. To be precise,
    journal_destroy_handle_cache doesn't, the others do. I haven't changed that,
    although I was sorely tempted. I think it would be better to be consistent and
    that NULLing the pointers is preferable. Any objections?

    This patch is an independent bug fix and following patches in this series do
    not depend on it.

    fs/jbd/journal.c | 7 ++++---
    1 files changed, 4 insertions(+), 3 deletions(-)

    diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
    index 3943a89..3f334a3 100644
    --- a/fs/jbd/journal.c
    +++ b/fs/jbd/journal.c
    @@ -1635,9 +1635,10 @@ static int journal_init_journal_head_cache(void)

    static void journal_destroy_journal_head_cache(void)
    {
    - J_ASSERT(journal_head_cache != NULL);
    - kmem_cache_destroy(journal_head_cache);
    - journal_head_cache = NULL;
    + if (journal_head_cache) {
    + kmem_cache_destroy(journal_head_cache);
    + journal_head_cache = NULL;
    + }
    }

    /*
    --
    1.5.3.7

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. [RFC, PATCH 3/6] jbd: only create debugfs entries if cache initialisation is successful

    JBD debugfs entries should only be created if cache initialisation is
    successful. At the moment they are being created unconditionally which will
    leave them dangling if cache (and hence module) initialisation fails.

    Signed-off-by: Duane Griffin
    ---

    This patch is an independent bug fix and following patches in this series do
    not depend on it.

    fs/jbd/journal.c | 5 +++--
    1 files changed, 3 insertions(+), 2 deletions(-)

    diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
    index 3f334a3..a868277 100644
    --- a/fs/jbd/journal.c
    +++ b/fs/jbd/journal.c
    @@ -1942,9 +1942,10 @@ static int __init journal_init(void)
    BUILD_BUG_ON(sizeof(struct journal_superblock_s) != 1024);

    ret = journal_init_caches();
    - if (ret != 0)
    + if (ret == 0)
    + jbd_create_debugfs_entry();
    + else
    journal_destroy_caches();
    - jbd_create_debugfs_entry();
    return ret;
    }

    --
    1.5.3.7

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. [RFC, PATCH 5/6] jbd: add support for read-only log recovery

    Read-only log recovery allows a dirty journalled filesystem to be mounted and
    provide a consistent view of its data without writing to the disk. Instead of
    replaying the log a mapping is constructed between modified filesystem blocks
    and the journal blocks containing their data.

    The mapping is stored in a hashtable that is created and populated during
    journal recovery. The hash function used is the same one used by the journal
    revocation code. The size of the hashtable is currently being arbitrarily set
    to 256 entries. Given that we know the number of block mappings when we create
    the table it may be worth dynamically calculating an appropriate size instead
    of hard-coding it.

    Signed-off-by: Duane Griffin
    ---

    I'm tempted to add warnings on attempts to modify a read-only journal, does
    that sound useful? I'd also be grateful to anyone with suggestions for better
    member names.

    fs/ext3/super.c | 2 +-
    fs/jbd/checkpoint.c | 2 +-
    fs/jbd/commit.c | 2 +-
    fs/jbd/journal.c | 56 ++++++++++------
    fs/jbd/recovery.c | 187 +++++++++++++++++++++++++++++++++++++++++++++------
    fs/jbd/revoke.c | 7 +--
    fs/ocfs2/journal.c | 4 +-
    include/linux/jbd.h | 41 ++++++++++-
    8 files changed, 246 insertions(+), 55 deletions(-)

    diff --git a/fs/ext3/super.c b/fs/ext3/super.c
    index 18769cc..4b9ff65 100644
    --- a/fs/ext3/super.c
    +++ b/fs/ext3/super.c
    @@ -2168,7 +2168,7 @@ static int ext3_load_journal(struct super_block *sb,
    if (!EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER))
    err = journal_wipe(journal, !really_read_only);
    if (!err)
    - err = journal_load(journal);
    + err = journal_load(journal, !really_read_only);

    if (err) {
    printk(KERN_ERR "EXT3-fs: error loading journal.\n");
    diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
    index a5432bb..84d4579 100644
    --- a/fs/jbd/checkpoint.c
    +++ b/fs/jbd/checkpoint.c
    @@ -454,7 +454,7 @@ int cleanup_journal_tail(journal_t *journal)
    journal->j_tail = blocknr;
    spin_unlock(&journal->j_state_lock);
    if (!(journal->j_flags & JFS_ABORT))
    - journal_update_superblock(journal, 1);
    + journal_update_superblock(journal, 1, 1);
    return 0;
    }

    diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
    index a38c718..c63598b 100644
    --- a/fs/jbd/commit.c
    +++ b/fs/jbd/commit.c
    @@ -310,7 +310,7 @@ void journal_commit_transaction(journal_t *journal)
    /* Do we need to erase the effects of a prior journal_flush? */
    if (journal->j_flags & JFS_FLUSHED) {
    jbd_debug(3, "super block updated\n");
    - journal_update_superblock(journal, 1);
    + journal_update_superblock(journal, 1, 1);
    } else {
    jbd_debug(3, "superblock not updated\n");
    }
    diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
    index a868277..b2723a0 100644
    --- a/fs/jbd/journal.c
    +++ b/fs/jbd/journal.c
    @@ -822,7 +822,7 @@ static void journal_fail_superblock (journal_t *journal)
    * subsequent use.
    */

    -static int journal_reset(journal_t *journal)
    +static int journal_reset(journal_t *journal, int write)
    {
    journal_superblock_t *sb = journal->j_superblock;
    unsigned long first, last;
    @@ -843,8 +843,8 @@ static int journal_reset(journal_t *journal)

    journal->j_max_transaction_buffers = journal->j_maxlen / 4;

    - /* Add the dynamic fields and write it to disk. */
    - journal_update_superblock(journal, 1);
    + /* Add the dynamic fields and write it to disk if not read-only. */
    + journal_update_superblock(journal, write, 1);
    return journal_start_thread(journal);
    }

    @@ -916,21 +916,21 @@ int journal_create(journal_t *journal)
    journal->j_flags &= ~JFS_ABORT;
    journal->j_format_version = 2;

    - return journal_reset(journal);
    + return journal_reset(journal, 1);
    }

    /**
    - * void journal_update_superblock() - Update journal sb on disk.
    + * void journal_update_superblock() - Update journal sb in memory and on disk.
    * @journal: The journal to update.
    + * @write: Set to '0' if you don't want to write to the disk.
    * @wait: Set to '0' if you don't want to wait for IO completion.
    *
    * Update a journal's dynamic superblock fields and write it to disk,
    * optionally waiting for the IO to complete.
    */
    -void journal_update_superblock(journal_t *journal, int wait)
    +void journal_update_superblock(journal_t *journal, int write, int wait)
    {
    journal_superblock_t *sb = journal->j_superblock;
    - struct buffer_head *bh = journal->j_sb_buffer;

    /*
    * As a special case, if the on-disk copy is already marked as needing
    @@ -957,12 +957,16 @@ void journal_update_superblock(journal_t *journal, int wait)
    sb->s_errno = cpu_to_be32(journal->j_errno);
    spin_unlock(&journal->j_state_lock);

    - BUFFER_TRACE(bh, "marking dirty");
    - mark_buffer_dirty(bh);
    - if (wait)
    - sync_dirty_buffer(bh);
    - else
    - ll_rw_block(SWRITE, 1, &bh);
    + if (write) {
    + struct buffer_head *bh = journal->j_sb_buffer;
    +
    + BUFFER_TRACE(bh, "marking dirty");
    + mark_buffer_dirty(bh);
    + if (wait)
    + sync_dirty_buffer(bh);
    + else
    + ll_rw_block(SWRITE, 1, &bh);
    + }

    out:
    /* If we have just flushed the log (by marking s_start==0), then
    @@ -1066,12 +1070,13 @@ static int load_superblock(journal_t *journal)
    /**
    * int journal_load() - Read journal from disk.
    * @journal: Journal to act on.
    + * @write: Whether the journal may be modified on-disk.
    *
    * Given a journal_t structure which tells us which disk blocks contain
    * a journal, read the journal from disk to initialise the in-memory
    * structures.
    */
    -int journal_load(journal_t *journal)
    +int journal_load(journal_t *journal, int write)
    {
    int err;
    journal_superblock_t *sb;
    @@ -1097,13 +1102,13 @@ int journal_load(journal_t *journal)

    /* Let the recovery code check whether it needs to recover any
    * data from the journal. */
    - if (journal_recover(journal))
    + if (journal_recover(journal, write))
    goto recovery_error;

    /* OK, we've finished with the dynamic journal bits:
    * reinitialise the dynamic contents of the superblock in memory
    * and reset them on disk. */
    - if (journal_reset(journal))
    + if (journal_reset(journal, write))
    goto recovery_error;

    journal->j_flags &= ~JFS_ABORT;
    @@ -1150,7 +1155,12 @@ void journal_destroy(journal_t *journal)
    journal->j_tail = 0;
    journal->j_tail_sequence = ++journal->j_transaction_sequence;
    if (journal->j_sb_buffer) {
    - journal_update_superblock(journal, 1);
    +
    + /*
    + * Ugly, but what to do?
    + * We don't have a superblock to test for read-only.
    + */
    + journal_update_superblock(journal, !journal->j_bt_hash, 1);
    brelse(journal->j_sb_buffer);
    }

    @@ -1158,6 +1168,8 @@ void journal_destroy(journal_t *journal)
    iput(journal->j_inode);
    if (journal->j_revoke)
    journal_destroy_revoke(journal);
    + if (journal->j_bt_hash)
    + journal_destroy_translations(journal);
    kfree(journal->j_wbuf);
    kfree(journal);
    }
    @@ -1374,7 +1386,7 @@ int journal_flush(journal_t *journal)
    old_tail = journal->j_tail;
    journal->j_tail = 0;
    spin_unlock(&journal->j_state_lock);
    - journal_update_superblock(journal, 1);
    + journal_update_superblock(journal, 1, 1);
    spin_lock(&journal->j_state_lock);
    journal->j_tail = old_tail;

    @@ -1420,8 +1432,7 @@ int journal_wipe(journal_t *journal, int write)
    write ? "Clearing" : "Ignoring");

    err = journal_skip_recovery(journal);
    - if (write)
    - journal_update_superblock(journal, 1);
    + journal_update_superblock(journal, write, 1);

    no_recovery:
    return err;
    @@ -1489,7 +1500,7 @@ static void __journal_abort_soft (journal_t *journal, int errno)
    __journal_abort_hard(journal);

    if (errno)
    - journal_update_superblock(journal, 1);
    + journal_update_superblock(journal, !journal->j_bt_hash, 1);
    }

    /**
    @@ -1922,6 +1933,8 @@ static int __init journal_init_caches(void)

    ret = journal_init_revoke_caches();
    if (ret == 0)
    + ret = journal_init_recover_cache();
    + if (ret == 0)
    ret = journal_init_journal_head_cache();
    if (ret == 0)
    ret = journal_init_handle_cache();
    @@ -1931,6 +1944,7 @@ static int __init journal_init_caches(void)
    static void journal_destroy_caches(void)
    {
    journal_destroy_revoke_caches();
    + journal_destroy_recover_cache();
    journal_destroy_journal_head_cache();
    journal_destroy_handle_cache();
    }
    diff --git a/fs/jbd/recovery.c b/fs/jbd/recovery.c
    index e2ac78f..a73b062 100644
    --- a/fs/jbd/recovery.c
    +++ b/fs/jbd/recovery.c
    @@ -17,12 +17,15 @@
    #include "jfs_user.h"
    #else
    #include
    +#include
    #include
    #include
    #include
    #include
    #endif

    +static struct kmem_cache *bt_record_cache;
    +
    /*
    * Maintain information about the progress of the recovery job, so that
    * the different passes can carry information between them.
    @@ -123,6 +126,71 @@ failed:

    #endif /* __KERNEL__ */

    +int __init journal_init_recover_cache(void)
    +{
    + bt_record_cache = kmem_cache_create(
    + "block_translation_record",
    + sizeof(struct journal_block_translation),
    + 0, 0, NULL);
    + if (bt_record_cache == NULL) {
    + printk(KERN_EMERG "JBD: failed to create recover cache\n");
    + return -ENOMEM;
    + }
    + return 0;
    +}
    +
    +void journal_destroy_recover_cache(void)
    +{
    + if (bt_record_cache) {
    + kmem_cache_destroy(bt_record_cache);
    + bt_record_cache = NULL;
    + }
    +}
    +
    +bool journal_translate_block(journal_t *journal, sector_t blocknr,
    + sector_t *retp)
    +{
    + int hash;
    + struct journal_block_translation *jbt;
    + struct hlist_node *node;
    +
    + if (!journal || !journal->j_bt_hash)
    + goto unmapped;
    +
    + hash = jbd_hash(blocknr, journal->j_bt_hash_bits,
    + 1 << journal->j_bt_hash_bits);
    + hlist_for_each_entry(jbt, node, journal->j_bt_hash + hash, jbt_list) {
    + if (jbt->original == blocknr) {
    + *retp = jbt->target;
    + return true;
    + }
    + }
    +
    +unmapped:
    + *retp = blocknr;
    + return false;
    +}
    +
    +void journal_destroy_translations(journal_t *journal)
    +{
    + int ii;
    + struct journal_block_translation *jbt;
    + struct hlist_node *tmp;
    + struct hlist_node *node;
    +
    + if (!journal->j_bt_hash)
    + return;
    +
    + for (ii = 0; ii < (1 << journal->j_bt_hash_bits); ii++) {
    + hlist_for_each_entry_safe(jbt, node, tmp,
    + journal->j_bt_hash + ii, jbt_list) {
    + kmem_cache_free(bt_record_cache, jbt);
    + }
    + }
    +
    + kfree(journal->j_bt_hash);
    + journal->j_bt_hash = NULL;
    +}

    /*
    * Read a block from the journal
    @@ -173,6 +241,45 @@ static int jread(struct buffer_head **bhp, journal_t *journal,
    return 0;
    }

    +static int journal_record_bt(journal_t *journal, sector_t original,
    + sector_t target)
    +{
    + int err;
    + int hash;
    + struct hlist_node *node;
    + struct journal_block_translation *jbt;
    + sector_t blocknr;
    +
    + /* Store the physical block to avoid repeated lookups. */
    + err = journal_bmap(journal, target, &blocknr);
    + if (err)
    + return err;
    +
    + /*
    + * If the same block has been translated multiple times,
    + * just use the last.
    + */
    + hash = jbd_hash(original, journal->j_bt_hash_bits,
    + 1 << journal->j_bt_hash_bits);
    + hlist_for_each_entry(jbt, node, journal->j_bt_hash + hash, jbt_list) {
    + if (jbt->original == original) {
    + jbt->target = blocknr;
    + return 0;
    + }
    + }
    +
    + jbt = kmem_cache_alloc(bt_record_cache, GFP_NOFS);
    + if (!jbt)
    + return -ENOMEM;
    +
    + INIT_HLIST_NODE(&jbt->jbt_list);
    + jbt->original = original;
    + jbt->target = blocknr;
    +
    + hlist_add_head(&jbt->jbt_list, journal->j_bt_hash + hash);
    +
    + return 0;
    +}

    /*
    * Count the number of in-use tags in a journal descriptor block.
    @@ -201,6 +308,24 @@ static int count_tags(struct buffer_head *bh, int size)
    return nr;
    }

    +#define JOURNAL_DEFAULT_BT_HASHBITS 8
    +static int bt_hash_init(journal_t *journal)
    +{
    + int ii;
    + BUG_ON(journal->j_bt_hash);
    +
    + /* Vary size depending on the number of mappings/device size? */
    + journal->j_bt_hash_bits = JOURNAL_DEFAULT_BT_HASHBITS;
    + journal->j_bt_hash = kmalloc(sizeof(struct hlist_head) *
    + 1 << journal->j_bt_hash_bits, GFP_NOFS);
    + if (!journal->j_bt_hash)
    + return -ENOMEM;
    +
    + for (ii = 0; ii < (1 << journal->j_bt_hash_bits); ii++)
    + INIT_HLIST_HEAD(journal->j_bt_hash + ii);
    +
    + return 0;
    +}

    /* Make sure we wrap around the log correctly! */
    #define wrap(journal, var) \
    @@ -212,6 +337,7 @@ do { \
    /**
    * journal_recover - recovers a on-disk journal
    * @journal: the journal to recover
    + * @write: whether to write to disk
    *
    * The primary function for recovering the log contents when mounting a
    * journaled device.
    @@ -220,8 +346,11 @@ do { \
    * end of the log. In the second, we assemble the list of revoke
    * blocks. In the third and final pass, we replay any un-revoked blocks
    * in the log.
    + *
    + * If the journal is read-only a translation table will be built up instead
    + * of the journal being replayed on disk.
    */
    -int journal_recover(journal_t *journal)
    +int journal_recover(journal_t *journal, int write)
    {
    int err;
    journal_superblock_t * sb;
    @@ -244,6 +373,16 @@ int journal_recover(journal_t *journal)
    return 0;
    }

    + if (!write) {
    + err = bt_hash_init(journal);
    + if (err) {
    + printk(KERN_ERR
    + "JBD: out of memory allocating"
    + " journal block translation hash\n");
    + return err;
    + }
    + }
    +
    err = do_one_pass(journal, &info, PASS_SCAN);
    if (!err)
    err = do_one_pass(journal, &info, PASS_REVOKE);
    @@ -256,12 +395,15 @@ int journal_recover(journal_t *journal)
    jbd_debug(1, "JBD: Replayed %d and revoked %d/%d blocks\n",
    info.nr_replays, info.nr_revoke_hits, info.nr_revokes);

    - /* Restart the log at the next transaction ID, thus invalidating
    - * any existing commit records in the log. */
    - journal->j_transaction_sequence = ++info.end_transaction;
    + if (write) {
    +
    + /* Restart the log at the next transaction ID, thus invalidating
    + * any existing commit records in the log. */
    + journal->j_transaction_sequence = ++info.end_transaction;
    + journal_clear_revoke(journal);
    + sync_blockdev(journal->j_fs_dev);
    + }

    - journal_clear_revoke(journal);
    - sync_blockdev(journal->j_fs_dev);
    return err;
    }

    @@ -345,7 +487,6 @@ static int replay_descriptor_block(
    int success = 0;
    char *tagp;
    unsigned long io_block;
    - struct buffer_head *obh;
    unsigned long next = *next_log_block;

    tagp = &data[sizeof(journal_header_t)];
    @@ -365,28 +506,34 @@ static int replay_descriptor_block(
    goto skip_write;
    }

    - err = jread(&obh, journal, io_block);
    - if (err) {
    + if (journal->j_bt_hash) {
    + err = journal_record_bt(journal, blocknr, io_block);
    + } else {
    + struct buffer_head *obh;

    - /* Recover what we can, but
    - * report failure at the end. */
    - success = err;
    - printk(KERN_ERR "JBD: IO error %d recovering "
    - "block %ld in log\n", err, io_block);
    - continue;
    - }
    - J_ASSERT(obh != NULL);
    + err = jread(&obh, journal, io_block);
    + if (err) {

    - err = replay_data_block(journal, obh, data, flags, blocknr);
    - if (err) {
    + /* Recover what we can, but
    + * report failure at the end. */
    + success = err;
    + printk(KERN_ERR "JBD: IO error %d recovering "
    + "block %ld in log\n", err, io_block);
    + continue;
    + }
    + J_ASSERT(obh != NULL);
    +
    + err = replay_data_block(
    + journal, obh, data, flags, blocknr);
    brelse(obh);
    + }
    + if (err) {
    printk(KERN_ERR "JBD: Out of memory during recovery\n");
    success = err;
    goto failed;
    }

    ++info->nr_replays;
    - brelse(obh);

    skip_write:
    tagp += sizeof(journal_block_tag_t);
    diff --git a/fs/jbd/revoke.c b/fs/jbd/revoke.c
    index 5f64df4..7f57d03 100644
    --- a/fs/jbd/revoke.c
    +++ b/fs/jbd/revoke.c
    @@ -105,15 +105,10 @@ static void flush_descriptor(journal_t *, struct journal_head *, int);

    /* Utility functions to maintain the revoke table */

    -/* Borrowed from buffer.c: this is a tried and tested block hash function */
    static inline int hash(journal_t *journal, unsigned long block)
    {
    struct jbd_revoke_table_s *table = journal->j_revoke;
    - int hash_shift = table->hash_shift;
    -
    - return ((block << (hash_shift - 6)) ^
    - (block >> 13) ^
    - (block << (hash_shift - 12))) & (table->hash_size - 1);
    + return jbd_hash(block, table->hash_shift, table->hash_size);
    }

    static int insert_revoke_hash(journal_t *journal, unsigned long blocknr,
    diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
    index f31c7e8..cbd53aa 100644
    --- a/fs/ocfs2/journal.c
    +++ b/fs/ocfs2/journal.c
    @@ -591,7 +591,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local)

    osb = journal->j_osb;

    - status = journal_load(journal->j_journal);
    + status = journal_load(journal->j_journal, true);
    if (status < 0) {
    mlog(ML_ERROR, "Failed to load journal!\n");
    goto done;
    @@ -1014,7 +1014,7 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb,
    goto done;
    }

    - status = journal_load(journal);
    + status = journal_load(journal, 1);
    if (status < 0) {
    mlog_errno(status);
    if (!igrab(inode))
    diff --git a/include/linux/jbd.h b/include/linux/jbd.h
    index b18fd3b..6154082 100644
    --- a/include/linux/jbd.h
    +++ b/include/linux/jbd.h
    @@ -558,6 +558,18 @@ struct transaction_s
    };

    /**
    + * struct journal_block_translation - hash bucket recording block translations.
    + * @jbt_list: list of translated blocks in this bucket.
    + * @original: the (logical?) filesystem block being translation.
    + * @target: the physical journal block holding the original's data.
    + */
    +struct journal_block_translation {
    + struct hlist_node jbt_list;
    + sector_t original;
    + sector_t target;
    +};
    +
    +/**
    * struct journal_s - The journal_s type is the concrete type associated with
    * journal_t.
    * @j_flags: General journaling state flags
    @@ -618,6 +630,8 @@ struct transaction_s
    * @j_wbufsize: maximum number of buffer_heads allowed in j_wbuf, the
    * number that will fit in j_blocksize
    * @j_last_sync_writer: most recent pid which did a synchronous write
    + * @j_bt_hash: hash table mapping uncommitted blocks to their targets
    + * @j_bt_hash_bits: number of bits in block translation hash table
    * @j_private: An opaque pointer to fs-private information.
    */

    @@ -810,6 +824,12 @@ struct journal_s
    pid_t j_last_sync_writer;

    /*
    + * Remap uncommitted transactions in a read-only filesystem.
    + */
    + struct hlist_head *j_bt_hash;
    + int j_bt_hash_bits;
    +
    + /*
    * An opaque pointer to fs-private information. ext3 puts its
    * superblock pointer here
    */
    @@ -916,12 +936,12 @@ extern int journal_check_available_features
    extern int journal_set_features
    (journal_t *, unsigned long, unsigned long, unsigned long);
    extern int journal_create (journal_t *);
    -extern int journal_load (journal_t *journal);
    +extern int journal_load (journal_t *journal, int);
    extern void journal_destroy (journal_t *);
    -extern int journal_recover (journal_t *journal);
    +extern int journal_recover (journal_t *journal, int);
    extern int journal_wipe (journal_t *, int);
    extern int journal_skip_recovery (journal_t *);
    -extern void journal_update_superblock (journal_t *, int);
    +extern void journal_update_superblock (journal_t *, int, int);
    extern void journal_abort (journal_t *, int);
    extern int journal_errno (journal_t *);
    extern void journal_ack_err (journal_t *);
    @@ -970,6 +990,12 @@ extern int journal_test_revoke(journal_t *, unsigned long, tid_t);
    extern void journal_clear_revoke(journal_t *);
    extern void journal_switch_revoke_table(journal_t *journal);

    +extern bool journal_translate_block(journal_t *journal, sector_t blocknr,
    + sector_t *retp);
    +extern void journal_destroy_translations(journal_t *journal);
    +extern int journal_init_recover_cache(void);
    +extern void journal_destroy_recover_cache(void);
    +
    /*
    * The log thread user interface:
    *
    @@ -989,6 +1015,15 @@ void __log_wait_for_space(journal_t *journal);
    extern void __journal_drop_transaction(journal_t *, transaction_t *);
    extern int cleanup_journal_tail(journal_t *);

    +/* Borrowed from buffer.c: this is a tried and tested block hash function */
    +static inline int jbd_hash(unsigned long block, int shift, int size)
    +{
    + return ((block << (shift - 6)) ^
    + (block >> 13) ^
    + (block << (shift - 12))) &
    + (size - 1);
    +}
    +
    /* Debugging code only: */

    #define jbd_ENOSYS() \
    --
    1.5.3.7

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. [RFC, PATCH 6/6] ext3: do not write to the disk when mounting a dirty read-only filesystem

    If a filesystem is being mounted read-only then load the journal read-only
    (i.e. do not replay the log) and do not process the orphan list.

    At present, when a dirty ext3 filesystem is mounted read-only, it writes to the
    disk while replaying the journal log and cleaning up the orphan list. This
    behaviour may surprise users and can potentially cause data corruption/loss
    (e.g. if a system is suspended, booted into a different OS, then resumed).

    Introduce block accessor wrapper functions that check for blocks requiring
    translation and access them from the journal as needed. Convert the ext3 code
    to use the wrappers anywhere that may be called on a read-only filesystem. Code
    that is only called when writing to the filesystem is not converted.

    For a discussion of the need for this feature, see:
    http://marc.info/?l=linux-kernel&m=117607695406580

    Tested by creating dirty filesystems, mounting a copy read-write, taking the
    md5sum of all its files, mounting a different copy read-only, confirming the
    md5sums match, unmounting it, then confirming the block device mounted
    read-only has not been modified. Testing has been done with both internal and
    external journals.

    NOTE: For now I'm simply preventing filesystems requiring recovery from being
    remounted read-write. This breaks booting with an uncleanly mounted root
    filesystem!

    Signed-off-by: Duane Griffin
    ---

    I went back and forth on converting all code to use the accessors. In the end
    it felt silly to shunt write code through a wrapper that was only useful in a
    read-only filesystem. On the other hand, the inconsistency bothers me and
    presumably increases the risks of future changes directly operating on
    filesysem blocks when they need to go through the wrappers. I'd appreciate
    second opinions on this point, in particular.

    Also, would it perhaps be better to split this into separate patches to
    introduce the accessors and add the no-replay code?

    fs/ext3/balloc.c | 2 +-
    fs/ext3/ialloc.c | 2 +-
    fs/ext3/inode.c | 8 ++--
    fs/ext3/resize.c | 2 +-
    fs/ext3/super.c | 123 +++++++++++++++++++++++++++++++----------------
    fs/ext3/xattr.c | 8 ++--
    include/linux/ext3_fs.h | 7 +++
    7 files changed, 99 insertions(+), 53 deletions(-)

    diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
    index da0cb2c..1b42154 100644
    --- a/fs/ext3/balloc.c
    +++ b/fs/ext3/balloc.c
    @@ -145,7 +145,7 @@ read_block_bitmap(struct super_block *sb, unsigned int block_group)
    if (!desc)
    return NULL;
    bitmap_blk = le32_to_cpu(desc->bg_block_bitmap);
    - bh = sb_getblk(sb, bitmap_blk);
    + bh = ext3_sb_getblk(sb, bitmap_blk);
    if (unlikely(!bh)) {
    ext3_error(sb, __FUNCTION__,
    "Cannot read block bitmap - "
    diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
    index 4f4020c..7de791c 100644
    --- a/fs/ext3/ialloc.c
    +++ b/fs/ext3/ialloc.c
    @@ -60,7 +60,7 @@ read_inode_bitmap(struct super_block * sb, unsigned long block_group)
    if (!desc)
    goto error_out;

    - bh = sb_bread(sb, le32_to_cpu(desc->bg_inode_bitmap));
    + bh = ext3_sb_bread(sb, le32_to_cpu(desc->bg_inode_bitmap));
    if (!bh)
    ext3_error(sb, "read_inode_bitmap",
    "Cannot read inode bitmap - "
    diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
    index eb95670..75fcf9e 100644
    --- a/fs/ext3/inode.c
    +++ b/fs/ext3/inode.c
    @@ -364,7 +364,7 @@ static Indirect *ext3_get_branch(struct inode *inode, int depth, int *offsets,
    if (!p->key)
    goto no_block;
    while (--depth) {
    - bh = sb_bread(sb, le32_to_cpu(p->key));
    + bh = ext3_sb_bread(sb, le32_to_cpu(p->key));
    if (!bh)
    goto failure;
    /* Reader: pointers */
    @@ -1009,7 +1009,7 @@ struct buffer_head *ext3_getblk(handle_t *handle, struct inode *inode,
    *errp = err;
    if (!err && buffer_mapped(&dummy)) {
    struct buffer_head *bh;
    - bh = sb_getblk(inode->i_sb, dummy.b_blocknr);
    + bh = ext3_sb_getblk(inode->i_sb, dummy.b_blocknr);
    if (!bh) {
    *errp = -EIO;
    goto err;
    @@ -2514,7 +2514,7 @@ static int __ext3_get_inode_loc(struct inode *inode,
    if (!block)
    return -EIO;

    - bh = sb_getblk(inode->i_sb, block);
    + bh = ext3_sb_getblk(inode->i_sb, block);
    if (!bh) {
    ext3_error (inode->i_sb, "ext3_get_inode_loc",
    "unable to read inode block - "
    @@ -2557,7 +2557,7 @@ static int __ext3_get_inode_loc(struct inode *inode,
    if (!desc)
    goto make_io;

    - bitmap_bh = sb_getblk(inode->i_sb,
    + bitmap_bh = ext3_sb_getblk(inode->i_sb,
    le32_to_cpu(desc->bg_inode_bitmap));
    if (!bitmap_bh)
    goto make_io;
    diff --git a/fs/ext3/resize.c b/fs/ext3/resize.c
    index 9397d77..2f9799b 100644
    --- a/fs/ext3/resize.c
    +++ b/fs/ext3/resize.c
    @@ -236,7 +236,7 @@ static int setup_new_group_blocks(struct super_block *sb,
    if (err)
    goto exit_bh;

    - gdb = sb_getblk(sb, block);
    + gdb = ext3_sb_getblk(sb, block);
    if (!gdb) {
    err = -EIO;
    goto exit_bh;
    diff --git a/fs/ext3/super.c b/fs/ext3/super.c
    index 4b9ff65..c333dac 100644
    --- a/fs/ext3/super.c
    +++ b/fs/ext3/super.c
    @@ -64,6 +64,40 @@ static void ext3_unlockfs(struct super_block *sb);
    static void ext3_write_super (struct super_block * sb);
    static void ext3_write_super_lockfs(struct super_block *sb);

    +struct buffer_head *ext3_sb_bread(struct super_block *sb, sector_t block)
    +{
    + sector_t blocknr;
    + journal_t *journal = EXT3_SB(sb)->s_journal;
    + if (journal_translate_block(journal, block, &blocknr))
    + return __bread(journal->j_dev, blocknr, journal->j_blocksize);
    + else
    + return sb_bread(sb, block);
    +}
    +
    +struct buffer_head *ext3_sb_getblk(struct super_block *sb, sector_t block)
    +{
    + sector_t blocknr;
    + journal_t *journal = EXT3_SB(sb)->s_journal;
    + if (journal_translate_block(journal, block, &blocknr))
    + return __getblk(journal->j_dev, blocknr, journal->j_blocksize);
    + else
    + return sb_getblk(sb, block);
    +}
    +
    +void ext3_map_bh(struct buffer_head *bh, struct super_block *sb, sector_t block)
    +{
    + sector_t blocknr;
    + journal_t *journal = EXT3_SB(sb)->s_journal;
    + if (journal_translate_block(journal, block, &blocknr)) {
    + set_buffer_mapped(bh);
    + bh->b_bdev = journal->j_dev;
    + bh->b_blocknr = blocknr;
    + bh->b_size = journal->j_blocksize;
    + } else {
    + map_bh(bh, sb, block);
    + }
    +}
    +
    /*
    * Wrappers for journal_start/end.
    *
    @@ -1328,7 +1362,6 @@ static int ext3_check_descriptors(struct super_block *sb)
    static void ext3_orphan_cleanup (struct super_block * sb,
    struct ext3_super_block * es)
    {
    - unsigned int s_flags = sb->s_flags;
    int nr_orphans = 0, nr_truncates = 0;
    #ifdef CONFIG_QUOTA
    int i;
    @@ -1338,9 +1371,9 @@ static void ext3_orphan_cleanup (struct super_block * sb,
    return;
    }

    - if (bdev_read_only(sb->s_bdev)) {
    - printk(KERN_ERR "EXT3-fs: write access "
    - "unavailable, skipping orphan cleanup.\n");
    + if (sb->s_flags & MS_RDONLY) {
    + printk(KERN_INFO "EXT3-fs: %s: skipping orphan cleanup on "
    + "readonly fs\n", sb->s_id);
    return;
    }

    @@ -1353,11 +1386,6 @@ static void ext3_orphan_cleanup (struct super_block * sb,
    return;
    }

    - if (s_flags & MS_RDONLY) {
    - printk(KERN_INFO "EXT3-fs: %s: orphan cleanup on readonly fs\n",
    - sb->s_id);
    - sb->s_flags &= ~MS_RDONLY;
    - }
    #ifdef CONFIG_QUOTA
    /* Needed for iput() to work correctly and not trash data */
    sb->s_flags |= MS_ACTIVE;
    @@ -1418,7 +1446,6 @@ static void ext3_orphan_cleanup (struct super_block * sb,
    vfs_quota_off(sb, i);
    }
    #endif
    - sb->s_flags = s_flags; /* Restore MS_RDONLY status */
    }

    /*
    @@ -1838,8 +1865,8 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
    goto failed_mount3;
    }

    - /* We have now updated the journal if required, so we can
    - * validate the data journaling mode. */
    + /* We have now updated the journal or mapped unrecovered blocks as
    + * required, so we can validate the data journaling mode. */
    switch (test_opt(sb, DATA_FLAGS)) {
    case 0:
    /* No mode set, assume a default based on the journal
    @@ -1872,8 +1899,8 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
    }
    }
    /*
    - * The journal_load will have done any necessary log recovery,
    - * so we can safely mount the rest of the filesystem now.
    + * The journal_load will have done any necessary log recovery or block
    + * remapping, so we can safely mount the rest of the filesystem now.
    */

    root = ext3_iget(sb, EXT3_ROOT_INO);
    @@ -1907,9 +1934,16 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
    EXT3_SB(sb)->s_mount_state |= EXT3_ORPHAN_FS;
    ext3_orphan_cleanup(sb, es);
    EXT3_SB(sb)->s_mount_state &= ~EXT3_ORPHAN_FS;
    - if (needs_recovery)
    - printk (KERN_INFO "EXT3-fs: recovery complete.\n");
    - ext3_mark_recovery_complete(sb, es);
    +
    + if (needs_recovery) {
    + if (sb->s_flags & MS_RDONLY) {
    + printk(KERN_INFO "EXT3-fs: recovery skipped on "
    + "read-only filesystem.\n");
    + } else {
    + ext3_mark_recovery_complete(sb, es);
    + printk(KERN_INFO "EXT3-fs: recovery complete.\n");
    + }
    + }
    printk (KERN_INFO "EXT3-fs: mounted filesystem with %s data mode.\n",
    test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
    test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
    @@ -2110,7 +2144,6 @@ static int ext3_load_journal(struct super_block *sb,
    unsigned int journal_inum = le32_to_cpu(es->s_journal_inum);
    dev_t journal_dev;
    int err = 0;
    - int really_read_only;

    if (journal_devnum &&
    journal_devnum != le32_to_cpu(es->s_journal_dev)) {
    @@ -2120,26 +2153,16 @@ static int ext3_load_journal(struct super_block *sb,
    } else
    journal_dev = new_decode_dev(le32_to_cpu(es->s_journal_dev));

    - really_read_only = bdev_read_only(sb->s_bdev);
    -
    /*
    * Are we loading a blank journal or performing recovery after a
    * crash? For recovery, we need to check in advance whether we
    * can get read-write access to the device.
    */

    - if (EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER)) {
    - if (sb->s_flags & MS_RDONLY) {
    - printk(KERN_INFO "EXT3-fs: INFO: recovery "
    - "required on readonly filesystem.\n");
    - if (really_read_only) {
    - printk(KERN_ERR "EXT3-fs: write access "
    - "unavailable, cannot proceed.\n");
    - return -EROFS;
    - }
    - printk (KERN_INFO "EXT3-fs: write access will "
    - "be enabled during recovery.\n");
    - }
    + if (EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER) &&
    + sb->s_flags & MS_RDONLY) {
    + printk(KERN_INFO "EXT3-fs: INFO: skipping recovery "
    + "on readonly filesystem.\n");
    }

    if (journal_inum && journal_dev) {
    @@ -2156,7 +2179,7 @@ static int ext3_load_journal(struct super_block *sb,
    return -EINVAL;
    }

    - if (!really_read_only && test_opt(sb, UPDATE_JOURNAL)) {
    + if (!(sb->s_flags & MS_RDONLY) && test_opt(sb, UPDATE_JOURNAL)) {
    err = journal_update_format(journal);
    if (err) {
    printk(KERN_ERR "EXT3-fs: error updating journal.\n");
    @@ -2166,9 +2189,9 @@ static int ext3_load_journal(struct super_block *sb,
    }

    if (!EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER))
    - err = journal_wipe(journal, !really_read_only);
    + err = journal_wipe(journal, !(sb->s_flags & MS_RDONLY));
    if (!err)
    - err = journal_load(journal, !really_read_only);
    + err = journal_load(journal, !(sb->s_flags & MS_RDONLY));

    if (err) {
    printk(KERN_ERR "EXT3-fs: error loading journal.\n");
    @@ -2177,15 +2200,17 @@ static int ext3_load_journal(struct super_block *sb,
    }

    EXT3_SB(sb)->s_journal = journal;
    - ext3_clear_journal_err(sb, es);
    + if (!(sb->s_flags & MS_RDONLY)) {
    + ext3_clear_journal_err(sb, es);

    - if (journal_devnum &&
    - journal_devnum != le32_to_cpu(es->s_journal_dev)) {
    - es->s_journal_dev = cpu_to_le32(journal_devnum);
    - sb->s_dirt = 1;
    + if (journal_devnum &&
    + journal_devnum != le32_to_cpu(es->s_journal_dev)) {
    + es->s_journal_dev = cpu_to_le32(journal_devnum);
    + sb->s_dirt = 1;

    - /* Make sure we flush the recovery flag to disk. */
    - ext3_commit_super(sb, es, 1);
    + /* Make sure we flush the recovery flag to disk. */
    + ext3_commit_super(sb, es, 1);
    + }
    }

    return 0;
    @@ -2494,6 +2519,20 @@ static int ext3_remount (struct super_block * sb, int * flags, char * data)
    }

    /*
    + * If we have unrecovered journal transactions hanging
    + * around require a full umount/remount for now.
    + */
    + if (sbi->s_journal->j_bt_hash) {
    + printk(KERN_WARNING "EXT3-fs: %s: couldn't "
    + "remount RDWR because of unrecovered "
    + "journal transactions. Please "
    + "umount/remount instead.\n",
    + sb->s_id);
    + err = -EINVAL;
    + goto restore_opts;
    + }
    +
    + /*
    * Mounting a RDONLY partition read-write, so reread
    * and store the current valid flag. (It may have
    * been changed by e2fsck since we originally mounted
    diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c
    index fb89c29..6d0ef6f 100644
    --- a/fs/ext3/xattr.c
    +++ b/fs/ext3/xattr.c
    @@ -226,7 +226,7 @@ ext3_xattr_block_get(struct inode *inode, int name_index, const char *name,
    if (!EXT3_I(inode)->i_file_acl)
    goto cleanup;
    ea_idebug(inode, "reading block %u", EXT3_I(inode)->i_file_acl);
    - bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
    + bh = ext3_sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
    if (!bh)
    goto cleanup;
    ea_bdebug(bh, "b_count=%d, refcount=%d",
    @@ -367,7 +367,7 @@ ext3_xattr_block_list(struct inode *inode, char *buffer, size_t buffer_size)
    if (!EXT3_I(inode)->i_file_acl)
    goto cleanup;
    ea_idebug(inode, "reading block %u", EXT3_I(inode)->i_file_acl);
    - bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
    + bh = ext3_sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
    error = -EIO;
    if (!bh)
    goto cleanup;
    @@ -641,7 +641,7 @@ ext3_xattr_block_find(struct inode *inode, struct ext3_xattr_info *i,

    if (EXT3_I(inode)->i_file_acl) {
    /* The inode already has an extended attribute block. */
    - bs->bh = sb_bread(sb, EXT3_I(inode)->i_file_acl);
    + bs->bh = ext3_sb_bread(sb, EXT3_I(inode)->i_file_acl);
    error = -EIO;
    if (!bs->bh)
    goto cleanup;
    @@ -1213,7 +1213,7 @@ again:
    goto again;
    break;
    }
    - bh = sb_bread(inode->i_sb, ce->e_block);
    + bh = ext3_sb_bread(inode->i_sb, ce->e_block);
    if (!bh) {
    ext3_error(inode->i_sb, __FUNCTION__,
    "inode %lu: block %lu read error",
    diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
    index 36c5403..c5732e1 100644
    --- a/include/linux/ext3_fs.h
    +++ b/include/linux/ext3_fs.h
    @@ -864,6 +864,13 @@ extern void ext3_abort (struct super_block *, const char *, const char *, ...)
    extern void ext3_warning (struct super_block *, const char *, const char *, ...)
    __attribute__ ((format (printf, 3, 4)));
    extern void ext3_update_dynamic_rev (struct super_block *sb);
    +extern struct buffer_head *ext3_sb_bread(struct super_block *sb,
    + sector_t block);
    +extern struct buffer_head *ext3_sb_getblk(struct super_block *sb,
    + sector_t block);
    +extern void ext3_map_bh(struct buffer_head *bh,
    + struct super_block *sb, sector_t block);
    +

    #define ext3_std_error(sb, errno) \
    do { \
    --
    1.5.3.7

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [RFC, PATCH 0/6] ext3: do not modify data on-disk when mounting read-only filesystem

    On Thu, 6 Mar 2008 01:59:08 +0000 "Duane Griffin" wrote:

    > At present, as discussed in this LKML thread,
    > http://marc.info/?l=linux-kernel&m=117607695406580, when a dirty ext3
    > filesystem is mounted read-only it writes to the disk while replaying the
    > journal log and cleaning up the orphan list. This behaviour may surprise users
    > and can potentially cause data corruption/loss (e.g. if a system is suspended,
    > booted into a different OS, then resumed).
    >
    > This patch series attempts to address this by using a block translation table
    > instead of replaying the journal on a read-only filesystem.
    >
    > Patches 1-3 are independent cleanups/bug-fixes for things I came across while
    > working on this. They could be submitted separately and are not required for
    > following patches.
    >
    > Patch 4 is a refactoring change that simplifies the code prior to later
    > substantive changes.
    >
    > Patch 5 introduces the translation table and support for a truly read-only
    > journal into jbd.
    >
    > Patch 6 uses the facility introduced in patch 5 to add support for true
    > read-only ext3.


    I'll grab the first three for now, thanks.

    Someone(tm) should do the jbd2 versions..
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [RFC, PATCH 0/6] ext3: do not modify data on-disk when mounting read-only filesystem

    On 06/03/2008, Andrew Morton wrote:
    > I'll grab the first three for now, thanks.


    Great, thanks!

    > Someone(tm) should do the jbd2 versions..


    I'll do them tonight, unless someone beats me to it.

    Cheers,
    Duane.

    --
    "I never could learn to drink that blood and call it wine" - Bob Dylan
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [RFC, PATCH 0/6] ext3: do not modify data on-disk when mounting read-only filesystem

    Hi Duane,

    Thanks for doing this. Some perhaps not so obvious fallout from the bad
    old way of doing things is that ddnap (zumastor) hits an issue in
    replication. Since ddsnap allows journal replay on the downstream
    server and also needs to have an unaltered snapshot to apply deltas
    against, if we do not take special care, Ext3 will come along and
    modify the downstream snapshot even when told not to. Our solution:
    take two snapshots per replication cycle (pretty cheap) so that one can
    be clean and the other can be stepped on at will by the journal replay.
    Ugh.

    With your hack, we can eventually drop the double snapshot, provided no
    other filesystem is similarly badly behaved.

    Re your page translation table: we already have a page translation
    table, it is called the page cache. If you could figure out which file
    (or metadata) each journal block belongs to, you could just load the
    page table pages back in and presto, done. No need to replay the
    journal at all, you are already back to journal+disk = consistent
    state.

    I probably have missed a detail or two since I haven't looked closely at
    how orphan inodes work, revokes, probably other things, but there is
    the basic idea. SCT, does my reasoning hold water? (In fact,
    ddsnap "replays" its own journal in exactly this way. Cache state is
    reconstructed and no actual journal flush is performed.)

    Anyway, this is just a theoretical comment, it is in no way a suggestion
    for a rewrite. The reason for that being, you do not have any
    convenient way to map physical journal blocks back to files and
    metadata. Maybe if we do implement reverse mapping for Ext3/4 later
    (not just a pipe dream) we could revisit this and lose your extra
    mapping. As it stands your solution seems well built, after a quick
    readthrough. Nice looking code. I think you added about 250 lines
    overall, so tight too. Thanks again.

    Daniel
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [RFC, PATCH 0/6] ext3: do not modify data on-disk when mounting read-only filesystem

    On 13/03/2008, Daniel Phillips wrote:
    > Hi Duane,
    >
    > Thanks for doing this. Some perhaps not so obvious fallout from the bad
    > old way of doing things is that ddnap (zumastor) hits an issue in
    > replication. Since ddsnap allows journal replay on the downstream
    > server and also needs to have an unaltered snapshot to apply deltas
    > against, if we do not take special care, Ext3 will come along and
    > modify the downstream snapshot even when told not to. Our solution:
    > take two snapshots per replication cycle (pretty cheap) so that one can
    > be clean and the other can be stepped on at will by the journal replay.
    > Ugh.


    Ah, good to know, thanks. It looks like you folks are doing some
    interesting things there.

    > With your hack, we can eventually drop the double snapshot, provided no
    > other filesystem is similarly badly behaved.


    Excellent, I'm really pleased to hear that.

    > Re your page translation table: we already have a page translation
    > table, it is called the page cache. If you could figure out which file
    > (or metadata) each journal block belongs to, you could just load the
    > page table pages back in and presto, done. No need to replay the
    > journal at all, you are already back to journal+disk = consistent
    > state.


    Hmm, interesting. I'll have to have a think about that, thanks.

    > mapping. As it stands your solution seems well built, after a quick
    > readthrough. Nice looking code. I think you added about 250 lines
    > overall, so tight too. Thanks again.


    Thanks very much, I appreciate it!

    Cheers,
    Duane.

    --
    "I never could learn to drink that blood and call it wine" - Bob Dylan
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread