consistency - Storage

This is a discussion on consistency - Storage ; For a custom database implementation, I'm generating 128 byte log pre-image logs of sections of the page. I need to pair the log data with the file offset (4 bytes). For this problem, assume: 1) 512-byte or larger binary disk ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: consistency

  1. consistency

    For a custom database implementation, I'm generating 128 byte log
    pre-image logs of sections of the page. I need to pair the log data
    with the file offset (4 bytes).

    For this problem, assume:
    1) 512-byte or larger binary disk sectors (very reasonably portable)
    2) meta-data won't need updating (I precommit file extensions)
    3) the file is precommitted with invalid addresses
    5) data written to the file is aligned on disk sector size
    6) OS file cache buffers are bypassed (i.e. O_DIRECT)

    To maintain consistency my log records look like:
    unsigned long address1;
    char data[128];
    unsigned long address2;

    I surround the log data with the address, giving me 136-byte log
    records. I assume that if both addresses match, then I'm guaranteed
    that the data in the middle is completely consistent (providing there
    is no outside corruption). Is this assumption correct?

    I feel its safe to make this assumption. The reason: a log record will
    reside on, at most, 2 disk sectors. If the first part of the log record
    is written but not the second part, then address2 won't match. If the
    last part of the log record is written but not the first part, then
    address1 won't match. The only time a match occurs is if both ends of
    the log record are on disk. This means that the entire log record
    (including data) made it to disk, giving me the consistency.

    So.. again.. is this assumption correct? If not, what will break it? -
    as I clearly don't understand file systems as well as I thought I did.


  2. Re: consistency

    From what I know, any single-sector update is atomic on modern disks. As
    about multi sector - it is not guaranteed so. So, it is better to keep some
    "generation" count in the beginning of the sectors, and maintain this count
    always the same for all sectors participating in a multi-sector write (a
    database page or such). Before writing, increment the counter in all sectors,
    then write them all. If the future reads will show different counters in these
    sectors - then the record or page is damaged by disk drive or power failure.

    At least this is how NTFS works

    BTW - I don't think O_DIRECT will allow non-sector-aligned IO on the file.
    At least its Windows counterpart does not allow this. Windows just locks these
    pages to the MDL structure, then scatters this MDL to several child MDLs
    according to the file runlist, and sends all of them down to disk stack in
    parallel.

    Dunno on UNIXen, but in such a thing I expect them to do the same.

    --
    Maxim Shatskih, Windows DDK MVP
    StorageCraft Corporation
    maxim@storagecraft.com
    http://www.storagecraft.com

    wrote in message
    news:1113405092.941660.151870@o13g2000cwo.googlegr oups.com...
    > For a custom database implementation, I'm generating 128 byte log
    > pre-image logs of sections of the page. I need to pair the log data
    > with the file offset (4 bytes).
    >
    > For this problem, assume:
    > 1) 512-byte or larger binary disk sectors (very reasonably portable)
    > 2) meta-data won't need updating (I precommit file extensions)
    > 3) the file is precommitted with invalid addresses
    > 5) data written to the file is aligned on disk sector size
    > 6) OS file cache buffers are bypassed (i.e. O_DIRECT)
    >
    > To maintain consistency my log records look like:
    > unsigned long address1;
    > char data[128];
    > unsigned long address2;
    >
    > I surround the log data with the address, giving me 136-byte log
    > records. I assume that if both addresses match, then I'm guaranteed
    > that the data in the middle is completely consistent (providing there
    > is no outside corruption). Is this assumption correct?
    >
    > I feel its safe to make this assumption. The reason: a log record will
    > reside on, at most, 2 disk sectors. If the first part of the log record
    > is written but not the second part, then address2 won't match. If the
    > last part of the log record is written but not the first part, then
    > address1 won't match. The only time a match occurs is if both ends of
    > the log record are on disk. This means that the entire log record
    > (including data) made it to disk, giving me the consistency.
    >
    > So.. again.. is this assumption correct? If not, what will break it? -
    > as I clearly don't understand file systems as well as I thought I did.
    >




  3. Re: consistency

    lindahlb@hotmail.com wrote:

    ....

    > I surround the log data with the address, giving me 136-byte log
    > records. I assume that if both addresses match, then I'm guaranteed
    > that the data in the middle is completely consistent (providing there
    > is no outside corruption). Is this assumption correct?


    Technically, no - but it may well be adequate.

    >
    > I feel its safe to make this assumption. The reason: a log record will
    > reside on, at most, 2 disk sectors. If the first part of the log record
    > is written but not the second part, then address2 won't match. If the
    > last part of the log record is written but not the first part, then
    > address1 won't match. The only time a match occurs is if both ends of
    > the log record are on disk. This means that the entire log record
    > (including data) made it to disk, giving me the consistency.
    >
    > So.. again.. is this assumption correct? If not, what will break it? -
    > as I clearly don't understand file systems as well as I thought I did.


    1. Obviously, should one of the two sectors fail to be written but just
    happen to contain the address value you're looking for, you'll have a
    garbage log that looks good.

    2. Another (*very* low probability) possibility is that you'll get a
    partial sector write - not because the disk didn't finish writing the
    sector (most modern disks will, even if power fails), but because the
    failing power caused RAM or bus errors (undetected by the disk) that
    corrupted the end of the sector transfer (I've seen first-person reports
    by multiple people of such occurrences, though they appear to be very
    rare). If the disk also happened to write the two sectors out of order
    (disks sometimes optimize writes by starting with the first of the
    target sectors that's writable after the head stabilizes and then finish
    up at the end of the disk revolution, though for just two sectors this
    should be rather improbable - but they also revector bad sectors, and if
    the first of the two sectors was so revectored it could easily wind up
    being written out of order), and the later-written (but first of the log
    record) sector write was corrupted as described, there could be garbage
    within the log record even if both addresses checked out.

    Processors are powerful enough these days that including a CRC for the
    entire log record shouldn't be unreasonably expensive - and it validates
    the entire record (at least to the probability of just happening to
    match the CRC in the same way described in point 1 above, but you can
    make the CRC as long as you want to minimize that).

    Other things to watch out for logging include stumbling upon unwritten
    data during recovery that just happens to look like valid log data -
    especially if you're reusing space by treating the log as a ring buffer
    so that 'unwritten' space will in fact be space containing obsolete log
    records (in which case you should at a minimum ensure that the total
    ring buffer size is not an integral multiple of the log record size,
    when that size is fixed - but that only works when you have the luxury
    of batching up log writes such that each exactly fills an integral
    number of sectors). A CRC over the rest of the record helps minimize
    this probability as well.

    If you write multiple log records at a time, it significantly increases
    the probability that a situation like that described in point 2 above
    might leave a 'hole' in the middle of a long log write. The normal
    mechanisms you use to detect the end of the log should safely stop you
    at the start of the hole, but if another quick failure occurred such
    that the new portion of the log ended within the sectors that were
    written before the *first* failure - and which therefore look like valid
    this-pass sectors - you could be in trouble on the second recovery. One
    way to guard against this is to clear out the sectors just after the end
    of the log during recovery equal to the size of the longest log write
    you ever perform.

    Logs are what makes ensuring the integrity (and sometimes availability
    as well) of the rest of the system easy. But ensuring the integrity
    (and if applicable availability) of the log itself is anything but.

    - bill

  4. Re: consistency

    > 1. Obviously, should one of the two sectors fail to be written but
    just
    > happen to contain the address value you're looking for, you'll have a
    > garbage log that looks good.


    I guess I wasn't quite clear enough about condition 3. This is
    impossible because I split the log file into multiples of 'extents'
    (customizable and usually large size - 1MB+) and when a log buffer sync
    will rub up against or go beyond the next extent, the area will be
    filled with null addresses (followed by an fsync, so this also prevents
    metadata problems if the OS file size isn't large enough) --
    essentially, when recycling the log file, log record space is
    precommitted with null addresses. Under this assumption we protect
    against case 1, correct?

    > failing power caused RAM or bus errors (undetected by the disk) that
    > corrupted the end of the sector transfer


    So in this scenario, given the address '1' and data 'logdata' and
    corruption '*', and after an out-of-order write of sector 2 before
    sector 1, with a power failure while writing sector 2, causing RAM to
    be corrupted in the second half of the sector buffer, you could see
    something like this?

    sector 1: 1lo**
    sector 2: ata1

    It sounds like the only way to prevent this would be a CRC, correct?
    But it sounds like this is very rare and can be ignored for all but the
    most critical of data (the user can specify to the database what is
    'critical')?

    > From what I know, any single-sector update is atomic on modern disks.

    As
    > about multi sector - it is not guaranteed so. So, it is better to

    keep some
    > "generation" count in the beginning of the sectors, and maintain this

    count
    > always the same for all sectors participating in a multi-sector write

    (a
    > database page or such). Before writing, increment the counter in all

    sectors,
    > then write them all. If the future reads will show different counters

    in these
    > sectors - then the record or page is damaged by disk drive or power

    failure.

    This is a good idea. It should help prevent ignorance of outside
    corruption and allow the database to be restored from a backup or
    possible user intervention. However, I think it is unnecessary for
    other types of problems.

    > I don't think O_DIRECT will allow non-sector-aligned IO on the file.


    To be more clear about this, I keep a log record ring buffer that is
    512-byte aligned. The buffer is committed to the log file in 512-byte
    blocks and if a partial block needs to be written, the rest of it is
    filled with null addresses. So we often write out many sectors at once.
    However, as I've stated above, outside corruption (ram/bus) aside,
    out-of-order writes are not a problem, because of precommitting null
    addresses - so I don't think this introduces any problems.


+ Reply to Thread