Why is there difference between sum sizes of individual files and output by du? - Linux

This is a discussion on Why is there difference between sum sizes of individual files and output by du? - Linux ; Hi All, This is what I have noticed. When I manually add the sizes of individual files in a directory the sum is different than what is produced by du on the same directory. ( /home/Prasad/programs/c/lsp/IPC/message_queue/ contains no directories within) ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: Why is there difference between sum sizes of individual files and output by du?

  1. Why is there difference between sum sizes of individual files and output by du?

    Hi All,

    This is what I have noticed. When I manually add the sizes of
    individual files in a directory the sum is different than what is
    produced by du on the same directory.

    ( /home/Prasad/programs/c/lsp/IPC/message_queue/ contains no
    directories within)
    [Prasad@prasadjoshi my_du]$ ls -al
    /home/Prasad/programs/c/lsp/IPC/message_queue/
    total 32
    drwxrwxr-x 2 Prasad Prasad 4096 Jan 1 18:59 .
    drwxrwxr-x 4 Prasad Prasad 4096 Jan 8 18:47 ..
    -rw-rw-r-- 1 Prasad Prasad 971 Jan 1 18:59 first_receiver.c
    -rw-rw-r-- 1 Prasad Prasad 1505 Jan 1 18:58 first_sender.c
    -rwxrwxr-x 1 Prasad Prasad 5642 Jan 1 15:09 rcv
    -rwxrwxr-x 1 Prasad Prasad 6115 Jan 1 15:09 send

    971 + 1505 + 5642 + 6115 = 14233
    14233 / 1024 = 13

    [Prasad@prasadjoshi my_du]$ du
    /home/Prasad/programs/c/lsp/IPC/message_queue
    28 /home/Prasad/programs/c/lsp/IPC/message_queue

    Why is there so much of difference?

    Thanks and regards,
    Prasad.


  2. Re: Why is there difference between sum sizes of individual filesand output by du?

    Prasad wrote:
    > Hi All,
    >
    > This is what I have noticed. When I manually add the sizes of
    > individual files in a directory the sum is different than what is
    > produced by du on the same directory.
    >
    > ( /home/Prasad/programs/c/lsp/IPC/message_queue/ contains no
    > directories within)
    > [Prasad@prasadjoshi my_du]$ ls -al
    > /home/Prasad/programs/c/lsp/IPC/message_queue/
    > total 32
    > drwxrwxr-x 2 Prasad Prasad 4096 Jan 1 18:59 .
    > drwxrwxr-x 4 Prasad Prasad 4096 Jan 8 18:47 ..
    > -rw-rw-r-- 1 Prasad Prasad 971 Jan 1 18:59 first_receiver.c
    > -rw-rw-r-- 1 Prasad Prasad 1505 Jan 1 18:58 first_sender.c
    > -rwxrwxr-x 1 Prasad Prasad 5642 Jan 1 15:09 rcv
    > -rwxrwxr-x 1 Prasad Prasad 6115 Jan 1 15:09 send
    >
    > 971 + 1505 + 5642 + 6115 = 14233
    > 14233 / 1024 = 13
    >
    > [Prasad@prasadjoshi my_du]$ du
    > /home/Prasad/programs/c/lsp/IPC/message_queue
    > 28 /home/Prasad/programs/c/lsp/IPC/message_queue
    >
    > Why is there so much of difference?


    Educated guess: your version of du prints in units of 512 bytes rather
    than in units of 1024 bytes, as the standard du would do.

    --
    Josef Möllers (Pinguinpfleger bei FSC)
    If failure had no penalty success would not be a prize
    -- T. Pratchett


  3. Re: Why is there difference between sum sizes of individual files and output by du?

    "Prasad" writes:

    > Hi All,
    >
    > This is what I have noticed. When I manually add the sizes of
    > individual files in a directory the sum is different than what is
    > produced by du on the same directory.


    Because du sums the space used on disk, not the size of the contents.
    Since the file system is expected to behave with good time
    performance, some space is used to optimize time.

    For example, we allocate disk space in units of blocks instead of
    units of bytes, and we write the list of blocks allocated into some
    other (meta) blocks.

    The first means that when you have a file whose size is not an exact
    multiple of the block size, it will consume actually some more bytes,
    till the end of the final block. An average of half the block size is
    lost (if file sizes were random. In the example of directory you
    give, the block size seems to be 4096, and we have 6 files of sizes
    (+ 4096 4096 971 1505 5642 6115) = 22425 ; don't forget the directories!
    but the number of blocks used are (+ 1 1 1 1 2 2) = 8 for a total size of
    (* 8 4096) = 32768 bytes.

    The second means that in addition to these blocks, each file will need
    (in this case, only) one addition meta block. For a total of 14
    blocks and 57344 bytes.


    You could design a file system where the files only use their bytes
    plus a small number of bytes for their meta data (names, access
    rights, etc), for example by writting sequentially the files on the
    disk as if it was a tape. But of course, accessing files would be
    extremely slow, since you'd have to read (on average half of) the
    whole disk to find them (well you could do some optimization in RAM,
    keeping a cache and the list of blocks of the files already found,
    perhaps it could be even usable! but it would be slower than the
    current designs. And append and deletion would be problematic: for
    append you'd have to copy the file at the end of the disk, for
    deletion you could mark the file deleted, but then you'd have problems
    to recover the space, either have creation of files read the whole
    disk for free spaces, (and lose the end of the free blocs!!!) or
    implement a very very slow compacting procedure, copying all the files
    to the beginning of the disk. What a cost, just to spare a few
    bytes!).


    Some file systems try to avoid too much overhead, for example,
    reiserfs will write the final bytes of the files together with those
    of other files in a single block (or meta-block) to avoid too much
    overhead. But there are still quite a number of meta-blocks needed.

    --
    __Pascal Bourguignon__ http://www.informatimago.com/

    "Logiciels libres : nourris au code source sans farine animale."

  4. Re: Why is there difference between sum sizes of individual files and output by du?

    On 10 Jan 2007 02:57:52 -0800 Prasad wrote:
    | Hi All,
    |
    | This is what I have noticed. When I manually add the sizes of
    | individual files in a directory the sum is different than what is
    | produced by du on the same directory.
    |
    | ( /home/Prasad/programs/c/lsp/IPC/message_queue/ contains no
    | directories within)
    | [Prasad@prasadjoshi my_du]$ ls -al
    | /home/Prasad/programs/c/lsp/IPC/message_queue/
    | total 32
    | drwxrwxr-x 2 Prasad Prasad 4096 Jan 1 18:59 .
    | drwxrwxr-x 4 Prasad Prasad 4096 Jan 8 18:47 ..
    | -rw-rw-r-- 1 Prasad Prasad 971 Jan 1 18:59 first_receiver.c
    | -rw-rw-r-- 1 Prasad Prasad 1505 Jan 1 18:58 first_sender.c
    | -rwxrwxr-x 1 Prasad Prasad 5642 Jan 1 15:09 rcv
    | -rwxrwxr-x 1 Prasad Prasad 6115 Jan 1 15:09 send
    |
    | 971 + 1505 + 5642 + 6115 = 14233
    | 14233 / 1024 = 13
    |
    | [Prasad@prasadjoshi my_du]$ du
    | /home/Prasad/programs/c/lsp/IPC/message_queue
    | 28 /home/Prasad/programs/c/lsp/IPC/message_queue
    |
    | Why is there so much of difference?

    Note the "4096" you get for "." and ".." indicating the block size.

    ( 971 + 4095 ) / 4096 = 1 (actual 4K blocks occupied by first_receiver.c)
    ( 1505 + 4095 ) / 4096 = 1 (actual 4K blocks occupied by first_sender.c)
    ( 5642 + 4095 ) / 4096 = 2 (actual 4K blocks occupied by rcv)
    ( 6115 + 4095 ) / 4096 = 2 (actual 4K blocks occupied by send)
    directory itself = 1 (actual 4K blocks occupied by the directory)
    1 + 1 + 2 + 2 + 1 = 7 (total 4K blocks occupied)
    7 * 4096 = 28672 (total bytes in 7 4K blocks)
    28672 / 1024 = 28 (total K's occupied)

    Looks to me like du gave you the correct answer for a filesystem with 4K
    units of allocation. The difference is the waste from having larger size
    allocation blocks for lots of smaller files. It's a tradeoff between waste
    of space and I/O performance. Matching I/O blocks and system page size is
    often the optimal choice.

    --
    |---------------------------------------/----------------------------------|
    | Phil Howard KA9WGN (ka9wgn.ham.org) / Do not send to the address below |
    | first name lower case at ipal.net / spamtrap-2007-01-10-0645@ipal.net |
    |------------------------------------/-------------------------------------|

  5. Re: Why is there difference between sum sizes of individual files and output by du?

    Pascal Bourguignon wrote:
    > "Prasad" writes:
    >
    > > Hi All,
    > >
    > > This is what I have noticed. When I manually add the sizes of
    > > individual files in a directory the sum is different than what is
    > > produced by du on the same directory.

    >
    > Because du sums the space used on disk, not the size of the contents.
    > Since the file system is expected to behave with good time
    > performance, some space is used to optimize time.
    >
    >
    > Some file systems try to avoid too much overhead, for example,
    > reiserfs will write the final bytes of the files together with those
    > of other files in a single block (or meta-block) to avoid too much
    > overhead. But there are still quite a number of meta-blocks needed.
    >


    I have a stunning example of that, i have copied the FreeBSD ports
    collection on a Linux machine with ext3 filesystem. To my great surprise
    du reports 30% more occupation on Linux than on FreeBSD. I assume that
    the explanation is that the ports collection consists in a large number
    of small files, and while Linux ext2 and FreeBSD UFS share a mostly
    similar filesystem, there is a difference, UFS uses fragments when ext2
    doesn't, which allows better packing of small files.


    --

    Michel TALON


  6. Re: Why is there difference between sum sizes of individual files and output by du?

    On Jan 10, 2:47 pm, t...@lpthe.jussieu.fr (Michel Talon) wrote:
    > Pascal Bourguignon wrote:
    > > "Prasad" writes:


    > > > This is what I have noticed. When I manually add the sizes of
    > > > individual files in a directory the sum is different than what is
    > > > produced by du on the same directory.

    >
    > > Because du sums the space used on disk, not the size of the contents.
    > > Since the file system is expected to behave with good time
    > > performance, some space is used to optimize time.

    >
    > > Some file systems try to avoid too much overhead, for example,
    > > reiserfs will write the final bytes of the files together with those
    > > of other files in a single block (or meta-block) to avoid too much
    > > overhead. But there are still quite a number of meta-blocks needed.I have a stunning example of that, i have copied the FreeBSD ports

    > collection on a Linux machine with ext3 filesystem. To my great surprise
    > du reports 30% more occupation on Linux than on FreeBSD. I assume that
    > the explanation is that the ports collection consists in a large number
    > of small files, and while Linux ext2 and FreeBSD UFS share a mostly
    > similar filesystem, there is a difference, UFS uses fragments when ext2
    > doesn't, which allows better packing of small files.


    For the record, sparse files may also affect the way du shows sizes.

    dd if=/dev/zero of=big count=1 seek=1000000000 => creates a 477GB file
    du big => reports only a few kilobytes.

    --
    Pierre


  7. Re: Why is there difference between sum sizes of individual files and output by du?

    Josef Moellers wrote:
    >
    >Educated guess: your version of du prints in units of 512 bytes rather
    >than in units of 1024 bytes, as the standard du would do.


    I guess this depends on your definition of "standard". The POSIX
    definition and Open Group Base Specification definition of "du" default to
    units of 512 bytes.

    Most Linux distributions enable the "-k" flag by default.
    --
    Tim Roberts, timr@probo.com
    Providenza & Boekelheide, Inc.

+ Reply to Thread