Making existing process efficient - Unix

This is a discussion on Making existing process efficient - Unix ; Problem -------- I have a file in which there are millions of records. I need to read those records and output records to different files depending on the value of fields in the record. There would be several records that ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: Making existing process efficient

  1. Making existing process efficient

    Problem
    --------
    I have a file in which there are millions of records. I need to read
    those records and output records to different
    files depending on the value of fields in the record. There would be
    several records that would end up getting
    inserted into the same file due to similar value of fields in the
    records.

    Current scenario
    ----------------
    Our process runs on a mutiproc machine. Currently it does things in
    the simplest possible way. It opens several
    output streams (as and when required) and keeps writing records to the
    streams one by one. It appends records when the corresponding stream
    is already open.

    I am _trying_ to do things efficiently here. It's important to
    preserve the ordering of records in the generated
    files i.e. if record A1 is prior to record A2 in the input file, then
    A1 should still be prior to A2 in the result
    file (if they land up in the same file).

    I have thought of some ways -

    a) Keep writing N (sufficiently large) number of records to an
    internal buffer (B1). Fork a child process and open a
    pipe with the child. The child would inherit the buffer (B1) from
    parent. Child would sort the records by the stream where they would be
    written. It would then write the records to the destination file. It
    would keep writing to pipe the last record it could write
    successfully. It would exit after writing the last record. This would
    give the advantage of writing records belonging to the same file in
    one go. This would reduce the number of seeks that would have
    otherwise been caused by random access of files.

    Parent would continue writing the next set of N records to another
    internal buffer (B2). It would wait for the child
    process to complete before forking again. When the child has finished
    it would rename B2 as B1 and fork another
    process. In case the child crashes abruptly, parent would write the
    unsucessful records from buffer B1 to output
    files.

    b) (Complex approach) Let the parent write N records to an internal
    buffer (B1) and fork a child process C1. It
    would continue writing N more records to buffer B2 and fork another
    child process. It would open a pipe with both
    the child processes.

    The child process C1 would sort the internal buffer B1 by the output
    stream. It would then lock all the output files
    before writing. It would communciate to parent that it has locked all
    the files. It would then write and relinquish
    locks one by one once it's done.

    The child process C2 would sort the internal buffer B2 by the output
    stream. It would wait for parent to
    communicater that C1 has locked all the files. C2 would then attempt
    to grab locks on files and would write to them once it gets it.

    I think approach (b) is unduly complex (that too without mentioning
    crash recovery of child) and may or may not get any additional
    performance benefits. Measurements would only tell me how much
    performance gain I get with the approaches. (I feel that probably I/O
    is the bottleneck).

    Is there any other approach that could help me here?

    Thanks..
    P.S. - I am on FreeBSD 4.10

  2. Re: Making existing process efficient

    Kelvin Moss wrote:
    > Problem
    > --------
    > I have a file in which there are millions of records. I need to read
    > those records and output records to different
    > files depending on the value of fields in the record. There would be
    > several records that would end up getting
    > inserted into the same file due to similar value of fields in the
    > records.
    >
    > Current scenario
    > ----------------
    > Our process runs on a mutiproc machine. Currently it does things in
    > the simplest possible way. It opens several
    > output streams (as and when required) and keeps writing records to the
    > streams one by one. It appends records when the corresponding stream
    > is already open.


    Unless the decision of which output file (files?)
    should receive a record is incredibly complicated, the
    task will be completely I/O-bound. Implication: Fancy
    schemes to use the CPU more efficiently or to apply the
    power of additional CPU's/cores will help very little.

    That said, for a really large amount of data it may
    be worth while avoiding the need to copy everything from
    kernel space out to user space and back to kernel space
    again, as in a simple read()/write() scheme. Consider
    using mmap() instead, with madvise() to tell the virtual
    memory system that your accesses will be sequential. At
    the very least, use mmap() for the input file even if
    you continue to use write() for the outputs.

    > I have thought of some ways -
    >
    > a) Keep writing N (sufficiently large) number of records to an
    > internal buffer (B1). Fork a child process and open a
    > pipe with the child. The child would inherit the buffer (B1) from
    > parent. Child would sort the records by the stream where they would be
    > written. It would then write the records to the destination file. It
    > would keep writing to pipe the last record it could write
    > successfully. It would exit after writing the last record. This would
    > give the advantage of writing records belonging to the same file in
    > one go. This would reduce the number of seeks that would have
    > otherwise been caused by random access of files.
    >
    > Parent would continue writing the next set of N records to another
    > internal buffer (B2). It would wait for the child
    > process to complete before forking again. When the child has finished
    > it would rename B2 as B1 and fork another
    > process. In case the child crashes abruptly, parent would write the
    > unsucessful records from buffer B1 to output
    > files.


    "It is a capital mistake to theorize before one has
    data," and I lack data. Still, it seems to me that this
    is much too involved for any benefit it might bring --
    indeed, the overhead of all those fork() calls might more
    than offset any gain.

    If you're using mmap() for output, there's no need
    for any of this. Just mmap() the buffers to their output
    files, copy records into them, and let the virtual memory
    system take care of flushing the pages to disk. As I
    mentioned before, madvise() may help the VM system make
    good choices.

    If you're using write() for output, the only delay
    you incur in the write() call itself is the time to copy
    the data from user space to kernel space; unless you've
    explicitly asked for data synchronization, the file system
    will perform the physical writes at some later time. If
    delays on the output side really are a concern, you could
    use multiple writer threads to gain parallelism and keep
    the input side running more or less unimpeded. This is
    likely to be a good deal more efficient than using
    multiple processes, especially a process-per-write()!

    If you're using write(), changing to writev() could
    save some time. With write() you'd copy data from the
    input buffer to an output buffer and thence to the kernel,
    while with writev() you could copy a whole bunch of
    discontiguous records straight from the input buffer to
    the kernel, thus saving one copy step. For "millions of
    records," the savings might be worth while. Of course,
    it requires that you not re-fill the input buffer until
    all the output streams are no longer interested in any
    of its content, which may complicate things a little if
    you're using multiple writer threads. "It stands to
    reason" (that is, "trust, but verify") that the performance
    of mmap()/writev() should be better than mmap()/copy/write(),
    but probably not quite as good as mmap()/mmap()/copy.

    The big take-aways:

    - You're almost certainly I/O bound, so schemes to
    optimize the CPU are optimizing the wrong thing.

    - And yet, with "millions of records" the copy steps
    may be worth a look. Plain read()/copy/write()
    moves each byte three times; using mmap() on the
    input eliminates one move; using mmap() or writev()
    on the output eliminates another.

    - If more parallelism is needed, use multiple threads
    rather than multiple processes. The synchronization
    issues are the same, but the overheads are lower.

    - Along with mmap(), use madvise() to inform the
    system of the access patterns for your buffers.

    - And, as always: Measure, measure, measure!

    --
    Eric.Sosman@sun.com

  3. Re: Making existing process efficient

    In article ,
    James Antill wrote:

    > On Wed, 05 Dec 2007 11:38:31 -0800, Kelvin Moss wrote:
    >
    > > Problem
    > > --------
    > > I have a file in which there are millions of records. I need to read
    > > those records and output records to different files depending on the
    > > value of fields in the record. There would be several records that would
    > > end up getting inserted into the same file due to similar value of
    > > fields in the records.
    > >
    > > Current scenario
    > > ----------------
    > > Our process runs on a mutiproc machine. Currently it does things in the
    > > simplest possible way. It opens several output streams (as and when
    > > required) and keeps writing records to the streams one by one. It
    > > appends records when the corresponding stream is already open.
    > >
    > > I am _trying_ to do things efficiently here. It's important to preserve
    > > the ordering of records in the generated files i.e. if record A1 is
    > > prior to record A2 in the input file, then A1 should still be prior to
    > > A2 in the result file (if they land up in the same file).

    >
    > . Create X children
    > . Try read Y records, until EOF
    > . Find next free child[1]
    > . Give read records to child, along with start/end record number
    > . Child writes data into child specific files, along with record number.
    > . Mergesort child specific files into real ones, via. record number.


    Sort? The data is already in order, it seems unlikely that using a
    sorting algorithm could possibly be more efficient than what he's
    already doing.

    --
    Barry Margolin, barmar@alum.mit.edu
    Arlington, MA
    *** PLEASE post questions in newsgroups, not directly to me ***
    *** PLEASE don't copy me on replies, I'll read them in the group ***

  4. Re: Making existing process efficient

    On Dec 5, 12:32 pm, Eric Sosman wrote:
    > Kelvin Moss wrote:
    > The big take-aways:
    >
    > - You're almost certainly I/O bound, so schemes to
    > optimize the CPU are optimizing the wrong thing.
    >
    > - And yet, with "millions of records" the copy steps
    > may be worth a look. Plain read()/copy/write()
    > moves each byte three times; using mmap() on the
    > input eliminates one move; using mmap() or writev()
    > on the output eliminates another.
    >
    > - If more parallelism is needed, use multiple threads
    > rather than multiple processes. The synchronization
    > issues are the same, but the overheads are lower.
    >
    > - Along with mmap(), use madvise() to inform the
    > system of the access patterns for your buffers.
    >
    > - And, as always: Measure, measure, measure!


    Thanks Eric. This all makes a lot of sense. Here are a few questions -

    1. I had thought about mmaping the input file. Unfortunately the file
    in question is very big (> 4 GB). I believe it won't help much to mmap
    the file on a 32 bit system.
    2. I don't use writev but as I understand writev probably makes
    writing more efficient if the records to be written are scattered. If
    I sort my data then I won't have the problem of scattered inputs.
    Would writev still bring me some adavntages?
    3. I am thinking that probably opening too many files in one single
    process is also slowing down the I/O. So how about this approach -
    Let the parent hash the output file descriptors into M buckets and
    accordingly write to M buffers. Once a buffer is full, it would fork a
    child process that would sort the records based on fd and write to
    files. It would effectively cut down the number of open file
    descriptors per process and disk seeks too (by sorting on fd).

    Your comments would definitely help me.

    Thanks ..







  5. Re: Making existing process efficient

    Kelvin Moss wrote:
    > On Dec 5, 12:32 pm, Eric Sosman wrote:
    >> Kelvin Moss wrote:
    >> The big take-aways:
    >>
    >> - You're almost certainly I/O bound, so schemes to
    >> optimize the CPU are optimizing the wrong thing.
    >>
    >> - And yet, with "millions of records" the copy steps
    >> may be worth a look. Plain read()/copy/write()
    >> moves each byte three times; using mmap() on the
    >> input eliminates one move; using mmap() or writev()
    >> on the output eliminates another.
    >>
    >> - If more parallelism is needed, use multiple threads
    >> rather than multiple processes. The synchronization
    >> issues are the same, but the overheads are lower.
    >>
    >> - Along with mmap(), use madvise() to inform the
    >> system of the access patterns for your buffers.
    >>
    >> - And, as always: Measure, measure, measure!

    >
    > Thanks Eric. This all makes a lot of sense. Here are a few questions -
    >
    > 1. I had thought about mmaping the input file. Unfortunately the file
    > in question is very big (> 4 GB). I believe it won't help much to mmap
    > the file on a 32 bit system.


    Map the first chunk of the file, process it, unmap it,
    map the next chunk, process it, unmap it, lather, rinse,
    repeat. Choice of chunk size is up to you.

    > 2. I don't use writev but as I understand writev probably makes
    > writing more efficient if the records to be written are scattered. If
    > I sort my data then I won't have the problem of scattered inputs.
    > Would writev still bring me some adavntages?


    You keep talking about sorting your data, but in the
    original post you said

    >>> It's important to
    >>> preserve the ordering of records in the generated
    >>> files i.e. if record A1 is prior to record A2 in the
    >>> input file, then A1 should still be prior to A2 in the
    >>> result file (if they land up in the same file).


    .... so I don't understand why any sorting is necessary or
    desirable. Unless I've misunderstood, you're taking in one
    long stream of records and splitting them between multiple
    output files, preserving their original order. Maybe you
    have an input file listing every United States citizen,
    sorted by name, and you want to produce fifty-something
    output files also sorted by name, one for each State or
    territory. If that's not close to what you're attempting,
    describe your intentions more fully.

    Anyhow, the important thing about this framework is
    that you don't need to rearrange the records, just route
    each to the proper output (or outputs). You could do this
    by visiting each input record in turn, still in the input
    buffer, and doing a write() to the chosen file (or files).
    But with a large number of records that will be a lot of
    trips in and out of the kernel, and it's probably a good
    idea to try to write more than one record per trip.

    First idea: Set aside a buffer area for each output
    file, copy each record to its proper buffer(s) as you
    encounter it, and do a write() whenever a buffer fills up.
    Advantages: Fewer round-trips to the kernel, simple to code.
    Drawbacks: Every record gets copied at least twice, once
    to "marshal" into its buffer, and a second time to send
    the data to the kernel's file system cache.

    Second idea: Use writev() instead of write(), so you
    can send all the Alaska records to the Alaska output
    directly from the input buffer, without copying. Advantage:
    Fewer round-trips to the kernel, simple to code, only one
    copy. Drawbacks: Must remember to *do* the pending writev()
    calls before changing the mapping of (or reading more data
    into) the input buffer.

    Third idea: Like the first, but with the output buffers
    mmap()'ed to their output files so all you need to do is
    copy the record to the buffer and let the virtual memory
    system do its magic; no write() or writev(). Advantages:
    Fewest round-trips to the kernel, only one copy. Drawbacks:
    Using mmap() while a file changes size is a little more work.

    > 3. I am thinking that probably opening too many files in one single
    > process is also slowing down the I/O.


    How many is "too many?" I'm not familiar with the limits
    of FreeBSD, but Solaris and Linux can handle tens of thousands
    of open files simultaneously without trouble. How many do
    you need?

    > So how about this approach -
    > Let the parent hash the output file descriptors into M buckets and
    > accordingly write to M buffers. Once a buffer is full, it would fork a
    > child process that would sort the records based on fd and write to
    > files. It would effectively cut down the number of open file
    > descriptors per process and disk seeks too (by sorting on fd).


    This idea of forking seems to fascinate you in a way I
    can only think of as unhealthy. Besides, it now sounds like
    you need to copy every record at least twice: Once to dump
    it into one of the M buffers, and again when the child "sorts"
    (that word, again) the records.

    As for avoiding disk seeks -- Well, I've been assuming
    that the input and output are files, not raw devices. If
    so, the file system takes care of managing the physical I/O,
    combining multiple writes to adjacent areas into single
    physical writes, optimizing the seek patterns, and so on.
    You can exercise some control over the FS behavior from the
    application level, but it's a weakish sort of control, kind
    of like pushing on a rope.

    > Your comments would definitely help me.


    More background on what you're trying to do would help me.
    What are these millions of records, how big are they, how
    many output files are there, why all this talk about sorting?
    (And why do you think fork() is free?)

    --
    Eric.Sosman@sun.com

  6. Re: Making existing process efficient

    On Dec 7, 3:14 pm, Eric Sosman wrote:
    > > 1. I had thought about mmaping the input file. Unfortunately the file
    > > in question is very big (> 4 GB). I believe it won't help much to mmap
    > > the file on a 32 bit system.

    >
    > Map the first chunk of the file, process it, unmap it,
    > map the next chunk, process it, unmap it, lather, rinse,
    > repeat. Choice of chunk size is up to you.


    OK..so I would need to keep track of bytes read through mapped region
    since I would not get any EOF like error, right?


    > > 2. I don't use writev but as I understand writev probably makes
    > > writing more efficient if the records to be written are scattered. If
    > > I sort my data then I won't have the problem of scattered inputs.
    > > Would writev still bring me some adavntages?


    > You keep talking about sorting your data, but in the
    > original post you said
    >
    > >>> It's important to
    > >>> preserve the ordering of records in the generated
    > >>> files i.e. if record A1 is prior to record A2 in the
    > >>> input file, then A1 should still be prior to A2 in the
    > >>> result file (if they land up in the same file).

    >
    > ... so I don't understand why any sorting is necessary or
    > desirable. Unless I've misunderstood, you're taking in one
    > long stream of records and splitting them between multiple
    > output files, preserving their original order. Maybe you
    > have an input file listing every United States citizen,
    > sorted by name, and you want to produce fifty-something
    > output files also sorted by name, one for each State or
    > territory. If that's not close to what you're attempting,
    > describe your intentions more fully.


    That's more or less correct. Let me give a simplified example (using
    first two fields as criterion for output file) - say the input file
    contains records like

    A1 B1 C1 D1
    A1 B2 C1 D2
    A1 B1 C3 D4
    A1 B3 C2 D4
    A1 B3 C6 C7

    So at the end, we would have output files like -
    A1_B1.xtn
    ---------
    A1 B1 C1 D1
    A1 B1 C3 D4

    A1_B2.xtn
    ---------
    A1 B2 C1 D2

    A1_B3.xtn
    ---------
    A1 B3 C2 D4
    A1 B3 C6 C7



    > Anyhow, the important thing about this framework is
    > that you don't need to rearrange the records, just route
    > each to the proper output (or outputs). You could do this
    > by visiting each input record in turn, still in the input
    > buffer, and doing a write() to the chosen file (or files).
    > But with a large number of records that will be a lot of
    > trips in and out of the kernel, and it's probably a good
    > idea to try to write more than one record per trip.


    True

    > First idea: Set aside a buffer area for each output
    > file, copy each record to its proper buffer(s) as you
    > encounter it, and do a write() whenever a buffer fills up.
    > Advantages: Fewer round-trips to the kernel, simple to code.
    > Drawbacks: Every record gets copied at least twice, once
    > to "marshal" into its buffer, and a second time to send
    > the data to the kernel's file system cache.


    I do not know in advance about the number of files that are going to
    get opened. Hence that would mean to resort to malloc/new, which I
    would like to avoid if I can. The overhead of dynamic memory
    management for so many buffers would probably outweigh the benefits of
    this approach.

    > Second idea: Use writev() instead of write(), so you
    > can send all the Alaska records to the Alaska output
    > directly from the input buffer, without copying. Advantage:
    > Fewer round-trips to the kernel, simple to code, only one
    > copy. Drawbacks: Must remember to *do* the pending writev()
    > calls before changing the mapping of (or reading more data
    > into) the input buffer.


    Seems like a good approach to me. Another issue I have is that the
    legacy code uses C++ iostream for doing I/O. Would there be any issue
    in mixing writev with legacy code in such a case?


    > Third idea: Like the first, but with the output buffers
    > mmap()'ed to their output files so all you need to do is
    > copy the record to the buffer and let the virtual memory
    > system do its magic; no write() or writev(). Advantages:
    > Fewest round-trips to the kernel, only one copy. Drawbacks:
    > Using mmap() while a file changes size is a little more work.


    Good approach but again requires to resort to dynamic memory
    management (Aggravates when the file size changes). I guess it's hard
    to get the best of both the worlds :-)


    > > 3. I am thinking that probably opening too many files in one single
    > > process is also slowing down the I/O.

    >
    > How many is "too many?" I'm not familiar with the limits
    > of FreeBSD, but Solaris and Linux can handle tens of thousands
    > of open files simultaneously without trouble. How many do
    > you need?


    I am expecting some 5000 file handles. I am not sure on the BSD limits
    but going by what you say it should not become a bottleneck.


    > > So how about this approach -
    > > Let the parent hash the output file descriptors into M buckets and
    > > accordingly write to M buffers. Once a buffer is full, it would fork a
    > > child process that would sort the records based on fd and write to
    > > files. It would effectively cut down the number of open file
    > > descriptors per process and disk seeks too (by sorting on fd).

    >
    > This idea of forking seems to fascinate you in a way I
    > can only think of as unhealthy. Besides, it now sounds like
    > you need to copy every record at least twice: Once to dump
    > it into one of the M buffers, and again when the child "sorts"
    > (that word, again) the records.
    > As for avoiding disk seeks -- Well, I've been assuming
    > that the input and output are files, not raw devices. If


    That's correct.

    > so, the file system takes care of managing the physical I/O,
    > combining multiple writes to adjacent areas into single
    > physical writes, optimizing the seek patterns, and so on.


    Any idea if older Unices also emply such optimizations (like
    FreeBSD4.x)?

    > You can exercise some control over the FS behavior from the
    > application level, but it's a weakish sort of control, kind
    > of like pushing on a rope.


    OK

    > > Your comments would definitely help me.

    >
    > More background on what you're trying to do would help me.
    > What are these millions of records, how big are they, how
    > many output files are there, why all this talk about sorting?
    > (And why do you think fork() is free?)


    I have tried to give some background on the problem. The records are
    essentially user data records where each record size is variable
    (unlike the example that I gave above). I agree with you that it's
    essentially an I/O bound process. But we don't seem to use the
    multiproc usage even if it's available. Indeed fork isn't free but may
    be faster with multiproc around.

    Thanks again, your comments would definitely help.





  7. Re: Making existing process efficient

    In article
    <1b3b4120-20dd-4a14-acdf-00f45a09ca21@a35g2000prf.googlegroups.com>,
    Kelvin Moss wrote:

    > On Dec 7, 3:14 pm, Eric Sosman wrote:
    > > so, the file system takes care of managing the physical I/O,
    > > combining multiple writes to adjacent areas into single
    > > physical writes, optimizing the seek patterns, and so on.

    >
    > Any idea if older Unices also emply such optimizations (like
    > FreeBSD4.x)?


    I think most OSes in the past 15-20 years are similar.

    > > You can exercise some control over the FS behavior from the
    > > application level, but it's a weakish sort of control, kind
    > > of like pushing on a rope.

    >
    > OK
    >
    > > > Your comments would definitely help me.

    > >
    > > More background on what you're trying to do would help me.
    > > What are these millions of records, how big are they, how
    > > many output files are there, why all this talk about sorting?
    > > (And why do you think fork() is free?)

    >
    > I have tried to give some background on the problem. The records are
    > essentially user data records where each record size is variable
    > (unlike the example that I gave above). I agree with you that it's
    > essentially an I/O bound process. But we don't seem to use the
    > multiproc usage even if it's available. Indeed fork isn't free but may
    > be faster with multiproc around.


    Your application is totally sequential, so there's not really any way to
    benefit from multiple processors at this level.

    Behind the scenes the kernel will make use of multiprocessors when
    actually performing the disk I/O. When your process calls write(), the
    data will simply be copied to a kernel buffer, and any available CPU
    will eventually perform the actual write to the disk. There may also be
    prefetching, so while your CPU is processing one record, another CPU may
    start the disk read for the next record, and it will already be in the
    kernel buffer by the time your application looks for it.

    --
    Barry Margolin, barmar@alum.mit.edu
    Arlington, MA
    *** PLEASE post questions in newsgroups, not directly to me ***
    *** PLEASE don't copy me on replies, I'll read them in the group ***

+ Reply to Thread