In coarse level parallelisation, how to handle output? - Linux

This is a discussion on In coarse level parallelisation, how to handle output? - Linux ; Hi all, If my executable non-parallel program needs to write to a datafile as an output, what happens if I run 100 independent copy of these files, using MPI as dispatcher? The simplest way is to change my executable program ...

+ Reply to Thread
Results 1 to 5 of 5

Thread: In coarse level parallelisation, how to handle output?

  1. In coarse level parallelisation, how to handle output?

    Hi all,

    If my executable non-parallel program needs to write to a datafile as an
    output, what happens if I run 100 independent copy of these files, using MPI
    as dispatcher? The simplest way is to change my executable program to write
    to different output datafile, say data001 all the way down to data100.

    Then I have to manually post-process all the datafiles myself.

    Is there a way they can write into one file, by appending? Maybe there will
    be conflicts, I guess...



  2. Re: In coarse level parallelisation, how to handle output?

    Mike wrote:
    > Hi all,
    >
    > If my executable non-parallel program needs to write to a datafile as an
    > output, what happens if I run 100 independent copy of these files, using MPI
    > as dispatcher? The simplest way is to change my executable program to write
    > to different output datafile, say data001 all the way down to data100.
    >
    > Then I have to manually post-process all the datafiles myself.
    >
    > Is there a way they can write into one file, by appending?


    Yes, but that's a lot more difficult than just writing into separate
    files. You'll have to make sure that your updates to the file don't
    stomp on each other. To put it in concrete terms, if two processes
    each have, say, 5000 bytes to write to one file, and if they both do
    it at once, there's a very good chance that one of them will write
    only part of the 5000 bytes before the other writes some. You will
    end up with one process's output arbitrarily interleaved with others'.
    It's just not a good way to go.

    How to combine separate outputs together is a question that depends
    on the type of outputs, though. For instance, if all 100 processes
    are trying to find members of a set, just output the members of a
    set that were found into each process's output file, then when all
    are done, concatenate them together and (if duplicates are possible)
    eliminate duplicates. That's pretty trivial. For example, if you
    had 100 processes trying to find all the prime numbers between
    1 and 1,000,000, you might have one process handling the range
    1 to 10,000, another process handling the range 10,001 to 20,000,
    and so on. So that would be really easy to merge together in the
    end.

    Of course, with other types of output, things won't be as easy,
    or maybe they will. It all depends on the type of output you are
    going to have.

    For what it's worth, in Unix it is trivial to concatenate 100 files
    together. If you have file00 through file99, you just do this:

    cat file* > combined-output

    If you want to sort them and eliminate duplicates on a line-by-line
    basis, that's also trivially easy in Unix:

    sort -u file* > combined-output-without-duplicates

    By the way, starting 4 separate threads for this same basic subject
    is bordering on excessive...

    - Logan

  3. Re: In coarse level parallelisation, how to handle output?

    Mike wrote:
    > Hi all,
    >
    > If my executable non-parallel program needs to write to a datafile as an
    > output, what happens if I run 100 independent copy of these files, using MPI
    > as dispatcher? The simplest way is to change my executable program to write
    > to different output datafile, say data001 all the way down to data100.
    >
    > Then I have to manually post-process all the datafiles myself.
    >
    > Is there a way they can write into one file, by appending? Maybe there will
    > be conflicts, I guess...


    It depends how much control you have over the calls to write():

    man 2 open:

    O_APPEND
    The file is opened in append mode. Before each
    write, the file pointer is positioned at the end of
    the file, as if with lseek. O_APPEND may lead to
    corrupted files on NFS file systems if more than
    one process appends data to a file at once. This
    is because NFS does not support appending to a
    file, so the client kernel has to simulate it,
    which can't be done without a race condition.


    --
    Josef Möllers (Pinguinpfleger bei FSC)
    If failure had no penalty success would not be a prize
    -- T. Pratchett


  4. Re: In coarse level parallelisation, how to handle output?

    "Logan Shaw" schrieb
    >
    > By the way, starting 4 separate threads for this same basic
    > subject is bordering on excessive...
    >

    Hey, it's about parallelisation, no? :-)

    SCNR
    Martin



  5. Re: In coarse level parallelisation, how to handle output?

    Mike wrote:
    > Hi all,
    >
    > If my executable non-parallel program needs to write to a datafile as an
    > output, what happens if I run 100 independent copy of these files, using MPI
    > as dispatcher? The simplest way is to change my executable program to write


    There is something called MPI-IO or parallel IO in MPI2.0 specs (thanks
    to IBM). Do some google about:
    - MPI_File_open ( MPI_COMM_WORLD...
    - MPI_File_set_view ( ...
    - MPI_File_write ( ...

    --
    Best regards
    Mateusz Pabis

+ Reply to Thread