sort, forced to use mostly disk IO, how? - SUN

This is a discussion on sort, forced to use mostly disk IO, how? - SUN ; I'm likely going to need to sort a very large file soon, on the order of 60Gb (3 billion X 21 byte records) on a Solaris 9 machine with plenty of disk space but only 8Gb of memory. Since it's ...

+ Reply to Thread
Results 1 to 10 of 10

Thread: sort, forced to use mostly disk IO, how?

  1. sort, forced to use mostly disk IO, how?

    I'm likely going to need to sort a very large file soon, on the order of
    60Gb (3 billion X 21 byte records) on a Solaris 9 machine with plenty of
    disk space but only 8Gb of memory. Since it's already known that this
    data won't fit into memory, in fact, won't even come close to fitting
    into memory, it would be nice to be able to tell sort to do a merge
    sort, or something similar, and so spare everything else in the system
    from being forced into swap.

    sort has an -S parameter to restrict "swap-based memory" usage. It is
    not clear to me exactly what the quoted term means. If -S is set to
    1Gb, for instance, will sort use just 1Gb of memory total, or will it
    use ALL of physical memory plus 1Gb of swap space?

    Is Solaris's sort the appropriate tool for this job, or is there another
    (free) sort around which is better optimized for a disk intensive sort?

    Thanks,

    David Mathog

  2. Re: sort, forced to use mostly disk IO, how?

    David Mathog wrote:
    > I'm likely going to need to sort a very large file soon, on the order of
    > 60Gb (3 billion X 21 byte records) on a Solaris 9 machine with plenty of
    > disk space but only 8Gb of memory. Since it's already known that this
    > data won't fit into memory, in fact, won't even come close to fitting
    > into memory, it would be nice to be able to tell sort to do a merge
    > sort, or something similar, and so spare everything else in the system
    > from being forced into swap.
    >
    > sort has an -S parameter to restrict "swap-based memory" usage. It is
    > not clear to me exactly what the quoted term means. If -S is set to
    > 1Gb, for instance, will sort use just 1Gb of memory total, or will it
    > use ALL of physical memory plus 1Gb of swap space?
    >
    > Is Solaris's sort the appropriate tool for this job, or is there another
    > (free) sort around which is better optimized for a disk intensive sort?
    >
    > Thanks,
    >
    > David Mathog


    If you want to finish this in your own lifetime, I'd suggest trying to
    run this in parallel on as many machines as you can devote to the task.
    Do something like pick out all the A's and have machine 1 sort them,
    pick out all the B's and have machine 2 sort them, etc.

    The I/O will make you or break you! Read BIG blocks and write BIG blocks.

    You might want to have a look at "Sorting and Searching" (title from
    memory) by Donald E. Knuth. It's volume two or three of "The Art of
    Computer Programming". AIRC there's a lot of theory and some practical
    advice.

    I've sorted a few things but never a job anywhere near the magnitude of
    this one.


  3. Re: sort, forced to use mostly disk IO, how?

    On Nov 18, 9:14 pm, "Richard B. Gilbert"
    wrote:
    >
    > If you want to finish this in your own lifetime, I'd suggest trying to
    > run this in parallel on as many machines as you can devote to the task.
    > Do something like pick out all the A's and have machine 1 sort them,
    > pick out all the B's and have machine 2 sort them, etc.
    >


    It's probably worth remembering that sort's ancestors ran on machines
    with rather small amounts of memory. It worked then.

  4. Re: sort, forced to use mostly disk IO, how?

    David Mathog wrote:

    > sort has an -S parameter to restrict "swap-based memory" usage. It is
    > not clear to me exactly what the quoted term means. If -S is set to
    > 1Gb, for instance, will sort use just 1Gb of memory total, or will it
    > use ALL of physical memory plus 1Gb of swap space?


    So I ran a smaller (but still largish) sort as a test with the
    parameter "-S 3000000", top shows this:

    PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    14135 root 1 0 15 10G 4162M run 10:18 99.81% sort

    Which doesn't quite make sense for either interpretation of -S.
    The machine has 8G or physical memory and 24G of swap. 8+3 is 11,
    not 10, and 3M kilobytes isn't 4162M. Maybe
    8 + 3 - delta = 11 - delta = rounds down to 10G?
    If that's what is happening then -S isn't going to do the trick.

    Thanks,

    David Mathog

  5. Re: sort, forced to use mostly disk IO, how?

    Tim Bradshaw wrote:
    > On Nov 18, 9:14 pm, "Richard B. Gilbert"
    > wrote:
    >
    >>If you want to finish this in your own lifetime, I'd suggest trying to
    >>run this in parallel on as many machines as you can devote to the task.
    >>Do something like pick out all the A's and have machine 1 sort them,
    >>pick out all the B's and have machine 2 sort them, etc.
    >>

    >
    >
    > It's probably worth remembering that sort's ancestors ran on machines
    > with rather small amounts of memory. It worked then.


    It's probably worth remembering that sort's ancestors were not asked to
    sort 60,000,000,000 records!


  6. Re: sort, forced to use mostly disk IO, how?

    On Nov 19, 6:51 pm, "Richard B. Gilbert"
    wrote:

    > It's probably worth remembering that sort's ancestors were not asked to
    > sort 60,000,000,000 records!


    (3 billion, not 60 billion.) The point I was trying to make is that
    there's a long history (a very long history!) of sort routines which
    could cope with sorting data sets much larger than physical memory and
    this history includes sort's direct ancestors. there's no particular
    reason to believe that sort will not deal reasonably well with large
    data sets even today.

    There's a separate question of algorithmic complexity etc, but this is
    largely distinct.

    Basically what I'm trying to say is that there is no reason to think
    that sort will just drive the machine into catastrophic paging rather
    than doing what it always used to do, namely make reasonably smart use
    of temporary files (it strikes me that, if it still uses /tmp, this
    might be an issue!). It may still be impractical to sort this much
    data, but possibly not because sort is naive.

  7. Re: sort, forced to use mostly disk IO, how?

    Tim Bradshaw wrote:
    > On Nov 19, 6:51 pm, "Richard B. Gilbert"
    > wrote:
    >
    >
    >>It's probably worth remembering that sort's ancestors were not asked to
    >>sort 60,000,000,000 records!

    >
    >
    > (3 billion, not 60 billion.) The point I was trying to make is that
    > there's a long history (a very long history!) of sort routines which
    > could cope with sorting data sets much larger than physical memory and
    > this history includes sort's direct ancestors. there's no particular
    > reason to believe that sort will not deal reasonably well with large
    > data sets even today.
    >
    > There's a separate question of algorithmic complexity etc, but this is
    > largely distinct.
    >
    > Basically what I'm trying to say is that there is no reason to think
    > that sort will just drive the machine into catastrophic paging rather
    > than doing what it always used to do, namely make reasonably smart use
    > of temporary files (it strikes me that, if it still uses /tmp, this
    > might be an issue!). It may still be impractical to sort this much
    > data, but possibly not because sort is naive.


    I think the point is not processing power or RAM or paging. It's going
    to be I/O limited. That's why I think it would run a lot faster on
    multiple machines.



  8. Re: sort, forced to use mostly disk IO, how?

    Richard B. Gilbert wrote:
    > Tim Bradshaw wrote:
    >
    >> On Nov 19, 6:51 pm, "Richard B. Gilbert"
    >> wrote:
    >>
    >>
    >>> It's probably worth remembering that sort's ancestors were not asked to
    >>> sort 60,000,000,000 records!

    >>
    >>
    >>
    >> (3 billion, not 60 billion.) The point I was trying to make is that
    >> there's a long history (a very long history!) of sort routines which
    >> could cope with sorting data sets much larger than physical memory and
    >> this history includes sort's direct ancestors. there's no particular
    >> reason to believe that sort will not deal reasonably well with large
    >> data sets even today.
    >>
    >> There's a separate question of algorithmic complexity etc, but this is
    >> largely distinct.
    >>
    >> Basically what I'm trying to say is that there is no reason to think
    >> that sort will just drive the machine into catastrophic paging rather
    >> than doing what it always used to do, namely make reasonably smart use
    >> of temporary files (it strikes me that, if it still uses /tmp, this
    >> might be an issue!). It may still be impractical to sort this much
    >> data, but possibly not because sort is naive.

    >
    >
    > I think the point is not processing power or RAM or paging. It's going
    > to be I/O limited. That's why I think it would run a lot faster on
    > multiple machines.
    >
    >

    60 GB isn't all that much any more. I/O is buffered -- you can move a
    lot of records in one go.

    --
    The e-mail address in our reply-to line is reversed in an attempt to
    minimize spam. Our true address is of the form che...@prodigy.net.

  9. Re: sort, forced to use mostly disk IO, how?

    On Nov 19, 11:12 pm, "Richard B. Gilbert"
    wrote:
    >
    > I think the point is not processing power or RAM or paging. It's going
    > to be I/O limited. That's why I think it would run a lot faster on
    > multiple machines.


    Well, that's kind of what I meant. If sort is clever it can arrange
    for fairly good I/O patterns (sequential reads & writes really), and a
    decent I/O system can do pretty well on that kind of load (some kind
    of definition of a properly-designed system is one that can saturate
    memory with I/O, and the processor with memory (this is a bit naive of
    course, but it's not that wrong)).

    --tim

  10. Re: sort, forced to use mostly disk IO, how?

    On Nov 19, 6:22 pm, David Mathog wrote:

    > Which doesn't quite make sense for either interpretation of -S.
    > The machine has 8G or physical memory and 24G of swap. 8+3 is 11,
    > not 10, and 3M kilobytes isn't 4162M. Maybe
    > 8 + 3 - delta = 11 - delta = rounds down to 10G?
    > If that's what is happening then -S isn't going to do the trick.


    I think you need to look at memory more accurately. It may be mapping
    the input file, which would completely distort the figures. pmap is
    probably your friend here. It does appear from a (casual) inspection
    of the source that it is controlling the amount of anonymous memory
    allocated.

+ Reply to Thread