FP, MMX and Friends on Minix 3 - Minix

This is a discussion on FP, MMX and Friends on Minix 3 - Minix ; Hi, I read in the Documentation Section, that Minix 3 does not yet support native Floating Point, MMX, 3DNow etc. Instructions. Someone states, that it would increase RAM required to store corresponding registers on task switches. As far as Minix ...

+ Reply to Thread
Results 1 to 12 of 12

Thread: FP, MMX and Friends on Minix 3

  1. FP, MMX and Friends on Minix 3

    Hi,

    I read in the Documentation Section, that Minix 3 does not yet support
    native Floating Point, MMX, 3DNow etc. Instructions. Someone states,
    that it would increase RAM required to store corresponding registers on
    task switches.
    As far as Minix 3 is a very modular system, why don't you use the
    following components:
    - a FP Scheduling Process (single instance)
    - a FP,MMX,etc. Executing Process (one for each CPU core and bind to
    this CPU core)

    The Scheduling Process gets a request for a 'floating point' job (this
    job's description should look the same on all native platforms - e.g.
    x86, 386, PowerPC, AMD64). It looks for a free executing process and
    asks that process for execution. The execution process interprets the
    job description and executes native FP, MMX, whatever instructions. When
    finished it send back the results to the scheduling process, that then
    sends back the result to the requesting process.

    No state needs to be stored at any task switch, because only one process
    per CPU would execute native FP instructions. Systems with no native FP
    support could have one or more emulating execution process(es).

    Compiling floating point statements on different platforms would result
    in the same code - no compiler changes are required in this part of code.

    I believe, that the time needed for communication between these
    processes plus time for native execution may be faster than emulating
    those commands on platforms that have native FP support.


    Kind regards
    Joerg

  2. Re: FP, MMX and Friends on Minix 3

    Joerg Knura escreveu:

    > I believe, that the time needed for communication between these
    > processes plus time for native execution may be faster than emulating
    > those commands on platforms that have native FP support.


    I have a better idea: why not let applications execute FP instructions on
    platforms with native FP support and catch the exceptions generated on
    platforms which don't have?

    Also, I believe that you will end up using more memory if you used a
    separate process than if you saved the FP registers for each process.

    --
    João Jerónimo

    "Computer are composed of software, hardware, and other stuff terminated
    in "ware", like firmware, tupperware, (...)" - by JJ.

  3. Re: FP, MMX and Friends on Minix 3

    Going his way you need to reserve RAM for each task running whether it
    uses FP or not and you need to check whether to store the FP registers
    or not. This would increase time during task switches.
    The other reason why I think that having separate processes is a more
    comfortable way is that:

    1) If no FP-applications are running the processes could be stopped and
    mapped out of RAM or deleted from it

    2) The type of FP-processor doesn't matter: e.g. some graphics card
    vendors provide libraries to execute heavy-math things like physical
    simulations on their GPUs - the scheduler process doesn't need to be
    changed. Only the execution process needs to be developed for the
    corresponding GPU. Or think about that Hypertransfer Co-Processors that
    for example fit in a Socket F (Opteron)

    The problem when handling it via exceptions is that each FP-instruction
    would cause an exception (in case of no FPU) switching from user to
    kernel space and back after emulating. Another would be that you are not
    able to use the features of other devices (2) when your CPU has no FP

    João Jerónimo wrote:
    > Joerg Knura escreveu:
    > I have a better idea: why not let applications execute FP instructions on
    > platforms with native FP support and catch the exceptions generated on
    > platforms which don't have?
    >
    > Also, I believe that you will end up using more memory if you used a
    > separate process than if you saved the FP registers for each process.
    >


  4. Re: FP, MMX and Friends on Minix 3

    Going his way you need to reserve RAM for each task running whether it
    uses FP or not and you need to check whether to store the FP registers
    or not. This would increase time during task switches.
    The other reason why I think that having separate processes is a more
    comfortable way is that:

    1) If no FP-applications are running the processes could be stopped and
    mapped out of RAM or deleted from it

    2) The type of FP-processor doesn't matter: e.g. some graphics card
    vendors provide libraries to execute heavy-math things like physical
    simulations on their GPUs - the scheduler process doesn't need to be
    changed. Only the execution process needs to be developed for the
    corresponding GPU. Or think about that Hypertransfer Co-Processors that
    for example fit in a Socket F (Opteron)

    The problem when handling it via exceptions is that each FP-instruction
    would cause an exception (in case of no FPU) switching from user to
    kernel space and back after emulating.

    João Jerónimo wrote:
    > Joerg Knura escreveu:
    > I have a better idea: why not let applications execute FP instructions on
    > platforms with native FP support and catch the exceptions generated on
    > platforms which don't have?
    >
    > Also, I believe that you will end up using more memory if you used a
    > separate process than if you saved the FP registers for each process.
    >


  5. Re: FP, MMX and Friends on Minix 3

    Joerg Knura escreveu:

    > Going his way you need to reserve RAM for each task running whether it
    > uses FP or not and you need to check whether to store the FP registers
    > or not. This would increase time during task switches.


    Your solution involves an extra process, which surely wastes more RAM than
    reserving some extra bytes in the process table.

    > The problem when handling it via exceptions is that each FP-instruction
    > would cause an exception (in case of no FPU) switching from user to
    > kernel space and back after emulating. Another would be that you are not
    > able to use the features of other devices (2) when your CPU has no FP


    When your CPU has no FPU, you emulate it in the exception handler.
    What you are proposing is creating a "floating point library" in the form of
    a server, which I think is overkill and a misuse of the concept.

    --
    João Jerónimo

    "Computer are composed of software, hardware, and other stuff terminated
    in "ware", like firmware, tupperware, (...)" - by JJ.

  6. Re: FP, MMX and Friends on Minix 3



    João Jerónimo wrote:
    > Joerg Knura escreveu:
    >
    >> Going his way you need to reserve RAM for each task running whether it
    >> uses FP or not and you need to check whether to store the FP registers
    >> or not. This would increase time during task switches.

    >
    > Your solution involves an extra process, which surely wastes more RAM than
    > reserving some extra bytes in the process table.
    >

    yes, you are right, it wastes more RAM - but that is more or less static
    as far as the number of processes is limited to the number of FPU-cores
    plus one. In case of no other process uses FP these processes can be
    killed and then they will not use any RAM. In this case - and when doing
    it like you suggests - the handling for the FP registers must be
    executed anyway, which wastes time during task-switches.
    >> The problem when handling it via exceptions is that each FP-instruction
    >> would cause an exception (in case of no FPU) switching from user to
    >> kernel space and back after emulating. Another would be that you are not
    >> able to use the features of other devices (2) when your CPU has no FP

    >
    > When your CPU has no FPU, you emulate it in the exception handler.
    > What you are proposing is creating a "floating point library" in the form of
    > a server, which I think is overkill and a misuse of the concept.
    >

    Yes, it will produce something like an illegal instruction exception
    forcing a switch to the kernel and then to the code that emulates FP as
    far as Minix 3 has an micro-kernel and will not emulate it itself. This
    happens for every FP instruction in the code.

    What I am proposing is not a library at all - it is a kind of driver for
    some kind of floating point calculations. The difference to other
    drivers is, that it cannot be addressed directly, but via the scheduler

    I don't state that either solution is better than the other - both have
    advantages and disadvantages.

    One advantage of my proposed solution is, that the kernel doesn't need
    to be changed and doesn't need to keep track of FPU-registers at all.

    Another is, that the type (and native code) of the co-processor doesn't
    matter from application point of view. This also implies that one could
    change the driver during runtime for example from a native-driver to an
    emulating driver

    It's comparable to java code that is handled by a jvm - the difference
    is, that the pseudo FP-code is embedded in the native non-FP code and is
    send to the scheduler.

    regards
    Joerg

  7. Re: FP, MMX and Friends on Minix 3

    Joerg Knura escreveu:

    > What I am proposing is not a library at all - it is a kind of driver for
    > some kind of floating point calculations. The difference to other
    > drivers is, that it cannot be addressed directly, but via the scheduler


    If I understood you correctly, a program who wants to do FP has to
    communicate with the FP server. If so, the only difference from a library
    is that the function in not called diretly, but rather by sending a
    message.

    --
    João Jerónimo

    "Computer are composed of software, hardware, and other stuff terminated
    in "ware", like firmware, tupperware, (...)" - by JJ.

  8. Re: FP, MMX and Friends on Minix 3

    You understood it correctly - a program that wants to do FP sends an
    request with psuedo-FP code to the FP-scheduler that then send the
    request to a free FP-Execution process (this doesn't need to be an X87
    FPU execution process - it may also be a execution process that uses the
    geometry engine within a graphics card)

    If you think that this is the only difference from a library, then every
    device-driver - graphics, disk, keyboard etc. - is just a library. They
    are all called via System Call ... not directly

    João Jerónimo wrote:
    > Joerg Knura escreveu:
    >
    >> What I am proposing is not a library at all - it is a kind of driver for
    >> some kind of floating point calculations. The difference to other
    >> drivers is, that it cannot be addressed directly, but via the scheduler

    >
    > If I understood you correctly, a program who wants to do FP has to
    > communicate with the FP server. If so, the only difference from a library
    > is that the function in not called diretly, but rather by sending a
    > message.
    >


  9. Re: FP, MMX and Friends on Minix 3

    Joerg Knura writes:
    > You understood it correctly - a program that wants to do FP sends an
    > request with psuedo-FP code to the FP-scheduler that then send the
    > request to a free FP-Execution process (this doesn't need to be an X87
    > FPU execution process - it may also be a execution process that uses the
    > geometry engine within a graphics card)


    Interesting idea, but it is not suitable for typical scalar FP
    applications. If you look at such code, you will see there are no long
    instruction sequences of FP-only commands. In fact, given a good FPU,
    the problem is to keep it busy. It is hard enough to do that as it is,
    but near impossible if you introduce additional latencies like message
    passing and copying data from one process to another and back.

    What you propose might work for vector operations, in particular if
    the vector unit is offchip and the vectors are large. There will be
    significant overhead already and saving vector registers certainly screws
    fast context switches.

    Aiming at high performance FP, the best solution is to offload typical OS
    functions to its own CPU and adapt the scheduler to the particular needs.
    The GPU is an example of that. The interface you proposed might work
    well for GPU code, and the longer the GPU code passed to the driver runs,
    the better.

    Michael

  10. Re: FP, MMX and Friends on Minix 3



    Michael Haardt wrote:
    > Interesting idea, but it is not suitable for typical scalar FP
    > applications. If you look at such code, you will see there are no long
    > instruction sequences of FP-only commands. In fact, given a good FPU,
    > the problem is to keep it busy. It is hard enough to do that as it is,
    > but near impossible if you introduce additional latencies like message
    > passing and copying data from one process to another and back.
    >


    What if the compiler generates pseudo code for structural operators like
    'for', 'while' etc. or produces native code that produces pseudo code on
    the fly e.g. :

    int calculateList (int n double *input1, double *input2, double *input3,
    double *output) {
    int counter;

    for (counter=0; counter++; counter result[counter]=input1[counter]+(input2[counter]*input3[counter]);
    }
    }

    leads to executed ops like (not in a real asm-language)

    ..
    ..
    ..
    push64 input1[0]
    call appendIN // add input1 as input data to the pseudo-code
    push64 input2[0]
    call appendIN // add input2 as input data to the pseudo-code
    push64 input3[0]
    call appendIN // add input3 as input data to the pseudo-code
    call incrementOUT // one more result required
    push #2
    push #LOAD
    call appendOP // create an pseudo op for loading input data
    // at offset 2 to the FP stack
    push #1
    push #LOAD
    call appendOP // create an pseudo op for loading input data
    // at offset 1 to the FP stack
    push #MULDOUBLE
    call appendOP // create an pseudo op for multiplying the top
    // two FP stack entries, delete both and put
    // the result back on
    push #0
    push #LOAD
    call #appendOP // create an pseudo op for loading input data
    // at offset 0 to the FP stack
    push #ADDDOUBLE
    call appendOP // create an pseudo op for adding the top
    // two FP stack entries, delete both and put
    // the result back on
    push #0
    push #STORE
    call #appendOP // create an pseudo op for storing the top
    // entry to output data at offset 0 and
    // delete the top
    ..
    ..
    ..
    push64 input1[n-1]
    call appendIN
    push64 input2[n-1]
    call appendIN
    push64 input3[n-1]
    call appendIN
    call incrementOUT
    push #(((n-1)*3)+2)
    push #LOAD
    call appendOP
    push #(((n-1)*3)+1)
    push #LOAD
    call appendOP
    push #MULDOUBLE
    call appendOP
    push #((n-1)*3)
    push #LOAD
    call #appendOP
    push #(n)
    push #STORE
    call #appendOP

    push fpRequest
    call sendFPRequest

    push addr(result[0])
    push #0
    call getOUT
    ..
    ..
    ..
    push addr(result[n-1])
    push #(n-1)
    call getOUT
    ..
    ..


    > What you propose might work for vector operations, in particular if
    > the vector unit is offchip and the vectors are large. There will be
    > significant overhead already and saving vector registers certainly screws
    > fast context switches.
    >
    > Aiming at high performance FP, the best solution is to offload typical OS
    > functions to its own CPU and adapt the scheduler to the particular needs.
    > The GPU is an example of that. The interface you proposed might work
    > well for GPU code, and the longer the GPU code passed to the driver runs,
    > the better.
    >
    > Michael


  11. Re: FP, MMX and Friends on Minix 3

    > What if the compiler generates pseudo code for structural operators like
    > 'for', 'while' etc. or produces native code that produces pseudo code on
    > the fly e.g. :
    >
    > int calculateList (int n double *input1, double *input2, double *input3,
    > double *output) {
    > int counter;
    >
    > for (counter=0; counter++; counter > result[counter]=input1[counter]+(input2[counter]*input3[counter]);
    > }
    > }
    >
    > leads to executed ops like (not in a real asm-language)


    [compile to bytecode]

    You mixed up two things here. First, it's a "what if the compiler
    could vectorise the code". Ask HPC people about that and the answer is:
    Write your code accordingly, and the compiler will. The above vectorises
    nicely. Second, it's a "what if the compiler compiles parts to bytecode,
    sending that to an interpreter". You unrolled the loop entirely to avoid
    the problem of loops, but you can't always do that, and out of a sudden
    you do need context switches and have the same problem as to begin with.

    Your approach works if you dedicate a CPU (or GPU) to HPC stuff, extending
    Minix for multiprocessors and run a kernel on the dedicated CPU with
    a special scheduler. As I said, if you want the most, then get the OS
    (mostly) out of the way.

    Michael

  12. Re: FP, MMX and Friends on Minix 3

    Joerg Knura escreveu:

    > If you think that this is the only difference from a library, then every
    > device-driver - graphics, disk, keyboard etc. - is just a library. They
    > are all called via System Call ... not directly


    Yes. Somewhat.
    But I forget that they can mix data related to different processes, contrary
    to a library.

    --
    João Jerónimo

    "Computer are composed of software, hardware, and other stuff terminated
    in "ware", like firmware, tupperware, (...)" - by JJ.

+ Reply to Thread