| Unix Content | Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
|
| Hi, I read in the Documentation Section, that Minix 3 does not yet support native Floating Point, MMX, 3DNow etc. Instructions. Someone states, that it would increase RAM required to store corresponding registers on task switches. As far as Minix 3 is a very modular system, why don't you use the following components: - a FP Scheduling Process (single instance) - a FP,MMX,etc. Executing Process (one for each CPU core and bind to this CPU core) The Scheduling Process gets a request for a 'floating point' job (this job's description should look the same on all native platforms - e.g. x86, 386, PowerPC, AMD64). It looks for a free executing process and asks that process for execution. The execution process interprets the job description and executes native FP, MMX, whatever instructions. When finished it send back the results to the scheduling process, that then sends back the result to the requesting process. No state needs to be stored at any task switch, because only one process per CPU would execute native FP instructions. Systems with no native FP support could have one or more emulating execution process(es). Compiling floating point statements on different platforms would result in the same code - no compiler changes are required in this part of code. I believe, that the time needed for communication between these processes plus time for native execution may be faster than emulating those commands on platforms that have native FP support. Kind regards Joerg |
|
#2
|
| Joerg Knura escreveu: > I believe, that the time needed for communication between these > processes plus time for native execution may be faster than emulating > those commands on platforms that have native FP support. I have a better idea: why not let applications execute FP instructions on platforms with native FP support and catch the exceptions generated on platforms which don't have? Also, I believe that you will end up using more memory if you used a separate process than if you saved the FP registers for each process. -- João Jerónimo "Computer are composed of software, hardware, and other stuff terminated in "ware", like firmware, tupperware, (...)" - by JJ. |
|
#3
|
| Going his way you need to reserve RAM for each task running whether it uses FP or not and you need to check whether to store the FP registers or not. This would increase time during task switches. The other reason why I think that having separate processes is a more comfortable way is that: 1) If no FP-applications are running the processes could be stopped and mapped out of RAM or deleted from it 2) The type of FP-processor doesn't matter: e.g. some graphics card vendors provide libraries to execute heavy-math things like physical simulations on their GPUs - the scheduler process doesn't need to be changed. Only the execution process needs to be developed for the corresponding GPU. Or think about that Hypertransfer Co-Processors that for example fit in a Socket F (Opteron) The problem when handling it via exceptions is that each FP-instruction would cause an exception (in case of no FPU) switching from user to kernel space and back after emulating. Another would be that you are not able to use the features of other devices (2) when your CPU has no FP João Jerónimo wrote: > Joerg Knura escreveu: > I have a better idea: why not let applications execute FP instructions on > platforms with native FP support and catch the exceptions generated on > platforms which don't have? > > Also, I believe that you will end up using more memory if you used a > separate process than if you saved the FP registers for each process. > |
|
#4
|
| Going his way you need to reserve RAM for each task running whether it uses FP or not and you need to check whether to store the FP registers or not. This would increase time during task switches. The other reason why I think that having separate processes is a more comfortable way is that: 1) If no FP-applications are running the processes could be stopped and mapped out of RAM or deleted from it 2) The type of FP-processor doesn't matter: e.g. some graphics card vendors provide libraries to execute heavy-math things like physical simulations on their GPUs - the scheduler process doesn't need to be changed. Only the execution process needs to be developed for the corresponding GPU. Or think about that Hypertransfer Co-Processors that for example fit in a Socket F (Opteron) The problem when handling it via exceptions is that each FP-instruction would cause an exception (in case of no FPU) switching from user to kernel space and back after emulating. João Jerónimo wrote: > Joerg Knura escreveu: > I have a better idea: why not let applications execute FP instructions on > platforms with native FP support and catch the exceptions generated on > platforms which don't have? > > Also, I believe that you will end up using more memory if you used a > separate process than if you saved the FP registers for each process. > |
|
#5
|
| Joerg Knura escreveu: > Going his way you need to reserve RAM for each task running whether it > uses FP or not and you need to check whether to store the FP registers > or not. This would increase time during task switches. Your solution involves an extra process, which surely wastes more RAM than reserving some extra bytes in the process table. > The problem when handling it via exceptions is that each FP-instruction > would cause an exception (in case of no FPU) switching from user to > kernel space and back after emulating. Another would be that you are not > able to use the features of other devices (2) when your CPU has no FP When your CPU has no FPU, you emulate it in the exception handler. What you are proposing is creating a "floating point library" in the form of a server, which I think is overkill and a misuse of the concept. -- João Jerónimo "Computer are composed of software, hardware, and other stuff terminated in "ware", like firmware, tupperware, (...)" - by JJ. |
|
#6
|
| João Jerónimo wrote: > Joerg Knura escreveu: > >> Going his way you need to reserve RAM for each task running whether it >> uses FP or not and you need to check whether to store the FP registers >> or not. This would increase time during task switches. > > Your solution involves an extra process, which surely wastes more RAM than > reserving some extra bytes in the process table. > yes, you are right, it wastes more RAM - but that is more or less static as far as the number of processes is limited to the number of FPU-cores plus one. In case of no other process uses FP these processes can be killed and then they will not use any RAM. In this case - and when doing it like you suggests - the handling for the FP registers must be executed anyway, which wastes time during task-switches. >> The problem when handling it via exceptions is that each FP-instruction >> would cause an exception (in case of no FPU) switching from user to >> kernel space and back after emulating. Another would be that you are not >> able to use the features of other devices (2) when your CPU has no FP > > When your CPU has no FPU, you emulate it in the exception handler. > What you are proposing is creating a "floating point library" in the form of > a server, which I think is overkill and a misuse of the concept. > Yes, it will produce something like an illegal instruction exception forcing a switch to the kernel and then to the code that emulates FP as far as Minix 3 has an micro-kernel and will not emulate it itself. This happens for every FP instruction in the code. What I am proposing is not a library at all - it is a kind of driver for some kind of floating point calculations. The difference to other drivers is, that it cannot be addressed directly, but via the scheduler I don't state that either solution is better than the other - both have advantages and disadvantages. One advantage of my proposed solution is, that the kernel doesn't need to be changed and doesn't need to keep track of FPU-registers at all. Another is, that the type (and native code) of the co-processor doesn't matter from application point of view. This also implies that one could change the driver during runtime for example from a native-driver to an emulating driver It's comparable to java code that is handled by a jvm - the difference is, that the pseudo FP-code is embedded in the native non-FP code and is send to the scheduler. regards Joerg |
|
#7
|
| Joerg Knura escreveu: > What I am proposing is not a library at all - it is a kind of driver for > some kind of floating point calculations. The difference to other > drivers is, that it cannot be addressed directly, but via the scheduler If I understood you correctly, a program who wants to do FP has to communicate with the FP server. If so, the only difference from a library is that the function in not called diretly, but rather by sending a message. -- João Jerónimo "Computer are composed of software, hardware, and other stuff terminated in "ware", like firmware, tupperware, (...)" - by JJ. |
|
#8
|
| You understood it correctly - a program that wants to do FP sends an request with psuedo-FP code to the FP-scheduler that then send the request to a free FP-Execution process (this doesn't need to be an X87 FPU execution process - it may also be a execution process that uses the geometry engine within a graphics card) If you think that this is the only difference from a library, then every device-driver - graphics, disk, keyboard etc. - is just a library. They are all called via System Call ... not directly João Jerónimo wrote: > Joerg Knura escreveu: > >> What I am proposing is not a library at all - it is a kind of driver for >> some kind of floating point calculations. The difference to other >> drivers is, that it cannot be addressed directly, but via the scheduler > > If I understood you correctly, a program who wants to do FP has to > communicate with the FP server. If so, the only difference from a library > is that the function in not called diretly, but rather by sending a > message. > |
|
#9
|
| Joerg Knura > You understood it correctly - a program that wants to do FP sends an > request with psuedo-FP code to the FP-scheduler that then send the > request to a free FP-Execution process (this doesn't need to be an X87 > FPU execution process - it may also be a execution process that uses the > geometry engine within a graphics card) Interesting idea, but it is not suitable for typical scalar FP applications. If you look at such code, you will see there are no long instruction sequences of FP-only commands. In fact, given a good FPU, the problem is to keep it busy. It is hard enough to do that as it is, but near impossible if you introduce additional latencies like message passing and copying data from one process to another and back. What you propose might work for vector operations, in particular if the vector unit is offchip and the vectors are large. There will be significant overhead already and saving vector registers certainly screws fast context switches. Aiming at high performance FP, the best solution is to offload typical OS functions to its own CPU and adapt the scheduler to the particular needs. The GPU is an example of that. The interface you proposed might work well for GPU code, and the longer the GPU code passed to the driver runs, the better. Michael |
|
#10
|
| Michael Haardt wrote: > Interesting idea, but it is not suitable for typical scalar FP > applications. If you look at such code, you will see there are no long > instruction sequences of FP-only commands. In fact, given a good FPU, > the problem is to keep it busy. It is hard enough to do that as it is, > but near impossible if you introduce additional latencies like message > passing and copying data from one process to another and back. > What if the compiler generates pseudo code for structural operators like 'for', 'while' etc. or produces native code that produces pseudo code on the fly e.g. : int calculateList (int n double *input1, double *input2, double *input3, double *output) { int counter; for (counter=0; counter++; counter } } leads to executed ops like (not in a real asm-language) .. .. .. push64 input1[0] call appendIN // add input1 as input data to the pseudo-code push64 input2[0] call appendIN // add input2 as input data to the pseudo-code push64 input3[0] call appendIN // add input3 as input data to the pseudo-code call incrementOUT // one more result required push #2 push #LOAD call appendOP // create an pseudo op for loading input data // at offset 2 to the FP stack push #1 push #LOAD call appendOP // create an pseudo op for loading input data // at offset 1 to the FP stack push #MULDOUBLE call appendOP // create an pseudo op for multiplying the top // two FP stack entries, delete both and put // the result back on push #0 push #LOAD call #appendOP // create an pseudo op for loading input data // at offset 0 to the FP stack push #ADDDOUBLE call appendOP // create an pseudo op for adding the top // two FP stack entries, delete both and put // the result back on push #0 push #STORE call #appendOP // create an pseudo op for storing the top // entry to output data at offset 0 and // delete the top .. .. .. push64 input1[n-1] call appendIN push64 input2[n-1] call appendIN push64 input3[n-1] call appendIN call incrementOUT push #(((n-1)*3)+2) push #LOAD call appendOP push #(((n-1)*3)+1) push #LOAD call appendOP push #MULDOUBLE call appendOP push #((n-1)*3) push #LOAD call #appendOP push #(n) push #STORE call #appendOP push fpRequest call sendFPRequest push addr(result[0]) push #0 call getOUT .. .. .. push addr(result[n-1]) push #(n-1) call getOUT .. .. > What you propose might work for vector operations, in particular if > the vector unit is offchip and the vectors are large. There will be > significant overhead already and saving vector registers certainly screws > fast context switches. > > Aiming at high performance FP, the best solution is to offload typical OS > functions to its own CPU and adapt the scheduler to the particular needs. > The GPU is an example of that. The interface you proposed might work > well for GPU code, and the longer the GPU code passed to the driver runs, > the better. > > Michael |
|
#11
|
| > What if the compiler generates pseudo code for structural operators like > 'for', 'while' etc. or produces native code that produces pseudo code on > the fly e.g. : > > int calculateList (int n double *input1, double *input2, double *input3, > double *output) { > int counter; > > for (counter=0; counter++; counter > } > } > > leads to executed ops like (not in a real asm-language) [compile to bytecode] You mixed up two things here. First, it's a "what if the compiler could vectorise the code". Ask HPC people about that and the answer is: Write your code accordingly, and the compiler will. The above vectorises nicely. Second, it's a "what if the compiler compiles parts to bytecode, sending that to an interpreter". You unrolled the loop entirely to avoid the problem of loops, but you can't always do that, and out of a sudden you do need context switches and have the same problem as to begin with. Your approach works if you dedicate a CPU (or GPU) to HPC stuff, extending Minix for multiprocessors and run a kernel on the dedicated CPU with a special scheduler. As I said, if you want the most, then get the OS (mostly) out of the way. Michael |
|
#12
|
| Joerg Knura escreveu: > If you think that this is the only difference from a library, then every > device-driver - graphics, disk, keyboard etc. - is just a library. They > are all called via System Call ... not directly Yes. Somewhat. But I forget that they can mix data related to different processes, contrary to a library. -- João Jerónimo "Computer are composed of software, hardware, and other stuff terminated in "ware", like firmware, tupperware, (...)" - by JJ. |