gcc inline cause performance drop drastically - Unix

This is a discussion on gcc inline cause performance drop drastically - Unix ; I have a program to calculator some floating point value, the algorithm is a bit complex and due to some reason I can't paste the code here. The strange thing is when I compile it with -O3, which will turn ...

+ Reply to Thread
Results 1 to 16 of 16

Thread: gcc inline cause performance drop drastically

  1. gcc inline cause performance drop drastically

    I have a program to calculator some floating point value, the
    algorithm is a bit complex and due to some reason I can't paste the
    code here. The strange thing is when I compile it with -O3, which will
    turn on inline automatically, the performance of the program is very
    slow, but when I force the gcc to not inline the functions, then the
    performance becomes faster a lot.

    Does anybody has similar experience? What cause maybe the most
    probably?

    Thanks.
    Bin

  2. Re: gcc inline cause performance drop drastically

    Bin Chen wrote:
    > I have a program to calculator some floating point value, the
    > algorithm is a bit complex and due to some reason I can't paste the
    > code here. The strange thing is when I compile it with -O3, which will
    > turn on inline automatically, the performance of the program is very
    > slow, but when I force the gcc to not inline the functions, then the
    > performance becomes faster a lot.
    >
    > Does anybody has similar experience? What cause maybe the most
    > probably?
    >

    Larger executable not fitting in the cache?

    --
    Ian Collins

  3. Re: gcc inline cause performance drop drastically

    Bin Chen wrote:

    > I have a program to calculator some floating point value, the
    > algorithm is a bit complex and due to some reason I can't paste the
    > code here. The strange thing is when I compile it with -O3, which will
    > turn on inline automatically, the performance of the program is very
    > slow, but when I force the gcc to not inline the functions, then the
    > performance becomes faster a lot.
    >
    > Does anybody has similar experience? What cause maybe the most
    > probably?


    My experience has been with gcc that -O and -O2 are generally better
    than -O3 in terms of performance. I base my experience on C-based image
    processing code with gcc 3.3.x and 4.1. Your experience may vary.

    I know that -O3 is or was the level that enables -finline-functions, so when
    I want inlined functions I generally use -finline-functions with -O2 or -O.

    Some people take extreme measures and generate an entire program or library
    by concatenating a bunch of C files together, so that the compiler can do
    more analysis on the program, and produce better code by reordering
    functions, and other methods. In some cases this can improve performance.
    SQLite is one example of this[1].

    There are also some tools[2] you can use with gcc to automate the process of
    discovering the best compilation flags.

    1. http://www.sqlite.org/amalgamation.html
    2. http://www.coyotegulch.com/products/acovea/

    --George

  4. Re: gcc inline cause performance drop drastically

    Bin Chen writes:
    > I have a program to calculator some floating point value, the
    > algorithm is a bit complex and due to some reason I can't paste the
    > code here. The strange thing is when I compile it with -O3, which will
    > turn on inline automatically, the performance of the program is very
    > slow, but when I force the gcc to not inline the functions, then the
    > performance becomes faster a lot.
    >
    > Does anybody has similar experience?


    Yes, abundantly. Automatic inlining is crap. Like any 'fine-tuning' of
    machine code without taking the properties of the execution
    environment into account. You can determine the exact reason why it
    blew up in your case by comparing the generated code in detail with
    the code generated for the 'better' case. A much easier solution is to
    turn this "feature" generally off and optimize code paths which are
    demonstrably too slow as the need arises.

    Premature optimization is as premature when done by a program.

  5. Re: gcc inline cause performance drop drastically

    Rainer Weikusat wrote:

    > Premature optimization is as premature when done by a program.


    I disagree. The problem is in this particular version of GCC thinking
    that inlining is good in cases where clearly it isn't.

    Obviously, a sufficiently advanced compiler would be able to make better
    decisions, and in that case higher optimization values would generate
    progressively faster (but potentially more complicated) code.

    Chris

  6. Re: gcc inline cause performance drop drastically

    Chris Friesen writes:
    > Rainer Weikusat wrote:
    >
    >> Premature optimization is as premature when done by a program.

    >
    > I disagree.


    There is always someone who considers disagreeing with observable
    reality in favor of a beloved ideology a sensible choice. Otherwise,
    there would be a lot less problems with compilers in general and gcc
    in particular.

    > The problem is in this particular version of GCC thinking
    > that inlining is good in cases where clearly it isn't.


    Creating unstructured code out of structured code just because the
    latter is believed to possibly run faster, generously ignoring that
    programs need not be 'as fast as possible' but just 'fast enough' is
    NEVER a good idea. Not to mention that the opposite (code runs slower
    now) is completely possible. Actually, most so-called 'advanced
    optimizations' only benefit code explicitly written to exploit them.


  7. Re: gcc inline cause performance drop drastically

    Rainer Weikusat wrote:
    > Chris Friesen writes:
    >>Rainer Weikusat wrote:
    >>>Premature optimization is as premature when done by a program.

    >>
    >>I disagree.

    >
    > There is always someone who considers disagreeing with observable
    > reality in favor of a beloved ideology a sensible choice. Otherwise,
    > there would be a lot less problems with compilers in general and gcc
    > in particular.


    There are certainly downsides to optimised compiling. It makes the
    compile stage take longer, and the resulting code is often (though not
    always) more complicated. However, optimised compiling should not make
    the resulting code slower--if it does, I consider that a bug in the
    compiler.

    >>The problem is in this particular version of GCC thinking
    >>that inlining is good in cases where clearly it isn't.


    > Creating unstructured code out of structured code just because the
    > latter is believed to possibly run faster, generously ignoring that
    > programs need not be 'as fast as possible' but just 'fast enough' is
    > NEVER a good idea.


    Fair enough, programs need to run "fast enough".

    > Not to mention that the opposite (code runs slower
    > now) is completely possible. Actually, most so-called 'advanced
    > optimizations' only benefit code explicitly written to exploit them.


    This is a fault of the compiler. If I tell the compiler "optimize this
    code for speed", and the result runs slower, then the compiler isn't
    actually doing what I told it to.

    Note that I didn't tell the compiler, "inline this code", or "unroll
    these loops", or "optimise for size", I said "make it faster". If the
    compiler isn't smart enough to do that properly, then it's not the fault
    of the user, it's the fault of the compiler writer. (Of course, the
    user has to give the compiler enough information to actually do that...)

    Chris

  8. Re: gcc inline cause performance drop drastically

    Ian Collins wrote:
    > Larger executable not fitting in the cache?


    Or not fitting in a given level of "the" cache.

    rick jones
    --
    oxymoron n, Hummer H2 with California Save Our Coasts and Oceans plates
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  9. Re: gcc inline cause performance drop drastically

    Rainer Weikusat wrote:
    > Bin Chen writes:
    > > I have a program to calculator some floating point value, the
    > > algorithm is a bit complex and due to some reason I can't paste the
    > > code here. The strange thing is when I compile it with -O3, which will
    > > turn on inline automatically, the performance of the program is very
    > > slow, but when I force the gcc to not inline the functions, then the
    > > performance becomes faster a lot.
    > >
    > > Does anybody has similar experience?


    > Yes, abundantly. Automatic inlining is crap. Like any 'fine-tuning'
    > of machine code without taking the properties of the execution
    > environment into account.


    Thus Profile Based Optimization?-) (aka feedback directed compilation
    IIRC)

    rick jones
    --
    The computing industry isn't as much a game of "Follow The Leader" as
    it is one of "Ring Around the Rosy" or perhaps "Duck Duck Goose."
    - Rick Jones
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  10. Re: gcc inline cause performance drop drastically

    >This is a fault of the compiler. If I tell the compiler "optimize this
    >code for speed", and the result runs slower, then the compiler isn't
    >actually doing what I told it to.


    Well, strictly speaking, if you tell the compiler to optimize for speed,
    it should produce the fastest possible code.

    On the other hand, if what you told it was "-O3", well that may or
    may not make your code faster. But it has more "O" than "-O2".


    --
    mac the na飂

  11. Re: gcc inline cause performance drop drastically

    Bin Chen wrote:

    > I have a program to calculator some floating point value, the
    > algorithm is a bit complex and due to some reason I can't paste the
    > code here. The strange thing is when I compile it with -O3, which will
    > turn on inline automatically, the performance of the program is very
    > slow, but when I force the gcc to not inline the functions, then the
    > performance becomes faster a lot.
    >
    > Does anybody has similar experience? What cause maybe the most
    > probably?


    Which version of gcc did you use?

    What flags did you specify?

    What micro-architecture did you compile for?

  12. Re: gcc inline cause performance drop drastically

    On 11月6日, 下午3时44分, Boon wrote:
    > Bin Chen wrote:
    > > I have a program to calculator some floating point value, the
    > > algorithm is a bit complex and due to some reason I can't paste the
    > > code here. The strange thing is when I compile it with -O3, which will
    > > turn on inline automatically, the performance of the program is very
    > > slow, but when I force the gcc to not inline the functions, then the
    > > performance becomes faster a lot.

    >
    > > Does anybody has similar experience? What cause maybe the most
    > > probably?

    >
    > Which version of gcc did you use?
    >
    > What flags did you specify?
    >
    > What micro-architecture did you compile for?


    Other guys has answered my question, thanks.

  13. Re: gcc inline cause performance drop drastically

    Rick Jones writes:
    > Rainer Weikusat wrote:
    >> Bin Chen writes:
    >> > I have a program to calculator some floating point value, the
    >> > algorithm is a bit complex and due to some reason I can't paste the
    >> > code here. The strange thing is when I compile it with -O3, which will
    >> > turn on inline automatically, the performance of the program is very
    >> > slow, but when I force the gcc to not inline the functions, then the
    >> > performance becomes faster a lot.
    >> >
    >> > Does anybody has similar experience?

    >
    >> Yes, abundantly. Automatic inlining is crap. Like any 'fine-tuning'
    >> of machine code without taking the properties of the execution
    >> environment into account.

    >
    > Thus Profile Based Optimization?-) (aka feedback directed compilation
    > IIRC)


    In theory, 'profile information' can be used to get an idea regarding
    how often each branch in some piece of code is taken, which, in turn,
    could be used to re-generate machine code from the corresponding
    source code such that less branches are executed for the more
    frequently taken code paths than were in the profiled code.

    In practice, doing simulation runs with real-world data is often
    difficult to impossible, branches need not have a clear asymmetric
    distribution of either 'taken' or 'not taken' and the distribution
    during the profile runs need not match any real-world use of a
    program. An example would be branches to code intended to handle
    system call errors. These are (usually) trivial to predict manually,
    but could be taken much more frequently than usual in case some
    'transient problem condition' happened to exist on the test system
    during the simulation run. Additionally, modifying a program to fail
    more efficiently is (at best) of dubious value.

    In essence, I consider profile based (or driven) optimization a
    basically good idea which will not matter much because of practical
    limitations, while I only know about gcc auto-inlining because either
    I had to disable it in order to facilitate machine-code level
    debugging or because it had a negative performance impact for some
    code where performance was actually critical. The latter was because,
    while this originally was one of the traditional -O3-problems, it has
    crept into -O2 as 'automatic inlining of functions called once'.

    It remains to be determined if the 'function call
    overhead'-witchhunters will take their creed to such extremes as
    preventing this mechanism from being disabled. It is already at
    the unfortunate point where I have some Makefiles which must
    perform build-time gcc version detection, because the newer "do not do
    this"-options lead to compilation refusal by older compilers :-(.

  14. Re: gcc inline cause performance drop drastically

    Rainer Weikusat wrote:

    > In essence, I consider profile based (or driven) optimization a
    > basically good idea which will not matter much because of practical
    > limitations, while I only know about gcc auto-inlining because either
    > I had to disable it in order to facilitate machine-code level
    > debugging or because it had a negative performance impact for some
    > code where performance was actually critical. The latter was because,
    > while this originally was one of the traditional -O3-problems, it has
    > crept into -O2 as 'automatic inlining of functions called once'.


    This is interesting...if a function is only called by one other
    function, there is no overall size penalty from the inlining and you
    save the cost of the jump.

    The only way I can think of for this to matter is that the original
    function layout was more optimal, and the inlining throws this off. If
    so, this is technically not a problem with the inlining itself, but with
    the code generated for the now-larger function.

    Did you ever track down exactly what the inlining was doing that caused
    problems?

    Chris

  15. Re: gcc inline cause performance drop drastically

    Chris Friesen writes:
    > Rainer Weikusat wrote:
    >> In essence, I consider profile based (or driven) optimization a
    >> basically good idea which will not matter much because of practical
    >> limitations, while I only know about gcc auto-inlining because either
    >> I had to disable it in order to facilitate machine-code level
    >> debugging or because it had a negative performance impact for some
    >> code where performance was actually critical. The latter was because,
    >> while this originally was one of the traditional -O3-problems, it has
    >> crept into -O2 as 'automatic inlining of functions called once'.

    >
    > This is interesting...if a function is only called by one other
    > function, there is no overall size penalty from the inlining and you
    > save the cost of the jump.


    There is an overall size penalty: Parts of the inlined subroutine will
    most likely be loaded into the i-cache instead of the code to be
    executed next when control flow would otherwise have passed the
    function call. Depending on the architecture, there can be other
    relevant differences, ie instead of using two instructions to save and
    restore some set of registers (eg stmdb sp!, {r4 - r6, lr}/ ldmia sp!,
    {r4 - r6, pc} on ARM) values may need to be stored into memory in numerous
    places inside the 'combined' subroutine and later reloaded (and maybe
    stored again and reloaded again and so on)

    > The only way I can think of for this to matter is that the original
    > function layout was more optimal, and the inlining throws this off.
    > If so, this is technically not a problem with the inlining itself, but
    > with the code generated for the now-larger function.


    That is 'technically a problem with the inlining itself': It may
    result in 'worse' code being generated (according to some defined
    metric) than without it.

    But that's not actually the real issue: Ever since the invention of
    functional decompostion, people have (without any hard evidence)
    believed that they must not structure their code because it would
    otherwise run to slowly[*]. This superstition has caused uncountable
    amounts of programming errors which could have been avoided by
    deconstructing complex operations subsequently less complex procedures,
    with each indentifiable part ideally being simple enough so that its
    correctness is obvious to the reader. Not to mention the time which
    has been lost, eg, because of the need to use comments excessively in
    order to document the otherwise inaccessible internal structure of the
    code (and the time lost by people who stumbled over comments which
    have 'traveled' unchanged alongside copied and pasted code having
    been changed to arbitrary distances from the locations where the made
    any sense, and ... etc).
    [*] That's the friendly assumption. The less friendly one is
    that lots of people are seeking excuses for continuing to
    write 'moderately large functions of a few thousand lines',
    because they have never learnt how to structure any text and
    had never wanted to learn this (after all, this is one of
    these traditionally despised 'soft science' thingies relating
    to use of language Real Men[tm] don't care for).

    Completely independently of any 'performance effects' I do not want a
    compiler to take my structured code (I know that this does not cause
    relevant performance problems by virtue of experience dating
    back to 4.77Mhz 8088-processors) and throw all the structure away
    in order to create next to incomprehensible 'blobs' of hundreds, if
    not thousands (or even tenthousands) of machine instructions. How am I
    supposed to optimize an algorithm when I never see how it is presently
    translated? And how am I supposed to debug issues below the
    'high-level language'-level, eg segmentation violations, or detect
    effects of compiler bugs etc, in such an organic conglomerate?

    > Did you ever track down exactly what the inlining was doing that
    > caused problems?


    In this particular case, I noticed that throughput (VPN
    application) became completely erratic and often less when compiling
    with optimzation enabled. I then tried to take a look at the generated
    code and detected 'the blob'. Next step, "Ok, I am not going to
    understand THAT completely this year, so let's first put it back into
    the intended structure.". After having done that, the problem was
    gone.

    Since the 'obvious solution' for this is to make the auto-inliner more
    complicated in order to avoid the known problems with it, I am going
    to close this soap box speech with one of the finest sentences I have
    ever read regarding composition of software:

    "There are two ways of constructing a software design: One way is to
    make it so simple that there are obviously no deficiencies, and the
    other way is to make it so complicated that there are no obvious
    deficiencies."

  16. Re: gcc inline cause performance drop drastically

    Rainer Weikusat wrote:
    > Chris Friesen writes:


    >>This is interesting...if a function is only called by one other
    >>function, there is no overall size penalty from the inlining and you
    >>save the cost of the jump.


    > There is an overall size penalty: Parts of the inlined subroutine will
    > most likely be loaded into the i-cache instead of the code to be
    > executed next when control flow would otherwise have passed the
    > function call.


    I may be splitting hairs here, but that's not an overall size penalty in
    the executable itself, but rather a runtime caching penalty due to the
    binary code layout as I mentioned later on.

    > Depending on the architecture, there can be other
    > relevant differences, ie instead of using two instructions to save and
    > restore some set of registers (eg stmdb sp!, {r4 - r6, lr}/ ldmia sp!,
    > {r4 - r6, pc} on ARM) values may need to be stored into memory in numerous
    > places inside the 'combined' subroutine and later reloaded (and maybe
    > stored again and reloaded again and so on)


    Again, this isn't actually due directly to the inlining, but rather due
    to choices made by the compiler. (Where the choices are now available
    because of the inlining).

    >>The only way I can think of for this to matter is that the original
    >>function layout was more optimal, and the inlining throws this off.
    >>If so, this is technically not a problem with the inlining itself, but
    >>with the code generated for the now-larger function.


    > That is 'technically a problem with the inlining itself': It may
    > result in 'worse' code being generated (according to some defined
    > metric) than without it.


    The inlining isn't actually the cause of the "worse" code though, it's
    the choices made by the compiler. If I forced inlining in multiple call
    sites (causing the binary executable size to bloat) then the problem
    could be the inlining itself.

    > But that's not actually the real issue: Ever since the invention of
    > functional decompostion, people have (without any hard evidence)
    > believed that they must not structure their code because it would
    > otherwise run to slowly[*]. This superstition has caused uncountable
    > amounts of programming errors which could have been avoided by
    > deconstructing complex operations subsequently less complex procedures,
    > with each indentifiable part ideally being simple enough so that its
    > correctness is obvious to the reader.


    I'd have to agree. When looking at code I like to be able to see an
    entire function on my screen at once.

    When dealing with compiler-generated instructions I'm a bit more
    ambivalent, since I need to look at it far less frequently. (And I'm
    doing mostly kernel development, which means that many app developers
    should need to look at machine code even less frequently.)

    > Since the 'obvious solution' for this is to make the auto-inliner more
    > complicated in order to avoid the known problems with it, I am going
    > to close this soap box speech with one of the finest sentences I have
    > ever read regarding composition of software:
    >
    > "There are two ways of constructing a software design: One way is to
    > make it so simple that there are obviously no deficiencies, and the
    > other way is to make it so complicated that there are no obvious
    > deficiencies."


    There is definately truth in that statement. However, automated
    optimization is complicated by nature, so I'm not sure it's possible to
    make it simple enough to be obviously correct.

    Anyways, it's been an interesting discussion. Thanks...

    Chris

+ Reply to Thread