Suggestions on optimizing driver code ? - Embedded

This is a discussion on Suggestions on optimizing driver code ? - Embedded ; Hi, We're implementing a video sensor driver for an OmniVision 6620(connected via AverLogic AL422b fifo memory) that runs a 2.6.17 kernel on an Intel XScale PXA255(200MHz). We're stuck at a data retrieval speed of approximately 440 microseconds/row. Can any one ...

+ Reply to Thread
Results 1 to 12 of 12

Thread: Suggestions on optimizing driver code ?

  1. Suggestions on optimizing driver code ?

    Hi,
    We're implementing a video sensor driver for an OmniVision
    6620(connected via AverLogic AL422b fifo memory) that runs a 2.6.17
    kernel on an Intel XScale PXA255(200MHz). We're stuck at a data
    retrieval speed of approximately 440 microseconds/row. Can any one
    give a pointer on how we can speed up the frame grabbing process. We
    found out that 280 microseconds/row of that time is being taken by the
    incrementation of the read clock, so advice on reduction of that time
    would be great. The code to grab a row is below.

    Thanks in advance,
    Yaman.

    for (i=0; i<176; i++) {
    // increment read clock
    GPCR1 = 0X4000;
    GPSR1 = 0X4000;
    // read fifo data from GPIO registers
    tmp2 = GPLR2;
    tmp1 = GPLR1;
    tmp0 = GPLR0;
    ov6620_dev->bitarray[i][0] = ((tmp2 & 0x80000) >> 19)
    | ((tmp1 & 0x400) >> 9)
    | ((tmp2 & 0x20000) >> 15)
    | ((tmp1 & 0x4000000) >> 23)
    | ((tmp2 & 0x20) >> 1)
    | ((tmp2 & 0x4) << 3)
    | ((tmp0 & 0x40000000) >> 24)
    | ((tmp1 & 0x10000000) >> 21);
    // skip a pixel
    GPCR1 = 0X4000;
    GPSR1 = 0X4000;

    GPCR1 = 0X4000;
    GPSR1 = 0X4000;
    // read fifo data from GPIO registers
    tmp2 = GPLR2;
    tmp1 = GPLR1;
    tmp0 = GPLR0;
    ov6620_dev->bitarray[i][0] = ((tmp2 & 0x80000) >> 19)
    | ((tmp1 & 0x400) >> 9)
    | ((tmp2 & 0x20000) >> 15)
    | ((tmp1 & 0x4000000) >> 23)
    | ((tmp2 & 0x20) >> 1)
    | ((tmp2 & 0x4) << 3)
    | ((tmp0 & 0x40000000) >> 24)
    | ((tmp1 & 0x10000000) >> 21);
    // skip another pixel
    GPCR1 = 0X4000;
    GPSR1 = 0X4000;
    }


  2. Re: Suggestions on optimizing driver code ?

    > for (i=0; i<176; i++) {
    > // increment read clock
    > GPCR1 = 0X4000;

    ....
    > GPCR1 = 0X4000;
    > GPSR1 = 0X4000;
    > }
    >


    try optimizing the invariants aout of your loops




  3. Re: Suggestions on optimizing driver code ?

    On 2006-12-04, A. Melinte wrote:
    >> for (i=0; i<176; i++) {
    >> // increment read clock
    >> GPCR1 = 0X4000;

    > ...
    >> GPCR1 = 0X4000;
    >> GPSR1 = 0X4000;
    >> }
    >>

    >
    > try optimizing the invariants aout of your loops


    I'm _guessing_ those are volatile variables mapped to HW
    registers so that those lines are not invariants that can be
    removed from the loop.

    I could be wrong...

    --
    Grant Edwards grante Yow! I've been WRITING
    at to SOPHIA LOREN every 45
    visi.com MINUTES since JANUARY 1ST!!

  4. Re: Suggestions on optimizing driver code ?

    You're definately right on this one. We have to reset the read clock to
    move the read pointer in the fifo.
    Grant Edwards wrote:
    > On 2006-12-04, A. Melinte wrote:
    > >> for (i=0; i<176; i++) {
    > >> // increment read clock
    > >> GPCR1 = 0X4000;

    > > ...
    > >> GPCR1 = 0X4000;
    > >> GPSR1 = 0X4000;
    > >> }
    > >>

    > >
    > > try optimizing the invariants aout of your loops

    >
    > I'm _guessing_ those are volatile variables mapped to HW
    > registers so that those lines are not invariants that can be
    > removed from the loop.
    >
    > I could be wrong...
    >
    > --
    > Grant Edwards grante Yow! I've been WRITING
    > at to SOPHIA LOREN every 45
    > visi.com MINUTES since JANUARY 1ST!!



  5. Re: Suggestions on optimizing driver code ?

    In article <1165257795.602337.231000@l12g2000cwl.googlegroups. com>,
    yamanc@gmail.com wrote:
    >Hi,
    >We're implementing a video sensor driver for an OmniVision
    >6620(connected via AverLogic AL422b fifo memory) that runs a 2.6.17
    >kernel on an Intel XScale PXA255(200MHz). We're stuck at a data
    >retrieval speed of approximately 440 microseconds/row. Can any one
    >give a pointer on how we can speed up the frame grabbing process. We
    >found out that 280 microseconds/row of that time is being taken by the
    >incrementation of the read clock, so advice on reduction of that time
    >would be great. The code to grab a row is below.
    >
    >Thanks in advance,
    >Yaman.
    >
    > for (i=0; i<176; i++) {
    > // increment read clock
    > GPCR1 = 0X4000;
    > GPSR1 = 0X4000;
    > // read fifo data from GPIO registers
    > tmp2 = GPLR2;
    > tmp1 = GPLR1;
    > tmp0 = GPLR0;
    > ov6620_dev->bitarray[i][0] = ((tmp2 & 0x80000) >> 19) // bit 0
    > | ((tmp1 & 0x400) >> 9) // bit 1
    > | ((tmp2 & 0x20000) >> 15) // bit 2
    > | ((tmp1 & 0x4000000) >> 23) // bit 3
    > | ((tmp2 & 0x20) >> 1) // bit 4
    > | ((tmp2 & 0x4) << 3) // bit 5
    > | ((tmp0 & 0x40000000) >> 24) // bit 6
    > | ((tmp1 & 0x10000000) >> 21); // bit 7
    > // skip a pixel
    > GPCR1 = 0X4000;
    > GPSR1 = 0X4000;


    First let's look at what bits are used in which variables:

    tmp0 uses 0x40000000
    tmp1 uses 0x14000400
    tmp2 uses 0x000a0024

    // let's assign byte pointers to the variables so we can pick out individual
    // bytes as needed (I'm assuming big-endian - correct me if I'm wrong?)

    unsigned char *t0 = (unsigned char *)&tmp0;
    unsigned char *t1 = (unsigned char *)&tmp1;
    unsigned char *t2 = (unsigned char *)&tmp2;

    // now let's build lookup tables for the various bit patterns (this should
    // done at initialization time)

    unsigned char lut0[65];
    unsigned char lut1[21];
    unsigned char lut2[5];
    unsigned char lut3[10];
    unsigned char lut4[37];

    // initialize lut0 to deal with t0[0]

    lut0[0x00] = 0x00;
    lut0[0x40] = 0x40;

    // initialize lut1 to deal with t1[0]

    lut1[0x00] = 0x00;
    lut1[0x04] = 0x08;
    lut1[0x10] = 0x80;
    lut1[0x14] = 0x88;

    // initialize lut2 to deal with t1[2]

    lut2[0x00] = 0x00;
    lut2[0x04] = 0x02;

    // initialize lut3 to deal with t2[1]

    lut3[0x00] = 0x00;
    lut3[0x02] = 0x04;
    lut3[0x08] = 0x01;
    lut3[0x0a] = 0x05;

    // initialize lut4 to deal with t2[3]

    lut4[0x00] = 0x00;
    lut4[0x04] = 0x10;
    lut4[0x20] = 0x20;
    lut4[0x24] = 0x30;

    Now the code in the loop only has to do this:

    ov6620_dev0->bitarray[i][0] =
    lut0[t0[0]&0x40] |
    lut1[t1[0]&0x14] |
    lut2[t1[2]&0x04] |
    lut3[t2[1]&0x0a] |
    lut4[t2[3]&0x24];

    Of course, lut0 isn't really necessary since the index matches the data,
    so we can optimize that out:

    ov6620_dev0->bitarray[i][0] =
    t0[0] |
    lut1[t1[0]&0x14] |
    lut2[t1[2]&0x04] |
    lut3[t2[1]&0x0a] |
    lut4[t2[3]&0x24];

    That's gotta be quicker than doing all that shifting, no?!?

    (please double check the values, but I think you get the idea??)

    Patrick
    ========= For LAN/WAN Protocol Analysis, check out PacketView Pro! =========
    Patrick Klos Email: patrick@klos.com
    Klos Technologies, Inc. Web: http://www.klos.com/
    ================================================== ==========================

  6. Re: Suggestions on optimizing driver code ?

    On 2006-12-05, Patrick Klos wrote:

    > tmp0 uses 0x40000000
    > tmp1 uses 0x14000400
    > tmp2 uses 0x000a0024
    >
    > // let's assign byte pointers to the variables so we can pick out individual
    > // bytes as needed (I'm assuming big-endian - correct me if I'm wrong?)


    Assuming any-endian is always wrong.

    > ov6620_dev0->bitarray[i][0] =
    > t0[0] |
    > lut1[t1[0]&0x14] |
    > lut2[t1[2]&0x04] |
    > lut3[t2[1]&0x0a] |
    > lut4[t2[3]&0x24];
    >
    > That's gotta be quicker than doing all that shifting, no?!?


    Possibly. Shifting by mutliples of 8 usually generates exact
    same code as indexing off of a byte pointer.

    In my experience, tricks like that rarely provide noticable
    benefit. Modern compilers are pretty good.

    --
    Grant Edwards grante Yow! I OWN six pink
    at HIPPOS!!
    visi.com

  7. Re: Suggestions on optimizing driver code ?

    yamanc@gmail.com wrote:
    > Hi,
    > We're implementing a video sensor driver for an OmniVision
    > 6620(connected via AverLogic AL422b fifo memory) that runs a 2.6.17
    > kernel on an Intel XScale PXA255(200MHz). We're stuck at a data
    > retrieval speed of approximately 440 microseconds/row. Can any one
    > give a pointer on how we can speed up the frame grabbing process. We
    > found out that 280 microseconds/row of that time is being taken by the
    > incrementation of the read clock, so advice on reduction of that time
    > would be great. The code to grab a row is below.
    >


    One question to ask is what is it that makes it take so long to
    increment the FIFO read clock? The device claims it'll do 50 MHz.

    Is your hardware design cast in stone? Of course the hardware guys will
    always tell you it is, but if the bit assignments of GPLR2, GPLR1, and
    GPLR0 can be juggled to match what you want to end up with in bitarray
    after all that ugly masking and shifting, you'll get some cycles back.
    You're only pulling out a byte - can that byte's bits be grouped into
    one contiguous bitfield within one GPLRx register? Now you're down to
    one read, one shift(at most), and a type cast.

    Is it possible to use a FIFO that doesn't require silly babysitting? A
    FIFO read should auto advance. Can the FIFO be memory mapped and
    hardware added to auto advance the read pointer?

    Back to the software: Have the compiler dump the assembly and see how
    many instructions are needed to index into ov6620->bitarray[i][0]. You
    might consider a pointer that references bitarray[0][0] initially and is
    incremented within the loop.


    Good luck,
    Dave

    > Thanks in advance,
    > Yaman.
    >
    > for (i=0; i<176; i++) {
    > // increment read clock
    > GPCR1 = 0X4000;
    > GPSR1 = 0X4000;
    > // read fifo data from GPIO registers
    > tmp2 = GPLR2;
    > tmp1 = GPLR1;
    > tmp0 = GPLR0;
    > ov6620_dev->bitarray[i][0] = ((tmp2 & 0x80000) >> 19)
    > | ((tmp1 & 0x400) >> 9)
    > | ((tmp2 & 0x20000) >> 15)
    > | ((tmp1 & 0x4000000) >> 23)
    > | ((tmp2 & 0x20) >> 1)
    > | ((tmp2 & 0x4) << 3)
    > | ((tmp0 & 0x40000000) >> 24)
    > | ((tmp1 & 0x10000000) >> 21);
    > // skip a pixel
    > GPCR1 = 0X4000;
    > GPSR1 = 0X4000;
    >
    > GPCR1 = 0X4000;
    > GPSR1 = 0X4000;
    > // read fifo data from GPIO registers
    > tmp2 = GPLR2;
    > tmp1 = GPLR1;
    > tmp0 = GPLR0;
    > ov6620_dev->bitarray[i][0] = ((tmp2 & 0x80000) >> 19)
    > | ((tmp1 & 0x400) >> 9)
    > | ((tmp2 & 0x20000) >> 15)
    > | ((tmp1 & 0x4000000) >> 23)
    > | ((tmp2 & 0x20) >> 1)
    > | ((tmp2 & 0x4) << 3)
    > | ((tmp0 & 0x40000000) >> 24)
    > | ((tmp1 & 0x10000000) >> 21);
    > // skip another pixel
    > GPCR1 = 0X4000;
    > GPSR1 = 0X4000;
    > }
    >


  8. Re: Suggestions on optimizing driver code ?

    In article <12nb3en96623o26@corp.supernews.com>,
    Grant Edwards wrote:
    >On 2006-12-05, Patrick Klos wrote:
    >
    >> tmp0 uses 0x40000000
    >> tmp1 uses 0x14000400
    >> tmp2 uses 0x000a0024
    >>
    >> // let's assign byte pointers to the variables so we can pick out individual
    >> // bytes as needed (I'm assuming big-endian - correct me if I'm wrong?)

    >
    >Assuming any-endian is always wrong.


    Grant,

    I know you do a lot of posting here, and I'm sure it's appreciated. I see
    no need to shoot down my assumption. I was simply stating that so anyone
    looking at the code would understand which bytes of the 32-bit variable I
    was extracting. If the OP doesn't have his processor in big-endian mode,
    he'll know to take that into account when trying the code. At least I took
    the time to come up with an answer and state my assumptions.

    >> ov6620_dev0->bitarray[i][0] =
    >> t0[0] |
    >> lut1[t1[0]&0x14] |
    >> lut2[t1[2]&0x04] |
    >> lut3[t2[1]&0x0a] |
    >> lut4[t2[3]&0x24];
    >>
    >> That's gotta be quicker than doing all that shifting, no?!?

    >
    >Possibly. Shifting by mutliples of 8 usually generates exact
    >same code as indexing off of a byte pointer.


    Sometimes. And only in some compilers. And only with optimzations
    enabled. One can't always rely on the compiler to optimize our code -
    sometimes it's best to optimize it ourselves so WE KNOW what is going
    on.

    Also, not all the shifting is done in multiples of 8 bits.

    >In my experience, tricks like that rarely provide noticable
    >benefit.


    Well, I guess your experience does not match my experience. It happens...

    >Modern compilers are pretty good.


    Now who's making assumptions?? Let's let the OP try it and find out what
    difference he sees??

    ========= For LAN/WAN Protocol Analysis, check out PacketView Pro! =========
    Patrick Klos Email: patrick@klos.com
    Klos Technologies, Inc. Web: http://www.klos.com/
    ================================================== ==========================

  9. Re: Suggestions on optimizing driver code ?

    "Grant Edwards" wrote in message
    news:12nb3en96623o26@corp.supernews.com...
    > On 2006-12-05, Patrick Klos wrote:
    >
    >> tmp0 uses 0x40000000
    >> tmp1 uses 0x14000400
    >> tmp2 uses 0x000a0024
    >>
    >> // let's assign byte pointers to the variables so we can pick out
    >> individual
    >> // bytes as needed (I'm assuming big-endian - correct me if I'm wrong?)

    >
    > Assuming any-endian is always wrong.
    >
    >> ov6620_dev0->bitarray[i][0] =
    >> t0[0] |
    >> lut1[t1[0]&0x14] |
    >> lut2[t1[2]&0x04] |
    >> lut3[t2[1]&0x0a] |
    >> lut4[t2[3]&0x24];
    >>
    >> That's gotta be quicker than doing all that shifting, no?!?

    >
    > Possibly. Shifting by mutliples of 8 usually generates exact
    > same code as indexing off of a byte pointer.
    >
    > In my experience, tricks like that rarely provide noticable
    > benefit. Modern compilers are pretty good.


    My understanding is that the PXA255 is very slow accessing I/O regs - a
    friend measured ~400 ns per access (using 400MHz part). So this sort of bit
    fiddling port bits will be slow. You want all your input bits on the same
    port to start... It looks like 2 us/byte min.

    Peter



  10. Re: Suggestions on optimizing driver code ?

    Greetings,

    The lookup table method didn't really introduce a significant change,
    though thanks for the advice. In the new design we'll pack all the
    video output to the same register so that things will be faster, and
    probably go on without using the FIFO hardware. Dave, you have any
    suggestions for a better FIFO just in case we might decide to use a
    better one?

    Cheers,
    Yaman


  11. Re: Suggestions on optimizing driver code ?

    yamanc@gmail.com wrote:
    > Greetings,
    >
    > The lookup table method didn't really introduce a significant change,
    > though thanks for the advice. In the new design we'll pack all the
    > video output to the same register so that things will be faster, and
    > probably go on without using the FIFO hardware. Dave, you have any
    > suggestions for a better FIFO just in case we might decide to use a
    > better one?
    >


    I'm sorry if I gave the impression that I thought the AverLogic was bad.
    I have no experience with that particular device. If you guys like it
    well enough, I'd definitely suggest that it be memory mapped with some
    control logic that will auto-tickle the read clock when the processor
    runs the read cycle. This approach will very likely yield much better
    performance that the GPIO-based mechanism.

    You may have other hardware constraints, but staying with the
    GPIO-driven approach will never be as fast as you might like. Ideally,
    you'd be able to use a DMA-based design, but that would require hardware
    assets you may not have available.


    Good luck,
    Dave

  12. Re: Suggestions on optimizing driver code ?



    Most optimising compilers should be able to do a more than adequate job

    on the code below. The array lookup code suggested might run just a
    fraction slower but you will be better sticking with the bitwise code.

    One suggestion might be to look at the address generation code for
    ov6220->... but it will probably also be reasonable and an alternative
    scheme may not prove that much faster either.

    On balance the slow times you are reporting will almost certainly be
    due to
    the I/Os. If this is impacting your overall elapsed times maybe you can
    set the device into a more lossy frame grab?


    dmctek.googlepages.com

    -----------------------


    yamanc@gmail.com wrote:

    > Hi,
    > We're implementing a video sensor driver for an OmniVision
    > 6620(connected via AverLogic AL422b fifo memory) that runs a 2.6.17
    > kernel on an Intel XScale PXA255(200MHz). We're stuck at a data
    > retrieval speed of approximately 440 microseconds/row. Can any one
    > give a pointer on how we can speed up the frame grabbing process. We
    > found out that 280 microseconds/row of that time is being taken by the
    > incrementation of the read clock, so advice on reduction of that time
    > would be great. The code to grab a row is below.
    >
    > Thanks in advance,
    > Yaman.
    >
    > for (i=0; i<176; i++) {
    > // increment read clock
    > GPCR1 = 0X4000;
    > GPSR1 = 0X4000;
    > // read fifo data from GPIO registers
    > tmp2 = GPLR2;
    > tmp1 = GPLR1;
    > tmp0 = GPLR0;
    > ov6620_dev->bitarray[i][0] = ((tmp2 & 0x80000) >> 19)
    > | ((tmp1 & 0x400) >> 9)
    > | ((tmp2 & 0x20000) >> 15)
    > | ((tmp1 & 0x4000000) >> 23)
    > | ((tmp2 & 0x20) >> 1)
    > | ((tmp2 & 0x4) << 3)
    > | ((tmp0 & 0x40000000) >> 24)
    > | ((tmp1 & 0x10000000) >> 21);
    > // skip a pixel
    > GPCR1 = 0X4000;
    > GPSR1 = 0X4000;
    >
    > GPCR1 = 0X4000;
    > GPSR1 = 0X4000;
    > // read fifo data from GPIO registers
    > tmp2 = GPLR2;
    > tmp1 = GPLR1;
    > tmp0 = GPLR0;
    > ov6620_dev->bitarray[i][0] = ((tmp2 & 0x80000) >> 19)
    > | ((tmp1 & 0x400) >> 9)
    > | ((tmp2 & 0x20000) >> 15)
    > | ((tmp1 & 0x4000000) >> 23)
    > | ((tmp2 & 0x20) >> 1)
    > | ((tmp2 & 0x4) << 3)
    > | ((tmp0 & 0x40000000) >> 24)
    > | ((tmp1 & 0x10000000) >> 21);
    > // skip another pixel
    > GPCR1 = 0X4000;
    > GPSR1 = 0X4000;
    > }



+ Reply to Thread