Re: PCI Bursting with PIO - Kernel

This is a discussion on Re: PCI Bursting with PIO - Kernel ; Dan Gora wrote: > Hi, > > I am trying to optimize a driver for a slave only PCI device and am > having a lot of trouble getting any kind of PCI burst transactions in > either the read ...

+ Reply to Thread
Results 1 to 2 of 2

Thread: Re: PCI Bursting with PIO

  1. Re: PCI Bursting with PIO

    Dan Gora wrote:
    > Hi,
    >
    > I am trying to optimize a driver for a slave only PCI device and am
    > having a lot of trouble getting any kind of PCI burst transactions in
    > either the read or the write direction. Using bcopy/memcpy or even a
    > hand-crafted while (len) { *pdst++ = *psrc++} (with pdst and psrc
    > unsigned long*) I can only get writes to burst and even in that case
    > only for 2 data phases (8 bytes) and only on 64 bit machines. The
    > best that I have managed is to use a hand crafted asm function which
    > copies the data through mmx registers on i386 machines, but that still
    > only bursts a maximum of 16 bytes in the write direction and not at
    > all in the read direction. The source and destination pointers are
    > both aligned to 8 byte boundaries, so I don't think that it's an
    > alignment issue.


    The chipset is being limited by what the CPU is giving it. If the CPU
    sends only a small amount of data in one access then the chipset usually
    does not try to burst more than that.

    >
    > Is there any way to get PIO to burst over the PCI bus in the read and
    > write direction? My device has 4 BAR registers, but the area where I
    > am transferring data is marked 'prefetchable' (although the others are
    > not). I read here: http://lkml.org/lkml/2004/9/23/393 that this was a
    > prerequisite, but it is apparently not sufficient. He also mentioned
    > that the area had to be marked as write-back, but it's not clear how
    > you can tell (no /proc/mtrr doesn't tell you) or that it has anything
    > to do with bursting reads.
    >
    > Any ideas would be really appreciated,


    Well, in order for the CPU to batch up more writes you'd have to map the
    BAR as either write-combining or write-back. If it's not listed in
    /proc/mtrr it will be the default setting of uncacheable. X has code to
    set up the video memory on the video card as write-combining so it can
    get better write performance, you could do something similar.

    Setting it as write-back might allow you to get the reads to do bursting
    as well (since the CPU will do a cache-line fill instead of individual
    accesses) but this if the device is modifying this memory area, unless
    you add code to invalidate those cache lines before reading the data
    you'll get stale data back. You could run into some other less obvious
    issues as well, as normally device memory regions are not mapped write-back.

    In general, especially if you need to read data back from the device,
    implementing a DMA engine would be by far the better option. Most
    chipsets seem not at all optimized for handling sequential reads from
    PCI memory from the CPU. (Even in the DMA case, you have to be careful
    with what type of memory read transaction you use when transferring from
    host memory - some chipsets don't like to burst more than one cycle if
    you use normal Memory Read instead of Memory Read Line or Memory Read
    Multiple.)
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: PCI Bursting with PIO

    On Feb 15, 2008 10:00 PM, Robert Han**** wrote:
    >
    > Well, in order for the CPU to batch up more writes you'd have to map the
    > BAR as either write-combining or write-back. If it's not listed in
    > /proc/mtrr it will be the default setting of uncacheable.


    Ok, this is pretty much what I thought, but I still don't really have
    any idea how to do this. ioremap() doesn't take any flags and I'm not
    using ioremap_uncacheable(), plus the BAR is marked prefetchable...

    > X has code to
    > set up the video memory on the video card as write-combining so it can
    > get better write performance, you could do something similar.


    Alan mentioned this as well, but I haven't tried to hunt this code
    yet. If you have any pointers as to where I might find this, I would
    appreciate it.

    > Setting it as write-back might allow you to get the reads to do bursting
    > as well (since the CPU will do a cache-line fill instead of individual
    > accesses)


    I don't see what the cache write policy has to do with the reads. If
    the region is marked cacheable, then all reads should try and read a
    cache line, right? The write-back or write-through policy only has to
    do with the writes. If it's write through then writes go directly to
    RAM, if it's write-back then they hit the cache and are flushed when
    the line is flushed (LRU replacement, explicit cache line flush,
    etc..), right?

    > but this if the device is modifying this memory area, unless
    > you add code to invalidate those cache lines before reading the data
    > you'll get stale data back.


    Yeah this could definitely be tricky, would pci_dma_sync suffice for this?

    > You could run into some other less obvious
    > issues as well, as normally device memory regions are not mapped write-back.
    >
    > In general, especially if you need to read data back from the device,
    > implementing a DMA engine would be by far the better option. Most
    > chipsets seem not at all optimized for handling sequential reads from
    > PCI memory from the CPU. (Even in the DMA case, you have to be careful
    > with what type of memory read transaction you use when transferring from
    > host memory - some chipsets don't like to burst more than one cycle if
    > you use normal Memory Read instead of Memory Read Line or Memory Read
    > Multiple.)


    True enough... Fortunately my device allows me to set these...

    What I am trying to avoid is PCI read transactions in general. PCI
    reads are slow pretty much no matter if they are originated from the
    device or from the host because of all the multitude of bridges they
    have to go through (I've seen 5 in some cases... sheesh). So
    ultimately I like for everything going to the device to be written
    from the host, then everything going towards the host be DMA'd into
    RAM by the device, at least then we can take advantage of PCI write
    posting and you don't have to wait for the write to actually complete
    before we plod on. But this depends on at least getting get write
    burst performance from the host so that the time to write the data
    from host is less than the time it would take for the device to read
    the data out of RAM.

    thanks again for your help!
    dan
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread