Hanging Problem in Networkdriver - VxWorks

This is a discussion on Hanging Problem in Networkdriver - VxWorks ; Hi All I am doing a networkdriver in vxworks5.4.My target is getting hang after sometime.I think i am able to acknowledge all the interrupts.How could be the problem?What are the different reasons that can cause a hang? Please reply Thanks ...

+ Reply to Thread
Results 1 to 3 of 3

Thread: Hanging Problem in Networkdriver

  1. Hanging Problem in Networkdriver

    Hi All
    I am doing a networkdriver in vxworks5.4.My target is getting hang
    after sometime.I think i am able to acknowledge all the interrupts.How
    could be the problem?What are the different reasons that can cause a
    hang?
    Please reply
    Thanks
    Martin


  2. Re: Hanging Problem in Networkdriver

    Possibly, there is some problem in your interrupt service routine.

    martinpattara@gmail.com wrote:
    > Hi All
    > I am doing a networkdriver in vxworks5.4.My target is getting hang
    > after sometime.I think i am able to acknowledge all the interrupts.How
    > could be the problem?What are the different reasons that can cause a
    > hang?
    > Please reply
    > Thanks
    > Martin



  3. Re: Hanging Problem in Networkdriver

    martinpattara@gmail.com wrote:
    > Hi All
    > I am doing a networkdriver in vxworks5.4.My target is getting hang
    > after sometime.I think i am able to acknowledge all the interrupts.How
    > could be the problem?What are the different reasons that can cause a
    > hang?
    > Please reply
    > Thanks
    > Martin


    There are a lot of possible reasons, however it would be a lot easier
    to answer your question had you bothered to mention just which network
    chip your driver is for. There are certain problems that are influenced
    directly by hardware design.

    In general, if your driver stops handling traffic, it's because your
    driver code is buggy. The END driver model is not very complicated,
    but there are a few gotchas that trip most people up. Common mistakes I
    see are:

    - Forgetting to acquire the END TX semaphore in the TX completion
    routine. After a 'TX done' interrupt fires, you must dispatch a TX
    completion routine to run in tNetTask to release any resources relating
    to this particular transmission and check for errors. By necessity, the
    TX done handler accesses some of the same driver state as the send
    routine. The send routine can run in any task that uses the network,
    while the TX done handler runs in tNetTask, so both routines must use
    the END TX semaphore to stay synchronized.

    - Poorly tested RX and TX error recovery routines. Your driver must
    handle RX overrun and TX underrun events correctly. (Not to mention RX
    CRC errors, and RX rung and giant packet errors.) With some devices, an
    RX or TX error will stall the RX or TX DMA channel, and the driver is
    responsible for restarting the channel after cleaning up. (For TX
    underrun errors, it's a good idea to increase the TX start threshold a
    little, if the hardware allows it. Once it's set high enough, the
    underruns should cease.) In some cases, deliberately and repeatedly
    triggering an RX or TX error can be hard, which means often the error
    recovery code is not very throroughly tested. This is a frequent source
    of problems, because sometimes bugs in the error recovery path can hose
    the driver, but the difficulty testing the error recovery path means
    the bugs can go unnoticed.

    - Forgetting to call muxTxRestart() after transmit completions. The
    VxWorks TCP/IP stack (at least in 5.4) has a TX output queue. When
    sending packets, the stack will check for a return status of
    END_ERR_BLOCK from the driver send routine (via muxSend()).
    END_ERR_BLOCK means the transmitter is currently busy and can't send
    another packet just now (maybe all DMA descriptors are full, maybe the
    TX memory is full, etc...). Rather than just dropping the packets, the
    stack will put them in the TX output queue so they can be sent once the
    currently pending transmissions complete. The problem is, the queue is
    only so large (50 entries by default), and once it fill up, then the
    stack will begin dropping packets (it will check if there's any room to
    queue the packets first, before it even attempts to call muxSend(), and
    will drop the packet if the queue is full). When the driver detects
    that the transmissions have completed (via the TX done interrupt), and
    once it's reclaimed enough TX resources to begin sending again, it
    should call muxTxRestart() to prod the stack to retry transmission of
    the packets waiting in the TX output queue. If your driver doesn't call
    muxTxRestart(), then the TX path within the TCP/IP stack will stall
    forever.

    - Incorrect synchronization of hardware access between RX and TX paths.
    A good way to illustrate this is with an example. Take the SMSC 91C111
    ethernet chip. This is a small, low cost device with a very small I/O
    footprint (just 16 bytes of register space). Packets are transfered
    between the chip and host using programmed I/O via a 16-bit I/O port
    register. The same I/O port is used for both RX and TX operations. This
    means you have to be very careful: if the RX path in tNetTask pre-empts
    the TX path running in an application task, there will be a collision:
    the TX code might be interrupted while it's using the I/O port to
    transfer a packet into the chip. To avoid this problem, the RX code
    needs to acquire the END TX semaphore while accessing the I/O port in
    order to synchronize with the TX path. This problem can also occur with
    devices that have multiple register banks (or pages) or which require
    indirect accesses to modify registers which aren't directly visible in
    the register space. Basically, any hardware access which is not atomic
    needs to be properly guarded so that one code path can't preempt
    another in the middle of a non-atomic access.

    - Race conditions between ISRs and task level code. Ideally, all your
    ISR should do is mask interrupts and call netJobAdd() to schedule
    additional processing in tNetTask. However, you need to be sure that
    you unmask interrupts once all processing is done. One thing which can
    go wrong here is re-enabling interrupts too late. For example, say that
    your hardware is designed so that the RX DMA channel will stall once
    the RX DMA descriptor ring is full (RX overrun condition). You should
    make sure interrupts are unmasked first, and then unstall the channel.
    In some cases, doing it the other way around can trigger a race
    condition: if you restart the RX DMA channel first, and a burst of
    traffic arrives before you get a chance to unmask interrupts, the RX
    ring might fill up again and the RX DMA channel will stall again too.
    But if you're unlucky, that will happen before interrupts have been
    re-enabled, and you might miss the RX overrun events.

    - Incorrect handling of shared interrupts. With PCI, it's very common
    for multiple devices to share an interrupt vector. When your device is
    sharing an interrupt, it's possible that its ISR will be invoked even
    though it doesn't have an interrupt pending. Your ISR should test that
    an actual event really is pending, and that your task level event
    handler isn't already running in tNetTask, before masking interrupts
    and calling netJobAdd() again. In some cases, failing to do this can
    lead to excessive and unneeded calls to netJobAdd(), which could lead
    to missed events and netJob ring overflow errors.

    There are probably more possibilities that I can't think of right now.
    Really though, instead of looking for general reasons for driver
    failures, you should be using the debugger and a good traffic generator
    to figure out what's wrong with your specific code. If you can reliably
    reproduce the hang, then you're well on the road to fixing it: just
    keep instrumenting and testing the code, and eventually the problem
    will jump out at you. I strongly recommend torture testing the driver
    with something like a Smartbits before putting your device into
    production.

    And next time you ask a question, give more information about what
    target/CPU/NIC you're using.

    -Bill


+ Reply to Thread