MSI problem since 2.6.21 for devices not providing a mask in their MSI capability - Kernel

This is a discussion on MSI problem since 2.6.21 for devices not providing a mask in their MSI capability - Kernel ; Hi, We observe a problem with MSI since kernel 2.6.21 where interrupts would randomly stop working. We have tracked it down to the new msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not providing a "native" MSI ...

+ Reply to Thread
Results 1 to 8 of 8

Thread: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

  1. MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

    Hi,

    We observe a problem with MSI since kernel 2.6.21 where interrupts would
    randomly stop working. We have tracked it down to the new
    msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not
    providing a "native" MSI mask, it was a no-op before, and now it
    disables MSI in the MSI-ctl register which according to the PCI spec is
    interpreted as reverting the device to legacy interrupts. If such a
    device try to generate a new interrupt during the "masked" window, the
    device will try a legacy interrupt which is generally
    ignored/never-acked and cause interrupts to no longer work for the
    device/driver combination (even after the enable bit is restored).


    Is there anything apart from irq migration that strongly requires
    masking? Is is possible to do the irq migration without masking?



    Loic

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

    Loic Prylli writes:

    > Hi,
    >
    > We observe a problem with MSI since kernel 2.6.21 where interrupts would
    > randomly stop working. We have tracked it down to the new
    > msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not
    > providing a "native" MSI mask, it was a no-op before, and now it
    > disables MSI in the MSI-ctl register which according to the PCI spec is
    > interpreted as reverting the device to legacy interrupts. If such a
    > device try to generate a new interrupt during the "masked" window, the
    > device will try a legacy interrupt which is generally
    > ignored/never-acked and cause interrupts to no longer work for the
    > device/driver combination (even after the enable bit is restored).


    We should also be leaving the INTx irqs disabled. So no irq
    should be generated.

    If you have a mask bit implemented you are required to be
    able to refire it after the msi is enabled. I don't recall
    the requirements for when both intx and msi irqs are both
    disabled. Intuitively I would expect no irq message to
    be generated, and at most the card would need to be polled
    manually to recognize a device event happened.

    Certainly firing an irq and having it get completely lost is
    unfortunate, and a major pain if you are trying to use the
    card.

    As for the previous no-op behavior that was a bug.

    > Is there anything apart from irq migration that strongly requires
    > masking? Is is possible to do the irq migration without masking?


    enable_irq/disable_irq. Although we can get away with a software
    emulation there and those are only needed if the driver calls them.

    The PCI spec requires disabling/masking the msi when reprogramming it.
    So as a general rule we can not do better. Further because we are
    writing to multiple pci config registers the only way we can safely
    reprogram the message is with the msi disabled/masked on the card in
    some fashion.

    I suspect what needs to happen is a spec search to verify that the
    current linux behavior is at least reasonable within the spec.

    Once we have verified that the generic code can not do better.
    We can look at work-arounds. One possibility is for the generic
    code to provide some overrides for the methods for masking and
    reading/writing to a msi message.

    I don't want to break anyones hardware, but at the same time I want us
    to be careful and in spec for the default case.

    Eric
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability


    > We should also be leaving the INTx irqs disabled. So no irq
    > should be generated.
    >
    > If you have a mask bit implemented you are required to be
    > able to refire it after the msi is enabled. I don't recall
    > the requirements for when both intx and msi irqs are both
    > disabled. Intuitively I would expect no irq message to
    > be generated, and at most the card would need to be polled
    > manually to recognize a device event happened.
    >
    > Certainly firing an irq and having it get completely lost is
    > unfortunate, and a major pain if you are trying to use the
    > card.
    >
    > As for the previous no-op behavior that was a bug.


    Well, yes and no ... A valid option here would be to use soft-masking,
    which is possible because MSIs are edge interrupts. That is, basically,
    when masked, just ignore them and set IRQF_PENDING, and when unmasked,
    replay (which can be done with softirq if there is no HW mechanism for
    that). The genirq code contains all the necessary infrastructure for
    doing that stuff, it's fairly trivial, and would probably avoid stepping
    in HW lalaland (how much do you bet HW generally get that masking thing
    wrong ?)

    > The PCI spec requires disabling/masking the msi when reprogramming it.
    > So as a general rule we can not do better. Further because we are
    > writing to multiple pci config registers the only way we can safely
    > reprogram the message is with the msi disabled/masked on the card in
    > some fashion.


    Hrm... all right, that will be an issue, so migration need a real
    masking.

    > I suspect what needs to happen is a spec search to verify that the
    > current linux behavior is at least reasonable within the spec.
    >
    > Once we have verified that the generic code can not do better.
    > We can look at work-arounds. One possibility is for the generic
    > code to provide some overrides for the methods for masking and
    > reading/writing to a msi message.
    >
    > I don't want to break anyones hardware, but at the same time I want us
    > to be careful and in spec for the default case.


    Ben.


    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

    Benjamin Herrenschmidt writes:
    >
    > Well, yes and no ... A valid option here would be to use soft-masking,
    > which is possible because MSIs are edge interrupts. That is, basically,
    > when masked, just ignore them and set IRQF_PENDING, and when unmasked,
    > replay (which can be done with softirq if there is no HW mechanism for
    > that). The genirq code contains all the necessary infrastructure for
    > doing that stuff, it's fairly trivial, and would probably avoid stepping
    > in HW lalaland (how much do you bet HW generally get that masking thing
    > wrong ?)


    Well. If people actually use it I suspect it will work ok. The
    circuitry is quite simple so as long as people get their requirements
    straight we should be fine. Which is why I tried to get everything
    working as well as we could sooner rather then later. Of course
    drivers are free not to call anything that would cause the irq
    to be masked.

    That said the current disable_irq and enable_irq path is using the
    IRQF_PENDING infrastructure on x86. So the only time this comes up
    is for irq migration.

    Eric
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

    On 10/3/2007 5:49 PM, Eric W. Biederman wrote:
    > Loic Prylli writes:
    >
    >
    >> Hi,
    >>
    >> We observe a problem with MSI since kernel 2.6.21 where interrupts would
    >> randomly stop working. We have tracked it down to the new
    >> msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not
    >> providing a "native" MSI mask, it was a no-op before, and now it
    >> disables MSI in the MSI-ctl register which according to the PCI spec is
    >> interpreted as reverting the device to legacy interrupts. If such a
    >> device try to generate a new interrupt during the "masked" window, the
    >> device will try a legacy interrupt which is generally
    >> ignored/never-acked and cause interrupts to no longer work for the
    >> device/driver combination (even after the enable bit is restored).
    >>

    >
    > We should also be leaving the INTx irqs disabled. So no irq
    > should be generated.
    >




    Even if the INTx line is not raised, you cannot rely on the device to
    retain memory of a interrupt triggered while MSI are disabled, and
    expect it to fire it under MSI form later when MSI are reenabled. The
    PCI spec does not provide any implicit or explicit guarantee about the
    MSI enable flag that would allow it to be used for temporary masking
    without running the risk of loosing such interrupts. Moreover, even if
    you eventually call the interrupt handler to recover a lost-interrupt,
    having switched the device to INTx mode (whether or not the INTx line
    was forced down or not with the corresponding pci-command bit) without
    informing the driver can (and will in our case) break interrupt
    handshaking because MSI and INTx interrupts are not acked in the same
    way (INTx requires an extra step that we don't do for MSI and that the
    device will still expect unless going through driver init again).


    > If you have a mask bit implemented you are required to be
    > able to refire it after the msi is enabled.




    Indeed the masking case is well-defined by the spec (including the
    operation of the pending bits). And my subject was definitely restricted
    to devices without that masking capability.



    > I don't recall
    > the requirements for when both intx and msi irqs are both
    > disabled. Intuitively I would expect no irq message to
    > be generated, and at most the card would need to be polled
    > manually to recognize a device event happened.
    >
    > Certainly firing an irq and having it get completely lost is
    > unfortunate, and a major pain if you are trying to use the
    > card.
    >
    > As for the previous no-op behavior that was a bug.
    >





    OK no-op was a bug, but using the enable-bit for temporary masking
    purposes still feels like a bug. I am afraid the only safe solution
    might be to prohibit any operation that absolutely requires masking if
    real masking is not available. Maybe the set_affinity method should
    simply be disabled for device not supported masking (unless there is an
    option of doing it without masking for instance by guaranteeing only one
    word of the MSI capability is changed).



    >
    >> Is there anything apart from irq migration that strongly requires
    >> masking? Is is possible to do the irq migration without masking?
    >>

    >
    > enable_irq/disable_irq. Although we can get away with a software
    > emulation there and those are only needed if the driver calls them.
    >




    I don't think there is a problem here, no sane driver would depend on
    receiving edge interrupts triggered while irqs were explicitly disabled.



    > The PCI spec requires disabling/masking the msi when reprogramming it.
    > So as a general rule we can not do better.




    Do you have a reference for that requirement. The spec only vaguely
    associates MSI programming with "configuration", but I haven't found any
    explicit indication that it should not work.



    > Further because we are
    > writing to multiple pci config registers the only way we can safely
    > reprogram the message is with the msi disabled/masked on the card in
    > some fashion.
    >



    That's indeed a show-stopper.


    > I suspect what needs to happen is a spec search to verify that the
    > current linux behavior is at least reasonable within the spec.
    >




    I don't see how you can disable MSI through the control bit (which is
    equivalent to switching the device to INTx whether or not the INTx
    disable bit is set in PCI_COMMAND) in the middle of operations, not tell
    the driver, and not risk loosing interrupts (unless you rely on much
    more than the spec).


    > I don't want to break anyones hardware, but at the same time I want us
    > to be careful and in spec for the default case.
    >
    >



    The interrupt while doing set_affinity masking would certainly cause a
    problem for the device we use (MSI-enable switch between INTx and MSI
    mode, and both interrupts are not acked the same way assuming they would
    even be delivered to the driver), but I got some new data: upon further
    examination, the lost interrupts we have seen seems in fact caused at a
    different time:
    - the problem is the mask_ack_irq() done in handle_edge_irq() when a
    new interrupt arrives before the IRQ_PROGRESS bit is cleared at the end
    of the function.

    Again here, switching MSI-off during hot operation breaks the interrupt
    accounting and handshaking between our driver and device. At least this
    case might be easier to handle, it seems safe to not mask there (when
    some proven masking is not available).


    Loic



    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

    Loic Prylli writes:

    > Even if the INTx line is not raised, you cannot rely on the device to
    > retain memory of a interrupt triggered while MSI are disabled, and
    > expect it to fire it under MSI form later when MSI are reenabled.


    Sure. My expectation is if we happened to hit such a narrow window
    the irq would simply be dropped.

    >
    >> If you have a mask bit implemented you are required to be
    >> able to refire it after the msi is enabled.

    >
    >
    >
    > Indeed the masking case is well-defined by the spec (including the
    > operation of the pending bits). And my subject was definitely restricted
    > to devices without that masking capability.


    Right. And INTx has such a pending bit as well. I guess I figured
    if MSI was enabled transferring it over would be the obvious thing to
    do.

    > OK no-op was a bug, but using the enable-bit for temporary masking
    > purposes still feels like a bug. I am afraid the only safe solution
    > might be to prohibit any operation that absolutely requires masking if
    > real masking is not available. Maybe the set_affinity method should
    > simply be disabled for device not supported masking (unless there is an
    > option of doing it without masking for instance by guaranteeing only one
    > word of the MSI capability is changed).


    It's worth looking at, I think that happens in the common case.

    Of course it might even make sense simply to refuse to enable MSI
    if there is not a masking capability present.

    >> The PCI spec requires disabling/masking the msi when reprogramming it.
    >> So as a general rule we can not do better.

    >
    >
    >
    > Do you have a reference for that requirement. The spec only vaguely
    > associates MSI programming with "configuration", but I haven't found any
    > explicit indication that it should not work.


    I would have to look it up again but it said that the result is only
    defined in the case when it is disabled/masked, when I looked a couple
    of months ago.

    >> I suspect what needs to happen is a spec search to verify that the
    >> current linux behavior is at least reasonable within the spec.
    >>

    >
    >
    > I don't see how you can disable MSI through the control bit (which is
    > equivalent to switching the device to INTx whether or not the INTx
    > disable bit is set in PCI_COMMAND) in the middle of operations, not tell
    > the driver, and not risk loosing interrupts (unless you rely on much
    > more than the spec).


    I will relook. My impression is that bit is defined as MSI enable.
    Not mode switch. Although myrinet has clearly implemented it as
    mode switch.

    >> I don't want to break anyones hardware, but at the same time I want us
    >> to be careful and in spec for the default case.
    >>
    >>

    >
    >
    > The interrupt while doing set_affinity masking would certainly cause a
    > problem for the device we use (MSI-enable switch between INTx and MSI
    > mode, and both interrupts are not acked the same way assuming they would
    > even be delivered to the driver), but I got some new data: upon further
    > examination, the lost interrupts we have seen seems in fact caused at a
    > different time:
    > - the problem is the mask_ack_irq() done in handle_edge_irq() when a
    > new interrupt arrives before the IRQ_PROGRESS bit is cleared at the end
    > of the function.
    >
    > Again here, switching MSI-off during hot operation breaks the interrupt
    > accounting and handshaking between our driver and device. At least this
    > case might be easier to handle, it seems safe to not mask there (when
    > some proven masking is not available).


    Interesting. So an irq fires before the driver has finished
    processing the last instance of the irq. This is very close to a
    screaming irq and something we may actually want to deal with.

    Eric
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

    On 10/3/2007 11:58 PM, Eric W. Biederman wrote:
    > Right. And INTx has such a pending bit as well. I guess I figured
    > if MSI was enabled transferring it over would be the obvious thing to
    > do.
    >




    The INTx pending and disable bit were only added starting with PCI 2.3,
    so in PCI-2.2 and PCI-X 1.0{,a,b} those bits don't exist at all and
    there is still a significant of such devices still in use or on the market.

    I agree it would look natural (and that probably happens for a lot of or
    most devices) to transfer the interrupt state from INTx to MSI, but I
    don't think you can rely on it without doing some assumptions about
    device interrupt management that are outside the scope of the spec.



    > Of course it might even make sense simply to refuse to enable MSI
    > if there is not a masking capability present.
    >




    The possibility of masking for MSI was only specified (and then only as
    something optional) starting with PCI-3.0, PCI Express 1.0 and 1.0a are
    based on the older PCI-2.3 and corresponding devices are very unlikely
    to have it. So there might still be majority of devices in the field
    with no MSI masking capability in the different PCI categories:
    conventional-PCI, PCI-X, PCI-Express.


    >
    >>> The PCI spec requires disabling/masking the msi when reprogramming it.
    >>> So as a general rule we can not do better.
    >>>
    >>>

    >> Do you have a reference for that requirement. The spec only vaguely
    >> associates MSI programming with "configuration", but I haven't found any
    >> explicit indication that it should not work.
    >>

    > I would have to look it up again but it said that the result is only
    > defined in the case when it is disabled/masked, when I looked a couple
    > of months ago.
    >
    >



    I found this quote in PCI-3.0/6.8.3.5:
    "For MSI-X, a function is permitted to cache Address and Data values
    from unmasked
    MSI-X Table entries. However, anytime software unmasks a currently
    masked MSI-X
    Table entry either by clearing its Mask bit or by clearing the Function
    Mask bit, the
    function must update any Address or Data values that it cached from that
    entry. If
    software changes the Address or Data value of an entry while the entry
    is unmasked, the
    result is undefined."


    I haven't seen a caching possibility mentioned for the MSI case, so
    apart from the problem with multi-word changes, maybe changing the
    Address or Data can be done at anytime for MSI.




    >> I don't see how you can disable MSI through the control bit (which is
    >> equivalent to switching the device to INTx whether or not the INTx
    >> disable bit is set in PCI_COMMAND) in the middle of operations, not tell
    >> the driver, and not risk loosing interrupts (unless you rely on much
    >> more than the spec).
    >>

    >
    > I will relook. My impression is that bit is defined as MSI enable.
    > Not mode switch. Although myrinet has clearly implemented it as
    > mode switch.
    >




    It is indeed defined as MSI-enable, but that's not a contradiction with
    being equivalent to a "mode switch between INTx and MSI" (ignoring MSI-X
    in that context). The spec seems to define the following "modes":

    MSI-enable = 1, INTx-disable= x : MSI-mode
    MSI-enable = 0, INTx-disable= 0 : INTx-mode with INTx-signal == INTx-pending
    MSI-enable = 0, INTx-disable= 1 : INTx-mode "polling/diag" mode using
    INTx pending bit

    The only specificity of Myrinet is having relatively independant logic
    for the two modes, while at the same time requiring any pending INTx to
    be acked before starting any kind of new interrupt.


    > Interesting. So an irq fires before the driver has finished
    > processing the last instance of the irq. This is very close to a
    > screaming irq and something we may actually want to deal with.
    >
    >




    In our case it is true that the device can fire a bounded number of MSI
    without acks (but not an infinite number, there are a limited number of
    interrupt tokens, furthermore interrupt rate is limited with a
    configurable minimum time between interrupts which default to ~10us), I
    suspect a race with other interrupts were involved because otherwise
    that minimum inter-interrupt delay would prevent entering that code path.

    I think even a more casual interrupt-scheme (with an explicit ack
    required for each interrupt before generating a new one) can also
    exercise that code path, since between the return from the handler and
    the clearing of IRQ_PROGRESS, there is an opportunity for the next
    interrupt to happen.

    To detect a crazy device generating storms of edge interrupts, I guess
    note_interrupt() could be called during this "reentrant detection" if
    masking was made conditional.


    Loic


    P.S.: just a little more context: in all Myrinet hardware, enough of the
    interrupt functionality is implemented in firmware that we can avoid
    loosing interrupts whenever MSI-enable is toggled, and we already
    started distributing a firmware-based software update for users running
    linux >= 2.6.21 and using MSI. So for Myrinet the problem is more or
    less already closed.

    The only motivation for starting the thread was that it seemed a
    possibility that other non-Myrinet devices could be affected by that "
    use MSI-enable as a masking function":
    - the first problem being a possible spurious INTx interrupt (and for
    most PCI-X 133Mhz or earlier there might not be a INTx disable bit to
    avoid that) during the "MSI-disabled" window.
    - it does not seem far-fetched that other devices could also loose an
    interrupt during that toggling, at best this seems a grey area of the spec.
    - the race to trigger any of those potential problems is small, they
    would be hard to reproduce.

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

    Loic Prylli writes:

    I still looking through my copy of the pci specs and so will reply to
    that part in a bit.

    > To detect a crazy device generating storms of edge interrupts, I guess
    > note_interrupt() could be called during this "reentrant detection" if
    > masking was made conditional.


    Hmm. Something like this?

    Only mask it if the irq is disabled, and only disable it if
    the user requests it or if note_interrupt decides we are screaming?

    void handle_edge_irq(unsigned int irq, struct irq_desc *desc)
    {
    const unsigned int cpu = smp_processor_id();

    spin_lock(&desc->lock);

    desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);

    /*
    * If we're currently running this IRQ, or its disabled,
    * we shouldn't process the IRQ. Mark it pending, handle
    * the necessary masking and go out
    */
    if (unlikely((desc->status & IRQ_DISABLED) || !desc->action)){
    desc->status |= (IRQ_PENDING | IRQ_MASKED);
    mask_ack_irq(desc, irq);
    goto out_unlock;
    }
    if (unlikely(desc->status & IRQ_INPROGRESS)) {
    desc->status |= IRQ_PENDING;
    desc->chip->ack(desc, irq);
    note_interrupt(irq, desc, IRQ_NONE /* IRQ_DUP? */);
    goto out_unlock;
    }

    kstat_cpu(cpu).irqs[irq]++;

    /* Start handling the irq */
    desc->chip->ack(irq);

    /* Mark the IRQ currently in progress.*/
    desc->status |= IRQ_INPROGRESS;

    do {
    struct irqaction *action = desc->action;
    irqreturn_t action_ret;

    if (unlikely(!action)) {
    desc->chip->mask(irq);
    goto out_unlock;
    }

    desc->status &= ~IRQ_PENDING;
    spin_unlock(&desc->lock);
    action_ret = handle_IRQ_event(irq, action);
    if (!noirqdebug)
    note_interrupt(irq, desc, action_ret);
    spin_lock(&desc->lock);

    } while ((desc->status & (IRQ_PENDING | IRQ_DISABLED)) == IRQ_PENDING);

    desc->status &= ~IRQ_INPROGRESS;
    out_unlock:
    spin_unlock(&desc->lock);
    }

    Eric
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread