Re: [RFC][PATCH] netconsole: avoid deadlock on printk from driver code - Kernel

This is a discussion on Re: [RFC][PATCH] netconsole: avoid deadlock on printk from driver code - Kernel ; On Wed, Aug 13, 2008 at 11:53:24AM +0200, Vegard Nossum wrote: > I encountered a hard-to-debug deadlock when I pulled out the plug of my > RealTek 8139 which was also running netconsole: The driver wants to print > a ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: Re: [RFC][PATCH] netconsole: avoid deadlock on printk from driver code

  1. Re: [RFC][PATCH] netconsole: avoid deadlock on printk from driver code

    On Wed, Aug 13, 2008 at 11:53:24AM +0200, Vegard Nossum wrote:
    > I encountered a hard-to-debug deadlock when I pulled out the plug of my
    > RealTek 8139 which was also running netconsole: The driver wants to print
    > a "link down" message. However, this triggers netconsole, which wants to
    > print the message using the same device. Here is a backtrace:
    >
    > [] _spin_lock_irqsave+0x76/0x90
    > [] rtl8139_start_xmit+0x65/0x130 <-- spin_lock(&tp->lock)
    > [] netpoll_send_skb+0x158/0x1a0
    > [] netpoll_send_udp+0x1db/0x1f0
    > [] write_msg+0x8c/0xc0
    > [] __call_console_drivers+0x53/0x60
    > [] _call_console_drivers+0x4b/0x90
    > [] release_console_sem+0xc5/0x1f0
    > [] vprintk+0x1ab/0x3e0
    > [] printk+0x1b/0x20
    > [] mii_check_media+0x196/0x1e0
    > [] rtl_check_media+0x24/0x30
    > [] rtl8139_interrupt+0x42a/0x4a0 <-- spin_lock(&tp->lock)
    > [] handle_IRQ_event+0x28/0x70
    > [] handle_fasteoi_irq+0x6b/0xe0
    > [] do_IRQ+0x48/0xa0
    >
    > The least invasive fix is to detect that we're trying to re-enter the
    > driver code. We provide a netdev_busy() function which can be used to
    > determine whether a deadlock can occur if we try to transmit another
    > packet.
    >
    > Note that this may lead to lost messages if the driver is active on
    > another CPU while we try to use the same device for netconsole.


    This sucks.

    > It would probably be best to set a "lost messages" flag in this case and
    > add it to the stream when the device becomes ready again.
    >
    > The only extra overhead in non-netconsole code paths is the fact that we
    > need another callback in struct net_device. However, all drivers must be
    > checked for the possibility of a deadlock and implement the ->busy()
    > callback as necessary.


    > --- a/drivers/net/8139too.c
    > +++ b/drivers/net/8139too.c
    > @@ -979,6 +980,7 @@ static int __devinit rtl8139_init_one (struct pci_dev *pdev,
    > /* The Rtl8139-specific entries in the device structure. */
    > dev->open = rtl8139_open;
    > dev->hard_start_xmit = rtl8139_start_xmit;
    > + dev->busy = rtl8139_busy;
    > netif_napi_add(dev, &tp->napi, rtl8139_poll, 64);
    > dev->stop = rtl8139_close;
    > dev->get_stats = rtl8139_get_stats;
    > @@ -1741,6 +1743,11 @@ static int rtl8139_start_xmit (struct sk_buff *skb, struct net_device *dev)
    > return 0;
    > }
    >
    > +static bool rtl8139_busy (struct net_device *dev)
    > +{
    > + struct rtl8139_private *tp = netdev_priv(dev);
    > + return spin_is_locked(&tp->lock);
    > +}


    How do I know if my driver is suspectible to this sort of deadlock?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [RFC][PATCH] netconsole: avoid deadlock on printk from driver code

    From: Alexey Dobriyan
    Date: Wed, 13 Aug 2008 13:59:43 +0400

    > On Wed, Aug 13, 2008 at 11:53:24AM +0200, Vegard Nossum wrote:
    > > I encountered a hard-to-debug deadlock when I pulled out the plug of my
    > > RealTek 8139 which was also running netconsole: The driver wants to print
    > > a "link down" message. However, this triggers netconsole, which wants to
    > > print the message using the same device. Here is a backtrace:
    > >
    > > [] _spin_lock_irqsave+0x76/0x90
    > > [] rtl8139_start_xmit+0x65/0x130 <-- spin_lock(&tp->lock)
    > > [] netpoll_send_skb+0x158/0x1a0
    > > [] netpoll_send_udp+0x1db/0x1f0
    > > [] write_msg+0x8c/0xc0
    > > [] __call_console_drivers+0x53/0x60
    > > [] _call_console_drivers+0x4b/0x90
    > > [] release_console_sem+0xc5/0x1f0
    > > [] vprintk+0x1ab/0x3e0
    > > [] printk+0x1b/0x20
    > > [] mii_check_media+0x196/0x1e0
    > > [] rtl_check_media+0x24/0x30
    > > [] rtl8139_interrupt+0x42a/0x4a0 <-- spin_lock(&tp->lock)
    > > [] handle_IRQ_event+0x28/0x70
    > > [] handle_fasteoi_irq+0x6b/0xe0
    > > [] do_IRQ+0x48/0xa0
    > >
    > > The least invasive fix is to detect that we're trying to re-enter the
    > > driver code. We provide a netdev_busy() function which can be used to
    > > determine whether a deadlock can occur if we try to transmit another
    > > packet.
    > >
    > > Note that this may lead to lost messages if the driver is active on
    > > another CPU while we try to use the same device for netconsole.

    >
    > This sucks.


    It's also the wrong fix.

    As a quicker and more palatable solution, print your link status
    message in some kind of deferred context where you can have the
    lock not held or similar.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [RFC][PATCH] netconsole: avoid deadlock on printk from driver code

    (Hm, did my original e-mail not make it to the lists?)

    >> The least invasive fix is to detect that we're trying to re-enter the
    >> driver code. We provide a netdev_busy() function which can be used to
    >> determine whether a deadlock can occur if we try to transmit another
    >> packet.
    >>
    >> Note that this may lead to lost messages if the driver is active on
    >> another CPU while we try to use the same device for netconsole.

    >
    > This sucks.


    Indeed. What can be done about it?

    >> It would probably be best to set a "lost messages" flag in this case and
    >> add it to the stream when the device becomes ready again.


    With such a flag, we can at least avoid silent loss of messages. But
    what is the likelihood of this happening in the first place? I have no
    idea.

    I still think that this is better than a possible deadlock, which
    gives NO CLUE as to what happened (because it jams the other
    consoles). Hopefully the interface will not be used for other traffic.

    But I agree. This possibly penalizes all the other cases where people
    *don't* pull out their netconsole-device cables.

    We could simply not print out the "link up/down" message if the device
    is running netconsole. But it feels like papering over the design
    error. Because there are other functions in there which may cause
    additional messages to be printed, not just the very visible printk().

    Can we detect if the lock was taken on the same CPU or not?

    Maybe it is possible to postpone the packet instead of discarding it?

    >> +static bool rtl8139_busy (struct net_device *dev)
    >> +{
    >> + struct rtl8139_private *tp = netdev_priv(dev);
    >> + return spin_is_locked(&tp->lock);
    >> +}

    >
    > How do I know if my driver is suspectible to this sort of deadlock?


    if the ->hard_start_xmit() takes any locks which can be taken from any
    code that calls (also indirectly!) printk(), in this case the
    interrupt handler that is invoked when I replace the cable. My guess
    is that _all_ drivers need to take some sort of lock in
    hard_start_xmit().

    It's a bit fragile, I agree. But the completely silent deadlock is nasty too!


    Vegard

    --
    "The animistic metaphor of the bug that maliciously sneaked in while
    the programmer was not looking is intellectually dishonest as it
    disguises that the error is the programmer's own creation."
    -- E. W. Dijkstra, EWD1036
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [RFC][PATCH] netconsole: avoid deadlock on printk from driver code

    From: "Vegard Nossum"
    Date: Wed, 13 Aug 2008 12:44:46 +0200

    > There must also be no BUG()s, WARN()s, other debugging facilities
    > (spinlock debugging, lockdep, irqtrace, etc.) triggering which may
    > call printk() inside the protected section. Can we really ensure this?
    > For all drivers supporting netconsole?


    For all drivers supporting netconsole? Well yes, there will be some
    bugs somewhere in some driver wrt. netconsole. But that's life
    and why we will constantly have something to fix in the kernel isn't
    it?

    But one thing is for sure, your deliriously dirty ->busy thing is
    emphatically not the answer.

    BTW, it seems the main reason this driver is susceptible to this
    problem is that it doesn't handle link status events in it's
    NAPI poll handler. If it did that, you would have never seen
    this deadlock.

    The netpoll layer prevents recursion into the NAPI ->poll() handler.

    Most drivers manage link state either in their NAPI ->poll() handler
    or a periodic timer that samples the link state. Both schemes
    avoid this very issue.

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread