poll() blocked / packets not received ? - Kernel

This is a discussion on poll() blocked / packets not received ? - Kernel ; Hello, We have an application that uses pthreads and (blocking) sockets. When the application runs with one single thread in separate processes (using fork()) we don't get any problem. However when it's multithreaded, we sometimes get stuck while poll()ing a ...

+ Reply to Thread
Results 1 to 10 of 10

Thread: poll() blocked / packets not received ?

  1. poll() blocked / packets not received ?

    Hello,

    We have an application that uses pthreads and (blocking) sockets.

    When the application runs with one single thread in separate processes
    (using fork()) we don't get any problem.

    However when it's multithreaded, we sometimes get stuck while poll()ing
    a socket (with events set to POLLIN). Even after the other side of the
    connection has closed its side of the connection, we are still stuck
    here. Adding a timeout only makes the poll() exit with 0, so we loop.

    In case we don't loop the next operation is a recv() which will block as
    well (which is consistent).

    It seems like nothing is longer received on the socket but it's
    difficult to verify with tcpdump since our server outputs something like
    15MB at peek time with 150 hits per seconds.

    We have Shorewall installed and enabled, but what seems strange is that
    the problem depends on multithreading. It also occurs much more often on
    the 4 core machines than on a 2 core ones (both with Hyperthreading
    activated). We're using kernel 2.6.20-15-server (#2 SMP) provided by Ubuntu.

    Any tip on we could fix that or investigate further would be
    appreciated. After one month of debugging we're really out of solution now.

    Best,
    Nicolas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: poll() blocked / packets not received ?

    On Mon, Oct 20, 2008 at 10:25:10AM +0200, Nicolas Cannasse wrote:
    > Hello,
    >
    > We have an application that uses pthreads and (blocking) sockets.
    >
    > When the application runs with one single thread in separate processes
    > (using fork()) we don't get any problem.
    >
    > However when it's multithreaded, we sometimes get stuck while poll()ing
    > a socket (with events set to POLLIN). Even after the other side of the
    > connection has closed its side of the connection, we are still stuck
    > here. Adding a timeout only makes the poll() exit with 0, so we loop.
    >
    > In case we don't loop the next operation is a recv() which will block as
    > well (which is consistent).
    >
    > It seems like nothing is longer received on the socket but it's
    > difficult to verify with tcpdump since our server outputs something like
    > 15MB at peek time with 150 hits per seconds.
    >
    > We have Shorewall installed and enabled, but what seems strange is that
    > the problem depends on multithreading. It also occurs much more often on
    > the 4 core machines than on a 2 core ones (both with Hyperthreading
    > activated). We're using kernel 2.6.20-15-server (#2 SMP) provided by Ubuntu.
    >
    > Any tip on we could fix that or investigate further would be
    > appreciated. After one month of debugging we're really out of solution now.
    >
    > Best,
    > Nicolas


    Your usage pattern is a very common one, I highly doubt you are experiencing
    a kernel bug here or many people (including myself) would be complaining.

    Shorewall sounds like it might be suspect, are FIN's not coming in when the
    remote closes? You can look in the output of netstat to see what state the
    TCP is in, still ESTABLISHED?

    Have you tried just disabling the firewall to see if the problem
    disappears?

    Regards,
    Vito Caputo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: poll() blocked / packets not received ?

    >> We have Shorewall installed and enabled, but what seems strange is that
    >> the problem depends on multithreading. It also occurs much more often on
    >> the 4 core machines than on a 2 core ones (both with Hyperthreading
    >> activated). We're using kernel 2.6.20-15-server (#2 SMP) provided by Ubuntu.
    >>
    >> Any tip on we could fix that or investigate further would be
    >> appreciated. After one month of debugging we're really out of solution now.
    >>
    >> Best,
    >> Nicolas

    >
    > Your usage pattern is a very common one, I highly doubt you are experiencing
    > a kernel bug here or many people (including myself) would be complaining.
    >
    > Shorewall sounds like it might be suspect, are FIN's not coming in when the
    > remote closes? You can look in the output of netstat to see what state the
    > TCP is in, still ESTABLISHED?


    Yes, it's still ESTABLISHED, but we can't see the corresponding
    connection on the other machine while running netstat. I'm not a TCP
    expert, so I'm not sure in which case this can occur.

    I agree with your comment in general, except that we have been running
    the same application in single-thread environment for years without
    running into this very specific problem.

    The only logs we get in the dmesg are the following :

    either (a few everyday) :

    [10742708.006350] TCP: Treason uncloaked! Peer 213.209.177.218:32924/80
    shrinks window 4049064122:4049064123. Repaired.

    Or (more often) :

    [10755036.856217] Shorewall:net2allROP:IN=eth0 OUT=
    MAC=00:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:XX:00 SRC=60.238.83.204
    DST=XX.XX.XX.43 LEN=404 TOS=0x00 PREC=0x00 TTL=114 ID=12366 PROTO=UDP
    SPT=1057 DPT=1434 LEN=384

    Both SRC/DST IPs does not correspond to the connections that are
    stalled, since they occur on the local network.

    Best,
    Nicolas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: poll() blocked / packets not received ?

    On Mon, Oct 20, 2008 at 12:46:56PM +0200, Nicolas Cannasse wrote:
    > >>We have Shorewall installed and enabled, but what seems strange is that
    > >>the problem depends on multithreading. It also occurs much more often on
    > >>the 4 core machines than on a 2 core ones (both with Hyperthreading
    > >>activated). We're using kernel 2.6.20-15-server (#2 SMP) provided by
    > >>Ubuntu.
    > >>
    > >>Any tip on we could fix that or investigate further would be
    > >>appreciated. After one month of debugging we're really out of solution
    > >>now.
    > >>
    > >>Best,
    > >>Nicolas

    > >
    > >Your usage pattern is a very common one, I highly doubt you are
    > >experiencing
    > >a kernel bug here or many people (including myself) would be complaining.
    > >
    > >Shorewall sounds like it might be suspect, are FIN's not coming in when the
    > >remote closes? You can look in the output of netstat to see what state the
    > >TCP is in, still ESTABLISHED?

    >
    > Yes, it's still ESTABLISHED, but we can't see the corresponding
    > connection on the other machine while running netstat. I'm not a TCP
    > expert, so I'm not sure in which case this can occur.


    If the end that's blocking still has the TCP in ESTABLISHED state, and
    the other end doesnt have the TCP at all... you've already identified
    why the one end is still ESTABLISHED. ESTABLISHED state won't be left
    until the FIN is received from the other end, then entering CLOSE_WAIT
    state.

    When the other end of the TCP is _gone_ that leads me to believe a FIN
    will not be coming, hence the indefinite ESTABLISHED state. Why it's
    gone is a different question, maybe your problem is at the other end?
    The end initiating a shutdown has to enter FIN_WAIT_1 then FIN_WAIT_2,
    these transitions require the other side to leave ESTABLISHED (receive a
    FIN then ACK) at the very least to proceed.

    >
    > I agree with your comment in general, except that we have been running
    > the same application in single-thread environment for years without
    > running into this very specific problem.
    >


    Perhaps when you run in multicore/threaded you are stressing the network
    stacks at both ends more, including everything in-between? The
    threading vs. single process relationship is probably not causal, but
    just coincidental.

    What is the protocol? Are there any timeouts to take care of these
    situations? Do you schedule an alarm or use SO_RCVTIMEO to shutdown
    dead connections and free up consumed threads?

    TCP being reliable can block indefinitely, you can employ TCP keepalive
    to change indefinite to quite a long time.

    Regards,
    Vito Caputo
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: poll() blocked / packets not received ?

    swivel@shells.gnugeneration.com a écrit :
    > When the other end of the TCP is _gone_ that leads me to believe a FIN
    > will not be coming, hence the indefinite ESTABLISHED state. Why it's
    > gone is a different question, maybe your problem is at the other end?
    > The end initiating a shutdown has to enter FIN_WAIT_1 then FIN_WAIT_2,
    > these transitions require the other side to leave ESTABLISHED (receive a
    > FIN then ACK) at the very least to proceed.
    >
    >> I agree with your comment in general, except that we have been running
    >> the same application in single-thread environment for years without
    >> running into this very specific problem.
    >>

    >
    > Perhaps when you run in multicore/threaded you are stressing the network
    > stacks at both ends more, including everything in-between? The
    > threading vs. single process relationship is probably not causal, but
    > just coincidental.


    Not sure why this should happen, since it's the same servers. What only
    change is part of the software that we are using to handle our server
    requests. It's either embedded in Apache 1.3 with fork() or a standalone
    multithread server which acts as Apache backend.

    So the only difference for networking is that we have additional
    Apache<->MT-Server communications, but they should be on 127.0.0.1 so I
    think they are purely software and not hardware-related.

    > What is the protocol? Are there any timeouts to take care of these
    > situations? Do you schedule an alarm or use SO_RCVTIMEO to shutdown
    > dead connections and free up consumed threads?


    The protocol is MySQL. Since we had the problem with libmysqlclient, we
    reimplemented it again from scratch to make sure that it was not
    software-related.

    What happens at the protocol-level is the following :

    a) we connect to the server
    b) we make several requests and get answers back
    c) at some (random+rare) point - always after making a request - we're
    stuck while waiting for the answer.

    Sadly, this can happen inside a transaction while we hold the lock on
    some shared resource. This will lock the whole website until we run out
    of File Descriptor due to accept'ed pending connections. In that case we
    get an exception and the server (the multithread one, not MySQL)
    restarts, which release the lock.

    In some other cases when we don't hold a lock, the thread remains
    blocked in poll() as I described it. After a timeout (I think it's 28800
    seconds) the MySQL server closes the connection. The client - which is
    waiting in poll() - does not have any timeout activated (it's relying on
    the mysql server). But it doesn't notice that the socket has been closed
    either.

    We investigated a lot about signals since poll() can also be interrupted
    by Garbage Collector and child process signals, but we correctly handle
    EINTR everywhere it's needed. So unless there's a possibility that
    interrupting poll() with a signal might somehow consume the data, this
    is not the problem here.

    > TCP being reliable can block indefinitely, you can employ TCP keepalive
    > to change indefinite to quite a long time.


    Sure. We could also use a client timeout, but we don't want to hold the
    lock more than required, and we can't make the difference between a
    given request that would take too much time to complete and a lost
    connection.

    Hope we can somehow understand what's going on.
    Thanks for the answers so far,

    Best,
    Nicolas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: poll() blocked / packets not received ?

    > TCP being reliable can block indefinitely, you can employ TCP keepalive
    > to change indefinite to quite a long time.


    Ok, funny thing is that we just found what is occurring...

    We had a process that was on a regular basis doing the following :

    conntrack -F

    This was done in order to prevent the table to grow too big, because we
    were reaching the maximum size as told by :

    /proc/sys/net/ipv4/netfilter/ip_conntrack_max
    and
    /proc/sys/net/ipv4/netfilter/ip_conntrack_count

    Seems like when there are active connections, this will break netfilter
    and stop delivering packets to the socket.

    At least I will have nice sleep tonight.

    Best,
    Nicolas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. RE: poll() blocked / packets not received ?


    Nick Cannasse wrote:

    > Ok, funny thing is that we just found what is occurring...
    >
    > We had a process that was on a regular basis doing the following :
    >
    > conntrack -F
    >
    > This was done in order to prevent the table to grow too big, because we
    > were reaching the maximum size as told by :
    >
    > /proc/sys/net/ipv4/netfilter/ip_conntrack_max
    > and
    > /proc/sys/net/ipv4/netfilter/ip_conntrack_count
    >
    > Seems like when there are active connections, this will break netfilter
    > and stop delivering packets to the socket.
    >
    > At least I will have nice sleep tonight.


    Note that this solved your symptom, not your problem. You actually have two
    problems:

    1) You rely on TCP to detect a lost connection even by a side that will
    never transmit any data. TCP simply does not do this. If you are not trying
    to send data, you are not assured that a lost connection will be detected.
    (You either need a timeout, or you need to send or dribble some data,
    depending on the protocl.)

    2) You hold a lock on a shared resource while you wait for a reply over a
    network. If this is a low-level "block and wait indefinitely" lock, this
    will cause many threads to line up behind a slow/stuck thread. The right fix
    depends on your circumstances, but you need to use a synchronization
    primitive that is suitable. (You need to be able to use multiple connections
    or defer operations without holding a thread.)

    With both of these bugs, you are vulnerable to precisely the scenario you
    observed. The TCP connection close packets were lost (in this case due to
    premature expiration of the connnection tracking, but other things can do
    it, such as the server rebooting), TCP could not detect the lost connection
    because you never sent any data, so one thread blocked forever, and other
    threads got in line behind it.

    DS


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: poll() blocked / packets not received ?

    David Schwartz a écrit :
    >> At least I will have nice sleep tonight.

    >
    > Note that this solved your symptom, not your problem. You actually have two
    > problems:
    >
    > 1) You rely on TCP to detect a lost connection even by a side that will
    > never transmit any data. TCP simply does not do this. If you are not trying
    > to send data, you are not assured that a lost connection will be detected.
    > (You either need a timeout, or you need to send or dribble some data,
    > depending on the protocl.)
    >
    > 2) You hold a lock on a shared resource while you wait for a reply over a
    > network. If this is a low-level "block and wait indefinitely" lock, this
    > will cause many threads to line up behind a slow/stuck thread. The right fix
    > depends on your circumstances, but you need to use a synchronization
    > primitive that is suitable. (You need to be able to use multiple connections
    > or defer operations without holding a thread.)


    I agree with both points, but I can't modify the MySQL protocol to
    implement that.

    For (1) I can't add the timeout since I have no way to differentiate
    between a lost connection and a request that takes time to execute. I'll
    maybe check if the protocol allow pings while waiting for the request
    result, but I'm not sure it does.

    For (2) the shared resources is on the database side, not on the server
    side. It's the transaction that have some rows locked. I have no
    solution for that.

    Best,
    Nicolas
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. RE: poll() blocked / packets not received ?


    > I agree with both points, but I can't modify the MySQL protocol to
    > implement that.


    > For (1) I can't add the timeout since I have no way to differentiate
    > between a lost connection and a request that takes time to execute. I'll
    > maybe check if the protocol allow pings while waiting for the request
    > result, but I'm not sure it does.


    Sure you can. For example, you can run a proxy on both the server and the
    client, with the two proxies speaking a protocol that carries the MySQL
    protocol. The protocol between the server and the client can include two
    types of messages, one being regular data (which the proxies pass to the
    server and client software) and one being a ping (which the proxies use
    internally to decide when to drop their connections). Each proxy can 'ping'
    the other as often as required and drop both connections if the ping fails
    to go through. This will ensure that your program detects a connection loss
    rapidly.

    There are many other possible solutions.

    > For (2) the shared resources is on the database side, not on the server
    > side. It's the transaction that have some rows locked. I have no
    > solution for that.


    That doesn't fit your problem description. Presumably the server detected
    the loss of the connection and so would have released any resources it was
    holding that were associated with it. The problem in this case was that the
    client couldn't detect the loss of the connection.

    > Best,
    > Nicolas


    Good luck.

    DS


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: poll() blocked / packets not received ?

    On Mon, Oct 20, 2008 at 07:24:14PM +0200, Nicolas Cannasse wrote:
    > For (1) I can't add the timeout since I have no way to differentiate
    > between a lost connection and a request that takes time to execute.


    Not only you can, but you *must*. Any service assuming infinite timeout
    is deemed to fail. If you know that one request can take as long as one
    minute for instance, then use a 2 minutes timeout. The day all requests
    will be automatically cleaned up because of a failed firewall between
    client and server, you'll be happy not to have to come there and restart
    the service to flush them.

    There's a huge difference between using a very large timeout and none at
    all.

    Willy

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread