Dear Net-SNMP Coders -

Just a head's up that we are still having the issue in Net-SNMP 5.4. This
is verified not to occur when the code is compiled in 32-bit, and we
believe it's 64-bit specific (and occurs on AMD/Opteron) builds.

The issue presents, on 64-bit, in the following ways:

Common Presentation -
Packets seem to "disappear" once sent in the queue and we
never get a call back to the asynch handler. Neither
a timeout nor asynch handler callback, the packets just
"disappear" eventually leading to full session and packet
queues.

Uncommon Presentation -
SNMP Free failure inside _sess_process_packet as per the call
stack below. The memory heap of the process appears to be
corrupted beyond this point (as per stack below).

> #32 0x0000000000529636 in Signal_Handler ()
> #33
> #34 0x00000031eda2e21d in raise () from /lib64/tls/libc.so.6
> #35 0x00000031eda2fa1e in abort () from /lib64/tls/libc.so.6
> #36 0x00000031eda63291 in __libc_message () from /lib64/tls/libc.so.6
> #37 0x00000031eda68eae in _int_free () from /lib64/tls/libc.so.6
> #38 0x00000031eda691f6 in free () from /lib64/tls/libc.so.6
> #39 0x0000000000959e9d in _sess_process_packet ()
> #40 0x000000000095b4c9 in _sess_read ()
> #41 0x000000000095bc39 in snmp_sess_read ()


This issue occurs only after multiple days of running the library under
fairly heavy load. The issues always occur after similar amount of
run-time, and always within net-snmp loops.

To simplify this the most, we either never get packets back to the
callback they just appear to "disappear", or infrequently get a bad free
that trips signal handler.

I will list the actions we are taking in the library here such that any
red flags may be identified by people who work on this library.

- We have multiple threads acquiring SNMP data, each thread will have
typically 200-220 independent, thread-dedicated SNMP sessions to
various hosts. These 200-ish sessions are created and closed within
the same thread and managed within the same thread.

- We will 'power-load' one session to one IP with many requests.
We've found this to be substantially faster than single queue
waiting and it seems to work with no problems on 32-bit.

- PDU construction for these (and a lot of none net-snmp related stuff)
is, however, done in a support thread. The pdu construction code,
from cursory looks through net-snmp appeared to be reasonably threa
is this the case? Specifically, calls to snmp_pdu_create and
snmp_add_null_var.

- We send SNMP packets, all within the same session-specific thread,
however we send the packets at two points. We send the packets
outside of the normal select loop, and also inside the asynch_response
callback. Would people believe it is okay to do this? We have
found some performance benefits in doing so.

- We've pretty much assumed that per-session access across threads is
unsafe, so we've guaranteed that (there were lots of problems when
using sessions across threads).

We are going to reduce to a single SNMP thread and single PDU per session
active in next step for testing.

However, all of these improvements do actually work very well on 32-bit
Net-SNMP.

It is only 64-bit that it causes an issue. Early testing a year or two
ago under older versions of SNMP showed it to be a good bit faster to have
many session-locked threads going, however, is it considered safe to do
this?

All of the above behaviors work reliably on 32-bit and have been tested
extensively, it is only 64-bit compiles that have problems.

We are doing a bunch more tests/debugging and should have more info and
will try to fix the issue on ourselves, but I wanted to make the group
aware of this issue because it does present as a critical systems failure
/ process heap corruption (so if you aren't doing adequate signal handling
you would possibly never know and your application could crash anywhere),
so I wanted to make people aware of this.

One thing that caught my eye was this;

long
snmp_get_next_msgid(void)
{
long retVal;
snmp_res_lock(MT_LIBRARY_ID, MT_LIB_MESSAGEID);
retVal = 1 + Msgid; /*MTCRITICAL_RESOURCE */
if (!retVal)
retVal = 2;
Msgid = retVal;
snmp_res_unlock(MT_LIBRARY_ID, MT_LIB_MESSAGEID);
if (netsnmp_ds_get_boolean(NETSNMP_DS_LIBRARY_ID,
NETSNMP_DS_LIB_16BIT_IDS))
return (retVal & 0x7fff); /* mask to 15 bits */
else
return retVal;
}

And specifically this:

return (retVal & 0x7fff); /* mask to 15 bits */

It appears there is a masking to 15 bits when using 16 bit IDs, but it
returns a long and a long is 8 bytes in 64-bit (as opposed to 4 bytes in
32-bit), so I wasn't sure if all these IDs would still be valid.

However, I masked them to 32-bit and it didn't affect anything, so we
returned it to your implementation.

This issue may be specific to our usage scenario which is very heavily
loading the library, but if there are thoughts/hints that might shorten
the time to isolate this issue, they would be much appreciated.

We will simplify to 1 SNMP thread and 1 PDU firing per session in next
step. Should we also only call snmp_sess_async_send OUTSIDE of the
callback?

That seems to be allowed from other examples and definitely makes things
faster in long SNMP walks. Could this also be the issue?

Again, all of this code works great in 32-bit and is impressively fast.
It's only 64-bit we are having issue with.

If we find any useful fixes, I'll forward, thanks in advance for any help
/ hints / ideas as to what might cause this!

Kyle



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?p...rge&CID=DEVDEV
_______________________________________________
Net-snmp-coders mailing list
Net-snmp-coders@lists.sourceforge.net
https://lists.sourceforge.net/lists/...et-snmp-coders