distributed measurement problem - Networking

This is a discussion on distributed measurement problem - Networking ; Hi, I am working on a distributed measurement project with a centralized data collection node (server) and 28 clients with different number of interfaces(1-4). I've written C code that captures packets on all the interfaces on a node(on which it ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: distributed measurement problem

  1. distributed measurement problem

    Hi,

    I am working on a distributed measurement project with a centralized
    data collection node (server) and 28 clients with different number of
    interfaces(1-4).

    I've written C code that captures packets on all the interfaces on a
    node(on which it runs), gets statistics(pps, Mbps etc for different
    subsets of traffic), and sends it to the server every second. The
    server basically creates a file for each interface on each client and
    writes these statistics into the respective files.

    I've used python to automate and synchronize, so it basically runs the
    C program in the background on each of the interfaces.

    The problem is:
    If I initiate the client program to run for, say 200 seconds, the
    clients run for the entire period sending statistics per second to the
    server. However, files corresponding to some interfaces do not show
    the entire 200 seconds even though the client finishes execution and
    the server closes the file after the client has finished execution.

    I don't think this is an issue with the server being flooded with data
    (its multithreaded and the below example was run one node at a time)
    or about packets being dropped(doesn't make sense for this problem
    plus ifconfig doesnt show dropped packets and I am using TCP sockets
    as well). I am not sure whether there is a bug in my code, since its
    essentially the same client code on all systems.

    Here is the wc -l execution on three nodes run one at a time for 200
    seconds:

    > wc -l *.log

    44 core1.10.1.11.2.log
    200 core1.10.1.3.2.log
    49 core1.10.1.32.3.log
    200 core1.10.1.9.2.log
    49 core2.10.1.13.2.log
    49 core2.10.1.15.2.log
    200 core2.10.1.3.3.log
    200 core2.10.1.5.2.log
    49 core3.10.1.17.2.log
    200 core3.10.1.18.2.log
    200 core3.10.1.30.3.log
    200 core3.10.1.5.3.log
    1640 total

    Each has 4 interfaces on it, and although the experiment ran for 200
    seconds, some show about 44 or 49 lines on it. ifconfig on the server
    shows no dropped packets.

    Does anyone have pointers on this?
    Sorry for the long post,

    Thanks,
    Shashank

  2. Re: distributed measurement problem

    On Nov 3, 3:44*pm, Shashank wrote:

    > The problem is:
    > If I initiate the client program to run for, say 200 seconds, the
    > clients run for the entire period sending statistics per second to the
    > server. However, files corresponding to some interfaces do not show
    > the entire 200 seconds even though the client finishes execution and
    > the server closes the file after the client has finished execution.


    This doesn't fit the pattern for any "typical mistake" that I'm
    familiar with. I'd suggest trying to localize the problem bit by bit.

    For example, first modify the client software to checkpoint how many
    reports it has sent to the server. Have a client log file, and have it
    write a 'checkpoint' after every ten messages. Open the log file in
    append mode, assemble the checkpoint message in a buffer, and send it
    with a single call to 'write'. If the checkpoints don't show the 200
    messages, then you know the client is the issue.

    Then add similar checkpointing in the software that talks to the
    client. Make sure the server software sees 200 messages. If not, then
    you know something is screwy in that piece of software. (Perhaps the
    client isn't really sending the messages? Perhaps the server is
    dropping some of them?)

    Keep going until you localize the problem.

    DS

  3. Re: distributed measurement problem

    David Schwartz wrote:
    > On Nov 3, 3:44 pm, Shashank wrote:
    >
    >> The problem is:
    >> If I initiate the client program to run for, say 200 seconds, the
    >> clients run for the entire period sending statistics per second to the
    >> server. However, files corresponding to some interfaces do not show
    >> the entire 200 seconds even though the client finishes execution and
    >> the server closes the file after the client has finished execution.

    >
    > This doesn't fit the pattern for any "typical mistake" that I'm
    > familiar with. I'd suggest trying to localize the problem bit by bit.
    >
    > For example, first modify the client software to checkpoint how many
    > reports it has sent to the server. Have a client log file, and have it
    > write a 'checkpoint' after every ten messages. Open the log file in
    > append mode, assemble the checkpoint message in a buffer, and send it
    > with a single call to 'write'. If the checkpoints don't show the 200
    > messages, then you know the client is the issue.
    >
    > Then add similar checkpointing in the software that talks to the
    > client. Make sure the server software sees 200 messages. If not, then
    > you know something is screwy in that piece of software. (Perhaps the
    > client isn't really sending the messages? Perhaps the server is
    > dropping some of them?)
    >
    > Keep going until you localize the problem.
    >
    > DS


    Also timestamp your messages and look to see which ones are missing.
    That may give you a clue of where to look for the problem.

  4. Re: distributed measurement problem

    On Nov 4, 1:21*pm, Joe Beanfish wrote:
    > David Schwartz wrote:
    > > On Nov 3, 3:44 pm, Shashank wrote:

    >
    > >> The problem is:
    > >> If I initiate the client program to run for, say 200 seconds, the
    > >> clients run for the entire period sending statistics per second to the
    > >> server. However, files corresponding to some interfaces do not show
    > >> the entire 200 seconds even though the client finishes execution and
    > >> the server closes the file after the client has finished execution.

    >
    > > This doesn't fit the pattern for any "typical mistake" that I'm
    > > familiar with. I'd suggest trying to localize the problem bit by bit.

    >
    > > For example, first modify the client software to checkpoint how many
    > > reports it has sent to the server. Have a client log file, and have it
    > > write a 'checkpoint' after every ten messages. Open the log file in
    > > append mode, assemble the checkpoint message in a buffer, and send it
    > > with a single call to 'write'. If the checkpoints don't show the 200
    > > messages, then you know the client is the issue.

    >
    > > Then add similar checkpointing in the software that talks to the
    > > client. Make sure the server software sees 200 messages. If not, then
    > > you know something is screwy in that piece of software. (Perhaps the
    > > client isn't really sending the messages? Perhaps the server is
    > > dropping some of them?)

    >
    > > Keep going until you localize the problem.

    >
    > > DS

    >
    > Also timestamp your messages and look to see which ones are missing.
    > That may give you a clue of where to look for the problem.


    Hello,

    Thanks to both of you for the suggestions.
    The problem was actually in one of the anomaly detection algorithms I
    was using.
    I have sorted the problem out.
    Thanks..
    Shashank

+ Reply to Thread