multiple link aggregation questions: LAG /LACP /IEEE 802.3ad/ etc. - Networking

This is a discussion on multiple link aggregation questions: LAG /LACP /IEEE 802.3ad/ etc. - Networking ; I'm still a bit uncertain about the way that I've set up my Linux box and the switches correctly in my quest for Ethernet channel bonding. Goal: to bond eth0 and eth1 on each blade and thus attain close to ...

+ Reply to Thread
Results 1 to 8 of 8

Thread: multiple link aggregation questions: LAG /LACP /IEEE 802.3ad/ etc.

  1. multiple link aggregation questions: LAG /LACP /IEEE 802.3ad/ etc.

    I'm still a bit uncertain about the way that I've set up my Linux box and
    the switches correctly in my quest for Ethernet channel bonding.

    Goal: to bond eth0 and eth1 on each blade and thus attain close to 2Gbps
    transmit and receive. i.e. desire Load balancing / bandwidth
    aggregation. Do *not* care at all about fault-tolerance.

    Equipment: 1 server (3 eth ports), 23 blades, 2 Dell 6248 switches. Each
    switch has 48 Gbit ports.

    I setup bond0 on each blade. Used mode=6. Adaptive load balancing. From
    what I read this seems most suitable (correct me if I am wrong please!)
    since supports both transmit and receive side balancing.

    Now comes the confusing parts:

    1. Do I need Ling Aggregation Groups (LAGs) on the switch or not for my
    switch-to-port connections? I receive multiple conflicting views on this
    online. http://www.linuxfoundation.org/en/Net:Bonding says "does not
    require any special switch support...does not require any special switch
    support..." So do many other tutorials that do not mention anything about
    any switch side configs being required at all!

    Others say it still needs LAGs. My "common-sense" says I ought to tell
    the switch that two of my ports are going to the same blade somehow.

    2. Will the switch see two MAC ids or just a single one for the bond0
    device if I examined its address tables? For alb it ought to be both
    right? But I tried examining the ARP tables on the server and there for
    each blade IP only the bond0 MAC is listed. Is that a sign something is
    wrong or just my misinformed-ignorant paranoia! I read the specs on all
    the 6 bonding algorithms (some load balancing and others for fault
    tolerance) and see that some seem to transmit both MAC ids and other just
    a single one? True?

    3. How about the switch-to-switch connections? If I want to connect 8 eth
    cables switch-to-switch (to aggregate bandwidth again) do I need a LAG
    here or not? (8 is the magic number because thats the max number of ports
    my switch will allow me to aggregate).

    4. Each LAG group has a LACP option. Enable or not? The core Linux specs.
    seem to have no mention of LACP; only company specific info seems to
    exist! (Cisco, Dell etc.) I guess LACP is related to IEEE 802.3ad? Is
    that only a workaround to prevent having to manually aggregate ports into
    LAG groups? Or does it have an advantage as a load-balancing protocol to
    my chose "Adaptive load balancing"

    I guess it boils down to two questions: (1) To LAG or not-to-LAG (same
    for LACP) (2) Is my mode=6 (Adaptive load balancing) the appropriate
    mode?


    --
    Rahul

  2. Re: multiple link aggregation questions: LAG /LACP /IEEE 802.3ad/ etc.

    In comp.os.linux.networking Rahul wrote:
    > I'm still a bit uncertain about the way that I've set up my Linux
    > box and the switches correctly in my quest for Ethernet channel
    > bonding.


    > Goal: to bond eth0 and eth1 on each blade and thus attain close to
    > 2Gbps transmit and receive. i.e. desire Load balancing / bandwidth
    > aggregation. Do *not* care at all about fault-tolerance.


    Do you expect that 2Gbps over a _single_ connection/flow?

    > Equipment: 1 server (3 eth ports), 23 blades, 2 Dell 6248
    > switches. Each switch has 48 Gbit ports.


    You mentioned blades - I cannot recall from earlier which blades these
    were, but are they connecting to the outside world through a _switch_
    module in the blade chassis or a pass-through module?

    > I setup bond0 on each blade. Used mode=6. Adaptive load
    > balancing. From what I read this seems most suitable (correct me if
    > I am wrong please!) since supports both transmit and receive side
    > balancing.


    > Now comes the confusing parts:


    Cannot really help much there.

    > 4. Each LAG group has a LACP option. Enable or not? The core Linux
    > specs. seem to have no mention of LACP; only company specific info
    > seems to exist! (Cisco, Dell etc.) I guess LACP is related to IEEE
    > 802.3ad?


    IIRC they are one and the same

    rick jones
    --
    denial, anger, bargaining, depression, acceptance, rebirth...
    where do you want to be today?
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  3. Re: multiple link aggregation questions: LAG /LACP /IEEE 802.3ad/ etc.

    Rick Jones wrote in news:g9sitk$cs7$2
    @usenet01.boi.hp.com:

    > Do you expect that 2Gbps over a _single_ connection/flow?


    Thanks again for your comments Rick. Yes. Am I wrong in expecting that?

    > You mentioned blades - I cannot recall from earlier which blades these
    > were, but are they connecting to the outside world through a _switch_
    > module in the blade chassis or a pass-through module?


    Dell Power Edge 1435 ("nodes"). They have twin eth ports each. They connect
    to a switch. The switch connects to a server. Server to world. I am *not*
    interested in node-to-world performance. Mostly node-to-node and node-to-
    server.

    > IIRC they are one and the same


    Could very well be! But then why does Dell have a seperate toggle for LACP.
    Implies I can have LAG but not LACP. Maybe its just a Dell error. Could
    swith users from other vendors comment on their configs? So that we can see
    if this is a Dell specific quirk?

    --
    Rahul

  4. Re: multiple link aggregation questions: LAG /LACP /IEEE 802.3ad/ etc.

    Rahul wrote:
    > Rick Jones wrote in news:g9sitk$cs7$2
    > @usenet01.boi.hp.com:
    > > Do you expect that 2Gbps over a _single_ connection/flow?


    > Thanks again for your comments Rick. Yes. Am I wrong in expecting
    > that?


    I think so but my point of view may not be shared by others. IIRC the
    only mode that will spread the _outbound_ traffic of a single
    connection/flow across multiple links in the bond/trunk/aggregate is
    mode-rr aka round-robin.

    I've never been terribly fond of that mode because it leads to
    out-of-order TCP segments and a resulting increase in ACKs and
    depending on the number of links in the bond/trunk/aggregate spurrious
    TCP retransmissions.

    I am not familiar with any switch with a similar round-robin mode for
    the inbound traffic. Doesn't mean they don't exist mind you...

    Those adaptive modes which are doing clever things with MAC addresses
    are (probably) doing them for different destinations (IP addresses).
    It would be necessary to _constantly_ be sending ARP refreshes (as in
    an ARP frame for virtually every frame carrying a TCP segment) to get
    traffic between a single pair of IPs to spread across different MAC
    addresses.

    IMO the best-if-not-only way to get > 1Gbit/s for a single TCP
    connection is to use a 10G link.

    > > You mentioned blades - I cannot recall from earlier which blades
    > > these were, but are they connecting to the outside world through a
    > > _switch_ module in the blade chassis or a pass-through module?


    > Dell Power Edge 1435 ("nodes"). They have twin eth ports each. They
    > connect to a switch. The switch connects to a server. Server to
    > world. I am *not* interested in node-to-world performance. Mostly
    > node-to-node and node-to- server.


    The "nodes" connect directly to an external switch and not some switch
    internal to the blade chassis? I'm not familiar with Dell blades, but
    for HP C-Class blades, there are I/O modules which plug into the back
    of the blade chassis to connect the eth ports on the blades themselves
    with the outside world. Those can either be pass-through modules or
    they can be actual switches. That is why I was asking about what was
    in the blade chassis along with the blades themselves. If you have
    switch modules you would need to bond/trunk/aggregate to _that_ switch
    module, and then have another bond/trunk/aggregate between the
    "chassis switch" and the external switch to which the server is
    connected.

    rick jones
    --
    The computing industry isn't as much a game of "Follow The Leader" as
    it is one of "Ring Around the Rosy" or perhaps "Duck Duck Goose."
    - Rick Jones
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  5. Re: multiple link aggregation questions: LAG /LACP /IEEE 802.3ad/ etc.

    Rick Jones wrote in news:ga4fl7$ndc$1
    @usenet01.boi.hp.com:

    > The "nodes" connect directly to an external switch and not some switch
    > internal to the blade chassis? I'm not familiar with Dell blades, but
    > for HP C-Class blades, there are I/O modules which plug into the back
    > of the blade chassis to connect the eth ports on the blades themselves
    > with the outside world. Those can either be pass-through modules or
    > they can be actual switches. That is why I was asking about what was
    > in the blade chassis along with the blades themselves. If you have
    > switch modules you would need to bond/trunk/aggregate to _that_ switch
    > module, and then have another bond/trunk/aggregate between the
    > "chassis switch" and the external switch to which the server is
    > connected.
    >


    Rick, my bad. Maybe I confused you with my misleading usage of the term
    "blades"? These are Dell Power Edge 1435 Rack Mount servers.
    http://www.dell.com/content/products.../pedge_sc1435?
    c=us&cs=555&l=en&s=biz

    The backplane has twin eth ports. We connected these using ordinary CAT5e
    cables to ports on a Dell switch. Switch is also a Dell Power Connect 6248
    with 48 Gbit ports.

    Does that clarify the situation better?

    --
    Rahul

  6. Re: multiple link aggregation questions: LAG /LACP /IEEE 802.3ad/ etc.

    Rick Jones wrote in news:ga4fl7$ndc$1
    @usenet01.boi.hp.com:

    > I think so but my point of view may not be shared by others. IIRC the
    > only mode that will spread the _outbound_ traffic of a single
    > connection/flow across multiple links in the bond/trunk/aggregate is
    > mode-rr aka round-robin.
    > I've never been terribly fond of that mode because it leads to
    > out-of-order TCP segments and a resulting increase in ACKs and
    > depending on the number of links in the bond/trunk/aggregate spurrious
    > TCP retransmissions.


    Interesting. Any downsides to mode-rr? Is it transmit-side load balancing
    only? Also, why do you think that some of the other "smarter" modes (alb
    / 802.3ab) do not achieve a bandwidth multiplier, can I ask? Just a
    personal preference or anything fundamentally iffy about those modes?


    > I am not familiar with any switch with a similar round-robin mode for
    > the inbound traffic. Doesn't mean they don't exist mind you...


    I thought a LAG was the same idea. If a switch cannot distinguish between
    two similar links and clubs them together doesn't that achive the same
    effect? Maybe I am wrong.

    > Those adaptive modes which are doing clever things with MAC addresses
    > are (probably) doing them for different destinations (IP addresses).
    > It would be necessary to _constantly_ be sending ARP refreshes (as in
    > an ARP frame for virtually every frame carrying a TCP segment) to get
    > traffic between a single pair of IPs to spread across different MAC
    > addresses.


    Right. Which is why mode=6 (alb) will only (IMO) give a bandwidth
    multiplier when speaking to *at least* two different peers. When talking
    to a single peer (single IP) no advantage.

    > IMO the best-if-not-only way to get > 1Gbit/s for a single TCP
    > connection is to use a 10G link.


    Too expensive for a university-research cluster!

    --
    Rahul

  7. Re: multiple link aggregation questions: LAG /LACP /IEEE 802.3ad/ etc.

    Rahul wrote:
    > Rick, my bad. Maybe I confused you with my misleading usage of the
    > term "blades"? These are Dell Power Edge 1435 Rack Mount servers.
    > http://www.dell.com/content/products.../pedge_sc1435?
    > c=us&cs=555&l=en&s=biz


    > The backplane has twin eth ports. We connected these using ordinary
    > CAT5e cables to ports on a Dell switch. Switch is also a Dell Power
    > Connect 6248 with 48 Gbit ports.


    > Does that clarify the situation better?


    Yes. Standalone systems. Understood. My end conclusion about
    single-stream, aggregatation and 10Gig still stands though

    rick jones
    --
    denial, anger, bargaining, depression, acceptance, rebirth...
    where do you want to be today?
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

  8. Re: multiple link aggregation questions: LAG /LACP /IEEE 802.3ad/ etc.

    Rahul wrote:
    > Rick Jones wrote in news:ga4fl7$ndc$1
    > @usenet01.boi.hp.com:


    > > I think so but my point of view may not be shared by others. IIRC the
    > > only mode that will spread the _outbound_ traffic of a single
    > > connection/flow across multiple links in the bond/trunk/aggregate is
    > > mode-rr aka round-robin.
    > > I've never been terribly fond of that mode because it leads to
    > > out-of-order TCP segments and a resulting increase in ACKs and
    > > depending on the number of links in the bond/trunk/aggregate spurrious
    > > TCP retransmissions.


    > Interesting. Any downsides to mode-rr?


    It leads to out-of-order TCP segments, which leads to an increase in
    the number of ACKs, which will increase CPU utilization per KB
    transferred (service demand in netperf-speak) and on the larger link
    counts in a single aggregate, spurrious TCP retransmissions which will
    waste bandwidth and suppress the congestion window.

    > Is it transmit-side load balancing only?


    Yes.

    > Also, why do you think that some of the other "smarter" modes (alb /
    > 802.3ab) do not achieve a bandwidth multiplier, can I ask? Just a
    > personal preference or anything fundamentally iffy about those
    > modes?


    Unless I've really misunderstood what is going on, the modes playing
    tricks with ARP cannot on first principles affect a single flow. They
    get traffic to flow over different links by handing-out different MAC
    addreses to queries for their one local IP. Even if we assume that
    every segment sent on a TCP connection does an ARP cache lookup, the
    only way it could get a new MAC address each time would be if there
    was an ARP update between every TCP segment. I cannot imagine any of
    the modes in linux bonding doing something sooo terribly inefficient.
    It would make mode-rr look positively pristine in comparison.

    The point of link aggregation was to increase aggregate throughput and
    provide a modicum of HA. Increasing the speed of a single flow was
    not part of the design center.

    > > I am not familiar with any switch with a similar round-robin mode for
    > > the inbound traffic. Doesn't mean they don't exist mind you...


    > I thought a LAG was the same idea. If a switch cannot distinguish between
    > two similar links and clubs them together doesn't that achive the same
    > effect? Maybe I am wrong.


    All depends on what the switch does. My experience with other
    switches (non-Dell) has been that when presented with an aggregate the
    switch will hash on some addressing in the frame to pick the link on
    which it will place the frame. Soemtimes this is simply the MAC,
    sometimes it may include the IP. I've heard unconfirmed rumours that
    some switches may even go so far as to look at TCP/UDP port numbers.
    However, none of that would result in traffic for a single flow
    flowing over multiple links in parallel.

    > > Those adaptive modes which are doing clever things with MAC
    > > addresses are (probably) doing them for different destinations (IP
    > > addresses). It would be necessary to _constantly_ be sending ARP
    > > refreshes (as in an ARP frame for virtually every frame carrying a
    > > TCP segment) to get traffic between a single pair of IPs to spread
    > > across different MAC addresses.


    > Right. Which is why mode=6 (alb) will only (IMO) give a bandwidth
    > multiplier when speaking to *at least* two different peers. When
    > talking to a single peer (single IP) no advantage.


    Right, and you said you needed an increase for comms to a single peer
    right?

    > > IMO the best-if-not-only way to get > 1Gbit/s for a single TCP
    > > connection is to use a 10G link.


    > Too expensive for a university-research cluster!


    How did the line go in "The Right Stuff?" "No bucks, no Buck Rogers."


    rick jones
    --
    a wide gulf separates "what if" from "if only"
    these opinions are mine, all mine; HP might not want them anyway...
    feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

+ Reply to Thread