Troubleshooting SANs - Storage

This is a discussion on Troubleshooting SANs - Storage ; I work for a consulting firm, and have begun to do troubleshooting on small SANs, mostly HP MSA1500cs based. Many times the problem the customer is talking about is some vague intermittent slowness issue or something like that. In cases ...

+ Reply to Thread
Results 1 to 8 of 8

Thread: Troubleshooting SANs

  1. Troubleshooting SANs

    I work for a consulting firm, and have begun to do troubleshooting on
    small SANs, mostly HP MSA1500cs based.

    Many times the problem the customer is talking about is some vague
    intermittent slowness issue or something like that. In cases like
    this, my troubleshooting goes something like this:

    1. Check switch logs for marginal ports or other errors (usually
    brocade 4/24s or similar)
    2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
    etc.

    If the problem still exists, I'll call HP support, but more often than
    not they can't really help from here. So the only approach that
    yields results is to start unplugging stuff until I see the problem
    disappear.

    In one recent instance, I had a customer start shutting blades off
    until he found that one of them had an HBA that was mysteriously
    causing the intermittent slowness for the whole SAN. The HBA actually
    seemed to work, and there were no errors in the Windows event logs, or
    switch logs, sansurfer, or anything.

    There has got to be a better way to find this kind of thing. On an IP
    network, I would run Ethereal or some other packet analyzer to try and
    see what is talking on the network when the problem manifests. But
    I've never really found anything like that for a fibre channel SAN.

    As I said, I'm pretty new to SAN, so any direction would be helpful.

    Thanks,
    Sean


  2. Re: Troubleshooting SANs

    Uzytkownik napisal w wiadomosci
    news:1173550832.940149.73810@j27g2000cwj.googlegro ups.com...
    >I work for a consulting firm, and have begun to do troubleshooting on
    > small SANs, mostly HP MSA1500cs based.
    >
    > Many times the problem the customer is talking about is some vague
    > intermittent slowness issue or something like that. In cases like
    > this, my troubleshooting goes something like this:
    >
    > 1. Check switch logs for marginal ports or other errors (usually
    > brocade 4/24s or similar)
    > 2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
    > etc.
    >
    > If the problem still exists, I'll call HP support, but more often than
    > not they can't really help from here. So the only approach that
    > yields results is to start unplugging stuff until I see the problem
    > disappear.
    >
    > In one recent instance, I had a customer start shutting blades off
    > until he found that one of them had an HBA that was mysteriously
    > causing the intermittent slowness for the whole SAN. The HBA actually
    > seemed to work, and there were no errors in the Windows event logs, or
    > switch logs, sansurfer, or anything.
    >
    > There has got to be a better way to find this kind of thing. On an IP
    > network, I would run Ethereal or some other packet analyzer to try and
    > see what is talking on the network when the problem manifests. But
    > I've never really found anything like that for a fibre channel SAN.
    >
    > As I said, I'm pretty new to SAN, so any direction would be helpful.
    >
    > Thanks,
    > Sean
    >


    Hi Sean,

    check
    http://www.finisar.com/index.php?fil...and%20Analysis

    Good luck,
    Piotr



  3. Re: Troubleshooting SANs

    > Hi Sean,
    >
    > check
    > http://www.finisar.com/index.php?fil...and%20Analysis
    >
    > Good luck,
    > Piotr


    Yeah I found some of that stuff. The problem with everything I've found is
    that it requires Taps. I haven't found anything equivalent to a "mirroring
    port" on a switch.

    Does such a thing exist?


  4. Re: Troubleshooting SANs


    Uzytkownik "Sean Howard" napisal w wiadomosci
    news:vZ6dnQ0LoffOaWjYnZ2dnUVZ_oCmnZ2d@comcast.com. ..
    >> Hi Sean,
    >>
    >> check
    >> http://www.finisar.com/index.php?fil...and%20Analysis
    >>
    >> Good luck,
    >> Piotr

    >
    > Yeah I found some of that stuff. The problem with everything I've found
    > is that it requires Taps. I haven't found anything equivalent to a
    > "mirroring port" on a switch.
    >
    > Does such a thing exist?


    Yes it does, but not on every product. As far as I am aware you can find it
    on Brocade 48000 directors and Brocade 5000 FC switches.

    There is a good reason for using Taps in SAN monitoring and troubleshooting
    (see below as found in a Finsar document covering this problem).
    1. Multiple ports mirrored to one port causes buffer overflow and dropped
    packets.
    2. Packets go through a buffer and are retimed, making accurate time
    sensitive measurements impossible, such as jitter, packet gap analysis, or
    latency.
    3. Most mirror ports filter anomalies, thus making troubleshooting
    impossible.
    4. Turning on port mirroring puts a load on the switch's CPU/transfer logic
    thus impacting the switch's operational performance.

    Piotr



  5. Re: Troubleshooting SANs

    Since when do 48k's (or *any* Brocade switch) support port mirroring?
    One would think the Condor's could handle it, but I've never seen it
    implemented in Brocade's product line. I'm not sure about Cisco.

    To the OP, the only way I know of is tapping the fabric. There are FC
    protocol analyzers, but they sit in band.

    -Mark

    On Mar 13, 7:50 am, "Piotr" wrote:
    > Uzytkownik "Sean Howard" napisal w wiadomoscinews:vZ6dnQ0LoffOaWjYnZ2dnUVZ_oCmnZ2d@co mcast.com...
    >
    > >> Hi Sean,

    >
    > >> check
    > >>http://www.finisar.com/index.php?fil...ct&div_id=smen...

    >
    > >> Good luck,
    > >> Piotr

    >
    > > Yeah I found some of that stuff. The problem with everything I've found
    > > is that it requires Taps. I haven't found anything equivalent to a
    > > "mirroring port" on a switch.

    >
    > > Does such a thing exist?

    >
    > Yes it does, but not on every product. As far as I am aware you can find it
    > on Brocade 48000 directors and Brocade 5000 FC switches.
    >
    > There is a good reason for using Taps in SAN monitoring and troubleshooting
    > (see below as found in a Finsar document covering this problem).
    > 1. Multiple ports mirrored to one port causes buffer overflow and dropped
    > packets.
    > 2. Packets go through a buffer and are retimed, making accurate time
    > sensitive measurements impossible, such as jitter, packet gap analysis, or
    > latency.
    > 3. Most mirror ports filter anomalies, thus making troubleshooting
    > impossible.
    > 4. Turning on port mirroring puts a load on the switch's CPU/transfer logic
    > thus impacting the switch's operational performance.
    >
    > Piotr




  6. Re: Troubleshooting SANs


    wrote in message
    news:1173550832.940149.73810@j27g2000cwj.googlegro ups.com...
    >I work for a consulting firm, and have begun to do troubleshooting on
    > small SANs, mostly HP MSA1500cs based.
    >
    > Many times the problem the customer is talking about is some vague
    > intermittent slowness issue or something like that. In cases like
    > this, my troubleshooting goes something like this:
    >
    > 1. Check switch logs for marginal ports or other errors (usually
    > brocade 4/24s or similar)
    > 2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
    > etc.
    >
    > If the problem still exists, I'll call HP support, but more often than
    > not they can't really help from here. So the only approach that
    > yields results is to start unplugging stuff until I see the problem
    > disappear.
    >
    > In one recent instance, I had a customer start shutting blades off
    > until he found that one of them had an HBA that was mysteriously
    > causing the intermittent slowness for the whole SAN. The HBA actually
    > seemed to work, and there were no errors in the Windows event logs, or
    > switch logs, sansurfer, or anything.
    >
    > There has got to be a better way to find this kind of thing. On an IP
    > network, I would run Ethereal or some other packet analyzer to try and
    > see what is talking on the network when the problem manifests. But
    > I've never really found anything like that for a fibre channel SAN.
    >
    > As I said, I'm pretty new to SAN, so any direction would be helpful.
    >
    > Thanks,
    > Sean
    >


    You're correct. There is no such thing as port mirroring or fibre channel
    software analyzer such as Ethernet's Ethereal. Your best bet in this
    scenario without using an inline fibre channel analyzer (Finisar is the
    defacto standard) is to use an application such as SCSI Utility For Windows
    to monitor the HBA port statistics to determine what errors man be
    happening.

    The Moojit



  7. Re: Troubleshooting SANs

    Sean,

    I'm going to guessing that this wasn't a FC problem. I'm more inclined to believe it was a SCSI problem. Specifically
    I would guess that the blade you closed down was doing Target Resets.

    If an initiator sends a target reset to a target and this target is providing LUNs for multiple initiators, all the
    outstanding IOs to all the initiators get reset. The initiators time out and retry the IO which succeeds. The end
    result is all the initiators slow down but no errors are displayed. Zoning won't help.

    You can limit the possible suspects by seeing which initiators are slowing down and which target they have in common.
    The HP box might provide some higher debug level that exposes target resets so you can track them down.

    From my experience, the most likely culprit is a Window 2003 SP1 cluster node (probably with an older storport driver.)
    I suggest whenever you see this problem just upgrade all the Windows clusters and all the storport drivers.

    Follow http://support.microsoft.com/default...b;EN-US;923830

    MSCS use resets to decide quorum ownership and when they get in a pickle, the do too many resets. Too many resets show
    up as slow storage. Cluster Nodes do log resets in the cluster log, although they don't call them resets, look for
    /arbitrat/ as in arbitration or something like that.

    There is also the Emulex TPRLO command which is an FC issue. You can research TPRLOs. If the offending blade had
    Emulex cards see if TPRLO was enabled. (By default it shouldn't be and if it is you'll get the same problems).







    seanh012@gmail.com wrote:
    > I work for a consulting firm, and have begun to do troubleshooting on
    > small SANs, mostly HP MSA1500cs based.
    >
    > Many times the problem the customer is talking about is some vague
    > intermittent slowness issue or something like that. In cases like
    > this, my troubleshooting goes something like this:
    >
    > 1. Check switch logs for marginal ports or other errors (usually
    > brocade 4/24s or similar)
    > 2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
    > etc.
    >
    > If the problem still exists, I'll call HP support, but more often than
    > not they can't really help from here. So the only approach that
    > yields results is to start unplugging stuff until I see the problem
    > disappear.
    >
    > In one recent instance, I had a customer start shutting blades off
    > until he found that one of them had an HBA that was mysteriously
    > causing the intermittent slowness for the whole SAN. The HBA actually
    > seemed to work, and there were no errors in the Windows event logs, or
    > switch logs, sansurfer, or anything.
    >
    > There has got to be a better way to find this kind of thing. On an IP
    > network, I would run Ethereal or some other packet analyzer to try and
    > see what is talking on the network when the problem manifests. But
    > I've never really found anything like that for a fibre channel SAN.
    >
    > As I said, I'm pretty new to SAN, so any direction would be helpful.
    >
    > Thanks,
    > Sean
    >


  8. Re: Troubleshooting SANs

    On 25 Mar, 00:50, Bob S wrote:
    > Sean,
    >
    > I'm going to guessing that this wasn't a FC problem. I'm more inclined to believe it was a SCSI problem. Specifically
    > I would guess that the blade you closed down was doing Target Resets.
    >
    > If an initiator sends a target reset to a target and this target is providing LUNs for multiple initiators, all the
    > outstanding IOs to all the initiators get reset. The initiators time out and retry the IO which succeeds. The end
    > result is all the initiators slow down but no errors are displayed. Zoning won't help.
    >
    > You can limit the possible suspects by seeing which initiators are slowing down and which target they have in common.
    > The HP box might provide some higher debug level that exposes target resets so you can track them down.
    >
    > From my experience, the most likely culprit is a Window 2003 SP1 cluster node (probably with an older storport driver.)
    > I suggest whenever you see this problem just upgrade all the Windows clusters and all the storport drivers.
    >
    > Followhttp://support.microsoft.com/default.aspx?scid=kb;EN-US;923830
    >
    > MSCS use resets to decide quorum ownership and when they get in a pickle, the do too many resets. Too many resets show
    > up as slow storage. Cluster Nodes do log resets in the cluster log, although they don't call them resets, look for
    > /arbitrat/ as in arbitration or something like that.
    >
    > There is also the Emulex TPRLO command which is an FC issue. You can research TPRLOs. If the offending blade had
    > Emulex cards see if TPRLO was enabled. (By default it shouldn't be and if it is you'll get the same problems).
    >
    >
    >
    > seanh...@gmail.com wrote:
    > > I work for a consulting firm, and have begun to do troubleshooting on
    > > small SANs, mostly HP MSA1500cs based.

    >
    > > Many times the problem the customer is talking about is some vague
    > > intermittent slowness issue or something like that. In cases like
    > > this, my troubleshooting goes something like this:

    >
    > > 1. Check switch logs for marginal ports or other errors (usually
    > > brocade 4/24s or similar)
    > > 2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
    > > etc.

    >
    > > If the problem still exists, I'll call HP support, but more often than
    > > not they can't really help from here. So the only approach that
    > > yields results is to start unplugging stuff until I see the problem
    > > disappear.

    >
    > > In one recent instance, I had a customer start shutting blades off
    > > until he found that one of them had an HBA that was mysteriously
    > > causing the intermittent slowness for the whole SAN. The HBA actually
    > > seemed to work, and there were no errors in the Windows event logs, or
    > > switch logs, sansurfer, or anything.

    >
    > > There has got to be a better way to find this kind of thing. On an IP
    > > network, I would run Ethereal or some other packet analyzer to try and
    > > see what is talking on the network when the problem manifests. But
    > > I've never really found anything like that for a fibre channel SAN.

    >
    > > As I said, I'm pretty new to SAN, so any direction would be helpful.

    >
    > > Thanks,
    > > Sean- Hide quoted text -

    >
    > - Show quoted text -


    I work as a SAN consultant for HP and I agree that embedding taps into
    environments is a very good idea. I have three finisar analysers and
    one of the biggest problems is getting the change approved to add or
    remove them, getting the customer to install taps removes this
    obstacle. The Cisco platform does have the SD port (mirror...)
    functionality but you don't see the whole picture when using it. Last
    time I was involved with an escalation on MDS then Cisco themselves
    asked for a finisar trace.

    Kind Regards

    Jason




+ Reply to Thread