No SCSI status on fibrechannel - SGI

This is a discussion on No SCSI status on fibrechannel - SGI ; I have connected identical RAIDs to two Origin 300 series machines (one a 300, one a 350). These each have dual port 4gig fibrechannel cards. Both machines show intermittent errors from these RAIDs like this: unix: ALERT: /hw/module/001c02/Ibrick/xtalk/14/pci/2a/scsi_ctlr/0/target/112/lun/2: No SCSI ...

+ Reply to Thread
Results 1 to 2 of 2

Thread: No SCSI status on fibrechannel

  1. No SCSI status on fibrechannel

    I have connected identical RAIDs to two Origin 300 series
    machines (one a 300, one a 350). These each have dual port
    4gig fibrechannel cards. Both machines show intermittent
    errors from these RAIDs like this:

    unix: ALERT: /hw/module/001c02/Ibrick/xtalk/14/pci/2a/scsi_ctlr/0/target/112/lun/2: No SCSI status, key=825
    unix: ALERT: /hw/module/001c02/Ibrick/xtalk/14/pci/2a/scsi_ctlr/0/target/112/lun/2: IO Terminated, key=825

    These error pairs vary over all the luns on the RAID and have different keys. After some dozen or two
    of these errors, I get a SCSI bus reset:

    unix: |$(3)<6>dksc9d112l2s7: SCSI driver error: Controller protocol error or SCSI bus reset


    A single file copy often fails at this point, but will succeed if retried.
    The identical failure on two different systems suggests a problem inherent in the
    set up such an incompatability or configuration issue. I have changed one of the few
    parameters available to me on the RAID - the host communication speed - from 4gig/sec
    to 2gig/sec and this has not solved the problem. I would appreciate any suggestions at
    this point. fx on the host side gives me a few other variables to tweak:


    ------------------------------------------------------------------------------

    Error correction enabled Disable data transfer on error
    Do report recovered errors Do delay for error recovery
    Do transfer bad blocks Error retry attempts 79
    Do auto bad block reallocation (read)
    Do auto bad block reallocation (write)
    Drive readahead enabled Drive buffered writes enabled
    Drive disable prefetch 65535 Drive minimum prefetch 0
    Drive maximum prefetch 65535 Drive prefetch ceiling 65535
    Number of cache segments 1
    Read buffer ratio 2/256 Write buffer ratio 2/256
    Command Tag Queueing disabled

    ------------------------------------------------------------------------------

    It seems that a ham-fisted approach that might work would be to
    toggle the "Disable data transfer on error". Just how ill advised is this?

    Is command tag queueing likely to help?
    Anything else here that people normally tweak?

    Thanks for any suggestions.


  2. Re: No SCSI status on fibrechannel

    On Thu, 30 Mar 2006 18:33:26 +0000, Daniel Packman wrote:

    > I have connected identical RAIDs to two Origin 300 series
    > machines (one a 300, one a 350). These each have dual port
    > 4gig fibrechannel cards. Both machines show intermittent
    > errors from these RAIDs like this:
    >
    > unix: ALERT: /hw/module/001c02/Ibrick/xtalk/14/pci/2a/scsi_ctlr/0/target/112/lun/2: No SCSI status, key=825
    > unix: ALERT: /hw/module/001c02/Ibrick/xtalk/14/pci/2a/scsi_ctlr/0/target/112/lun/2: IO Terminated, key=825
    >
    > These error pairs vary over all the luns on the RAID and have different keys. After some dozen or two
    > of these errors, I get a SCSI bus reset:
    >
    > unix: |$(3)<6>dksc9d112l2s7: SCSI driver error: Controller protocol error or SCSI bus reset
    >



    I would look at swapping fibres actually. the key is up to the scsi_ctlr/0
    part. That tells you which port on the FC card is being used. If it's
    always the same card and port that's the best place to start swapping
    cables.

    Also your RAID should have something for looking at link errors.



    >
    > fx on the host side gives me a few other variables to tweak:
    >
    >
    > ------------------------------------------------------------------------------
    >
    > Error correction enabled Disable data transfer on error
    > Do report recovered errors Do delay for error recovery
    > Do transfer bad blocks Error retry attempts 79
    > Do auto bad block reallocation (read)
    > Do auto bad block reallocation (write)
    > Drive readahead enabled Drive buffered writes enabled
    > Drive disable prefetch 65535 Drive minimum prefetch 0
    > Drive maximum prefetch 65535 Drive prefetch ceiling 65535
    > Number of cache segments 1
    > Read buffer ratio 2/256 Write buffer ratio 2/256
    > Command Tag Queueing disabled
    >
    > ------------------------------------------------------------------------------
    >
    > It seems that a ham-fisted approach that might work would be to
    > toggle the "Disable data transfer on error". Just how ill advised is this?


    Nope - don't do that. It's intended for disk drives I believe and besides
    which you want to fix your underlying problem which looks like one
    of cables, the controller on the origin side, or the controllers you plug
    into on the raid side.



    >
    > Is command tag queueing likely to help?


    CTQ defintely helps performance - but wont' fix your problem. When you
    do fix you actual problem you have to calculate how big the queue should
    be. It depends on a lot of factors (number of luns, number of hosts
    etc) depending on your raid. You'll need to find that out for your raid.
    There's no one formula for this. You'll have to find out what it is before
    you start fiddling with it.

    I'm pretty sure you have to reboot to activate the change also.




+ Reply to Thread