Fixing the main programmer thinko with the device model - Kernel

This is a discussion on Fixing the main programmer thinko with the device model - Kernel ; Having just spent the weekend tracking two separate driver model problems through SCSI, I believe the biggest trap everyone falls into with the driver model (well, OK, at least with SCSI) is to try to defer a callback to the ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: Fixing the main programmer thinko with the device model

  1. Fixing the main programmer thinko with the device model

    Having just spent the weekend tracking two separate driver model
    problems through SCSI, I believe the biggest trap everyone falls into
    with the driver model (well, OK, at least with SCSI) is to try to defer
    a callback to the device ->release routine without realising that
    somewhere along the callback path we're going to drop a reference to the
    device.

    You can do this very inadvertently: One developer didn't realise
    bsg_unregister_queue() released a ref, and another didn't realise that
    transport_destroy_device() held one.

    The real problem is that it's fantastically easy to do this ... it's not
    at all clear which of the cleanup routines actually release references
    unless you dig down into them and it's very difficult to detect because
    all that happens is that devices don't get released when they should,
    which isn't something we ever warn about.

    So, what I was wondering is: is there any way we can reliably detect
    and warn when someone does this. Could something like lockdep (although
    I can't really see how dynamic detection will work because the device
    ->release routine is never called) or a static code analysis tool like
    sparse be modified to detect the unreleaseable references?

    James


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: Fixing the main programmer thinko with the device model

    On Mon, Mar 24, 2008 at 10:39:48AM -0500, James Bottomley wrote:
    > Having just spent the weekend tracking two separate driver model
    > problems through SCSI, I believe the biggest trap everyone falls into
    > with the driver model (well, OK, at least with SCSI) is to try to defer
    > a callback to the device ->release routine without realising that
    > somewhere along the callback path we're going to drop a reference to the
    > device.
    >
    > You can do this very inadvertently: One developer didn't realise
    > bsg_unregister_queue() released a ref, and another didn't realise that
    > transport_destroy_device() held one.
    >
    > The real problem is that it's fantastically easy to do this ... it's not
    > at all clear which of the cleanup routines actually release references
    > unless you dig down into them and it's very difficult to detect because
    > all that happens is that devices don't get released when they should,
    > which isn't something we ever warn about.


    Sounds like a documentation issue for how the scsi layer is using the
    driver model more than anything else. None of the other busses seem to
    have these kinds of issues that I can see, is it just because of your
    complex usage model?

    > So, what I was wondering is: is there any way we can reliably detect
    > and warn when someone does this.


    Warn that a device did not get released when the programmer thought it
    should yet they forgot to call the correct function to have that happen?
    That seems a bit difficult

    Also note that the scsi layer usage of multiple refcounted objects
    within the same structure might be causing some of these issues, that's
    a bug in how the scsi layer has implmented things much more so than how
    the driver core is implemented, right?

    thanks,

    greg k-h
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: Fixing the main programmer thinko with the device model

    On Mon, 2008-03-24 at 10:58 -0700, Greg KH wrote:
    > On Mon, Mar 24, 2008 at 10:39:48AM -0500, James Bottomley wrote:
    > > Having just spent the weekend tracking two separate driver model
    > > problems through SCSI, I believe the biggest trap everyone falls into
    > > with the driver model (well, OK, at least with SCSI) is to try to defer
    > > a callback to the device ->release routine without realising that
    > > somewhere along the callback path we're going to drop a reference to the
    > > device.
    > >
    > > You can do this very inadvertently: One developer didn't realise
    > > bsg_unregister_queue() released a ref, and another didn't realise that
    > > transport_destroy_device() held one.
    > >
    > > The real problem is that it's fantastically easy to do this ... it's not
    > > at all clear which of the cleanup routines actually release references
    > > unless you dig down into them and it's very difficult to detect because
    > > all that happens is that devices don't get released when they should,
    > > which isn't something we ever warn about.

    >
    > Sounds like a documentation issue for how the scsi layer is using the
    > driver model more than anything else. None of the other busses seem to
    > have these kinds of issues that I can see, is it just because of your
    > complex usage model?


    Possibly ... although the bsg routines aren't actually SCSI ones,
    they're block. The transport class destroy function also actually
    picked up an implicit reference because of the class device rework you
    just did.

    > > So, what I was wondering is: is there any way we can reliably detect
    > > and warn when someone does this.

    >
    > Warn that a device did not get released when the programmer thought it
    > should yet they forgot to call the correct function to have that happen?
    > That seems a bit difficult


    Yes, that's why I think it has to be discovered via static analysis.

    > Also note that the scsi layer usage of multiple refcounted objects
    > within the same structure might be causing some of these issues, that's
    > a bug in how the scsi layer has implmented things much more so than how
    > the driver core is implemented, right?


    That's true, but irrelevant (and also soon to be untrue if we get rid of
    the scsi_device class as you and Kay keep requesting). The two calls
    release references on the actual embedded generic device, it's nothing
    to do with entangled lifetime rules.

    James


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: Fixing the main programmer thinko with the device model

    James Bottomley writes:
    >
    > That's true, but irrelevant (and also soon to be untrue if we get rid of
    > the scsi_device class as you and Kay keep requesting). The two calls
    > release references on the actual embedded generic device, it's nothing
    > to do with entangled lifetime rules.


    Has anybody ever considered just doing away with
    the problematic and bug prone and tricky reference counts for kobjects
    and switch to a simple garbage collector for them?

    -Andi
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: Fixing the main programmer thinko with the device model

    On Tue, Mar 25, 2008 at 10:57:32AM +0100, Andi Kleen wrote:
    > James Bottomley writes:
    > >
    > > That's true, but irrelevant (and also soon to be untrue if we get rid of
    > > the scsi_device class as you and Kay keep requesting). The two calls
    > > release references on the actual embedded generic device, it's nothing
    > > to do with entangled lifetime rules.

    >
    > Has anybody ever considered just doing away with
    > the problematic and bug prone and tricky reference counts for kobjects
    > and switch to a simple garbage collector for them?


    Sure, I have no objection to that. It's just that the reference count
    "issue" really doesn't seem to be one on sanely designed busses

    thanks,

    greg k-h
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: Fixing the main programmer thinko with the device model

    On Tue, 2008-03-25 at 21:16 -0700, Greg KH wrote:
    > On Tue, Mar 25, 2008 at 10:57:32AM +0100, Andi Kleen wrote:
    > > James Bottomley writes:
    > > >
    > > > That's true, but irrelevant (and also soon to be untrue if we get rid of
    > > > the scsi_device class as you and Kay keep requesting). The two calls
    > > > release references on the actual embedded generic device, it's nothing
    > > > to do with entangled lifetime rules.

    > >
    > > Has anybody ever considered just doing away with
    > > the problematic and bug prone and tricky reference counts for kobjects
    > > and switch to a simple garbage collector for them?

    >
    > Sure, I have no objection to that. It's just that the reference count
    > "issue" really doesn't seem to be one on sanely designed busses


    Hmm, what is a "´╗┐simple garbage collector" here? How could one determine
    "reachability" of objects, means: at what point of time do objects
    actually become "garbage"? How could one trace in our current kernel
    code who still accesses an object, without doing refcounts?

    Kay

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: Fixing the main programmer thinko with the device model

    On Wed, Mar 26, 2008 at 06:07:49AM +0100, Kay Sievers wrote:
    > On Tue, 2008-03-25 at 21:16 -0700, Greg KH wrote:
    > > On Tue, Mar 25, 2008 at 10:57:32AM +0100, Andi Kleen wrote:
    > > > James Bottomley writes:
    > > > >
    > > > > That's true, but irrelevant (and also soon to be untrue if we get rid of
    > > > > the scsi_device class as you and Kay keep requesting). The two calls
    > > > > release references on the actual embedded generic device, it's nothing
    > > > > to do with entangled lifetime rules.
    > > >
    > > > Has anybody ever considered just doing away with
    > > > the problematic and bug prone and tricky reference counts for kobjects
    > > > and switch to a simple garbage collector for them?

    > >
    > > Sure, I have no objection to that. It's just that the reference count
    > > "issue" really doesn't seem to be one on sanely designed busses

    >
    > Hmm, what is a "???simple garbage collector" here? How could one determine
    > "reachability" of objects, means: at what point of time do objects
    > actually become "garbage"? How could one trace in our current kernel
    > code who still accesses an object, without doing refcounts?


    I think in the end, it would be the same thing, so we should stick with
    our current code, as that's exactly what the current kobject model now
    implements

    thanks,

    greg k-h
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread