On Sat, 2008-04-05 at 11:26 -0700, Roland McGrath wrote:
> > BTW I did look at allocating it in posix_cpu_timer_set() and
> > set_process_cpu_timer() but the first at least is doing stuff with locks
> > held. I'll keep looking at it, though.

> Yeah, it's a little sticky. It would be simple enough to arrange things to
> do the allocation when needed before taking locks, but would not make the
> code wonderfully self-contained. I wouldn't worry about it unless/until we
> conclude for other reasons that this really is the best way to go. We seem
> now to be leaning towards allocating at clone time anyway.

Well, if we allocate at clone time then some things do get a bit
simpler. (Hmm, for some reason I was thinking that we still allocate in
do_setitimer() as well, but it doesn't look like it.)

> > One little gotcha we just ran into, though: When checking
> > tsk->signal->(anything) in run_posix_cpu_timers(), we have to hold
> > tasklist_lock to avoid a race with release_task(). This is going to
> > make even the null case always cost more than before.

> This is reminiscent of something that came up once before. I think it
> was the same issue of what happens on a tick while the thread is in the
> middle of exiting and racing with release_task. See commit
> 72ab373a5688a78cbdaf3bf96012e597d5399bb7, commit
> 3de463c7d9d58f8cf3395268230cb20a4c15bffa and related history (some of
> the further history is pre-GIT).

Yeah, I checked that out. The one difference here is that that was a
race between do_exit() (actually release_task()/__exit_signal()) and
run_posix_cpu_timers(). While this race was the same in that respect,
there was also a race between all of the timer-tick routines that call
any of account_group_user_time(), account_group_system_time() or
account_group_exec_runtime() and __exit_signal(). This is because those
functions all dereference tsk->signal.

> > For the local environment, I solved the problem by moving the percpu
> > structure out of the signal structure entirely and by making it
> > refcounted.

> This is a big can of worms that we really don't need. Complicating the
> data structure handling this way is really not warranted at all just to
> address this race. You'll just create another version of the same race
> with a different pointer, and then solve it some simple way that you
> could have just used to solve the existing problem. If you don't have
> some independent (and very compelling) reasons to reorganize the data
> structures, nix nix nix.

Erm, well, this isn't reorganizing the data structures per se, since
these are new data structures. Regardless, the way I did this was by
making the refcounted structure follow the lifetime of the task
structure. It's instantiated when needed and at that point each task
structure in the thread group gets a reference, which is released when
that task structure is destroyed. The last release destroys the data

The upshot of this is that none of the timer routines dereference
tsk->signal, so the races go away, no locking needed. From my
perspective this was the simplest solution, since lock dependency
ordering is _really_ a can of worms.

The solution was really very simple, in fact.

> We can make posix_cpu_timers_exit() or earlier in the exit/reap path
> tweak any state we need to ensure that this problem won't come up.
> But, off hand, I don't think we need any new state.

That's certainly a possibility, but it'll still require more code in the
timer routines to check whatever state there is.

> Probably the right fix is to make the fast-path check do:
> rcu_read_lock();
> signal = rcu_dereference(current->signal);
> if (unlikely(!signal) || !fastpath_process_timer_check(signal)) {
> rcu_read_unlock();
> return;
> }
> sighand = lock_task_sighand(current, &flags);
> rcu_read_unlock();
> if (likely(sighand))
> slowpath_process_timer_work(signal);
> unlock_task_sighand(sighand, &flags);
> Another approach that is probably fine too is just to do:
> if (unlikely(current->exit_state))
> return;
> We can never get to the __exit_signal code that causes the race if we
> are not already late in exit. The former seems a little preferable
> because the added fast-path cost is the same (or perhaps even more
> cache-friendly), but it fires timers even at the very last tick during
> exit, and only loses the ideal behavior (of always firing if the timer
> expiration is ever crossed) at the last possible instant.

I'll have to look at the code some more and thing about this. For the
other timer routines I would think that the second approach would be
better, but I still think that best of all is to not have to check
anything at all (except of course the presence of the structure pointer

Regarding the second approach, without locking wouldn't that still be
racy? Couldn't exit_state change (and therefore __exit_signal() run)
between the check and the dereference?
Frank Mayhar frank@exit.com http://www.exit.com/
Exit Consulting http://www.gpsclock.com/
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/