Re: Seeing context switch storm with 10/13 snapshot of - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Seeing context switch storm with 10/13 snapshot of
Date
Msg-id 1129928242.8300.960.camel@localhost.localdomain
Whole thread Raw
In response to Re: Seeing context switch storm with 10/13 snapshot of  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Seeing context switch storm with 10/13 snapshot of
List pgsql-hackers
On Fri, 2005-10-21 at 09:52 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > It would be good right now to have a multi-process test harness that
> > would allow us to test out different spin lock code without the rest of
> > PostgreSQL getting in the way of testing. If we can isolate the issue
> > outside of PostgreSQL it will be much easier to fix.
> 
> Actually I disagree with that, on three grounds:
> 
> 1. Optimizing for an artificial environment may result in the wrong
> optimization.
> 
> 2. I don't think you'll be able to prove anything except that all SMP
> designs ultimately suck.  There is no hardware on the planet that can
> trade cache lines back and forth at instruction-dispatch rates.
> 
> 3. The problem right now is not lack of ability to reproduce the
> problem, it is lack of ideas how to fix it.  Building an artificial
> testbed is just directing effort into make-work rather than towards
> solving the problem.

If we think spinlocks are the problem, building a spinlock test harness
will prove that and also simplify the testing of a solution. Isolating
the spinlocks in that way is not artificial, but actually a very pure
test, so although I agree with (1) as a general statement, this does not
apply for the test harness proposal.

I was seeing the problem as likely to rear its head again over time. We
are entering a stage of increased importance of SMP code, since within a
few short years all CPUs will be dual core/HT or something similar. We
may fix it for one platform, but other similar problems may re/emerge.
The easiest way to test spinlock code on any platform is to get an
isolated test case that is runnable outside of the context of
PostgreSQL, yet using the pg spinlock code. That would allow us to bring
in other people and their many eyeballs to look at the issue; we on this
list are not experts at everything.

Certainly there is a lack of ideas as to how to fix it, as you mention
in (3). This shows to me that the solution lies in one of two areas: a)
the solution has not yet been considered or b) the solution has already
been thought of and for whatever reason disregarded. You may be certain
that the solution lies in a), though I am not. Rejecting ideas quickly
may simply increase the chances of finding the solution in a b) case.

Forgive me but (2) seems spurious. Nobody said anything about trading
cache lines at instruction-dispatch rates. The objective of the test
harness would be to check whether negative effects such as CS storms
exist on that platform. Actual optimization of the spinlock mechanisms
in the context of the PostgreSQL server should certainly be done within
the code.

In a more general sense: what is the best next action to make progress
on the CS issue that exists for certain CPUs?

Best Regards, Simon Riggs




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [PATCHES] Win32 CHECK_FOR_INTERRUPTS() performance tweak
Next
From: Qingqing Zhou
Date:
Subject: Re: [PATCHES] Win32 CHECK_FOR_INTERRUPTS() performance