Thread: Seeing context switch storm with 10/13 snapshot of 8.1beta3

Seeing context switch storm with 10/13 snapshot of 8.1beta3

From
Robert Creager
Date:
I've been having this problem since trying to upgrade from 7.4.1 to 8.03, and
now 8.1.

It's a dual Xenon machine:
Linux annette.stortek.com 2.4.22-26mdkenterprise #1 SMP Wed Jan 7 07:10:39 MST
2004 i686 unknown unknown GNU/Linux

PostgreSQL version is:
-------------------------------------------------------------------------------
----------------------------PostgreSQL 8.1beta3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.1
(Mandrake Linux 9.2 3.3.1-2mdk)

[creagrs@annette creagrs]$ /usr/local/pgsql810/bin/pg_config
BINDIR = /usr/local/pgsql810/bin
DOCDIR = /usr/local/pgsql810/doc
INCLUDEDIR = /usr/local/pgsql810/include
PKGINCLUDEDIR = /usr/local/pgsql810/include
INCLUDEDIR-SERVER = /usr/local/pgsql810/include/server
LIBDIR = /usr/local/pgsql810/lib
PKGLIBDIR = /usr/local/pgsql810/lib
LOCALEDIR =
MANDIR = /usr/local/pgsql810/man
SHAREDIR = /usr/local/pgsql810/share
SYSCONFDIR = /usr/local/pgsql810/etc
PGXS = /usr/local/pgsql810/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--enable-syslog' '--prefix=/usr/local/pgsql810' '--enable-debug'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE
CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Winline -Wendif-labels
-fno-strict-aliasing -g
CFLAGS_SL = -fpic
LDFLAGS = -Wl,-rpath,/usr/local/pgsql810/lib
LDFLAGS_SL =
LIBS = -lpgport -lz -lreadline -lncurses -lcrypt -lresolv -lnsl -ldl -lm -lbsd
VERSION = PostgreSQL 8.1beta3

I can reproduce this problem quite quickly with 20 concurrent queries (< 20
minutes with 5 sets of my app at 4 queries a piece, probably quicker).  The
queries nominally take < .1 second, but when all 5 sets run, the queries are up
around 5-6 seconds.  CS hit over 100k.  Below is a vmstat over a few seconds
which show the problem, then it subsides, then it hits again.  Note that when my
application gets up to 5 sets of 8, it never comes down again, and the queries
are over 40 seconds to complete.

When run against 7.4.1, the CS problem does not occur (stays < 10k).

[creagrs@annette creagrs]$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----r  b   swpd   free   buff  cache   si
so   bi    bo   in    cs us sy id wa
 
10  0   1032  63120     48 1608472    0    0     5    18    3     6 92  8  0  08  0   1032  62584     48 1608480    0
0     0    24  173 97793 39 61  0  09  0   1032  62332     48 1608500    0    0     0    32  197 83089 43 57  0  04  0
1032  61216     48 1608508    0    0     0    16  158 79967 43 57  0  06  0   1032  65208     48 1608556    0    0
0   88  262 91266 52 48  0  05  0   1032  67368     48 1608556    0    0     0   292  235 14234 87 13  0  04  0   1032
65780    48 1608580    0    0     0    40  173   325 95  5  0  04  0   1032  64764     48 1608580    0    0     0     0
144    86 98  2  0  06  0   1032  65800     48 1608580    0    0     0    24  197   397 96  4  0  06  0   1032  66284
 48 1608604    0    0     0    64  234  9284 83 17  0  06  0   1032  66104     48 1608604    0    0     0   116  253
8765856 44  0  05  0   1032  66276     48 1608612    0    0     0    56  245 20648 81 19  0  06  0   1032  65736     48
1608644   0    0     0    88  250 19699 82 18  0  05  0   1032  65316     48 1608652    0    0     0    40  215 73709
7030  0  05  0   1032  67664     48 1608700    0    0     0    96  272 22222 81 19  0  04  0   1032  66700     48
1608700   0    0     0    84  188   351 93  7  0  04  0   1032  67696     48 1608732    0    0     0    64  210   505
91 9  0  04  0   1032  67696     48 1608732    0    0     0     0  139    86 99  1  0  04  0   1032  64612     48
1608732   0    0     0     0  134    74 99  1  0  05  0   1032  65344     48 1608756    0    0     0    48  206   421
94 6  0  04  0   1032  66636     48 1608764    0    0     0    16  171   232 96  4  0  04  0   1032  67404     48
1608764   0    0     0     0  136    71 100  0  0  07  0   1032  63544     48 1608780    0    0     0    56  239 67250
6535  0  06  0   1032  65168     48 1608796    0    0     0    56  230 87782 59 41  0  07  0   1032  66588     48
1608820   0    0     0    72  236 106591 42 58  0  08  0   1032  62544     48 1608836    0    0     0   184  244 129696
4852  0  04  0   1032  63504     48 1608860    0    0     0    72  245 110887 48 52  0  08  0   1032  63400     48
1608868   0    0     0    16  156 101638 35 65  0  08  0   1032  62744     48 1608868    0    0     0    24  193 91159
4060  0  08  0   1032  63700     48 1608868    0    0     0    16  172 85165 42 58  0  07  0   1032  62888     48
1608876   0    0     0    48  228 94902 42 58  0  04  0   1032  64184     48 1608908    0    0     0    72  234 117585
4654  0  07  0   1032  61832     48 1608948    0    0     0    72  209 104287 46 54  0  08  0   1032  64468     48
1608956   0    0     0    56  237 112181 44 56  0  07  0   1032  63032     48 1608972    0    0     0    56  236 114977
4753  0  08  0   1032  63204     48 1608980    0    0     0    40  205 107433 45 55  0  07  0   1032  63832     48
1608996   0    0     0    56  225 113536 40 60  0  0
 

Help?

Thanks,
Rob

-- 
Robert Creager
Advisory Software Engineer
Data Management Group
Sun Microsystems
Robert.Creager@Sun.com
303.673.2365 Office
888.912.4458 Pager


Re: Seeing context switch storm with 10/13 snapshot of 8.1beta3

From
Tom Lane
Date:
Robert Creager <Robert.Creager@Sun.com> writes:
> I've been having this problem since trying to upgrade from 7.4.1 to 8.03, and
> now 8.1.

Can you put together a test case that other people could use to
reproduce it?
        regards, tom lane


Re: Seeing context switch storm with 10/13 snapshot of

From
Robert Creager
Date:
When grilled further on (Thu, 13 Oct 2005 22:44:54 -0400),
Tom Lane <tgl@sss.pgh.pa.us> confessed:

> Robert Creager <Robert.Creager@Sun.com> writes:
> > I've been having this problem since trying to upgrade from 7.4.1 to 8.03,
and
> > now 8.1.
>
> Can you put together a test case that other people could use to
> reproduce it?
>

I can try.  The data size isn't big, but the scripts that run against it are,
and are dependent on our development enviornment.  What I'll have to do is pull
out the db portion of the app and see if I can simplify it - might work.  The
app is a test system that runs against our big storage libraries.

Is there anything I might be able to do (without the test case) that would help
figure out what's happening?

Cheers,
Rob

-- 21:09:29 up 11 days, 12:44,  7 users,  load average: 3.87, 4.27, 5.04
Linux 2.6.5-02 #8 SMP Mon Jul 12 21:34:44 MDT 2004

Re: Seeing context switch storm with 10/13 snapshot of 8.1beta3

From
Tom Lane
Date:
Robert Creager <Robert_Creager@LogicalChaos.org> writes:
> Is there anything I might be able to do (without the test case) that
> would help figure out what's happening?

oprofile stats would be enlightening, perhaps.

I'm particularly interested in why 7.4 is behaving better than newer
versions --- that does not compute ;-).  Have you checked that the same
query plan is being used by 7.4 and the newer versions?  Could you get
oprofile results for both versions?
        regards, tom lane


Re: Seeing context switch storm with 10/13 snapshot of

From
Simon Riggs
Date:
On Thu, 2005-10-13 at 21:20 -0600, Robert Creager wrote:
> When grilled further on (Thu, 13 Oct 2005 22:44:54 -0400),
> Tom Lane <tgl@sss.pgh.pa.us> confessed:
> 
> > Robert Creager <Robert.Creager@Sun.com> writes:
> > > I've been having this problem since trying to upgrade from 7.4.1 to 8.03,
> and
> > > now 8.1.
> > 
> > Can you put together a test case that other people could use to
> > reproduce it?
> > 
> 
> I can try.  The data size isn't big, but the scripts that run against it are,
> and are dependent on our development enviornment.  What I'll have to do is pull
> out the db portion of the app and see if I can simplify it - might work.  The
> app is a test system that runs against our big storage libraries.
> 
> Is there anything I might be able to do (without the test case) that would help
> figure out what's happening?

Please try this patch and see if it reduces the CS storm:

http://archives.postgresql.org/pgsql-patches/2005-10/msg00091.php

Do you have access to another similar machine to do comparative testing?
Do you have access to another machine with different CPU arch? It would
be good to firmly isolate this to a CPU architecture interaction issue.

Are the Xeons particularly old models? How new is the server?

Thanks,

Best Regards, Simon Riggs



Re: Seeing context switch storm with 10/13 snapshot of

From
Robert Creager
Date:
On Tue, 18 Oct 2005 00:25:25 +0100
Simon Riggs <simon@2ndquadrant.com> wrote:
> 
> Please try this patch and see if it reduces the CS storm:
> 
> http://archives.postgresql.org/pgsql-patches/2005-10/msg00091.php

Yes, I will.  I'd been trying to figure out what triggered it, as I was unable
to reproduce it for a while, but it gets there eventually.  I had just
re-started with auto vacuum off, as that might be the trigger?  I'll apply the
patch, re-build and re-start with auto vacuum back on, they way I'm running when
I know I'll see it.

> 
> Do you have access to another similar machine to do comparative testing?
> Do you have access to another machine with different CPU arch? It would
> be good to firmly isolate this to a CPU architecture interaction issue.

I do have access to a windows version of the DL-380 but it's only a single CPU. 
I've a dual AMD server at home, but I cannot run against the db there (reverse
VPN?).  I'd need that reproducibility Tom wanted that I have not gotten to yet.

> 
> Are the Xeons particularly old models? How new is the server?

It is a HP Proliant (post Compaq merger) DL-380.  A couple years old.

Thanks,
Rob



Re: Seeing context switch storm with 10/13 snapshot of

From
Robert Creager
Date:
On Tue, 18 Oct 2005 10:29:43 -0600
Robert Creager <Robert.Creager@sun.com> wrote:

> On Tue, 18 Oct 2005 00:25:25 +0100
> Simon Riggs <simon@2ndquadrant.com> wrote:
> > 
> > Please try this patch and see if it reduces the CS storm:
> > 

Sorry, didn't work.  Took about an hour, and now it's now at the CS storm
(averaging 94k).  I've now disabled auto vacuum, just to see if that is a
trigger, and am re-running.

Thanks,
Rob


Re: Seeing context switch storm with 10/13 snapshot of

From
Robert Creager
Date:
On Tue, 18 Oct 2005 12:24:03 -0600
Robert Creager <Robert.Creager@sun.com> wrote:

> On Tue, 18 Oct 2005 10:29:43 -0600
> Robert Creager <Robert.Creager@sun.com> wrote:
> 
> > On Tue, 18 Oct 2005 00:25:25 +0100
> > Simon Riggs <simon@2ndquadrant.com> wrote:
> > > 
> > > Please try this patch and see if it reduces the CS storm:
> > > 
> 
> Sorry, didn't work.  Took about an hour, and now it's now at the CS storm
> (averaging 94k).  I've now disabled auto vacuum, just to see if that is a
> trigger, and am re-running.
> 

Vacuum (cron or auto) has no impact on the problem for 8.1beta3.

Tom had mentioned running oprofile, but I'm unable to get either 0.9.1 or 0.9.0
versions of oprofile working, so looks like I'll have to get a reproducible test
case...

Anyone know of a script that can replay a PostgreSQL log file?  Then I could log
all queries, wait till the problem hits, and then replay to see if that
reproduces it...

Cheers,
Rob


Re: Seeing context switch storm with 10/13 snapshot of

From
Josh Berkus
Date:
Robert,

> Anyone know of a script that can replay a PostgreSQL log file?  Then I
> could log all queries, wait till the problem hits, and then replay to
> see if that reproduces it...

log_statement=true in your postgresql.conf file.  The trick is weeding out 
all the other non-query stuff.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


Re: Seeing context switch storm with 10/13 snapshot of

From
Simon Riggs
Date:
On Tue, 2005-10-18 at 12:24 -0600, Robert Creager wrote:
> Robert Creager <Robert.Creager@sun.com> wrote:
> > Simon Riggs <simon@2ndquadrant.com> wrote:
> > > Please try this patch and see if it reduces the CS storm:

> Sorry, didn't work.  Took about an hour, and now it's now at the CS storm
> (averaging 94k).  I've now disabled auto vacuum, just to see if that is a
> trigger, and am re-running.

OK, thanks.

There was a possibility that there was some issue with the spinlock code
inadvertently grabbing a cache line with another lock in it.

Try this to recreate the problem:
http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php

Best Regards, Simon Riggs




Re: Seeing context switch storm with 10/13 snapshot of

From
Robert Creager
Date:
On Thu, 20 Oct 2005 21:19:18 +0100
Simon Riggs <simon@2ndquadrant.com> wrote:

> Try this to recreate the problem:
> http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php
> 

Yup, that does it.  Three hits the level I see with my application ~100k.  Two
hits about 50k, one does nothing (< 1k).

Interesting.  7.4.1 is worse for this test, as two jump up to 130k.  But, my app
runs fine against 7.4.1...

Would it still be helpful to try and pull together a test case from my app
against 8.1beta3?  I have not worked on it, and don't know when I'd get time to
(since I have a work around).

Cheers,
Rob


Re: Seeing context switch storm with 10/13 snapshot of

From
Tom Lane
Date:
Robert Creager <Robert.Creager@Sun.com> writes:
> Interesting.  7.4.1 is worse for this test, as two jump up to 130k.  But, my app
> runs fine against 7.4.1...

> Would it still be helpful to try and pull together a test case from my app
> against 8.1beta3?

Yes, if you can show a case where 8.1 is much worse than 7.4 then we
need to know about it yesterday.
        regards, tom lane


Re: Seeing context switch storm with 10/13 snapshot of

From
Simon Riggs
Date:
On Thu, 2005-10-20 at 14:59 -0600, Robert Creager wrote:
> On Thu, 20 Oct 2005 21:19:18 +0100
> Simon Riggs <simon@2ndquadrant.com> wrote:
> 
> > Try this to recreate the problem:
> > http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php
> > 
> 
> Yup, that does it.  Three hits the level I see with my application ~100k.  Two
> hits about 50k, one does nothing (< 1k).

OK, good. IYKWIM

Can you try a slight modification?

Run 3 threads, but against 3 different otherwise identical test tables
created using a name-only mod of the test script. e.g. test_data1, 2 and
3.

This will hit a different pattern of lwlocks.

If the CS is the same, then it will tell us that the issue is not data
dependent. If the CS drops, it tells us that it is an activity performed
on the precise data blocks rather than the shared data structures which
is the issue. That would then account for why the effect appears to come
and go in your own application, because the effect is actually dependant
on the data distribution (which presumably varies in your tables).

Just trying to more tightly bracket the failure-case behaviour....

Best Regards, Simon Riggs




Re: Seeing context switch storm with 10/13 snapshot of

From
Robert Creager
Date:
On Thu, 20 Oct 2005 23:28:21 +0100
Simon Riggs <simon@2ndquadrant.com> wrote:

> On Thu, 2005-10-20 at 14:59 -0600, Robert Creager wrote:
> > On Thu, 20 Oct 2005 21:19:18 +0100
> > Simon Riggs <simon@2ndquadrant.com> wrote:
> > 
> > > Try this to recreate the problem:
> > > http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php
> > > 
> > 
> > Yup, that does it.  Three hits the level I see with my application ~100k. 
> > Two hits about 50k, one does nothing (< 1k).
> 
> OK, good. IYKWIM
> 
> Can you try a slight modification?
> 
> Run 3 threads, but against 3 different otherwise identical test tables
> created using a name-only mod of the test script. e.g. test_data1, 2 and
> 3.

I modified the setup to create the first table, then selected it into the second
two tables, all in the same database.  Created three unique indexes (when I
didn't, there was no CS issue).

> 
> This will hit a different pattern of lwlocks.
> 
> If the CS is the same, then it will tell us that the issue is not data
> dependent. If the CS drops, it tells us that it is an activity performed
> on the precise data blocks rather than the shared data structures which
> is the issue. That would then account for why the effect appears to come
> and go in your own application, because the effect is actually dependant
> on the data distribution (which presumably varies in your tables).

The CS is the same on both 7.4.1 and 8.1b3 (with 8.1b3 showing 100k and 7.4.1
showing 170k using three clients).  I double checked the scripts to make sure...

Cheers,
Rob


Re: Seeing context switch storm with 10/13 snapshot of

From
Simon Riggs
Date:
On Thu, 2005-10-20 at 16:54 -0600, Robert Creager wrote:
> On Thu, 20 Oct 2005 23:28:21 +0100
> Simon Riggs <simon@2ndquadrant.com> wrote:
>  
> > If the CS is the same, then it will tell us that the issue is not data
> > dependent. If the CS drops, it tells us that it is an activity performed
> > on the precise data blocks rather than the shared data structures which
> > is the issue. That would then account for why the effect appears to come
> > and go in your own application, because the effect is actually dependant
> > on the data distribution (which presumably varies in your tables).
> 
> The CS is the same on both 7.4.1 and 8.1b3 (with 8.1b3 showing 100k and 7.4.1
> showing 170k using three clients).  I double checked the scripts to make sure...

So that means it is either a generic issue or likely to be a lwlock
issue on the main named locks. The drop in CS between releases is
consistent with the overall drop in contention as a result of the
redesign of the buffer manager.

Does definitely look like spinlock issues on that platform. It would be
good to test this on a much newer system. Tuning for older hardware
doesn't seem that worthwhile (no shouting, please!).

It would be good right now to have a multi-process test harness that
would allow us to test out different spin lock code without the rest of
PostgreSQL getting in the way of testing. If we can isolate the issue
outside of PostgreSQL it will be much easier to fix. Any hackers out
there looking to set up a spinlock test harness?

Best Regards, Simon Riggs





Re: Seeing context switch storm with 10/13 snapshot of

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> It would be good right now to have a multi-process test harness that
> would allow us to test out different spin lock code without the rest of
> PostgreSQL getting in the way of testing. If we can isolate the issue
> outside of PostgreSQL it will be much easier to fix.

Actually I disagree with that, on three grounds:

1. Optimizing for an artificial environment may result in the wrong
optimization.

2. I don't think you'll be able to prove anything except that all SMP
designs ultimately suck.  There is no hardware on the planet that can
trade cache lines back and forth at instruction-dispatch rates.

3. The problem right now is not lack of ability to reproduce the
problem, it is lack of ideas how to fix it.  Building an artificial
testbed is just directing effort into make-work rather than towards
solving the problem.
        regards, tom lane


Re: Seeing context switch storm with 10/13 snapshot of

From
Simon Riggs
Date:
On Fri, 2005-10-21 at 09:52 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > It would be good right now to have a multi-process test harness that
> > would allow us to test out different spin lock code without the rest of
> > PostgreSQL getting in the way of testing. If we can isolate the issue
> > outside of PostgreSQL it will be much easier to fix.
> 
> Actually I disagree with that, on three grounds:
> 
> 1. Optimizing for an artificial environment may result in the wrong
> optimization.
> 
> 2. I don't think you'll be able to prove anything except that all SMP
> designs ultimately suck.  There is no hardware on the planet that can
> trade cache lines back and forth at instruction-dispatch rates.
> 
> 3. The problem right now is not lack of ability to reproduce the
> problem, it is lack of ideas how to fix it.  Building an artificial
> testbed is just directing effort into make-work rather than towards
> solving the problem.

If we think spinlocks are the problem, building a spinlock test harness
will prove that and also simplify the testing of a solution. Isolating
the spinlocks in that way is not artificial, but actually a very pure
test, so although I agree with (1) as a general statement, this does not
apply for the test harness proposal.

I was seeing the problem as likely to rear its head again over time. We
are entering a stage of increased importance of SMP code, since within a
few short years all CPUs will be dual core/HT or something similar. We
may fix it for one platform, but other similar problems may re/emerge.
The easiest way to test spinlock code on any platform is to get an
isolated test case that is runnable outside of the context of
PostgreSQL, yet using the pg spinlock code. That would allow us to bring
in other people and their many eyeballs to look at the issue; we on this
list are not experts at everything.

Certainly there is a lack of ideas as to how to fix it, as you mention
in (3). This shows to me that the solution lies in one of two areas: a)
the solution has not yet been considered or b) the solution has already
been thought of and for whatever reason disregarded. You may be certain
that the solution lies in a), though I am not. Rejecting ideas quickly
may simply increase the chances of finding the solution in a b) case.

Forgive me but (2) seems spurious. Nobody said anything about trading
cache lines at instruction-dispatch rates. The objective of the test
harness would be to check whether negative effects such as CS storms
exist on that platform. Actual optimization of the spinlock mechanisms
in the context of the PostgreSQL server should certainly be done within
the code.

In a more general sense: what is the best next action to make progress
on the CS issue that exists for certain CPUs?

Best Regards, Simon Riggs




Re: Seeing context switch storm with 10/13 snapshot of

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Certainly there is a lack of ideas as to how to fix it, as you mention
> in (3). This shows to me that the solution lies in one of two areas: a)
> the solution has not yet been considered or b) the solution has already
> been thought of and for whatever reason disregarded. You may be certain
> that the solution lies in a), though I am not. Rejecting ideas quickly
> may simply increase the chances of finding the solution in a b) case.

However, building a spinlock test harness presupposes that the solution
lies in the spinlock code itself, and not in (say) changing our usage
patterns of it.  So I'd throw the "rejecting ideas too quickly"
challenge right back at you.  What we need to optimize is the behavior
in the real context of Postgres, not in a test harness.
        regards, tom lane


Re: Seeing context switch storm with 10/13 snapshot of

From
Bruce Momjian
Date:
Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Certainly there is a lack of ideas as to how to fix it, as you mention
> > in (3). This shows to me that the solution lies in one of two areas: a)
> > the solution has not yet been considered or b) the solution has already
> > been thought of and for whatever reason disregarded. You may be certain
> > that the solution lies in a), though I am not. Rejecting ideas quickly
> > may simply increase the chances of finding the solution in a b) case.
> 
> However, building a spinlock test harness presupposes that the solution
> lies in the spinlock code itself, and not in (say) changing our usage
> patterns of it.  So I'd throw the "rejecting ideas too quickly"
> challenge right back at you.  What we need to optimize is the behavior
> in the real context of Postgres, not in a test harness.

How do other databases deal with this?  I can't imagine we are the only
ones.  Are we doing something different than them?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Seeing context switch storm with 10/13 snapshot of

From
Simon Riggs
Date:
On Fri, 2005-10-21 at 17:17 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Certainly there is a lack of ideas as to how to fix it, as you mention
> > in (3). This shows to me that the solution lies in one of two areas: a)
> > the solution has not yet been considered or b) the solution has already
> > been thought of and for whatever reason disregarded. You may be certain
> > that the solution lies in a), though I am not. Rejecting ideas quickly
> > may simply increase the chances of finding the solution in a b) case.
> 
> However, building a spinlock test harness presupposes that the solution
> lies in the spinlock code itself

(I thought I had said) that was actually the main test, rather than a
presupposition. The solution may lie there or not and when it is built
we would be able to determine where it is. If the issue is with the
spinlock code itself, it will then help us find the solution. If it is
not, then we will know to look elsewhere.

It is important that we isolate the problem in such a way that others
can help fix it. If we do it only within the database, then the special
context will force many to say they cannot help us.

> patterns of it.  So I'd throw the "rejecting ideas too quickly"
> challenge right back at you.  

My words were inspired by a desire to leave the Doldrums only, not to
cause other issues. I apologise for any offence given.

> What we need to optimize is the behavior
> in the real context of Postgres, not in a test harness.

That is the end goal, yes.

My sub-goal on that path is problem isolation (only).

Best Regards, Simon Riggs



Re: Seeing context switch storm with 10/13 snapshot of

From
Tom Lane
Date:
Robert Creager <Robert.Creager@Sun.com> writes:
> I have a Perl script, a Perl module and a 1Mb database (from pg_dump
> -F c).  Are you interested at this time in receiving this?

Sure.  Please send it off-list, of course.

> An upgrade to RC2 might occur when RC2 comes out, unless there would
> be great benefit on running my tests against 8.1RC1.

There isn't going to be an RC2, barring catastrophic problems being
identified *very* shortly ...
        regards, tom lane


Re: Seeing context switch storm with 10/13 snapshot of

From
Robert Creager
Date:
On Thu, 20 Oct 2005 17:35:31 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Robert Creager <Robert.Creager@Sun.com> writes:
> > Interesting.  7.4.1 is worse for this test, as two jump up to 130k.  But, my
> > app runs fine against 7.4.1...
> 
> > Would it still be helpful to try and pull together a test case from my app
> > against 8.1beta3?
> 
> Yes, if you can show a case where 8.1 is much worse than 7.4 then we
> need to know about it yesterday.
> 

Ok.  I finally have a stand alone test case that does not show that 8.1 is worse
than 7.4.  It shows the opposite quite nicely ;-).  In tests I'm running, 7.4.1
bogs down immediately and never recovers.  8.1 behaves itself for about 20
minutes before it goes out to lunch.  So, in those terms, it is worse as it
looks fine at the start, but then gets progressively worse, but overall, it's
much better.

I have a Perl script, a Perl module and a 1Mb database (from pg_dump -F c).  Are
you interested at this time in receiving this?  I plan on taking it home and
running against my AMD system to see what it does, but that will be against 8.0
if I remember correctly, maybe 8.03.  An upgrade to RC2 might occur when RC2
comes out, unless there would be great benefit on running my tests against
8.1RC1.

Cheers,
Rob