Spinlocks, yet again: a new test case - Mailing list pgsql-hackers

From Tom Lane
Subject Spinlocks, yet again: a new test case
Date
Msg-id 10234.1126475469@sss.pgh.pa.us
Whole thread Raw
Responses Re: Spinlocks, yet again: a new test case
Re: Spinlocks, yet again: a new test case
Re: Spinlocks, yet again: a new test case
List pgsql-hackers
I've recently been taking another look at the test case shown here:
http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php
This is the infamous "context swap storm" problem that we've hacking
at for so long.  After the 8.1 buffer manager redesign, the problem
of contention for the BufMgrLock is pretty much gone, and so the
test case no longer exhibits a lot of semop()s.  However ... performance
of CVS tip on a multiprocessor still pretty much sucks :-(

I started looking at this because of a discussion with a Red Hat
customer, who was saying that the "futex spinlock" patch shown here
http://archives.postgresql.org/pgsql-performance/2004-10/msg00185.php
had helped him materially with a context-swap-storm problem on 8.0.3.
That didn't make a lot of sense to me, because we had determined that
the futex patch didn't help much for this problem (which is one reason
why it never got applied).  So I started digging, and I've found a few
interesting things that we still need to fix.

First off, the RH customer had found that the futex patch improved
his query performance even though the context swap rate actually went
*up* materially.  So I realized that we shouldn't simply be looking
at the CS rate and saying "ugh, we have to get that down".  It's
important to find out how much work is really getting done, too.
Accordingly, I generated a modified version of the test case with
a shorter query that is actually meant to complete in a reasonable
amount of time, and I looked at how long it took to complete multiple
instances of this query running in parallel.  Ideally, if you have
N cpus you should be able to run up to N copies of the query in the
same amount of time it takes for one copy, and then, say, 2*N copies
ought to take about twice as long as N copies.  We do actually seem
to scale pretty linearly beyond N ... but below that, not.

The files for the updated test case are attached if anyone else wants
to try it.  They are:

    test_setup.sql    Run this to create the test tables

    test_run.sql    A version of the test query that will run
            pretty nearly forever (until you kill it)

    test_run_small.sql    A one-minute-or-so test query

    startn.sh    A shell script to run N copies of
            test_run_small.sql in parallel.

You need this one-time setup:

    createdb test
    psql -f test_setup.sql test

To perform testing, do "time ./startn.sh N" for N=1,2,... up to however
many CPUs you have, and maybe a run with more queries than that to make
sure performance doesn't drop off above that.  Note the total time to
complete the script for each N.  You can also watch the CPU utilization
and context swap rates using "vmstat 1" or equivalent, and/or strace
some of the backends to see what they're doing.

I did my testing on a 4-way Opteron box at Red Hat, using kernel
2.6.12-1.1398_FC4smp.  8.0.3 completely sucks in this test:

N, runtime:    1 45s    2 119s    4 507s    (didn't try 8)

With 4 processes running, it shows about 40K context swaps/sec
and barely 15% CPU utilization!  strace shows plenty of semop()s
as expected, but also a fair number of delaying select() calls
from s_lock().  I saw about one select for every hundred semops,
but that was still enough to have 30 or so selects per second
(per process).  Since each select sleeps 10msec, that's about a
third of the potential CPU usage being discarded in spinlock delays.

(Even more interesting, this test case behaves much worse on Opteron
than on other SMP boxes.  I had previously done similar tests on
a 4-way Xeon with similar-vintage kernel, and saw context swap rates
over 140K/sec with about 70% CPU load.  I don't have test_run_small
results from that session, but clearly the spinlock code is behaving
worse on the Opterons.)

CVS tip eliminates essentially all the semops, as we hoped, but
scalability is still not great:

N, runtime:    1 36s    2 61s    4 105s    8 198s

At 4 processes, CPU use is only about 30%; the rate of select() delays
has gone up to 50/sec or more, so even though we fixed the bufmgr itself,
we are still losing a lot of potential CPU cycles to spinlock waits.

(Note: I later realized that the CVS-tip code was accidentally built
with -O1, not -O2 as the 8.0.3 code had been, so these two sets of
numbers are not completely comparable: with -O2 the single-query time
drops to 31s in CVS tip.  But the main point is all the time that's
disappearing into spinlock delays.)

I have some proposals for fixing this stuff, which I'll put into a
separate message since this one is pretty long already.  What I wanted
to accomplish here is to document the test case and the problem, and
that's done.

Anyone have SMP boxes of other types that they can try this on?

            regards, tom lane

drop table test_data;

create table test_data(f1 int);

insert into test_data values (random() * 100);
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;

create index test_index on test_data(f1);

vacuum verbose analyze test_data;
checkpoint;
-- force nestloop indexscan plan
set enable_seqscan to 0;
set enable_bitmapscan to 0;
set enable_mergejoin to 0;
set enable_hashjoin to 0;

explain
select count(*) from test_data a, test_data b, test_data c
where a.f1 = b.f1 and b.f1 = c.f1;

select count(*) from test_data a, test_data b, test_data c
where a.f1 = b.f1 and b.f1 = c.f1;
-- force nestloop indexscan plan
set enable_seqscan to 0;
set enable_bitmapscan to 0;
set enable_mergejoin to 0;
set enable_hashjoin to 0;

select count(*) from test_data a, test_data b where a.f1 = b.f1;
#!/bin/sh

N=0

while [ $N -lt $1 ]
do
  psql -f test_run_small.sql test >/dev/null &
  N=`expr $N + 1`
done

wait

pgsql-hackers by date:

Previous
From: Michael Fuhr
Date:
Subject: Re: problem for o/p
Next
From: Tom Lane
Date:
Subject: Spinlocks, yet again: analysis and proposed patches