Thread: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile

Hi,

I've been running some tests on pg 9.2beta1 and in particular a set
of queries like

create table _tmp0 as select * from (    select *, (select healpixid from idt_match as m where
m.transitid=o.transitid)               as x from idt_photoobservation as o offset 0          ) as y where x%16=ZZZ
orderby x;
 

(where ZZZ is some number integer number 0<=ZZZ<16)

With the following plan:
----------------------------------------------------------------------------------------------------------------------
Sort (cost=3228814504.96..3228815137.21 rows=252902 width=498)   Sort Key: y.x   ->  Subquery Scan on y
(cost=0.00..3228791809.25rows=252902 width=498)         Filter: ((y.x % 16::bigint) = 0)         ->  Limit
(cost=0.00..3228033102.41rows=50580456 width=490)               ->  Seq Scan on idt_photoobservation o
(cost=0.00..3228033102.41rows=50580456 width=490)                     SubPlan 1                       ->  Index Scan
usingidt_match_transitid_idx on idt_match m  (cost=0.00..63.74 rows=1 width=8)                             Index Cond:
(transitid= o.transitid)
 

The schema of the tables are:

e2es2=> \d idt_match     Table "public.idt_match"  Column   |   Type   | Modifiers
-----------+----------+----------- sourceid  | bigint   | transitid | bigint   | healpixid | bigint   | flagsxm   |
smallint|
 
Indexes:    "idt_match_idx" btree (healpixid)    "idt_match_transitid_idx" btree (transitid)

Table "public.idt_photoobservation"  Column   |   Type   | Modifiers
-----------+----------+----------- transitid | bigint   | fluxbp0   | real     |
....
more columns
....
Indexes:    "idt_photoobservation_idx" btree (transitid)

And I noticed than when I run the query like the one shown above in parallel
(in multiple connections for ZZZ=0...8) the performance of each query 
drops down significantly (factor of 2)(despite the fact that during the 
execution of the query postgres is mostly CPU bound).

So I tried to oprofile it and strace it, And that's what I saw:

The strace -c of each process  shows something like that
#########
Process 18660 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ---------------- 76.25    0.001342           0    268987
semop23.75    0.000418           0     61694           read  0.00    0.000000           0       138           lseek
0.00   0.000000           0       355           select  0.00    0.000000           0         3           kill
 
------ ----------- ----------- --------- --------- ----------------
100.00    0.001760                331177           total
#######

And the oprofile shows this on top:
-------------------------------------------------------------------------------
2863981  25.7117  ReleasePredicateLocks  2863981  100.000  ReleasePredicateLocks [self]
-------------------------------------------------------------------------------
1246629  11.1917  LocalBufferAlloc  1246629  100.000  LocalBufferAlloc [self]
-------------------------------------------------------------------------------
1135393  10.1931  CheckForSerializableConflictIn  1135393  100.000  CheckForSerializableConflictIn [self]
------------------------------------------------------------

So there is a lot of locking for some reason, And I was surprised to see 
anything related to SSI in the profile at all, because I'm not running 
serializable transactions (I was just running my queries from multiple 
psql sessions).

Is there a reasonable explanation for what I'm seeing  ?

Thanks in advance,    Sergey

More info: there is no other concurrent activity on the machine.
shared_buffers is 10G, The machine has ~ 60G of RAM, 24 cores and proper 
RAID

Another note is that the oprofile and strace were obtained during the 
stage when the idt_photo.. table was scanned and data was being stored in 
the pgsql_tmp (not during the last bit when the data in pgsql_tmp were
actually being sorted).


*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK


On Thu, May 24, 2012 at 8:24 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> Hi,
>
> I've been running some tests on pg 9.2beta1 and in particular a set
> of queries like
>
> create table _tmp0 as select * from (
>        select *, (select healpixid from idt_match as m where
>                                        m.transitid=o.transitid)
>                as x from idt_photoobservation as o offset 0
>          ) as y where x%16=ZZZ order by x;
>
> (where ZZZ is some number integer number 0<=ZZZ<16)
>
> With the following plan:
>
----------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=3228814504.96..3228815137.21 rows=252902 width=498)
>   Sort Key: y.x
>   ->  Subquery Scan on y  (cost=0.00..3228791809.25 rows=252902 width=498)
>         Filter: ((y.x % 16::bigint) = 0)
>         ->  Limit  (cost=0.00..3228033102.41 rows=50580456 width=490)
>               ->  Seq Scan on idt_photoobservation o
>  (cost=0.00..3228033102.41 rows=50580456 width=490)
>                     SubPlan 1
>                       ->  Index Scan using idt_match_transitid_idx on
> idt_match m  (cost=0.00..63.74 rows=1 width=8)
>                             Index Cond: (transitid = o.transitid)
>
> The schema of the tables are:
>
> e2es2=> \d idt_match
>     Table "public.idt_match"
>  Column   |   Type   | Modifiers
> -----------+----------+-----------
>  sourceid  | bigint   |
>  transitid | bigint   |
>  healpixid | bigint   |
>  flagsxm   | smallint |
> Indexes:
>    "idt_match_idx" btree (healpixid)
>    "idt_match_transitid_idx" btree (transitid)
>
> Table "public.idt_photoobservation"
>  Column   |   Type   | Modifiers
> -----------+----------+-----------
>  transitid | bigint   |
>  fluxbp0   | real     |
> ....
> more columns
> ....
> Indexes:
>    "idt_photoobservation_idx" btree (transitid)
>
> And I noticed than when I run the query like the one shown above in parallel
> (in multiple connections for ZZZ=0...8) the performance of each query drops
> down significantly (factor of 2)(despite the fact that during the execution
> of the query postgres is mostly CPU bound).
>
> So I tried to oprofile it and strace it, And that's what I saw:
>
> The strace -c of each process  shows something like that
> #########
> Process 18660 detached
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  76.25    0.001342           0    268987           semop
>  23.75    0.000418           0     61694           read
>  0.00    0.000000           0       138           lseek
>  0.00    0.000000           0       355           select
>  0.00    0.000000           0         3           kill
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.001760                331177           total
> #######
>
> And the oprofile shows this on top:
> -------------------------------------------------------------------------------
> 2863981  25.7117  ReleasePredicateLocks
>  2863981  100.000  ReleasePredicateLocks [self]
> -------------------------------------------------------------------------------
> 1246629  11.1917  LocalBufferAlloc
>  1246629  100.000  LocalBufferAlloc [self]
> -------------------------------------------------------------------------------
> 1135393  10.1931  CheckForSerializableConflictIn
>  1135393  100.000  CheckForSerializableConflictIn [self]
> ------------------------------------------------------------
>
> So there is a lot of locking for some reason, And I was surprised to see
> anything related to SSI in the profile at all, because I'm not running
> serializable transactions (I was just running my queries from multiple psql
> sessions).

Are you sure?  I looked at all the ReleasePredicateLocks calls and
they appear to be guarded by:
       /* Nothing to do if this is not a serializable transaction */       if (MySerializableXact ==
InvalidSerializableXact)              return false; 

What's the default isolation mode set to?

merlin


On Thu, 24 May 2012, Merlin Moncure wrote:

> Are you sure?  I looked at all the ReleasePredicateLocks calls and
> they appear to be guarded by:
>
>        /* Nothing to do if this is not a serializable transaction */
>        if (MySerializableXact == InvalidSerializableXact)
>                return false;
> What's the default isolation mode set to?

Sorry, it was indeed my mistake. I was pointing opreport to the 
binary which was recompiled, while the old version was still running , so 
I guess oprofile was picking up wrong function names

Here is the correct oprofile: when multiple queries are running :
samples  %        symbol name
-------------------------------------------------------------------------------
952457   13.8715  LWLockAcquire  952457   100.000  LWLockAcquire [self]
-------------------------------------------------------------------------------
779077   11.3464  PinBuffer  779077   100.000  PinBuffer [self]
-------------------------------------------------------------------------------
759898   11.0671  s_lock  759898   100.000  s_lock [self]
-------------------------------------------------------------------------------
689753    6.7178  slot_deform_tuple  689753   100.000  slot_deform_tuple [self]
-------------------------------------------------------------------------------
526002    5.1230  UnpinBuffer  526002   100.000  UnpinBuffer [self]

When only one is running:
samples  %        symbol name
-------------------------------------------------------------------------------
163268   14.0343  slot_deform_tuple  163268   100.000  slot_deform_tuple [self]
-------------------------------------------------------------------------------
126018   10.8324  _bt_compare  126018   100.000  _bt_compare [self]
-------------------------------------------------------------------------------
113975    9.7972  ExecProject  113975   100.000  ExecProject [self]
-------------------------------------------------------------------------------
49760     4.2773  FunctionCall2Coll  49760    100.000  FunctionCall2Coll [self]
-------------------------------------------------------------------------------
49164     4.2261  LWLockAcquire  49164    100.000  LWLockAcquire [self]
-------------------------------------------------------------------------------
43526     3.7414  hash_search_with_hash_value  43526    100.000  hash_search_with_hash_value [self]

############

I guess there is nothing catastrophically wrong with that, but still I'm 
very suprised that you get severe locking problems (factor of two 
slow-down) when running parallel read-only transactions.
    Sergey

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK


On Thu, May 24, 2012 at 6:24 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> Hi,
>
> I've been running some tests on pg 9.2beta1 and in particular a set
> of queries like
...
>
> And I noticed than when I run the query like the one shown above in parallel
> (in multiple connections for ZZZ=0...8) the performance of each query drops
> down significantly (factor of 2)(despite the fact that during the execution
> of the query postgres is mostly CPU bound).

Is this a regression from previous versions (9.0, 9.1), or has the
slow down always been there?

Cheers,

Jeff


On Thu, May 24, 2012 at 1:42 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> I guess there is nothing catastrophically wrong with that, but still I'm
> very suprised that you get severe locking problems (factor of two slow-down)
> when running parallel read-only transactions.

Me, too.  How many concurrent connections are you running, and does
your working set exceed shared_buffers?  Can you provide a
self-contained reproducible test case?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thu, 24 May 2012, Robert Haas wrote:

> On Thu, May 24, 2012 at 1:42 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
>> I guess there is nothing catastrophically wrong with that, but still I'm
>> very suprised that you get severe locking problems (factor of two slow-down)
>> when running parallel read-only transactions.
>
> Me, too.  How many concurrent connections are you running, and does
> your working set exceed shared_buffers?  Can you provide a
> self-contained reproducible test case?

The last tests I've been doing were with 8 connections.
And the working set is roughly 30Gig, which is ~ 3x the shared buffers. 
(but ~ 50% of RAM).

Regarding the test-case, I'll try to see whether I can still observe the 
same slowing down if I chop the main table by a factor of few, so I can 
put the data somewhere for download.
    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Thu, May 24, 2012 at 2:19 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> On Thu, 24 May 2012, Robert Haas wrote:
>> On Thu, May 24, 2012 at 1:42 PM, Sergey Koposov <koposov@ast.cam.ac.uk>
>> wrote:
>>>
>>> I guess there is nothing catastrophically wrong with that, but still I'm
>>> very suprised that you get severe locking problems (factor of two
>>> slow-down)
>>> when running parallel read-only transactions.
>>
>> Me, too.  How many concurrent connections are you running, and does
>> your working set exceed shared_buffers?  Can you provide a
>> self-contained reproducible test case?
>
> The last tests I've been doing were with 8 connections.
> And the working set is roughly 30Gig, which is ~ 3x the shared buffers. (but
> ~ 50% of RAM).

Given that additional information, I would say these results are
expected.  Unfortunately, our BufFreelistLock is a serious contention
point, and I think that's what you're hitting.  See the graph here:

http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html

As you can see, raw performance isn't much worse with the larger data
sets, but scalability at high connection counts is severely degraded
once the working set no longer fits in shared_buffers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thu, May 24, 2012 at 1:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, May 24, 2012 at 2:19 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
>> On Thu, 24 May 2012, Robert Haas wrote:
>>> On Thu, May 24, 2012 at 1:42 PM, Sergey Koposov <koposov@ast.cam.ac.uk>
>>> wrote:
>>>>
>>>> I guess there is nothing catastrophically wrong with that, but still I'm
>>>> very suprised that you get severe locking problems (factor of two
>>>> slow-down)
>>>> when running parallel read-only transactions.
>>>
>>> Me, too.  How many concurrent connections are you running, and does
>>> your working set exceed shared_buffers?  Can you provide a
>>> self-contained reproducible test case?
>>
>> The last tests I've been doing were with 8 connections.
>> And the working set is roughly 30Gig, which is ~ 3x the shared buffers. (but
>> ~ 50% of RAM).
>
> Given that additional information, I would say these results are
> expected.  Unfortunately, our BufFreelistLock is a serious contention
> point, and I think that's what you're hitting.  See the graph here:
>
> http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html
>
> As you can see, raw performance isn't much worse with the larger data
> sets, but scalability at high connection counts is severely degraded
> once the working set no longer fits in shared_buffers.

Hm, wouldn't the BufFreelistLock issue be ameliorated if
StrategyGetBuffer could reserve multiple buffers so that you'd draw
down your local list and only then go back to the global pool? (easier
said than done obviously).

merlin


On Thu, May 24, 2012 at 2:24 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> As you can see, raw performance isn't much worse with the larger data
>> sets, but scalability at high connection counts is severely degraded
>> once the working set no longer fits in shared_buffers.
>
> Hm, wouldn't the BufFreelistLock issue be ameliorated if
> StrategyGetBuffer could reserve multiple buffers so that you'd draw
> down your local list and only then go back to the global pool? (easier
> said than done obviously).

hm, looking at the code some more, it looks like the whole point of
the strategy system is to do that.  ISTM bulk insert type queries
would be good candidates for a buffer strategy somehow?

merlin


On Thu, 24 May 2012, Robert Haas wrote:

> As you can see, raw performance isn't much worse with the larger data
> sets, but scalability at high connection counts is severely degraded
> once the working set no longer fits in shared_buffers.

Actually the problem persits even when I trim the dataset size to be within
the shared_buffers.

Here is the dump (0.5 gig in size, tested with shared_buffers=10G,
work_mem=500Mb):
http://www.ast.cam.ac.uk/~koposov/files/dump.gz
And I attach the script

For my toy dataset the performance of a single thread goes down 
from ~6.4 to 18 seconds (~ 3 times worse),

And actually while running the script repeatedly on my main machine, for 
some reason I saw  some variation in terms of how much threaded execution 
is slower than a single thread.

Now I see 25 seconds for multi threaded run vs the same ~ 6 second for a 
single thread.

The oprofile shows 782355   21.5269  s_lock  782355   100.000  s_lock [self]
-------------------------------------------------------------------------------
709801   19.5305  PinBuffer  709801   100.000  PinBuffer [self]
-------------------------------------------------------------------------------
326457    8.9826  LWLockAcquire  326457   100.000  LWLockAcquire [self]
-------------------------------------------------------------------------------
309437    8.5143  UnpinBuffer  309437   100.000  UnpinBuffer [self]
-------------------------------------------------------------------------------
252972    6.9606  ReadBuffer_common  252972   100.000  ReadBuffer_common [self]
-------------------------------------------------------------------------------
201558    5.5460  LockBuffer  201558   100.000  LockBuffer [self]
------------------------------------------------------------

It is interesting that On another machine with much smaller shared memory 
(3G), smaller RAM (12G),  smaller number of cpus  and PG 9.1 running I was 
getting consistently ~ 7.2 vs 4.5 sec (for multi vs single thread)

PS Just in case the CPU on the main machine I'm testing is Xeon(R) CPU E7- 
4807 (the total number of real cores is 24)





*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
On Thu, May 24, 2012 at 12:46 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, May 24, 2012 at 2:24 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> As you can see, raw performance isn't much worse with the larger data
>>> sets, but scalability at high connection counts is severely degraded
>>> once the working set no longer fits in shared_buffers.
>>
>> Hm, wouldn't the BufFreelistLock issue be ameliorated if
>> StrategyGetBuffer could reserve multiple buffers so that you'd draw
>> down your local list and only then go back to the global pool? (easier
>> said than done obviously).
>
> hm, looking at the code some more, it looks like the whole point of
> the strategy system is to do that.

I thought you were suggesting that the StrategyGetBuffer would
pre-allocate multiple buffers to a backend under the cover of a single
BufFreelistLock.  If that is what you were suggesting, that is not
what the strategy system is currently for.  It is for locally reusing
buffers, not for gang-allocating them.

If a backend could somehow predict that the buffer it is about to read
in is likely going to be a cold buffer, perhaps it would make sense
for each backend to maintain an small ring of its own which it can
reuse for such cold buffers.



> ISTM bulk insert type queries
> would be good candidates for a buffer strategy somehow?

Probably.  There is a code or README comment to that effect that I
stumbled upon just ra couple hours ago, but can't immediately re-find
it.

Cheers,

Jeff


On Thu, May 24, 2012 at 3:46 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, May 24, 2012 at 2:24 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> As you can see, raw performance isn't much worse with the larger data
>>> sets, but scalability at high connection counts is severely degraded
>>> once the working set no longer fits in shared_buffers.
>>
>> Hm, wouldn't the BufFreelistLock issue be ameliorated if
>> StrategyGetBuffer could reserve multiple buffers so that you'd draw
>> down your local list and only then go back to the global pool? (easier
>> said than done obviously).
>
> hm, looking at the code some more, it looks like the whole point of
> the strategy system is to do that.  ISTM bulk insert type queries
> would be good candidates for a buffer strategy somehow?

Yep.

commit 85e2cedf985bfecaf43a18ca17433070f439fb0e
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Thu Nov 6 20:51:15 2008 +0000
   Improve bulk-insert performance by keeping the current target buffer pinned   (but not locked, as that would risk
deadlocks). Also, make it work in a sma   ring of buffers to avoid having bulk inserts trash the whole buffer arena. 
   Robert Haas, after an idea of Simon Riggs'.

But that doesn't help here, since pgbench is only looking up one row per query.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thu, May 24, 2012 at 3:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, May 24, 2012 at 3:46 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> hm, looking at the code some more, it looks like the whole point of
>> the strategy system is to do that.  ISTM bulk insert type queries
>> would be good candidates for a buffer strategy somehow?
>
> Yep.
>
> commit 85e2cedf985bfecaf43a18ca17433070f439fb0e
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Thu Nov 6 20:51:15 2008 +0000
>
>    Improve bulk-insert performance by keeping the current target buffer pinned
>    (but not locked, as that would risk deadlocks).  Also, make it work in a sma
>    ring of buffers to avoid having bulk inserts trash the whole buffer arena.
>
>    Robert Haas, after an idea of Simon Riggs'.
>
> But that doesn't help here, since pgbench is only looking up one row per query.

Wait -- OP's gripe this isn't regarding standard pgbench, but multiple
large concurrent 'insert into foo select...'.  I looked in the code
and it appears that the only bulk insert strategy using operations are
copy, create table as select, and table rewrite operations. Concurrent
INSERT SELECT apparently doesn't get the benefit of a strategy and
should be fighting over the freelist once the pool exhausts.

We don't get to skip wal of course, but we should be able to use a
bulk insert strategy, especially if there was some way of predicting
that a large number of tuples were going to be inserted.  I'm
wondering though of contention on the free list is in fact the OP's
problem.

merlin


On Thu, May 24, 2012 at 4:46 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> Wait -- OP's gripe this isn't regarding standard pgbench, but multiple
> large concurrent 'insert into foo select...'.  I looked in the code
> and it appears that the only bulk insert strategy using operations are
> copy, create table as select, and table rewrite operations. Concurrent
> INSERT SELECT apparently doesn't get the benefit of a strategy and
> should be fighting over the freelist once the pool exhausts.

I think you are right.

> We don't get to skip wal of course, but we should be able to use a
> bulk insert strategy, especially if there was some way of predicting
> that a large number of tuples were going to be inserted.  I'm
> wondering though of contention on the free list is in fact the OP's
> problem.

Not sure.  It might be some other LWLock, but it's hard to tell which
one from the information provided.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, May 24, 2012 at 4:46 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> We don't get to skip wal of course, but we should be able to use a
>> bulk insert strategy, especially if there was some way of predicting
>> that a large number of tuples were going to be inserted. �I'm
>> wondering though of contention on the free list is in fact the OP's
>> problem.

> Not sure.  It might be some other LWLock, but it's hard to tell which
> one from the information provided.

Yeah.  It seems quite plausible that Robert's select-only benchmark might
be mainly tripping over the freelist lock, but I'm less convinced about
something that's doing INSERT/SELECT, and therefore is also doing a lot
of WAL activity, index insertions, etc.  I'd want to see some
instrumentation results before assuming we know where the bottleneck is
there.
        regards, tom lane


Hi,

On Thu, 24 May 2012, Robert Haas wrote:
> Not sure.  It might be some other LWLock, but it's hard to tell which
> one from the information provided.

If you could tell what's the best way to find out the info that you need, 
then I could run it reasonably quickly.
    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Thu, May 24, 2012 at 3:36 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> Hi,
>
>
> On Thu, 24 May 2012, Robert Haas wrote:
>>
>> Not sure.  It might be some other LWLock, but it's hard to tell which
>> one from the information provided.
>
>
> If you could tell what's the best way to find out the info that you need,
> then I could run it reasonably quickly.

Add
#define LWLOCK_STATS
near the top of:
src/backend/storage/lmgr/lwlock.c

and recompile and run a reduced-size workload.  When the processes
exits, they will dump a lot of data about LWLock usage to the logfile.Generally the LWLock with the most blocks on it
willbe the main 
culprit.

Cheers,

Jeff


On Thu, 24 May 2012, Jeff Janes wrote:
> Add
> #define LWLOCK_STATS
> near the top of:
> src/backend/storage/lmgr/lwlock.c
>
> and recompile and run a reduced-size workload.  When the processes
> exits, they will dump a lot of data about LWLock usage to the logfile.
> Generally the LWLock with the most blocks on it will be the main
> culprit.

Here is the output from a multi-threaded run (8thtreads, 22 seconds) 
sorted by blk. Not sure whether that's of much use or not:

PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373
PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110
PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976
PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955
PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871
PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838
PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774
PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702
PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665
PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651
PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602
PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546
PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511
PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466
PID 7114 lwlock 47: shacq 227986 exacq 929 blk 449
PID 7115 lwlock 44: shacq 1127495 exacq 633 blk 401
PID 7115 lwlock 48: shacq 1124666 exacq 559 blk 397
PID 7112 lwlock 47: shacq 227993 exacq 1248 blk 387
PID 7110 lwlock 47: shacq 228218 exacq 1082 blk 378
PID 7111 lwlock 47: shacq 228093 exacq 907 blk 321
PID 7114 lwlock 42: shacq 232591 exacq 935 blk 318
PID 7113 lwlock 47: shacq 228085 exacq 909 blk 306
PID 7112 lwlock 34: shacq 343247 exacq 1255 blk 271
PID 7110 lwlock 42: shacq 232599 exacq 1049 blk 262
PID 7111 lwlock 34: shacq 343398 exacq 926 blk 255
PID 7112 lwlock 42: shacq 232505 exacq 1262 blk 240
PID 7108 lwlock 42: shacq 233215 exacq 1539 blk 237
PID 7110 lwlock 43: shacq 438768 exacq 1026 blk 231
PID 7114 lwlock 43: shacq 438795 exacq 898 blk 230
PID 7113 lwlock 34: shacq 343283 exacq 832 blk 226
PID 7110 lwlock 34: shacq 343338 exacq 1074 blk 226
PID 7114 lwlock 0: shacq 0 exacq 14864 blk 219
PID 7112 lwlock 43: shacq 438691 exacq 1123 blk 215
PID 7111 lwlock 42: shacq 232645 exacq 885 blk 215
PID 7114 lwlock 34: shacq 343362 exacq 939 blk 214
PID 7111 lwlock 0: shacq 0 exacq 13638 blk 205
PID 7111 lwlock 43: shacq 438817 exacq 899 blk 200
PID 7114 lwlock 7: shacq 0 exacq 547 blk 193
PID 7113 lwlock 42: shacq 232600 exacq 860 blk 192
PID 7110 lwlock 0: shacq 0 exacq 16862 blk 191
PID 7114 lwlock 38: shacq 333266 exacq 977 blk 183
PID 7112 lwlock 38: shacq 333175 exacq 1246 blk 183
PID 7113 lwlock 7: shacq 0 exacq 548 blk 178
PID 7113 lwlock 43: shacq 438720 exacq 833 blk 178
PID 7115 lwlock 7: shacq 0 exacq 549 blk 177
PID 7108 lwlock 47: shacq 228682 exacq 1495 blk 177
PID 7108 lwlock 34: shacq 343982 exacq 1595 blk 177
PID 7112 lwlock 0: shacq 0 exacq 19538 blk 172
PID 7111 lwlock 7: shacq 0 exacq 549 blk 172
PID 7115 lwlock 17: shacq 0 exacq 9927 blk 167
PID 7109 lwlock 34: shacq 343410 exacq 348 blk 167
PID 7112 lwlock 7: shacq 0 exacq 548 blk 166
PID 7110 lwlock 38: shacq 333249 exacq 1013 blk 165
PID 7113 lwlock 38: shacq 333226 exacq 828 blk 163
PID 7110 lwlock 7: shacq 0 exacq 548 blk 162
PID 7109 lwlock 7: shacq 0 exacq 548 blk 161
PID 7109 lwlock 47: shacq 228097 exacq 386 blk 158
PID 7112 lwlock 45: shacq 130843 exacq 1201 blk 154
PID 7109 lwlock 43: shacq 438787 exacq 314 blk 153
PID 7111 lwlock 38: shacq 333306 exacq 882 blk 152
PID 7108 lwlock 43: shacq 439454 exacq 1614 blk 151
PID 7108 lwlock 17: shacq 0 exacq 23165 blk 147
PID 7108 lwlock 7: shacq 0 exacq 549 blk 145
PID 7113 lwlock 0: shacq 0 exacq 13394 blk 142
PID 7112 lwlock 46: shacq 132972 exacq 1196 blk 142
PID 7112 lwlock 36: shacq 204508 exacq 1180 blk 140
PID 7109 lwlock 42: shacq 232611 exacq 363 blk 140
PID 7115 lwlock 34: shacq 343582 exacq 600 blk 139
PID 7114 lwlock 45: shacq 130818 exacq 903 blk 139
PID 7115 lwlock 42: shacq 232846 exacq 517 blk 138
PID 7108 lwlock 38: shacq 333989 exacq 1704 blk 137
PID 7115 lwlock 43: shacq 438993 exacq 558 blk 126
PID 7115 lwlock 0: shacq 0 exacq 3161 blk 125
PID 7110 lwlock 40: shacq 129577 exacq 1021 blk 122
PID 7113 lwlock 45: shacq 130833 exacq 814 blk 120
PID 7111 lwlock 17: shacq 0 exacq 4607 blk 119
PID 7109 lwlock 0: shacq 0 exacq 5711 blk 119
PID 7110 lwlock 45: shacq 130865 exacq 1001 blk 117
PID 7109 lwlock 38: shacq 333340 exacq 322 blk 115
PID 7114 lwlock 46: shacq 132997 exacq 860 blk 112
PID 7113 lwlock 36: shacq 204566 exacq 833 blk 112
PID 7112 lwlock 40: shacq 129528 exacq 1234 blk 111
PID 7115 lwlock 47: shacq 228397 exacq 516 blk 109
PID 7114 lwlock 36: shacq 204634 exacq 930 blk 109
PID 7112 lwlock 41: shacq 133600 exacq 1136 blk 109
PID 7111 lwlock 36: shacq 204675 exacq 830 blk 108
PID 7108 lwlock 0: shacq 0 exacq 6504 blk 108
PID 7115 lwlock 38: shacq 333512 exacq 581 blk 106
PID 7110 lwlock 36: shacq 204546 exacq 1014 blk 106
PID 7113 lwlock 37: shacq 127040 exacq 790 blk 105
PID 7111 lwlock 45: shacq 130893 exacq 863 blk 103
PID 7114 lwlock 35: shacq 135342 exacq 879 blk 102
PID 7113 lwlock 40: shacq 129560 exacq 853 blk 101
PID 7110 lwlock 41: shacq 133697 exacq 1063 blk 101
PID 7112 lwlock 35: shacq 135257 exacq 1229 blk 98
PID 7110 lwlock 46: shacq 132990 exacq 1059 blk 98
PID 7112 lwlock 37: shacq 127029 exacq 1229 blk 97
PID 7114 lwlock 40: shacq 129597 exacq 913 blk 96
PID 7110 lwlock 17: shacq 0 exacq 2655 blk 96
PID 7109 lwlock 46: shacq 133014 exacq 326 blk 96
PID 7111 lwlock 40: shacq 129720 exacq 890 blk 95
PID 7111 lwlock 46: shacq 133057 exacq 876 blk 94
PID 7114 lwlock 33: shacq 127651 exacq 921 blk 92
PID 7112 lwlock 39: shacq 127288 exacq 1176 blk 91
PID 7111 lwlock 39: shacq 127438 exacq 891 blk 91
PID 7114 lwlock 37: shacq 127038 exacq 894 blk 89
PID 7113 lwlock 33: shacq 127602 exacq 803 blk 89
PID 7110 lwlock 37: shacq 127038 exacq 1085 blk 89
PID 7110 lwlock 35: shacq 135333 exacq 1016 blk 88
PID 7113 lwlock 46: shacq 132954 exacq 833 blk 87
PID 7111 lwlock 41: shacq 133778 exacq 821 blk 87
PID 7115 lwlock 36: shacq 204845 exacq 535 blk 86
PID 7112 lwlock 33: shacq 127551 exacq 1149 blk 86
PID 7109 lwlock 17: shacq 0 exacq 3463 blk 86
PID 7110 lwlock 910: shacq 91 exacq 91 blk 84
PID 7110 lwlock 39: shacq 127396 exacq 1068 blk 84
PID 7109 lwlock 36: shacq 204581 exacq 363 blk 83
PID 7113 lwlock 41: shacq 133648 exacq 765 blk 82
PID 7114 lwlock 41: shacq 133757 exacq 897 blk 79
PID 7114 lwlock 39: shacq 127429 exacq 923 blk 79
    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Thu, May 24, 2012 at 6:26 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> On Thu, 24 May 2012, Jeff Janes wrote:
>>
>> Add
>> #define LWLOCK_STATS
>> near the top of:
>> src/backend/storage/lmgr/lwlock.c
>>
>> and recompile and run a reduced-size workload.  When the processes
>> exits, they will dump a lot of data about LWLock usage to the logfile.
>> Generally the LWLock with the most blocks on it will be the main
>> culprit.
>
>
> Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted
> by blk. Not sure whether that's of much use or not:
>
> PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373
> PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110
> PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976
> PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955
> PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871
> PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838
> PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774
> PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702
> PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665
> PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651
> PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602
> PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546
> PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511
> PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466
> PID 7114 lwlock 47: shacq 227986 exacq 929 blk 449
> PID 7115 lwlock 44: shacq 1127495 exacq 633 blk 401
> PID 7115 lwlock 48: shacq 1124666 exacq 559 blk 397
> PID 7112 lwlock 47: shacq 227993 exacq 1248 blk 387
> PID 7110 lwlock 47: shacq 228218 exacq 1082 blk 378
> PID 7111 lwlock 47: shacq 228093 exacq 907 blk 321
> PID 7114 lwlock 42: shacq 232591 exacq 935 blk 318
> PID 7113 lwlock 47: shacq 228085 exacq 909 blk 306
> PID 7112 lwlock 34: shacq 343247 exacq 1255 blk 271
> PID 7110 lwlock 42: shacq 232599 exacq 1049 blk 262
> PID 7111 lwlock 34: shacq 343398 exacq 926 blk 255
> PID 7112 lwlock 42: shacq 232505 exacq 1262 blk 240
> PID 7108 lwlock 42: shacq 233215 exacq 1539 blk 237
> PID 7110 lwlock 43: shacq 438768 exacq 1026 blk 231
> PID 7114 lwlock 43: shacq 438795 exacq 898 blk 230
> PID 7113 lwlock 34: shacq 343283 exacq 832 blk 226
> PID 7110 lwlock 34: shacq 343338 exacq 1074 blk 226
> PID 7114 lwlock 0: shacq 0 exacq 14864 blk 219
> PID 7112 lwlock 43: shacq 438691 exacq 1123 blk 215
> PID 7111 lwlock 42: shacq 232645 exacq 885 blk 215
> PID 7114 lwlock 34: shacq 343362 exacq 939 blk 214
> PID 7111 lwlock 0: shacq 0 exacq 13638 blk 205
> PID 7111 lwlock 43: shacq 438817 exacq 899 blk 200
> PID 7114 lwlock 7: shacq 0 exacq 547 blk 193
> PID 7113 lwlock 42: shacq 232600 exacq 860 blk 192
> PID 7110 lwlock 0: shacq 0 exacq 16862 blk 191
> PID 7114 lwlock 38: shacq 333266 exacq 977 blk 183
> PID 7112 lwlock 38: shacq 333175 exacq 1246 blk 183
> PID 7113 lwlock 7: shacq 0 exacq 548 blk 178
> PID 7113 lwlock 43: shacq 438720 exacq 833 blk 178
> PID 7115 lwlock 7: shacq 0 exacq 549 blk 177
> PID 7108 lwlock 47: shacq 228682 exacq 1495 blk 177
> PID 7108 lwlock 34: shacq 343982 exacq 1595 blk 177
> PID 7112 lwlock 0: shacq 0 exacq 19538 blk 172
> PID 7111 lwlock 7: shacq 0 exacq 549 blk 172
> PID 7115 lwlock 17: shacq 0 exacq 9927 blk 167
> PID 7109 lwlock 34: shacq 343410 exacq 348 blk 167
> PID 7112 lwlock 7: shacq 0 exacq 548 blk 166
> PID 7110 lwlock 38: shacq 333249 exacq 1013 blk 165
> PID 7113 lwlock 38: shacq 333226 exacq 828 blk 163
> PID 7110 lwlock 7: shacq 0 exacq 548 blk 162
> PID 7109 lwlock 7: shacq 0 exacq 548 blk 161
> PID 7109 lwlock 47: shacq 228097 exacq 386 blk 158
> PID 7112 lwlock 45: shacq 130843 exacq 1201 blk 154
> PID 7109 lwlock 43: shacq 438787 exacq 314 blk 153
> PID 7111 lwlock 38: shacq 333306 exacq 882 blk 152
> PID 7108 lwlock 43: shacq 439454 exacq 1614 blk 151
> PID 7108 lwlock 17: shacq 0 exacq 23165 blk 147
> PID 7108 lwlock 7: shacq 0 exacq 549 blk 145
> PID 7113 lwlock 0: shacq 0 exacq 13394 blk 142
> PID 7112 lwlock 46: shacq 132972 exacq 1196 blk 142
> PID 7112 lwlock 36: shacq 204508 exacq 1180 blk 140
> PID 7109 lwlock 42: shacq 232611 exacq 363 blk 140
> PID 7115 lwlock 34: shacq 343582 exacq 600 blk 139
> PID 7114 lwlock 45: shacq 130818 exacq 903 blk 139
> PID 7115 lwlock 42: shacq 232846 exacq 517 blk 138
> PID 7108 lwlock 38: shacq 333989 exacq 1704 blk 137
> PID 7115 lwlock 43: shacq 438993 exacq 558 blk 126
> PID 7115 lwlock 0: shacq 0 exacq 3161 blk 125
> PID 7110 lwlock 40: shacq 129577 exacq 1021 blk 122
> PID 7113 lwlock 45: shacq 130833 exacq 814 blk 120
> PID 7111 lwlock 17: shacq 0 exacq 4607 blk 119
> PID 7109 lwlock 0: shacq 0 exacq 5711 blk 119
> PID 7110 lwlock 45: shacq 130865 exacq 1001 blk 117
> PID 7109 lwlock 38: shacq 333340 exacq 322 blk 115
> PID 7114 lwlock 46: shacq 132997 exacq 860 blk 112
> PID 7113 lwlock 36: shacq 204566 exacq 833 blk 112
> PID 7112 lwlock 40: shacq 129528 exacq 1234 blk 111
> PID 7115 lwlock 47: shacq 228397 exacq 516 blk 109
> PID 7114 lwlock 36: shacq 204634 exacq 930 blk 109
> PID 7112 lwlock 41: shacq 133600 exacq 1136 blk 109
> PID 7111 lwlock 36: shacq 204675 exacq 830 blk 108
> PID 7108 lwlock 0: shacq 0 exacq 6504 blk 108
> PID 7115 lwlock 38: shacq 333512 exacq 581 blk 106
> PID 7110 lwlock 36: shacq 204546 exacq 1014 blk 106
> PID 7113 lwlock 37: shacq 127040 exacq 790 blk 105
> PID 7111 lwlock 45: shacq 130893 exacq 863 blk 103
> PID 7114 lwlock 35: shacq 135342 exacq 879 blk 102
> PID 7113 lwlock 40: shacq 129560 exacq 853 blk 101
> PID 7110 lwlock 41: shacq 133697 exacq 1063 blk 101
> PID 7112 lwlock 35: shacq 135257 exacq 1229 blk 98
> PID 7110 lwlock 46: shacq 132990 exacq 1059 blk 98
> PID 7112 lwlock 37: shacq 127029 exacq 1229 blk 97
> PID 7114 lwlock 40: shacq 129597 exacq 913 blk 96
> PID 7110 lwlock 17: shacq 0 exacq 2655 blk 96
> PID 7109 lwlock 46: shacq 133014 exacq 326 blk 96
> PID 7111 lwlock 40: shacq 129720 exacq 890 blk 95
> PID 7111 lwlock 46: shacq 133057 exacq 876 blk 94
> PID 7114 lwlock 33: shacq 127651 exacq 921 blk 92
> PID 7112 lwlock 39: shacq 127288 exacq 1176 blk 91
> PID 7111 lwlock 39: shacq 127438 exacq 891 blk 91
> PID 7114 lwlock 37: shacq 127038 exacq 894 blk 89
> PID 7113 lwlock 33: shacq 127602 exacq 803 blk 89
> PID 7110 lwlock 37: shacq 127038 exacq 1085 blk 89
> PID 7110 lwlock 35: shacq 135333 exacq 1016 blk 88
> PID 7113 lwlock 46: shacq 132954 exacq 833 blk 87
> PID 7111 lwlock 41: shacq 133778 exacq 821 blk 87
> PID 7115 lwlock 36: shacq 204845 exacq 535 blk 86
> PID 7112 lwlock 33: shacq 127551 exacq 1149 blk 86
> PID 7109 lwlock 17: shacq 0 exacq 3463 blk 86
> PID 7110 lwlock 910: shacq 91 exacq 91 blk 84
> PID 7110 lwlock 39: shacq 127396 exacq 1068 blk 84
> PID 7109 lwlock 36: shacq 204581 exacq 363 blk 83
> PID 7113 lwlock 41: shacq 133648 exacq 765 blk 82
> PID 7114 lwlock 41: shacq 133757 exacq 897 blk 79
> PID 7114 lwlock 39: shacq 127429 exacq 923 blk 79

These are all on the buffer partition locks.  That makes sense...I was
wrong earlier: this case was in fact 'create table as', not 'insert
select' which rules out both the freelist lock and the wal insert lock
because create table as gets to use both a bulk insert strategy and
wal avoiding logic (assuming wal log level is not 'archive or higher'.So, why aren't the lock partitions helping here?

merlin


On Fri, 25 May 2012, Merlin Moncure wrote:

> These are all on the buffer partition locks.  That makes sense...I was
> wrong earlier: this case was in fact 'create table as', not 'insert
> select' which rules out both the freelist lock and the wal insert lock
> because create table as gets to use both a bulk insert strategy and
> wal avoiding logic (assuming wal log level is not 'archive or higher'.
> So, why aren't the lock partitions helping here?

FYI the WAL level is default, and here are the relevand changed params 
from the .conf (autovacuum was disabled but the data was vacuumed 
manually).

max_connections = 100                   # (change requires restart)
shared_buffers = 10GB                   # min 128kB
work_mem = 500MB                                # min 64kB
maintenance_work_mem = 500MB            # min 1MB
synchronous_commit = off                # synchronization level;
checkpoint_segments = 20                # in logfile segments, min 1, 16MB each
enable_hashjoin = off
enable_mergejoin = off
effective_cache_size = 10GB
autovacuum = off                        # Enable autovacuum subprocess?  'on'


Regards,    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Fri, May 25, 2012 at 8:06 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, May 24, 2012 at 6:26 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
>> On Thu, 24 May 2012, Jeff Janes wrote:
>>>
>>> Add
>>> #define LWLOCK_STATS
>>> near the top of:
>>> src/backend/storage/lmgr/lwlock.c
>>>
>>> and recompile and run a reduced-size workload.  When the processes
>>> exits, they will dump a lot of data about LWLock usage to the logfile.
>>> Generally the LWLock with the most blocks on it will be the main
>>> culprit.
>>
>>
>> Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted
>> by blk. Not sure whether that's of much use or not:
>>
>> PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373
>> PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110
>> PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976
>> PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955
>> PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871
>> PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838
>> PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774
>> PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702
>> PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665
>> PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651
>> PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602
>> PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546
>> PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511
>> PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466
>> PID 7114 lwlock 47: shacq 227986 exacq 929 blk 449
>> PID 7115 lwlock 44: shacq 1127495 exacq 633 blk 401
>> PID 7115 lwlock 48: shacq 1124666 exacq 559 blk 397
>> PID 7112 lwlock 47: shacq 227993 exacq 1248 blk 387
>> PID 7110 lwlock 47: shacq 228218 exacq 1082 blk 378
>> PID 7111 lwlock 47: shacq 228093 exacq 907 blk 321
>> PID 7114 lwlock 42: shacq 232591 exacq 935 blk 318
>> PID 7113 lwlock 47: shacq 228085 exacq 909 blk 306
>> PID 7112 lwlock 34: shacq 343247 exacq 1255 blk 271
>> PID 7110 lwlock 42: shacq 232599 exacq 1049 blk 262
>> PID 7111 lwlock 34: shacq 343398 exacq 926 blk 255
>> PID 7112 lwlock 42: shacq 232505 exacq 1262 blk 240
>> PID 7108 lwlock 42: shacq 233215 exacq 1539 blk 237
>> PID 7110 lwlock 43: shacq 438768 exacq 1026 blk 231
>> PID 7114 lwlock 43: shacq 438795 exacq 898 blk 230
>> PID 7113 lwlock 34: shacq 343283 exacq 832 blk 226
>> PID 7110 lwlock 34: shacq 343338 exacq 1074 blk 226
>> PID 7114 lwlock 0: shacq 0 exacq 14864 blk 219
>> PID 7112 lwlock 43: shacq 438691 exacq 1123 blk 215
>> PID 7111 lwlock 42: shacq 232645 exacq 885 blk 215
>> PID 7114 lwlock 34: shacq 343362 exacq 939 blk 214
>> PID 7111 lwlock 0: shacq 0 exacq 13638 blk 205
>> PID 7111 lwlock 43: shacq 438817 exacq 899 blk 200
>> PID 7114 lwlock 7: shacq 0 exacq 547 blk 193
>> PID 7113 lwlock 42: shacq 232600 exacq 860 blk 192
>> PID 7110 lwlock 0: shacq 0 exacq 16862 blk 191
>> PID 7114 lwlock 38: shacq 333266 exacq 977 blk 183
>> PID 7112 lwlock 38: shacq 333175 exacq 1246 blk 183
>> PID 7113 lwlock 7: shacq 0 exacq 548 blk 178
>> PID 7113 lwlock 43: shacq 438720 exacq 833 blk 178
>> PID 7115 lwlock 7: shacq 0 exacq 549 blk 177
>> PID 7108 lwlock 47: shacq 228682 exacq 1495 blk 177
>> PID 7108 lwlock 34: shacq 343982 exacq 1595 blk 177
>> PID 7112 lwlock 0: shacq 0 exacq 19538 blk 172
>> PID 7111 lwlock 7: shacq 0 exacq 549 blk 172
>> PID 7115 lwlock 17: shacq 0 exacq 9927 blk 167
>> PID 7109 lwlock 34: shacq 343410 exacq 348 blk 167
>> PID 7112 lwlock 7: shacq 0 exacq 548 blk 166
>> PID 7110 lwlock 38: shacq 333249 exacq 1013 blk 165
>> PID 7113 lwlock 38: shacq 333226 exacq 828 blk 163
>> PID 7110 lwlock 7: shacq 0 exacq 548 blk 162
>> PID 7109 lwlock 7: shacq 0 exacq 548 blk 161
>> PID 7109 lwlock 47: shacq 228097 exacq 386 blk 158
>> PID 7112 lwlock 45: shacq 130843 exacq 1201 blk 154
>> PID 7109 lwlock 43: shacq 438787 exacq 314 blk 153
>> PID 7111 lwlock 38: shacq 333306 exacq 882 blk 152
>> PID 7108 lwlock 43: shacq 439454 exacq 1614 blk 151
>> PID 7108 lwlock 17: shacq 0 exacq 23165 blk 147
>> PID 7108 lwlock 7: shacq 0 exacq 549 blk 145
>> PID 7113 lwlock 0: shacq 0 exacq 13394 blk 142
>> PID 7112 lwlock 46: shacq 132972 exacq 1196 blk 142
>> PID 7112 lwlock 36: shacq 204508 exacq 1180 blk 140
>> PID 7109 lwlock 42: shacq 232611 exacq 363 blk 140
>> PID 7115 lwlock 34: shacq 343582 exacq 600 blk 139
>> PID 7114 lwlock 45: shacq 130818 exacq 903 blk 139
>> PID 7115 lwlock 42: shacq 232846 exacq 517 blk 138
>> PID 7108 lwlock 38: shacq 333989 exacq 1704 blk 137
>> PID 7115 lwlock 43: shacq 438993 exacq 558 blk 126
>> PID 7115 lwlock 0: shacq 0 exacq 3161 blk 125
>> PID 7110 lwlock 40: shacq 129577 exacq 1021 blk 122
>> PID 7113 lwlock 45: shacq 130833 exacq 814 blk 120
>> PID 7111 lwlock 17: shacq 0 exacq 4607 blk 119
>> PID 7109 lwlock 0: shacq 0 exacq 5711 blk 119
>> PID 7110 lwlock 45: shacq 130865 exacq 1001 blk 117
>> PID 7109 lwlock 38: shacq 333340 exacq 322 blk 115
>> PID 7114 lwlock 46: shacq 132997 exacq 860 blk 112
>> PID 7113 lwlock 36: shacq 204566 exacq 833 blk 112
>> PID 7112 lwlock 40: shacq 129528 exacq 1234 blk 111
>> PID 7115 lwlock 47: shacq 228397 exacq 516 blk 109
>> PID 7114 lwlock 36: shacq 204634 exacq 930 blk 109
>> PID 7112 lwlock 41: shacq 133600 exacq 1136 blk 109
>> PID 7111 lwlock 36: shacq 204675 exacq 830 blk 108
>> PID 7108 lwlock 0: shacq 0 exacq 6504 blk 108
>> PID 7115 lwlock 38: shacq 333512 exacq 581 blk 106
>> PID 7110 lwlock 36: shacq 204546 exacq 1014 blk 106
>> PID 7113 lwlock 37: shacq 127040 exacq 790 blk 105
>> PID 7111 lwlock 45: shacq 130893 exacq 863 blk 103
>> PID 7114 lwlock 35: shacq 135342 exacq 879 blk 102
>> PID 7113 lwlock 40: shacq 129560 exacq 853 blk 101
>> PID 7110 lwlock 41: shacq 133697 exacq 1063 blk 101
>> PID 7112 lwlock 35: shacq 135257 exacq 1229 blk 98
>> PID 7110 lwlock 46: shacq 132990 exacq 1059 blk 98
>> PID 7112 lwlock 37: shacq 127029 exacq 1229 blk 97
>> PID 7114 lwlock 40: shacq 129597 exacq 913 blk 96
>> PID 7110 lwlock 17: shacq 0 exacq 2655 blk 96
>> PID 7109 lwlock 46: shacq 133014 exacq 326 blk 96
>> PID 7111 lwlock 40: shacq 129720 exacq 890 blk 95
>> PID 7111 lwlock 46: shacq 133057 exacq 876 blk 94
>> PID 7114 lwlock 33: shacq 127651 exacq 921 blk 92
>> PID 7112 lwlock 39: shacq 127288 exacq 1176 blk 91
>> PID 7111 lwlock 39: shacq 127438 exacq 891 blk 91
>> PID 7114 lwlock 37: shacq 127038 exacq 894 blk 89
>> PID 7113 lwlock 33: shacq 127602 exacq 803 blk 89
>> PID 7110 lwlock 37: shacq 127038 exacq 1085 blk 89
>> PID 7110 lwlock 35: shacq 135333 exacq 1016 blk 88
>> PID 7113 lwlock 46: shacq 132954 exacq 833 blk 87
>> PID 7111 lwlock 41: shacq 133778 exacq 821 blk 87
>> PID 7115 lwlock 36: shacq 204845 exacq 535 blk 86
>> PID 7112 lwlock 33: shacq 127551 exacq 1149 blk 86
>> PID 7109 lwlock 17: shacq 0 exacq 3463 blk 86
>> PID 7110 lwlock 910: shacq 91 exacq 91 blk 84
>> PID 7110 lwlock 39: shacq 127396 exacq 1068 blk 84
>> PID 7109 lwlock 36: shacq 204581 exacq 363 blk 83
>> PID 7113 lwlock 41: shacq 133648 exacq 765 blk 82
>> PID 7114 lwlock 41: shacq 133757 exacq 897 blk 79
>> PID 7114 lwlock 39: shacq 127429 exacq 923 blk 79
>
> These are all on the buffer partition locks.  That makes sense...I was
> wrong earlier: this case was in fact 'create table as', not 'insert
> select' which rules out both the freelist lock and the wal insert lock
> because create table as gets to use both a bulk insert strategy and
> wal avoiding logic (assuming wal log level is not 'archive or higher'.
>  So, why aren't the lock partitions helping here?

I think what's happening here is that the buffer partitions don't help
(in fact, they hurt) in the presence of multiple concurrent scans that
are operating on approximately the same data.  Sooner or later the
scans line up on each other and start binding when reassigning lock
tags (which take out up to two ordered exclusive lwlocks). This is on
the read side, so the buffer access strategy is zero help (I confirmed
this off list).  I also was curious to see what happened via
increasing the number of buffer partitions; it was counter productive
as expected.

If this is the case, I have no clue what to do about it.  If the OP
could organize the so that they hit different physical regions of the
table that would give confirmation if performance went up.

Hm, what if BufTableHashPartition() was pseudo randomized so that
different backends would not get the same buffer partition for a
particular tag?

merlin


Merlin Moncure <mmoncure@gmail.com> writes:
> Hm, what if BufTableHashPartition() was pseudo randomized so that
> different backends would not get the same buffer partition for a
> particular tag?

Huh?  You have to make sure that different backends will find the same
buffer for the same page, so I don't see how that can possibly work.
        regards, tom lane


On Fri, May 25, 2012 at 10:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> Hm, what if BufTableHashPartition() was pseudo randomized so that
>> different backends would not get the same buffer partition for a
>> particular tag?
>
> Huh?  You have to make sure that different backends will find the same
> buffer for the same page, so I don't see how that can possibly work.

Right -- duh.  Well, hm.  Is this worth fixing?  ISTM there's a bit of
'optimizing for pgbench-itis' in the buffer partitions -- they seem
optimized to lever the mostly random access behavior of pgbench.  But
how likely is it to see multiple simultaneous scans in the real world?Interleaving scans like that is not a very
effectiveoptimization -- 
if it was me, it'd be trying to organize something around a
partitioned tid scan for parallel sequential access.

merlin


* Merlin Moncure (mmoncure@gmail.com) wrote:
> Right -- duh.  Well, hm.  Is this worth fixing?  ISTM there's a bit of
> 'optimizing for pgbench-itis' in the buffer partitions -- they seem
> optimized to lever the mostly random access behavior of pgbench.  But
> how likely is it to see multiple simultaneous scans in the real world?
>  Interleaving scans like that is not a very effective optimization --
> if it was me, it'd be trying to organize something around a
> partitioned tid scan for parallel sequential access.

Didn't we implement a system whereby this is exactly what we intend to
happen on the read side- that is, everyone doing a SeqScan gangs up on
one ring buffer and follows it, which we felt was going to dramatically
improve performance in some cases?

Or is this completely different from that..?
Thanks,
    Stephen

On Thu, May 24, 2012 at 4:26 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> On Thu, 24 May 2012, Jeff Janes wrote:
>>
>> Add
>> #define LWLOCK_STATS
>> near the top of:
>> src/backend/storage/lmgr/lwlock.c
>>
>> and recompile and run a reduced-size workload.  When the processes
>> exits, they will dump a lot of data about LWLock usage to the logfile.
>> Generally the LWLock with the most blocks on it will be the main
>> culprit.
>
>
> Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted
> by blk. Not sure whether that's of much use or not:
>
> PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373
> PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110
> PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976
> PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955
> PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871
> PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838
> PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774
> PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702
> PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665
> PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651
> PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602
> PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546
> PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511
> PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466

That is not an informative as I thought it would be (except to show
WAL was not the issue).

I'm guessing that 44 and 48 are the buffer mapping partitions which
cover the root block of some highly used index.

But just because those things are at the top of the list doesn't mean
they are a problem.  Something has to be at the top, and they don't
dominate the total number of blocking they way I would expect them to
if they were truly a substantial bottleneck.

Cheers,

Jeff


On Fri, May 25, 2012 at 11:17 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Merlin Moncure (mmoncure@gmail.com) wrote:
>> Right -- duh.  Well, hm.  Is this worth fixing?  ISTM there's a bit of
>> 'optimizing for pgbench-itis' in the buffer partitions -- they seem
>> optimized to lever the mostly random access behavior of pgbench.  But
>> how likely is it to see multiple simultaneous scans in the real world?
>>  Interleaving scans like that is not a very effective optimization --
>> if it was me, it'd be trying to organize something around a
>> partitioned tid scan for parallel sequential access.
>
> Didn't we implement a system whereby this is exactly what we intend to
> happen on the read side- that is, everyone doing a SeqScan gangs up on
> one ring buffer and follows it, which we felt was going to dramatically
> improve performance in some cases?

yeah:
/* * If the table is large relative to NBuffers, use a bulk-read access * strategy and enable synchronized scanning
(seesyncscan.c).    Although * the thresholds for these features could be different, we make them the * same so that
thereare only two behaviors to tune rather than four. * (However, some callers need to be able to disable one or both
ofthese * behaviors, independently of the size of the table; also there is a GUC * variable that can disable
synchronizedscanning.) * * During a rescan, don't make a new strategy object if we don't have to. */if
(!RelationUsesLocalBuffers(scan->rs_rd)&&    scan->rs_nblocks > NBuffers / 4){    allow_strat = scan->rs_allow_strat;
allow_sync = scan->rs_allow_sync;}else    allow_strat = allow_sync = false; 
if (allow_strat){    if (scan->rs_strategy == NULL)        scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD);}


I wonder if the logic here is just being too strict...

merlin


Merlin Moncure <mmoncure@gmail.com> writes:
> On Fri, May 25, 2012 at 11:17 AM, Stephen Frost <sfrost@snowman.net> wrote:
>> Didn't we implement a system whereby this is exactly what we intend to
>> happen on the read side- that is, everyone doing a SeqScan gangs up on
>> one ring buffer and follows it, which we felt was going to dramatically
>> improve performance in some cases?

> yeah:
> ...
> I wonder if the logic here is just being too strict...

I don't recall how much evidence there is behind the NBuffers/4 threshold.
Maybe that needs some tuning?
        regards, tom lane


On Fri, 25 May 2012, Merlin Moncure wrote:
> how likely is it to see multiple simultaneous scans in the real world?
> Interleaving scans like that is not a very effective optimization --
> if it was me, it'd be trying to organize something around a
> partitioned tid scan for parallel sequential access.

Regarding the "real world use".  I can say for myself is that the reason I'm
running the queries like the ones I have in my test is that I have a large
inflow of data every day, tens of gb, which has to be processed and since PG
doesn't have any parallelization of queries built in, and running the
processing in a single thread will take too long, I'm trying to work around
that by manually running multiple threads in PG and trying to split the work
among them. So not having scalability here is going to hurt us (but I 
admit that it's probably not very popular use-case for PG).

Regards,    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Thu, May 24, 2012 at 7:26 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted
> by blk. Not sure whether that's of much use or not:

What are the top dozen or so entries if you sort by shacq?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Fri, 25 May 2012, Robert Haas wrote:

> On Thu, May 24, 2012 at 7:26 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
>> Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted
>> by blk. Not sure whether that's of much use or not:
>
> What are the top dozen or so entries if you sort by shacq?

Here it is:

PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511
PID 7115 lwlock 44: shacq 1127495 exacq 633 blk 401
PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702
PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651
PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466
PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774
PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665
PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838
PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546
PID 7115 lwlock 48: shacq 1124666 exacq 559 blk 397
PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955
PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976
PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110
PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602
PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373
PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871
PID 7115 lwlock 1020: shacq 1000000 exacq 0 blk 0
PID 7115 lwlock 1018: shacq 1000000 exacq 0 blk 0
PID 7114 lwlock 1020: shacq 1000000 exacq 0 blk 0
PID 7114 lwlock 1018: shacq 1000000 exacq 0 blk 0
PID 7113 lwlock 1020: shacq 1000000 exacq 0 blk 0
PID 7113 lwlock 1018: shacq 1000000 exacq 0 blk 0
PID 7112 lwlock 1020: shacq 1000000 exacq 0 blk 0
PID 7112 lwlock 1018: shacq 1000000 exacq 0 blk 0
PID 7111 lwlock 1020: shacq 1000000 exacq 0 blk 0
PID 7111 lwlock 1018: shacq 1000000 exacq 0 blk 0
PID 7110 lwlock 1020: shacq 1000000 exacq 0 blk 0
PID 7110 lwlock 1018: shacq 1000000 exacq 0 blk 0
PID 7109 lwlock 1020: shacq 1000000 exacq 0 blk 0
PID 7109 lwlock 1018: shacq 1000000 exacq 0 blk 0
PID 7108 lwlock 1020: shacq 1000000 exacq 0 blk 0
PID 7108 lwlock 1018: shacq 1000000 exacq 0 blk 0
PID 7108 lwlock 43: shacq 439454 exacq 1614 blk 151
PID 7115 lwlock 43: shacq 438993 exacq 558 blk 126
PID 7111 lwlock 43: shacq 438817 exacq 899 blk 200
PID 7114 lwlock 43: shacq 438795 exacq 898 blk 230
PID 7109 lwlock 43: shacq 438787 exacq 314 blk 153
PID 7110 lwlock 43: shacq 438768 exacq 1026 blk 231
PID 7113 lwlock 43: shacq 438720 exacq 833 blk 178
PID 7112 lwlock 43: shacq 438691 exacq 1123 blk 215
PID 7108 lwlock 34: shacq 343982 exacq 1595 blk 177
PID 7115 lwlock 34: shacq 343582 exacq 600 blk 139
PID 7109 lwlock 34: shacq 343410 exacq 348 blk 167
PID 7111 lwlock 34: shacq 343398 exacq 926 blk 255
PID 7114 lwlock 34: shacq 343362 exacq 939 blk 214
PID 7110 lwlock 34: shacq 343338 exacq 1074 blk 226
PID 7113 lwlock 34: shacq 343283 exacq 832 blk 226
PID 7112 lwlock 34: shacq 343247 exacq 1255 blk 271
PID 7108 lwlock 38: shacq 333989 exacq 1704 blk 137
PID 7115 lwlock 38: shacq 333512 exacq 581 blk 106
PID 7109 lwlock 38: shacq 333340 exacq 322 blk 115
PID 7111 lwlock 38: shacq 333306 exacq 882 blk 152
PID 7114 lwlock 38: shacq 333266 exacq 977 blk 183
PID 7110 lwlock 38: shacq 333249 exacq 1013 blk 165
PID 7113 lwlock 38: shacq 333226 exacq 828 blk 163
PID 7112 lwlock 38: shacq 333175 exacq 1246 blk 183
PID 7108 lwlock 42: shacq 233215 exacq 1539 blk 237
PID 7115 lwlock 42: shacq 232846 exacq 517 blk 138
PID 7111 lwlock 42: shacq 232645 exacq 885 blk 215
PID 7109 lwlock 42: shacq 232611 exacq 363 blk 140
PID 7113 lwlock 42: shacq 232600 exacq 860 blk 192
PID 7110 lwlock 42: shacq 232599 exacq 1049 blk 262
PID 7114 lwlock 42: shacq 232591 exacq 935 blk 318
PID 7112 lwlock 42: shacq 232505 exacq 1262 blk 240
PID 7108 lwlock 47: shacq 228682 exacq 1495 blk 177
PID 7115 lwlock 47: shacq 228397 exacq 516 blk 109
PID 7110 lwlock 47: shacq 228218 exacq 1082 blk 378
PID 7109 lwlock 47: shacq 228097 exacq 386 blk 158
PID 7111 lwlock 47: shacq 228093 exacq 907 blk 321
PID 7113 lwlock 47: shacq 228085 exacq 909 blk 306
PID 7112 lwlock 47: shacq 227993 exacq 1248 blk 387
PID 7114 lwlock 47: shacq 227986 exacq 929 blk 449
PID 7108 lwlock 36: shacq 205287 exacq 1592 blk 78
PID 7115 lwlock 36: shacq 204845 exacq 535 blk 86
PID 7111 lwlock 36: shacq 204675 exacq 830 blk 108
PID 7114 lwlock 36: shacq 204634 exacq 930 blk 109
PID 7109 lwlock 36: shacq 204581 exacq 363 blk 83
PID 7113 lwlock 36: shacq 204566 exacq 833 blk 112
PID 7110 lwlock 36: shacq 204546 exacq 1014 blk 106
PID 7112 lwlock 36: shacq 204508 exacq 1180 blk 140
PID 7108 lwlock 35: shacq 136067 exacq 1537 blk 56
PID 7115 lwlock 35: shacq 135532 exacq 494 blk 47
PID 7111 lwlock 35: shacq 135404 exacq 820 blk 76
PID 7109 lwlock 35: shacq 135362 exacq 326 blk 56
PID 7114 lwlock 35: shacq 135342 exacq 879 blk 102
PID 7110 lwlock 35: shacq 135333 exacq 1016 blk 88
PID 7113 lwlock 35: shacq 135318 exacq 829 blk 75
PID 7112 lwlock 35: shacq 135257 exacq 1229 blk 98
PID 7108 lwlock 41: shacq 134338 exacq 1540 blk 55
PID 7115 lwlock 41: shacq 133929 exacq 542 blk 50
PID 7109 lwlock 41: shacq 133781 exacq 360 blk 49
PID 7111 lwlock 41: shacq 133778 exacq 821 blk 87
PID 7114 lwlock 41: shacq 133757 exacq 897 blk 79
...

I've also put all the lwlock output here (2Mb)
http://www.ast.cam.ac.uk/~koposov/files/pg/lwlocks.gz
    S


*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Fri, May 25, 2012 at 11:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> On Fri, May 25, 2012 at 11:17 AM, Stephen Frost <sfrost@snowman.net> wrote:
>>> Didn't we implement a system whereby this is exactly what we intend to
>>> happen on the read side- that is, everyone doing a SeqScan gangs up on
>>> one ring buffer and follows it, which we felt was going to dramatically
>>> improve performance in some cases?
>
>> yeah:
>> ...
>> I wonder if the logic here is just being too strict...
>
> I don't recall how much evidence there is behind the NBuffers/4 threshold.
> Maybe that needs some tuning?


probably.  Sergey is going to get some numbers back from a hacked
version that forces the scan to use a strategy.

If that turns out to be the problem, I bet it'd make sense to hook the
decision against ring size which is currently internal to freelist.
In other words, since freelist is making the call to determine the
size of the ring, it should also probably be making the call in terms
of a particular input size being useful to apply a strategy to.

Maybe something like
IsAccessStrategyUseful(BufferAccessStrategyType btype, BlockNumber nblocks) ...

merlin


On Fri, 25 May 2012, Merlin Moncure wrote:
> can you hack this in heapam.c and see if it helps?
> line 131-ish:
>
>     if (!RelationUsesLocalBuffers(scan->rs_rd) &&
>         scan->rs_nblocks > NBuffers / 4)
> becomes
>     if (!RelationUsesLocalBuffers(scan->rs_rd))
>
> (also you can set the partition count back).

The first few runs of my test with the suggested change made the 
multithreaded queries even slower by a factor of two.
E.g. 44 seconds for multithreaded vs ~ 7sec single threaded.
(with the default 9.2beta I get ~ 14-22sec for multithreaded run)
But now after another few reruns of my test, i see again this variability 
in times for multithreaded runs. It went down to 18-22seconds. Then again 
to 44 sec. So overall the change you suggested either doesn't improve 
the situation or makes it worse .

Regards,    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Fri, May 25, 2012 at 10:30 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> I think what's happening here is that the buffer partitions don't help
> (in fact, they hurt) in the presence of multiple concurrent scans that
> are operating on approximately the same data.  Sooner or later the
> scans line up on each other and start binding when reassigning lock
> tags (which take out up to two ordered exclusive lwlocks). This is on
> the read side, so the buffer access strategy is zero help (I confirmed
> this off list).

This theory is seeming fairly plausible - how can we test it?

How about trying it with synchronize_seqscans = off?  If the
synchronized-seqscan logic is causing contention on the buf mapping
locks and individual buffer locks, that should fix it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Sat, 26 May 2012, Robert Haas wrote:
> This theory is seeming fairly plausible - how can we test it?
>
> How about trying it with synchronize_seqscans = off?  If the
> synchronized-seqscan logic is causing contention on the buf mapping
> locks and individual buffer locks, that should fix it.

Turning off synch seq scans doesn't help either. 18 sec multithreaded run 
vs 7 sec single threaded.
    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


* Sergey Koposov (koposov@ast.cam.ac.uk) wrote:
> Turning off synch seq scans doesn't help either. 18 sec
> multithreaded run vs 7 sec single threaded.

Alright, can you just time 'cat' when they're started a few seconds or
whatever apart from each other?  I can't imagine it being affected in
the same way as these, but seems like it wouldn't hurt to check.
Thanks,
    Stephen

On Sat, 26 May 2012, Stephen Frost wrote:

> Alright, can you just time 'cat' when they're started a few seconds or
> whatever apart from each other?  I can't imagine it being affected in
> the same way as these, but seems like it wouldn't hurt to check.

I've tryed cat'ting a created in advance 8gig file on the same filesystem.
And  during the multi-threaded run it takes ~4sec, roughly the same as 
when run separately.


*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


Hi,

I did another test using the same data and the same code, which I've 
provided before and the performance of the single thread seems to be 
degrading quadratically with the number of threads.

Here are the results:
Nthreads Time_to_execute_one_thread
1 8.1
2 7.8
3 8.1
4 9.0
5 10.2
6 11.4
7 13.3
8 16.1
9 19.0
10 21.4
11 23.8
12 27.3
13 30.2
14 32.0
15 34.1
16 37.5

Regards,    Sergey



On Sat, 26 May 2012, Stephen Frost wrote:

> * Sergey Koposov (koposov@ast.cam.ac.uk) wrote:
>> Turning off synch seq scans doesn't help either. 18 sec
>> multithreaded run vs 7 sec single threaded.
>
> Alright, can you just time 'cat' when they're started a few seconds or
> whatever apart from each other?  I can't imagine it being affected in
> the same way as these, but seems like it wouldn't hurt to check.
>
>     Thanks,
>
>         Stephen
>

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Sun, May 27, 2012 at 1:45 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> Hi,
>
> I did another test using the same data and the same code, which I've
> provided before and the performance of the single thread seems to be
> degrading quadratically with the number of threads.
>
> Here are the results:
> Nthreads Time_to_execute_one_thread
> 1 8.1
> 2 7.8
> 3 8.1
> 4 9.0
> 5 10.2
> 6 11.4
> 7 13.3
> 8 16.1
> 9 19.0
> 10 21.4
> 11 23.8
> 12 27.3
> 13 30.2
> 14 32.0
> 15 34.1
> 16 37.5

Ok, I double checked offline with Sergey to make sure the strategy
wasn't helping...it isn't, and we confirmed it was being forced on
after playing with the tunables a bit.  So what is happening here?  I
have a hunch that this particular query is defeating the ring buffer
strategy code.  Here's the query:

create table _tmp0 as select * from (      select *, (select healpixid from idt_match as m where
             m.transitid=o.transitid)              as x from idt_photoobservation as o offset 0        ) as y where
x%16=ZZZorder by x;
 

(where ZZZ is some number integer number 0<=ZZZ<16)

With the following plan:

----------------------------------------------------------------------------------------------------------------------Sort
(cost=3228814504.96..3228815137.21 rows=252902 width=498) Sort Key: y.x ->  Subquery Scan on y
(cost=0.00..3228791809.25rows=252902 width=498)       Filter: ((y.x % 16::bigint) = 0)       ->  Limit
(cost=0.00..3228033102.41rows=50580456 width=490)             ->  Seq Scan on idt_photoobservation o
 
(cost=0.00..3228033102.41 rows=50580456 width=490)                   SubPlan 1                     ->  Index Scan using
idt_match_transitid_idxon
 
idt_match m  (cost=0.00..63.74 rows=1 width=8)                           Index Cond: (transitid = o.transitid)

If I understand this properly,.Sergey is scanning a large table and
looking up an integer value from a smaller table row by row and
explicitly forcing it as such (via 'offset 0'). That integer value is
then used to filter the resultant table down to size 1/N so that
multiple workers can process the data which is then sorted.  This
isn't a particularly great way to attack the problem, but it should
scale better than it does.  I think the problem is coming not from the
main seq scan, but from the dependent index lookup on idt_match.  Even
though the scan can be made to exhibit lock free behavior, the index
lookups will eventually start lining up and hit the same sequences of
pages in the same order -- bam.

merlin


Here is the actual explain analyze of the query on the smaller dataset
which I have been using for the recent testing.

test=# explain analyze create table _tmp0 as select * from  ( select *,        (select healpixid from idt_match as m
wherem.transitid=o.transitid)                as x from idt_photoobservation_small as o offset 0                  ) as y
wherex%16=0 order by x;                                                                              QUERY PLAN
 


p----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=63835201.73..63835214.23 rows=5000 width=498) (actual time=8203.041..8252.216 rows=173696 loops=1)   Sort
Key:y.x   Sort Method: quicksort  Memory: 182555kB   ->  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000
width=498)(actual time=0.102..7602.947 rows=173696 loops=1)         Filter: ((y.x % 16::bigint) = 0)         Rows
Removedby Filter: 826304         ->  Limit  (cost=0.00..63819894.51 rows=1000002 width=490) (actual
time=0.041..7296.401rows=1000000 loops=1)               ->  Seq Scan on idt_photoobservation_small o
(cost=0.00..63819894.51rows=1000002 width=490) (actual time=0.038..7094.555 rows=1000000 loops=1)
SubPlan1                       ->  Index Scan using idt_match_transitid_idx on idt_match m  (cost=0.00..63.74 rows=1
width=8)(actual time=0.003..0.004 rows=1 loops=1000000)                             Index Cond: (transitid =
o.transitid)Total runtime: 8908.631 ms
 

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Wed, May 30, 2012 at 10:42 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> Here is the actual explain analyze of the query on the smaller dataset
> which I have been using for the recent testing.
>
> test=# explain analyze create table _tmp0 as select * from
>
>  ( select *,
>        (select healpixid from idt_match as m where m.transitid=o.transitid)
>                as x from idt_photoobservation_small as o offset 0
>                  ) as y where x%16=0 order by x;
>
>  QUERY PLAN
>
>
p----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=63835201.73..63835214.23 rows=5000 width=498) (actual
> time=8203.041..8252.216 rows=173696 loops=1)
>   Sort Key: y.x
>   Sort Method: quicksort  Memory: 182555kB
>   ->  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498)
> (actual time=0.102..7602.947 rows=173696 loops=1)
>
>         Filter: ((y.x % 16::bigint) = 0)
>         Rows Removed by Filter: 826304
>         ->  Limit  (cost=0.00..63819894.51 rows=1000002 width=490) (actual
> time=0.041..7296.401 rows=1000000 loops=1)
>               ->  Seq Scan on idt_photoobservation_small o
>  (cost=0.00..63819894.51 rows=1000002 width=490) (actual
> time=0.038..7094.555 rows=1000000 loops=1)
>                     SubPlan 1
>                       ->  Index Scan using idt_match_transitid_idx on
> idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.003..0.004
> rows=1 loops=1000000)
>
>                             Index Cond: (transitid = o.transitid)
>  Total runtime: 8908.631 ms

Two things:

1. Can we see an explain analyze during a 'bogged' case?

2. Can we try to get 'index only scan' working over idt_match?  That's
should be a matter of adjusting the index so that it's:
create index on idt_match (transitid, healpixid);

merlin




merlin


On Wed, 30 May 2012, Merlin Moncure wrote:

>
> 1. Can we see an explain analyze during a 'bogged' case?

Here is the one to one comparison of the 'bogged' 
**********                                                                              QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=63835201.73..63835214.23 rows=5000 width=498) (actual time=18007.500..18007.500 rows=0 loops=1)   Sort Key:
y.x  Sort Method: quicksort  Memory: 25kB   ->  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498)
(actualtime=18007.454..18007.454 rows=0 loops=1)         Filter: ((y.x % 16::bigint) = 7)         Rows Removed by
Filter:1000000         ->  Limit  (cost=0.00..63819894.51 rows=1000002 width=490) (actual time=0.047..17734.570
rows=1000000loops=1)               ->  Seq Scan on idt_photoobservation_small o  (cost=0.00..63819894.51 rows=1000002
width=490)(actual time=0.045..17543.902 rows=1000000 loops=1)                     SubPlan 1                       ->
IndexScan using idt_match_transitid_idx on idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.015..0.015
rows=1loops=1000000)                             Index Cond: (transitid = o.transitid) Total runtime: 18056.866 ms
 
(12 rows)

Time: 18067.929 ms
*************************

vs  non-bogged:

**************************                                                                              QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=63835201.73..63835214.23 rows=5000 width=498) (actual time=6635.133..6635.133 rows=0 loops=1)   Sort Key:
y.x  Sort Method: quicksort  Memory: 25kB   ->  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498)
(actualtime=6635.091..6635.091 rows=0 loops=1)         Filter: ((y.x % 16::bigint) = 7)         Rows Removed by Filter:
1000000        ->  Limit  (cost=0.00..63819894.51 rows=1000002 width=490) (actual time=0.059..6344.683 rows=1000000
loops=1)              ->  Seq Scan on idt_photoobservation_small o  (cost=0.00..63819894.51 rows=1000002 width=490)
(actualtime=0.056..6149.429 rows=1000000 loops=1)                     SubPlan 1                       ->  Index Scan
usingidt_match_transitid_idx on idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.003..0.004 rows=1
loops=1000000)                            Index Cond: (transitid = o.transitid) Total runtime: 6669.215 ms
 
(12 rows)

Time: 6673.991 ms
**************************

>
> 2. Can we try to get 'index only scan' working over idt_match?  That's
> should be a matter of adjusting the index so that it's:
> create index on idt_match (transitid, healpixid);

I don't think I could do that. I created the index, you asked (although 
it is unclear why it would help...), tried explain, and it still used a 
simple index scan. After disabling the index scan it decided to use the bitmap scan 
(which isn't actually faster in the multithreaded setup. And after disabling the 
bitmapscan it switched to seqscan).

Cheers,    Sergey

PS Just for the reference the current indices on idt_match are    "idt_match_idx" btree (healpixid)
"idt_match_transitid_healpixid_idx"btree (transitid, healpixid)    "idt_match_transitid_idx" btree (transitid)
 



*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Wed, May 30, 2012 at 12:58 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> Here is the one to one comparison of the 'bogged' **********
>
>  QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=63835201.73..63835214.23 rows=5000 width=498) (actual
> time=18007.500..18007.500 rows=0 loops=1)
>   Sort Key: y.x
>   Sort Method: quicksort  Memory: 25kB
>   ->  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498)
> (actual time=18007.454..18007.454 rows=0 loops=1)
>         Filter: ((y.x % 16::bigint) = 7)
>         Rows Removed by Filter: 1000000
>         ->  Limit  (cost=0.00..63819894.51 rows=1000002 width=490) (actual
> time=0.047..17734.570 rows=1000000 loops=1)
>               ->  Seq Scan on idt_photoobservation_small o
>  (cost=0.00..63819894.51 rows=1000002 width=490) (actual
> time=0.045..17543.902 rows=1000000 loops=1)
>                     SubPlan 1
>                       ->  Index Scan using idt_match_transitid_idx on
> idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.015..0.015
> rows=1 loops=1000000)
>
>                             Index Cond: (transitid = o.transitid)
>  Total runtime: 18056.866 ms
> (12 rows)
>
> Time: 18067.929 ms
> *************************
>
> vs  non-bogged:
>
> **************************
>
>  QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=63835201.73..63835214.23 rows=5000 width=498) (actual
> time=6635.133..6635.133 rows=0 loops=1)
>   Sort Key: y.x
>   Sort Method: quicksort  Memory: 25kB
>   ->  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498)
> (actual time=6635.091..6635.091 rows=0 loops=1)
>         Filter: ((y.x % 16::bigint) = 7)
>         Rows Removed by Filter: 1000000
>         ->  Limit  (cost=0.00..63819894.51 rows=1000002 width=490) (actual
> time=0.059..6344.683 rows=1000000 loops=1)
>               ->  Seq Scan on idt_photoobservation_small o
>  (cost=0.00..63819894.51 rows=1000002 width=490) (actual
> time=0.056..6149.429 rows=1000000 loops=1)
>
>                     SubPlan 1
>                       ->  Index Scan using idt_match_transitid_idx on
> idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.003..0.004
> rows=1 loops=1000000)
>                             Index Cond: (transitid = o.transitid)
>  Total runtime: 6669.215 ms
> (12 rows)
>
> Time: 6673.991 ms

Hmm.  So the total time across all loops is 3000-4000 ms in the
unbogged case, and 15000 ms in the bogged case.  The time for the
seqscan parent is 6344.683 in the unbogged case, so ~3s once you
subtract out the index scans, and 17543.902 in the bogged case, so ~3s
once you subtract out the index scans.  So this seems to support
Merlin's theory that the index scan is what's getting slow under heavy
concurrency.  Apparently, the index scan slows down 4-5x but the
seqscan is just as fast as ever.  Very interesting...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Wed, May 30, 2012 at 12:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 30, 2012 at 12:58 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
>> Here is the one to one comparison of the 'bogged' **********
>>
>>  QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>  Sort  (cost=63835201.73..63835214.23 rows=5000 width=498) (actual
>> time=18007.500..18007.500 rows=0 loops=1)
>>   Sort Key: y.x
>>   Sort Method: quicksort  Memory: 25kB
>>   ->  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498)
>> (actual time=18007.454..18007.454 rows=0 loops=1)
>>         Filter: ((y.x % 16::bigint) = 7)
>>         Rows Removed by Filter: 1000000
>>         ->  Limit  (cost=0.00..63819894.51 rows=1000002 width=490) (actual
>> time=0.047..17734.570 rows=1000000 loops=1)
>>               ->  Seq Scan on idt_photoobservation_small o
>>  (cost=0.00..63819894.51 rows=1000002 width=490) (actual
>> time=0.045..17543.902 rows=1000000 loops=1)
>>                     SubPlan 1
>>                       ->  Index Scan using idt_match_transitid_idx on
>> idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.015..0.015
>> rows=1 loops=1000000)
>>
>>                             Index Cond: (transitid = o.transitid)
>>  Total runtime: 18056.866 ms
>> (12 rows)
>>
>> Time: 18067.929 ms
>> *************************
>>
>> vs  non-bogged:
>>
>> **************************
>>
>>  QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>  Sort  (cost=63835201.73..63835214.23 rows=5000 width=498) (actual
>> time=6635.133..6635.133 rows=0 loops=1)
>>   Sort Key: y.x
>>   Sort Method: quicksort  Memory: 25kB
>>   ->  Subquery Scan on y  (cost=0.00..63834894.54 rows=5000 width=498)
>> (actual time=6635.091..6635.091 rows=0 loops=1)
>>         Filter: ((y.x % 16::bigint) = 7)
>>         Rows Removed by Filter: 1000000
>>         ->  Limit  (cost=0.00..63819894.51 rows=1000002 width=490) (actual
>> time=0.059..6344.683 rows=1000000 loops=1)
>>               ->  Seq Scan on idt_photoobservation_small o
>>  (cost=0.00..63819894.51 rows=1000002 width=490) (actual
>> time=0.056..6149.429 rows=1000000 loops=1)
>>
>>                     SubPlan 1
>>                       ->  Index Scan using idt_match_transitid_idx on
>> idt_match m  (cost=0.00..63.74 rows=1 width=8) (actual time=0.003..0.004
>> rows=1 loops=1000000)
>>                             Index Cond: (transitid = o.transitid)
>>  Total runtime: 6669.215 ms
>> (12 rows)
>>
>> Time: 6673.991 ms
>
> Hmm.  So the total time across all loops is 3000-4000 ms in the
> unbogged case, and 15000 ms in the bogged case.  The time for the
> seqscan parent is 6344.683 in the unbogged case, so ~3s once you
> subtract out the index scans, and 17543.902 in the bogged case, so ~3s
> once you subtract out the index scans.  So this seems to support
> Merlin's theory that the index scan is what's getting slow under heavy
> concurrency.  Apparently, the index scan slows down 4-5x but the
> seqscan is just as fast as ever.  Very interesting...

Yeah, an 'index only scan' *would* fix the performance problem if my
theory is correct.  By skipping the heap lookup it would also skip the
buffer retagging which is binding up the multiple concurrent scans.
Sergey, just in case you don't know, an IOS works like this.  If all
the fields that are fetched from a table are present in the index and
the index points towards a tuple on a page that contains only visible
tuples according to the visibility map, you get to skip jumping back
into the heap for visibility checks.  In your case, I think that would
result in almost linear scaling.

Hm, why aren't we getting a IOS?  Just for kicks (assuming this is
test data), can we drop the index on just transitid, leaving the index
on transitid, healpixid?    Is enable_indexonlyscan on?  Has idt_match
been vacuumed?  What kind of plan do you get when do:

select healpixid from idt_match where transitid = <some valid transitid>;

merlin


On Wed, 30 May 2012, Merlin Moncure wrote:
>
> Hm, why aren't we getting a IOS?  Just for kicks (assuming this is
> test data), can we drop the index on just transitid, leaving the index
> on transitid, healpixid?    Is enable_indexonlyscan on?  Has idt_match
> been vacuumed?  What kind of plan do you get when do:

Okay dropping the index on transitid solved the issue with indexonlyscan 
but didn't solve the original problem. Actually the indexonlyscan made the 
sequential queries faster but not the parallel ones.

Here is a side by side explain analyze again for indexonly scans

Bogged:
************                                                                                     QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=82262274.19..82262286.69 rows=5000 width=498) (actual time=19635.318..19635.318 rows=0 loops=1)   Sort Key:
y.x  Sort Method: quicksort  Memory: 25kB   ->  Subquery Scan on y  (cost=0.00..82261967.00 rows=5000 width=498)
(actualtime=19635.282..19635.282 rows=0 loops=1)         Filter: ((y.x % 16::bigint) = 7)         Rows Removed by
Filter:1000000         ->  Limit  (cost=0.00..82246967.00 rows=1000000 width=490) (actual time=0.076..19373.922
rows=1000000loops=1)               ->  Seq Scan on idt_photoobservation_small o  (cost=0.00..82246967.00 rows=1000000
width=490)(actual time=0.074..19179.336 rows=1000000 loops=1)                     SubPlan 1                       ->
IndexOnly Scan using idt_match_transitid_healpixid_idx on idt_match m  (cost=0.00..82.17 rows=1 width=8) (actual
time=0.016..0.017rows=1 loops=1000000)                             Index Cond: (transitid = o.transitid)
            Heap Fetches: 0 Total runtime: 19702.727 ms
 
(13 rows)
***********
vs non-bogged
***********                                                                                     QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=82262274.19..82262286.69 rows=5000 width=498) (actual time=6198.334..6198.334 rows=0 loops=1)   Sort Key:
y.x  Sort Method: quicksort  Memory: 25kB   ->  Subquery Scan on y  (cost=0.00..82261967.00 rows=5000 width=498)
(actualtime=6198.285..6198.285 rows=0 loops=1)         Filter: ((y.x % 16::bigint) = 7)         Rows Removed by Filter:
1000000        ->  Limit  (cost=0.00..82246967.00 rows=1000000 width=490) (actual time=0.116..5932.766 rows=1000000
loops=1)              ->  Seq Scan on idt_photoobservation_small o  (cost=0.00..82246967.00 rows=1000000 width=490)
(actualtime=0.113..5730.012 rows=1000000 loops=1)                     SubPlan 1                       ->  Index Only
Scanusing idt_match_transitid_healpixid_idx on idt_match m  (cost=0.00..82.17 rows=1 width=8) (actual time=0.003..0.003
rows=1loops=1000000)                             Index Cond: (transitid = o.transitid)                             Heap
Fetches:0 Total runtime: 6239.512 ms
 
(13 rows)
******

Cheers,    S

PS I didn't know that index on two columns together with the indexonlyscan can help 
the queries based on one of those columns. Very helpful!

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Wed, May 30, 2012 at 1:45 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> On Wed, 30 May 2012, Merlin Moncure wrote:
>>
>>
>> Hm, why aren't we getting a IOS?  Just for kicks (assuming this is
>> test data), can we drop the index on just transitid, leaving the index
>> on transitid, healpixid?    Is enable_indexonlyscan on?  Has idt_match
>> been vacuumed?  What kind of plan do you get when do:
>
>
> Okay dropping the index on transitid solved the issue with indexonlyscan but
> didn't solve the original problem. Actually the indexonlyscan made the
> sequential queries faster but not the parallel ones.

hurk --  ISTM that since IOS is masikng the heap lookups, there must
be contention on the index itself?  Does this working set fit in
shared memory?  If so, what happens when you do a database restart and
repeat the IOS test?

merlin


On Wed, 30 May 2012, Merlin Moncure wrote:

> hurk --  ISTM that since IOS is masikng the heap lookups, there must
> be contention on the index itself?  Does this working set fit in
> shared memory?  If so, what happens when you do a database restart and
> repeat the IOS test?

The dataset fits well in shared buffers

Here are the data sizes public.idt_photoobservation_small                        | 521 MB         |      528756
public.idt_match_idx                                    | 1084 MB        |      528762
public.idt_match_transitid_healpixid_idx                | 1522 MB        |     1955140 public.idt_match
                       | 2906 MB        |      528753
 

And shared buffers are 10G

If I restart the db the timings do not change significantly. There is always some 
variation which I don't really understand, e.g. the parallel runs sometimes
take 18s, or 25 seconds, or 30 seconds per thread. So there is something else affecting
the runs -- I don't know, maybe that's related to which thread starts first,
or what is the starting point of the seq scan... (there is no other 
activity on the machine btw).

Cheers,    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Wed, May 30, 2012 at 11:45 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> On Wed, 30 May 2012, Merlin Moncure wrote:
>>
>>
>> Hm, why aren't we getting a IOS?  Just for kicks (assuming this is
>> test data), can we drop the index on just transitid, leaving the index
>> on transitid, healpixid?    Is enable_indexonlyscan on?  Has idt_match
>> been vacuumed?  What kind of plan do you get when do:
>
>
> Okay dropping the index on transitid solved the issue with indexonlyscan but
> didn't solve the original problem. Actually the indexonlyscan made the
> sequential queries faster but not the parallel ones.

How big is idt_match?  What if you drop all indexes on idt_match,
encouraging all the backends to do hash joins against it, which occur
in local memory and so don't have contention?

Cheers,

Jeff


On Wed, May 30, 2012 at 3:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, May 30, 2012 at 11:45 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
>> On Wed, 30 May 2012, Merlin Moncure wrote:
>>>
>>>
>>> Hm, why aren't we getting a IOS?  Just for kicks (assuming this is
>>> test data), can we drop the index on just transitid, leaving the index
>>> on transitid, healpixid?    Is enable_indexonlyscan on?  Has idt_match
>>> been vacuumed?  What kind of plan do you get when do:
>>
>>
>> Okay dropping the index on transitid solved the issue with indexonlyscan but
>> didn't solve the original problem. Actually the indexonlyscan made the
>> sequential queries faster but not the parallel ones.
>
> How big is idt_match?  What if you drop all indexes on idt_match,
> encouraging all the backends to do hash joins against it, which occur
> in local memory and so don't have contention?

You just missed his post -- it's only 3G.   can you run your 'small'
working set against 48gb shared buffers?

merlin


On Wed, 30 May 2012, Merlin Moncure wrote:

>> How big is idt_match?  What if you drop all indexes on idt_match,
>> encouraging all the backends to do hash joins against it, which occur
>> in local memory and so don't have contention?
>
> You just missed his post -- it's only 3G.   can you run your 'small'
> working set against 48gb shared buffers?

Just tried 3times and it actually got much worse ~ 70-80 seconds per
query in the  parallel setup ( i.e. >10 times slower than the single run)

The oprofile then looks like this:

CPU: Intel Architectural Perfmon, speed 1862 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        linenr info                 symbol name
-------------------------------------------------------------------------------
883523   46.3676  (no location information)   s_lock  883523   100.000  (no location information)   s_lock [self]
-------------------------------------------------------------------------------
303984   15.9532  (no location information)   PinBuffer  303984   100.000  (no location information)   PinBuffer [self]

The problem is that since there is that variability in times, I don't
really 100% know whether this trend of slow-down with increasing
shared memory is genuine or not.

I've also tried just in case shared_buffers=1G, and it is still very slow
(50 sec).

After that I changed the shared buffers back to 10G and the timings got
back to 25 sec.

Weird...

I still wonder whether there is problem with the way the locking is done
(as referenced in the recent "droping tables slowiness" thread).

Cheers,    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

On May30, 2012, at 22:07 , Sergey Koposov wrote:
> If I restart the db the timings do not change significantly. There is always some variation which I don't really
understand,e.g. the parallel runs sometimes 
> take 18s, or 25 seconds, or 30 seconds per thread. So there is something else affecting
> the runs -- I don't know, maybe that's related to which thread starts first,
> or what is the starting point of the seq scan... (there is no other activity on the machine btw).

I wonder if the huge variance could be caused by non-uniform synchronization costs across different cores. That's not
allthat unlikely, because at least some cache levels (L2 and/or L3, I think) are usually shared between all cores on a
singledie. Thus, a cache bouncing line between cores on the same die might very well be faster then it bouncing between
coreson different dies. 

On linux, you can use the taskset command to explicitly assign processes to cores. The easiest way to check if that
makesa difference is to assign one core for each connection to the postmaster before launching your test. Assuming that
cpuassignment are inherited to child processes, that should then spread your backends out over exactly the cores you
specify.

best regards,
Florian Pflug



On Wed, 30 May 2012, Florian Pflug wrote:
>
> I wonder if the huge variance could be caused by non-uniform 
> synchronization costs across different cores. That's not all that 
> unlikely, because at least some cache levels (L2 and/or L3, I think) are 
> usually shared between all cores on a single die. Thus, a cache bouncing 
> line between cores on the same die might very well be faster then it 
> bouncing between cores on different dies.
>
> On linux, you can use the taskset command to explicitly assign processes 
> to cores. The easiest way to check if that makes a difference is to 
> assign one core for each connection to the postmaster before launching 
> your test. Assuming that cpu assignment are inherited to child 
> processes, that should then spread your backends out over exactly the 
> cores you specify.

Wow, thanks! This seems to be working to some extend. I've found that 
distributing each thread x ( 0<x<7) to the cpu 1+3*x
(reminder, that i have HT disabled and in total I have 4 cpus with 6 
proper cores each) gives quite good results. And after a few runs, I seem 
to be getting a more or less stable results for the multiple threads, 
with the performance of multithreaded runs going from 6 to 11 seconds for 
various threads. (another reminder is that 5-6 seconds  is roughly the 
timing of a my queries running in a single  thread).

So to some extend one can say that the problem is partially solved (i.e. 
it is probably understood)
But the question now is whether there is a *PG* problem here or 
not, or is it Intel's or Linux's problem ? 
Because still the slowdown was caused by locking. If there wouldn't be 
locking there wouldn't be any problems (as demonstrated a while ago by 
just cat'ting the files in multiple threads).

Cheers,    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On May31, 2012, at 01:16 , Sergey Koposov wrote:
> On Wed, 30 May 2012, Florian Pflug wrote:
>>
>> I wonder if the huge variance could be caused by non-uniform synchronization costs across different cores. That's
notall that unlikely, because at least some cache levels (L2 and/or L3, I think) are usually shared between all cores
ona single die. Thus, a cache bouncing line between cores on the same die might very well be faster then it bouncing
betweencores on different dies. 
>>
>> On linux, you can use the taskset command to explicitly assign processes to cores. The easiest way to check if that
makesa difference is to assign one core for each connection to the postmaster before launching your test. Assuming that
cpuassignment are inherited to child processes, that should then spread your backends out over exactly the cores you
specify.
>
> Wow, thanks! This seems to be working to some extend. I've found that distributing each thread x ( 0<x<7) to the cpu
1+3*x
> (reminder, that i have HT disabled and in total I have 4 cpus with 6 proper cores each) gives quite good results. And
aftera few runs, I seem to be getting a more or less stable results for the multiple threads, with the performance of
multithreadedruns going from 6 to 11 seconds for various threads. (another reminder is that 5-6 seconds  is roughly the
timingof a my queries running in a single  thread). 

Wait, so performance *increased* by spreading the backends out over as many dies as possible, not by using as few as
possible?That'd 
be exactly the opposite of what I'd have expected. (I'm assuming that cores on one die have ascending ids on linux. If
youcould post the contents of /proc/cpuinfo, we could verify that) 

> So to some extend one can say that the problem is partially solved (i.e. it is probably understood)

Not quite, I think. We still don't really know why there's that much spinlock contention AFAICS. But what we've learned
isthat the actual 
spinning on a contested lock is only part of the problem. The cache-line bouncing caused by all those lock acquisition
isthe other part, and it's pretty expensive too - otherwise, moving the backends around wouldn't have helped. 

> But the question now is whether there is a *PG* problem here or not, or is it Intel's or Linux's problem ?

Neither Intel nor Linux can do much about this, I fear. Synchronization will always be expensive, and the more so the
largerthe number of cores. Linux could maybe pick a better process to core assignment, but it probably won't be able to
pickthe optimal one for every workload. So unfortunately, this is a postgres problem I'd say. 

> Because still the slowdown was caused by locking. If there wouldn't be locking there wouldn't be any problems (as
demonstrateda while ago by just cat'ting the files in multiple threads). 

Yup, we'll have to figure out a way to reduce the locking overhead. 9.2 already scales much better to a large number of
coresthan previous versions did, but your test case shows that there's still room for improvement. 

best regards,
Florian Pflug



On Thu, 31 May 2012, Florian Pflug wrote:

> Wait, so performance *increased* by spreading the backends out over as 
> many dies as possible, not by using as few as possible? That'd be 
> exactly the opposite of what I'd have expected. (I'm assuming that cores 
> on one die have ascending ids on linux. If you could post the contents 
> of /proc/cpuinfo, we could verify that)

Yes, you are correct. And I also can confirm that the cpus in the cpuinfo 
are ordered by "physical id" e.g. they go like
0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3

I did a specific test with just 6 threads (== number of cores per cpu)
and ran it on a single phys cpu, it took ~ 12 seconds for each thread,
and when I tried to spread it across 4 cpus it took 7-9 seconds per 
thread. But all these numbers are anyway significantly better then when I 
didn't use taskset. Which probably means without it the processes were 
jumping from core to core ? ...

>> Because still the slowdown was caused by locking. If there wouldn't be 
>> locking there wouldn't be any problems (as demonstrated a while ago by 
>> just cat'ting the files in multiple threads).
>
> Yup, we'll have to figure out a way to reduce the locking overhead. 9.2 
> already scales much better to a large number of cores than previous 
> versions did, but your test case shows that there's still room for 
> improvement.

Yes, and unfortunately these scaling problems in read-only cpu 
bound queries, where naively I wound't expect any.

Thanks,
    Sergey

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Wed, May 30, 2012 at 4:16 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:

> But the question now is whether there is a *PG* problem here or not, or is
> it Intel's or Linux's problem ? Because still the slowdown was caused by
> locking. If there wouldn't be locking there wouldn't be any problems (as
> demonstrated a while ago by just cat'ting the files in multiple threads).

You cannot have a traditional RDBMS without locking.  From your
description of the problem, I probably wouldn't be using a traditional
database system at all for this, but rather flat files and Perl.  Or
at least, I would partition the data before loading it to the DB,
rather than trying to do it after.

But anyway, is idt_match a fairly static table?  If so, I'd partition
that into 16 tables, and then have each one of your tasks join against
a different one of those tables.  That should relieve the contention
on the index root block, and might have some other benefits as well.

Cheers,

Jeff


Sergey, all,

* Sergey Koposov (koposov@ast.cam.ac.uk) wrote:
> I did a specific test with just 6 threads (== number of cores per cpu)
> and ran it on a single phys cpu, it took ~ 12 seconds for each thread,
> and when I tried to spread it across 4 cpus it took 7-9 seconds per
> thread. But all these numbers are anyway significantly better then
> when I didn't use taskset. Which probably means without it the
> processes were jumping from core to core ? ...

It sounds like the issue here really is the PG processes bouncing
between the cores, which I do think is a Linux issue, though perhaps PG
could somehow "encourage" Linux to be better about this somehow..

This isn't specific to Linux, however, under Windows there's something
similar and they encourage the same kind of manual setting of things
like IIS processes to CPUs..
Thanks,
    Stephen

* Sergey Koposov (koposov@ast.cam.ac.uk) wrote:
> I did a specific test with just 6 threads (== number of cores per cpu)
> and ran it on a single phys cpu, it took ~ 12 seconds for each thread,
> and when I tried to spread it across 4 cpus it took 7-9 seconds per
> thread. But all these numbers are anyway significantly better then
> when I didn't use taskset. Which probably means without it the
> processes were jumping from core to core ? ...

Oh, and wrt why 'cat' isn't really affected by this issue- that's
because 'cat' isn't *doing* anything, CPU wise, really.  The PG backends
are actually doing real work and therefore they have both CPU state and
memory accesses which are impacted when the process is moved from one
core to another.  If this system has NUMA (where memory is associated
with a set of cores), then that can make it more painful when threads
are being moved between cores also, because suddenly the memory you were
accessing (which was 'fast') is now taking longer because you're having
to go through another CPU to get to it.
Thanks,
    Stephen

On Wed, 30 May 2012, Jeff Janes wrote:

>> But the question now is whether there is a *PG* problem here or not, or is
>> it Intel's or Linux's problem ? Because still the slowdown was caused by
>> locking. If there wouldn't be locking there wouldn't be any problems (as
>> demonstrated a while ago by just cat'ting the files in multiple threads).
>
> You cannot have a traditional RDBMS without locking.  From your

I understand the need of significant locking when there concurrent writes, 
but not when there only reads. But I'm not a RDBMS expert, so that's 
maybe that's misunderstanding on my side.

> description of the problem,  I probably wouldn't be using a traditional
> database system at all for this, but rather flat files and Perl.

Flat files and perl for 25-50 TB of data over few years is a bit extreme 
;)

> Or
> at least, I would partition the data before loading it to the DB,
> rather than trying to do it after.

I intensionally did otherwise, because I thought that PG will 
to be much smarter than me in juggling the data I'm ingesting (~ tens of 
gig each day), join the appropriate bits of data and then split by 
partitions. Unfortunately I see that there are some scalability 
issues on the way, which I didn't expect. Those aren't fatal, but slightly 
disappointing.

> But anyway, is idt_match a fairly static table?  If so, I'd partition
> that into 16 tables, and then have each one of your tasks join against
> a different one of those tables.  That should relieve the contention
> on the index root block, and might have some other benefits as well.

No, idt_match is getting filled by multi-threaded copy() and then joined 
with 4 other big tables like idt_phot. The result is then split into 
partitions. And I was trying different approaches to fully utilize the 
CPUs and/or I/O and somehow parallize the queries. That's the 
reasoning for somewhat contrived  queries in my test.

Cheers,    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Wed, May 30, 2012 at 9:10 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> I understand the need of significant locking when there concurrent writes,
> but not when there only reads. But I'm not a RDBMS expert, so that's maybe
> that's misunderstanding on my side.

If we knew in advance that no writes would come along during the
execution of a particular test case, then we could skip a lot of
locking on the reads.  But we don't, so we have to be prepared for the
possibility of writes at any time, which means doing things taking
share-locks on data while it's actively being read.

Also, even in an entirely read-only workload, buffers can get evicted
from the database buffer cache, so we have to lock to make sure a
buffer doesn't get evicted while we're reading it.

I'd really like to find out exactly where all those s_lock calls are
coming from.  Is there any way you can get oprofile to output a
partial stack backtrace?  If you have perf it's very easy, just 'perf
record -g -a <command to launch your test case>' and then 'perf report
-g'.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On May31, 2012, at 02:26 , Sergey Koposov wrote:
> On Thu, 31 May 2012, Florian Pflug wrote:
>> Wait, so performance *increased* by spreading the backends out over as many dies as possible, not by using as few as
possible?That'd be exactly the opposite of what I'd have expected. (I'm assuming that cores on one die have ascending
idson linux. If you could post the contents of /proc/cpuinfo, we could verify that) 
>
> Yes, you are correct. And I also can confirm that the cpus in the cpuinfo are ordered by "physical id" e.g. they go
like
> 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3
>
> I did a specific test with just 6 threads (== number of cores per cpu)
> and ran it on a single phys cpu, it took ~ 12 seconds for each thread,
> and when I tried to spread it across 4 cpus it took 7-9 seconds per thread. But all these numbers are anyway
significantlybetter then when I didn't use taskset. 

Hm. The only resource that is shared between different cores on a die is usually the last cache level (L2 or L3). So by
spreadingbackends out over more cores, you're increasing the total cache size. Maybe that could explain the behavior
you'reseeing. 

> Which probably means without it the processes were jumping from core to core ? …

Seems so. I completely fail to understand why they would, though. Since you've got far fewer runnable processes than
cores,why would the kernel ever see a reason to migrate a process from one core to another? I think that you can query
thecore that a process is running one from one of the files in /proc/<pid>. You could try gathering samples ever 50 ms
orso during the test run - maybe that could shed some additional light on this. 

best regards,
Florian Pflug




Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, May 30, 2012 at 9:10 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> > I understand the need of significant locking when there concurrent writes,
> > but not when there only reads. But I'm not a RDBMS expert, so that's maybe
> > that's misunderstanding on my side.
>
> If we knew in advance that no writes would come along during the
> execution of a particular test case, then we could skip a lot of
> locking on the reads.  But we don't, so we have to be prepared for the
> possibility of writes at any time, which means doing things taking
> share-locks on data while it's actively being read.

Uh, we have a read-only transaction mode, don't we?  Or does that not
help, because someone else, in another transaction, could take a
read-write lock?
Thanks,
    Stephen

On Wed, May 30, 2012 at 7:00 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Robert,
>
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> On Wed, May 30, 2012 at 9:10 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
>> > I understand the need of significant locking when there concurrent writes,
>> > but not when there only reads. But I'm not a RDBMS expert, so that's maybe
>> > that's misunderstanding on my side.
>>
>> If we knew in advance that no writes would come along during the
>> execution of a particular test case, then we could skip a lot of
>> locking on the reads.  But we don't, so we have to be prepared for the
>> possibility of writes at any time, which means doing things taking
>> share-locks on data while it's actively being read.
>
> Uh, we have a read-only transaction mode, don't we?  Or does that not
> help, because someone else, in another transaction, could take a
> read-write lock?

Yes, exactly.  You would have to put the entire cluster into a read
only mode, and the entire code base almost from top to bottom would
have to be rewritten to detect and support such a mode.  Even then,
either you have a unified buffer cache (which needs locking even if
the queries are read only), or each process uses its own local memory.In the latter case, how do you avoid
oversubscribingphysical memory?Also, there are certain house-keeping tasks that can take place 
anytime, even in read-only transactions.  You would either need to
turn those off, or you would need to make a pass through the entire
system doing all the house-keeping up front during the transition from
read-write to read-only.

It would basically be like writing an entirely new database management
system, which just happens to share PG's on-disk format and PG's
flavor of SLQ syntax and semantics.  And is read only.  (And since the
OP is trying to create tables based on his read-only query, such a
system wouldn't be very well suited.)

Cheers,

Jeff


Hi,

On Wed, 30 May 2012, Robert Haas wrote:

> I'd really like to find out exactly where all those s_lock calls are
> coming from.  Is there any way you can get oprofile to output a
> partial stack backtrace?  If you have perf it's very easy, just 'perf
> record -g -a <command to launch your test case>' and then 'perf report
> -g'.

I repeated my test with 8 threads (without tasksetting) and with 
sharedbuffers=48g (because that seemed to trigger in particular long 
times ~ 80 seconds). And I attach the perf report.

Cheers,    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
On Thu, May 31, 2012 at 7:31 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> On Wed, 30 May 2012, Robert Haas wrote:
>
>> I'd really like to find out exactly where all those s_lock calls are
>> coming from.  Is there any way you can get oprofile to output a
>> partial stack backtrace?  If you have perf it's very easy, just 'perf
>> record -g -a <command to launch your test case>' and then 'perf report
>> -g'.
>
>
> I repeated my test with 8 threads (without tasksetting) and with
> sharedbuffers=48g (because that seemed to trigger in particular long times ~
> 80 seconds). And I attach the perf report.

Thanks.  How did you generate this perf report?  It's cool, because I
haven't figured out how to make perf generate a report that is easily
email-able, and it seems you have.

The only trouble is that there's no call stack information here for
s_lock or PinBuffer, which is what I really want.  It seems to have
spit out call stack information only for the kernel functions, and not
for user functions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thu, 31 May 2012, Robert Haas wrote:
>
> Thanks.  How did you generate this perf report?  It's cool, because I
> haven't figured out how to make perf generate a report that is easily
> email-able, and it seems you have.

I did pretty much what you have said, e.g.
attached it to running process by
perf record -g -p PID
and then
perf report -g  > output

And postgresql was compiled with cflags=-g
>
> The only trouble is that there's no call stack information here for
> s_lock or PinBuffer, which is what I really want.  It seems to have
> spit out call stack information only for the kernel functions, and not
> for user functions.

Yes, I forgot to clean the old binaries when recompiled with cflags=-g.
So not it is fixed. I attach the updated perf report (i.e. the first 10000 
lines of it to reduce the  file size).

Cheers,    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
On Thu, May 31, 2012 at 11:23 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> On Thu, 31 May 2012, Robert Haas wrote:
>>
>> Thanks.  How did you generate this perf report?  It's cool, because I
>> haven't figured out how to make perf generate a report that is easily
>> email-able, and it seems you have.
>
> I did pretty much what you have said, e.g.
> attached it to running process by
> perf record -g -p PID
> and then
> perf report -g  > output

Ah, interesting.  I never tried sending the output to a file.

> And postgresql was compiled with cflags=-g
>>
>> The only trouble is that there's no call stack information here for
>> s_lock or PinBuffer, which is what I really want.  It seems to have
>> spit out call stack information only for the kernel functions, and not
>> for user functions.
>
> Yes, I forgot to clean the old binaries when recompiled with cflags=-g.
> So not it is fixed. I attach the updated perf report (i.e. the first 10000
> lines of it to reduce the  file size).

Oh, ho.  So from this we can see that the problem is that we're
getting huge amounts of spinlock contention when pinning and unpinning
index pages.

It would be nice to have a self-contained reproducible test case for
this, so that we could experiment with it on other systems.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thu, May 31, 2012 at 10:23 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> On Thu, 31 May 2012, Robert Haas wrote:
>>
>>
>> Thanks.  How did you generate this perf report?  It's cool, because I
>> haven't figured out how to make perf generate a report that is easily
>> email-able, and it seems you have.
>
>
> I did pretty much what you have said, e.g.
> attached it to running process by
> perf record -g -p PID
> and then
> perf report -g  > output
>
> And postgresql was compiled with cflags=-g
>
>>
>> The only trouble is that there's no call stack information here for
>> s_lock or PinBuffer, which is what I really want.  It seems to have
>> spit out call stack information only for the kernel functions, and not
>> for user functions.
>
>
> Yes, I forgot to clean the old binaries when recompiled with cflags=-g.
> So not it is fixed. I attach the updated perf report (i.e. the first 10000
> lines of it to reduce the  file size).

That's basically what we needed.  The sequential scans are driving
index scans are all simultaneously pointing at the same couple of
pages in the index    The are constantly pinning and unpinning -- the
database is schizophrenically going back and forth between the key
pages in the index being and not being allowed to be candidates for
buffer eviction.  Raising shared buffers doesn't help because it's
just marking the buffers to be available for eviction, not the
eviction process itself, that is the problem.  IOS doens't help
because it's index relation buffers, not the heap buffers we are
binding up on (although if the btree was fixed it's entirely possible
the problem could head right back to the heap for non IOS scans.

merlin


On Thu, 31 May 2012, Robert Haas wrote:

> Oh, ho.  So from this we can see that the problem is that we're
> getting huge amounts of spinlock contention when pinning and unpinning
> index pages.
>
> It would be nice to have a self-contained reproducible test case for
> this, so that we could experiment with it on other systems.

I have created it a few days ago:
http://archives.postgresql.org/pgsql-hackers/2012-05/msg01143.php

It is still valid. And I'm using exactly it to test. The only thing to 
change is to create a two-col index and drop another index.
The scripts are precisely the ones I'm using now.

The problem is that in order to see a really big slowdown (10 times slower 
than a single thread) I've had to raise the buffers to 48g but it was 
slow for smaller shared buffer settings as well.

But I'm not sure how sensitive the test is to the hardware.

Cheers,    S

*****************************************************
Sergey E. Koposov,  PhD, Research Associate
Institute of Astronomy, University of Cambridge 
Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551
Web: http://www.ast.cam.ac.uk/~koposov/


On Thu, May 31, 2012 at 11:54 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> On Thu, 31 May 2012, Robert Haas wrote:
>
>> Oh, ho.  So from this we can see that the problem is that we're
>> getting huge amounts of spinlock contention when pinning and unpinning
>> index pages.
>>
>> It would be nice to have a self-contained reproducible test case for
>> this, so that we could experiment with it on other systems.
>
>
> I have created it a few days ago:
> http://archives.postgresql.org/pgsql-hackers/2012-05/msg01143.php
>
> It is still valid. And I'm using exactly it to test. The only thing to
> change is to create a two-col index and drop another index.
> The scripts are precisely the ones I'm using now.
>
> The problem is that in order to see a really big slowdown (10 times slower
> than a single thread) I've had to raise the buffers to 48g but it was slow
> for smaller shared buffer settings as well.
>
> But I'm not sure how sensitive the test is to the hardware.

It's not: high contention on spinlocks is going to suck no matter what
hardware you have.   I think the problem is pretty obvious now: any
case where multiple backends are scanning the same sequence of buffers
in a very tight loop is going to display this behavior.  It doesn't
come up that often: it takes a pretty unusual sequence of events to
get a bunch of backends hitting the same buffer like that.

Hm, I wonder if you could alleviate the symptoms by making making the
Pin/UnpinBuffer smarter so that frequently pinned buffers could stay
pinned longer -- kinda as if your private ref count was hacked to be
higher in that case.   It would be a complex fix for a narrow issue
though.

merlin


On Thu, May 31, 2012 at 2:03 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, May 31, 2012 at 11:54 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
>> On Thu, 31 May 2012, Robert Haas wrote:
>>
>>> Oh, ho.  So from this we can see that the problem is that we're
>>> getting huge amounts of spinlock contention when pinning and unpinning
>>> index pages.
>>>
>>> It would be nice to have a self-contained reproducible test case for
>>> this, so that we could experiment with it on other systems.
>>
>>
>> I have created it a few days ago:
>> http://archives.postgresql.org/pgsql-hackers/2012-05/msg01143.php
>>
>> It is still valid. And I'm using exactly it to test. The only thing to
>> change is to create a two-col index and drop another index.
>> The scripts are precisely the ones I'm using now.
>>
>> The problem is that in order to see a really big slowdown (10 times slower
>> than a single thread) I've had to raise the buffers to 48g but it was slow
>> for smaller shared buffer settings as well.
>>
>> But I'm not sure how sensitive the test is to the hardware.
>
> It's not: high contention on spinlocks is going to suck no matter what
> hardware you have.   I think the problem is pretty obvious now: any
> case where multiple backends are scanning the same sequence of buffers
> in a very tight loop is going to display this behavior.  It doesn't
> come up that often: it takes a pretty unusual sequence of events to
> get a bunch of backends hitting the same buffer like that.
>
> Hm, I wonder if you could alleviate the symptoms by making making the
> Pin/UnpinBuffer smarter so that frequently pinned buffers could stay
> pinned longer -- kinda as if your private ref count was hacked to be
> higher in that case.   It would be a complex fix for a narrow issue
> though.

This test case is unusual because it hits a whole series of buffers
very hard.  However, there are other cases where this happens on a
single buffer that is just very, very hot, like the root block of a
btree index, where the pin/unpin overhead hurts us.  I've been
thinking about this problem for a while, but it hasn't made it up to
the top of my priority list, because workloads where pin/unpin is the
dominant cost are still relatively uncommon.  I expect them to get
more common as we fix other problems.

Anyhow, I do have some vague thoughts on how to fix this.  Buffer pins
are a lot like weak relation locks, in that they are a type of lock
that is taken frequently, but rarely conflicts.  And the fast-path
locking in 9.2 provides a demonstration of how to handle this kind of
problem efficiently: making the weak, rarely-conflicting locks
cheaper, at the cost of some additional expense when a conflicting
lock (in this case, a buffer cleanup lock) is taken.  In particular,
each backend has its own area to record weak relation locks, and a
strong relation lock must scan all of those areas and migrate any
locks found there to the main lock table.  I don't think it would be
feasible to adopt exactly this solution for buffer pins, because page
eviction and buffer cleanup locks, while not exactly common, are
common enough that we can't require a scan of N per-backend areas
every time one of those operations occurs.

But, maybe we could have a system of this type that only applies to
the very hottest buffers.  Suppose we introduce two new buffer flags,
BUF_NAILED and BUF_NAIL_REMOVAL.  When we detect excessive contention
on the buffer header spinlock, we set BUF_NAILED.  Once we do that,
the buffer can't be evicted until that flag is removed, and backends
are permitted to record pins in a per-backend area protected by a
per-backend spinlock or lwlock, rather than in the buffer header.
When we want to un-nail the buffer, we set BUF_NAIL_REMOVAL.  At that
point, it's no longer permissible to record new pins in the
per-backend areas, but old ones may still exist.  So then we scan all
the per-backend areas and transfer the pins to the buffer header, or
else just wait until no more exist; then, we clear both BUF_NAILED and
BUF_NAIL_REMOVAL.

So the pin algorithm looks like this:

read buffer header flags (unlocked)
if (flags & (BUF_NAILED|BUF_NAIL_REMOVAL) != BUF_NAILED)
{   take buffer header spinlock   record pin in buffer header   release buffer header spinlock;
}
else
{   take per-backend lwlock   record pin in per-backend area   release per-backend lwlock   read buffer header flags
(unlocked)  if (flags & (BUF_NAILED|BUF_NAIL_REMOVAL) != BUF_NAILED)   {       take per-backend lwlock       forget pin
inper-backend area       release per-backend lwlock       take buffer header spinlock       record pin in buffer header
     release buffer header spinlock   } 
}

Due to memory ordering effects, we might see the buffer as nailed when
in fact nail removal has already begun (or even, completed).  We can
prevent that if (1) the nail removal code sets the nail removal flag
before checking the per-backend areas and (2) the pin code checks the
nail removal flag AFTER checking the per-backend areas.  Since
LWLockRelease is a sequencing point, the above algorithm is consistent
with that scheme; the initial unlocked test of the buffer header flags
is merely a heuristic to avoid extra work in the common case where the
buffer isn't nailed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Thu, May 31, 2012 at 1:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, May 31, 2012 at 2:03 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Thu, May 31, 2012 at 11:54 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
>>> On Thu, 31 May 2012, Robert Haas wrote:
>>>
>>>> Oh, ho.  So from this we can see that the problem is that we're
>>>> getting huge amounts of spinlock contention when pinning and unpinning
>>>> index pages.
>>>>
>>>> It would be nice to have a self-contained reproducible test case for
>>>> this, so that we could experiment with it on other systems.
>>>
>>>
>>> I have created it a few days ago:
>>> http://archives.postgresql.org/pgsql-hackers/2012-05/msg01143.php
>>>
>>> It is still valid. And I'm using exactly it to test. The only thing to
>>> change is to create a two-col index and drop another index.
>>> The scripts are precisely the ones I'm using now.
>>>
>>> The problem is that in order to see a really big slowdown (10 times slower
>>> than a single thread) I've had to raise the buffers to 48g but it was slow
>>> for smaller shared buffer settings as well.
>>>
>>> But I'm not sure how sensitive the test is to the hardware.
>>
>> It's not: high contention on spinlocks is going to suck no matter what
>> hardware you have.   I think the problem is pretty obvious now: any
>> case where multiple backends are scanning the same sequence of buffers
>> in a very tight loop is going to display this behavior.  It doesn't
>> come up that often: it takes a pretty unusual sequence of events to
>> get a bunch of backends hitting the same buffer like that.
>>
>> Hm, I wonder if you could alleviate the symptoms by making making the
>> Pin/UnpinBuffer smarter so that frequently pinned buffers could stay
>> pinned longer -- kinda as if your private ref count was hacked to be
>> higher in that case.   It would be a complex fix for a narrow issue
>> though.
>
> This test case is unusual because it hits a whole series of buffers
> very hard.  However, there are other cases where this happens on a
> single buffer that is just very, very hot, like the root block of a
> btree index, where the pin/unpin overhead hurts us.  I've been
> thinking about this problem for a while, but it hasn't made it up to
> the top of my priority list, because workloads where pin/unpin is the
> dominant cost are still relatively uncommon.  I expect them to get
> more common as we fix other problems.
>
> Anyhow, I do have some vague thoughts on how to fix this.  Buffer pins
> are a lot like weak relation locks, in that they are a type of lock
> that is taken frequently, but rarely conflicts.  And the fast-path
> locking in 9.2 provides a demonstration of how to handle this kind of
> problem efficiently: making the weak, rarely-conflicting locks
> cheaper, at the cost of some additional expense when a conflicting
> lock (in this case, a buffer cleanup lock) is taken.  In particular,
> each backend has its own area to record weak relation locks, and a
> strong relation lock must scan all of those areas and migrate any
> locks found there to the main lock table.  I don't think it would be
> feasible to adopt exactly this solution for buffer pins, because page
> eviction and buffer cleanup locks, while not exactly common, are
> common enough that we can't require a scan of N per-backend areas
> every time one of those operations occurs.
>
> But, maybe we could have a system of this type that only applies to
> the very hottest buffers.  Suppose we introduce two new buffer flags,
> BUF_NAILED and BUF_NAIL_REMOVAL.  When we detect excessive contention
> on the buffer header spinlock, we set BUF_NAILED.  Once we do that,
> the buffer can't be evicted until that flag is removed, and backends
> are permitted to record pins in a per-backend area protected by a
> per-backend spinlock or lwlock, rather than in the buffer header.
> When we want to un-nail the buffer, we set BUF_NAIL_REMOVAL.

Hm, couple questions: how do you determine if/when to un-nail a
buffer, and who makes that decision (bgwriter?) Is there a limit to
how many buffers you are allowed to nail?  It seems like a much
stronger idea, but one downside I see vs the 'pin for longer idea' i
was kicking around was how to deal stale nailed buffers and keeping
them from uncontrollably growing so that you have to either stop
nailing or forcibly evicting them.

merlin


On Thu, May 31, 2012 at 3:25 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> Hm, couple questions: how do you determine if/when to un-nail a
> buffer, and who makes that decision (bgwriter?)

Well, I think some experimentation might be required, but my first
thought is to tie it into buffer eviction.  If we record a pin in some
side array in lieu of acquiring the buffer header spinlock, then we're
not bumping the usage count.  So perhaps what we might do is - when we
first nail the buffer, we set it's usage count to something much
higher than what's normally allowed, like say 50.  When it gets down
to 0 and somebody tries to evict it, we scan the per-backend arrays
and count the number of pins we find there, and set the usage count to
5 * that number.

The details could be completely wrong here.  The point is that I think
every so often we can just reconsider whether the buffer is still
worthy of being nailed.  Exactly how often we should reconsider it
seems like an empirical question.

> Is there a limit to
> how many buffers you are allowed to nail?

Probably not.  The system had better be designed so that the threshold
for nailing a buffer is pretty high.  It should be exceedingly
unlikely for more than a small percentage of buffers to meet the
criteria; if it happens, then there's probably a good reason.

> It seems like a much
> stronger idea, but one downside I see vs the 'pin for longer idea' i
> was kicking around was how to deal stale nailed buffers and keeping
> them from uncontrollably growing so that you have to either stop
> nailing or forcibly evicting them.

I think we're really talking about variants on the same theme.  My
idea is basically "pin forever", which is taking "longer" to its
logical extreme, but even if it were something short of forever I
think you'd still need some way to flush the stale pins.  Otherwise,
vacuum can get stuck, etc.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Sun, May 27, 2012 at 11:45 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> Hi,
>
> I did another test using the same data and the same code, which I've
> provided before and the performance of the single thread seems to be
> degrading quadratically with the number of threads.
>
> Here are the results:
> Nthreads Time_to_execute_one_thread
> 1 8.1
> 2 7.8
> 3 8.1
> 4 9.0
> 5 10.2
> 6 11.4
> 7 13.3
> 8 16.1
> 9 19.0
> 10 21.4
> 11 23.8
> 12 27.3
> 13 30.2
> 14 32.0
> 15 34.1
> 16 37.5

This looks more skew hyperbolic than quadratic.  In the higher region,
it doesn't seem to be increasing at an increasing rate, but rather
increasing at a constant rate of about 2.5 sec per additional backend,
i.e. converging to an asymptote with a slope of 2.5.  A pedantic
point, perhaps, but if you are trying to infer the nature of the
bottleneck from the shape of the curve, they lead to rather different
sets of hypotheses.

Cheers,

Jeff


On Thu, May 31, 2012 at 9:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Oh, ho.  So from this we can see that the problem is that we're
> getting huge amounts of spinlock contention when pinning and unpinning
> index pages.
>
> It would be nice to have a self-contained reproducible test case for
> this, so that we could experiment with it on other systems.

I just posted a patch under subject "pgbench--new transaction type"
that introduces a pgbench -P option.

I think that that would do a good job of simulating unique-key
look-ups on the inner side of a nested loop (which is basically what
we have here) and so creating contention on index pages.  Right now I
don't have anything with more than 2 CPUs and 2 is not high enough to
get much contention so I can't post any meaningful numbers.  (pgbench
-P might also be of interest in hash index investigation)

Cheers,

Jeff


On Thu, May 31, 2012 at 11:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> This test case is unusual because it hits a whole series of buffers
> very hard.  However, there are other cases where this happens on a
> single buffer that is just very, very hot, like the root block of a
> btree index, where the pin/unpin overhead hurts us.

I think that very very hot page is also the problem here, not a whole
sequence of hot pages.  Most of his buffer content sh lwlocks are on
just two buffers, and most of his blocked buffer mapping lwlocks on
are on just two partitions.  So I am guessing that almost all of his
spin-lock contention from Pin and Unpin are also coming from those
same two buffers.  Why there are two buffers when there is only one
index root block involved, I don't know.

Cheers,

Jeff


On Wed, May 30, 2012 at 6:10 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> On Wed, 30 May 2012, Jeff Janes wrote:
>
>> But anyway, is idt_match a fairly static table?  If so, I'd partition
>> that into 16 tables, and then have each one of your tasks join against
>> a different one of those tables.  That should relieve the contention
>> on the index root block, and might have some other benefits as well.
>
>
> No, idt_match is getting filled by multi-threaded copy() and then joined
> with 4 other big tables like idt_phot. The result is then split into
> partitions.

That does make things more complicated.  But you could you partition
it at that level and then do the joins partition-wise?

I don't have much experience at data partitioning (well, I do, but the
experience is with partitioning in Perl with terabytes of flat files,
not in PG :) ) but I think that once you have your partitioning keys
you want to apply them the same way up and down the data set.

Cheers,

Jeff


On Thu, 31 May 2012, Jeff Janes wrote:

>>
>> No, idt_match is getting filled by multi-threaded copy() and then joined
>> with 4 other big tables like idt_phot. The result is then split into
>> partitions.
>
> That does make things more complicated.  But you could you partition
> it at that level and then do the joins partition-wise?

Unfortunately the information to understand in which partition the data 
needs to be put in is only available from the idt_match table. So I have 
to do at least one join with idt_match. But I will experiment further 
with ways to perform queries, I just stopped doing that because I saw 
these problems with scalability.

> I don't have much experience at data partitioning (well, I do, but the
> experience is with partitioning in Perl with terabytes of flat files,
> not in PG :) ) but I think that once you have your partitioning keys
> you want to apply them the same way up and down the data set.

I'm not sure what you mean by "the same way up and down the data set".

Cheers,    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On May31, 2012, at 20:50 , Robert Haas wrote:
> Suppose we introduce two new buffer flags,
> BUF_NAILED and BUF_NAIL_REMOVAL.  When we detect excessive contention
> on the buffer header spinlock, we set BUF_NAILED.  Once we do that,
> the buffer can't be evicted until that flag is removed, and backends
> are permitted to record pins in a per-backend area protected by a
> per-backend spinlock or lwlock, rather than in the buffer header.
> When we want to un-nail the buffer, we set BUF_NAIL_REMOVAL.  At that
> point, it's no longer permissible to record new pins in the
> per-backend areas, but old ones may still exist.  So then we scan all
> the per-backend areas and transfer the pins to the buffer header, or
> else just wait until no more exist; then, we clear both BUF_NAILED and
> BUF_NAIL_REMOVAL.

A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs
by queing UnpinBuffer() requests for a while before actually updating
shared state.

I'm imagining having a small unpin queue with, say, 32 entries in
backend-local memory. When we unpin a buffer, we add the buffer at the
front of the queue. If the queue overflows, we dequeue a buffer from the
back of the queue and actually call UnpinBuffer(). If PinBuffer() is called
for a queued buffer, we simply remove the buffer from the queue.

We'd drain the unpin queue whenever we don't expect a PinBuffer() request
to happen for a while. Returning to the main loop is an obvious such place,
but there might be others. We could, for example, drain the queue every time
we block on a lock or signal, and maybe also before we go do I/O. Or, we
could have one such queue per resource owner, and drain it when we release
the resource owner.

We already avoid calling PinBuffer() multiple times for multiple overlapping
pins of a single buffer by a single backend. The strategy above would extend
that to not-quite-overlapping pins.

best regards,
Florian Pflug



On Fri, Jun 1, 2012 at 7:47 AM, Florian Pflug <fgp@phlo.org> wrote:
> On May31, 2012, at 20:50 , Robert Haas wrote:
>> Suppose we introduce two new buffer flags,
>> BUF_NAILED and BUF_NAIL_REMOVAL.  When we detect excessive contention
>> on the buffer header spinlock, we set BUF_NAILED.  Once we do that,
>> the buffer can't be evicted until that flag is removed, and backends
>> are permitted to record pins in a per-backend area protected by a
>> per-backend spinlock or lwlock, rather than in the buffer header.
>> When we want to un-nail the buffer, we set BUF_NAIL_REMOVAL.  At that
>> point, it's no longer permissible to record new pins in the
>> per-backend areas, but old ones may still exist.  So then we scan all
>> the per-backend areas and transfer the pins to the buffer header, or
>> else just wait until no more exist; then, we clear both BUF_NAILED and
>> BUF_NAIL_REMOVAL.
>
> A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs
> by queing UnpinBuffer() requests for a while before actually updating
> shared state.
>
> I'm imagining having a small unpin queue with, say, 32 entries in
> backend-local memory. When we unpin a buffer, we add the buffer at the
> front of the queue. If the queue overflows, we dequeue a buffer from the
> back of the queue and actually call UnpinBuffer(). If PinBuffer() is called
> for a queued buffer, we simply remove the buffer from the queue.
>
> We'd drain the unpin queue whenever we don't expect a PinBuffer() request
> to happen for a while. Returning to the main loop is an obvious such place,
> but there might be others. We could, for example, drain the queue every time
> we block on a lock or signal, and maybe also before we go do I/O. Or, we
> could have one such queue per resource owner, and drain it when we release
> the resource owner.
>
> We already avoid calling PinBuffer() multiple times for multiple overlapping
> pins of a single buffer by a single backend. The strategy above would extend
> that to not-quite-overlapping pins.

A potential issue with this line of thinking is that your pin delay
queue could get highly pressured by outer portions of the query (as in
the OP's case)  that will get little or no benefit from the delayed
pin.  But choosing a sufficiently sized drain queue would work for
most reasonable cases assuming 32 isn't enough?  Why not something
much larger, for example the lesser of 1024, (NBuffers * .25) /
max_connections?  In other words, for you to get much benefit, you
have to pin the buffer sufficiently more than 1/N times among all
buffers.

Or, maybe you only put contended buffers in the delay queue by
watching for the delay count that is going to be returned from s_lock.That forces a lookup on each pin though.

merlin


Merlin Moncure <mmoncure@gmail.com> writes:
> A potential issue with this line of thinking is that your pin delay
> queue could get highly pressured by outer portions of the query (as in
> the OP's case)  that will get little or no benefit from the delayed
> pin.  But choosing a sufficiently sized drain queue would work for
> most reasonable cases assuming 32 isn't enough?  Why not something
> much larger, for example the lesser of 1024, (NBuffers * .25) /
> max_connections?  In other words, for you to get much benefit, you
> have to pin the buffer sufficiently more than 1/N times among all
> buffers.

Allowing each backend to pin a large fraction of shared buffers sounds
like a seriously bad idea to me.  That's just going to increase
thrashing of what remains.

More generally, I don't believe that we have any way to know which
buffers would be good candidates to keep pinned for a long time.
Typically, we don't drop the pin in the first place if we know we're
likely to touch that buffer again soon.  btree root pages might be an
exception, but I'm not even convinced of that one.
        regards, tom lane


On Jun1, 2012, at 15:45 , Tom Lane wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> A potential issue with this line of thinking is that your pin delay
>> queue could get highly pressured by outer portions of the query (as in
>> the OP's case)  that will get little or no benefit from the delayed
>> pin.  But choosing a sufficiently sized drain queue would work for
>> most reasonable cases assuming 32 isn't enough?  Why not something
>> much larger, for example the lesser of 1024, (NBuffers * .25) /
>> max_connections?  In other words, for you to get much benefit, you
>> have to pin the buffer sufficiently more than 1/N times among all
>> buffers.
>
> Allowing each backend to pin a large fraction of shared buffers sounds
> like a seriously bad idea to me.  That's just going to increase
> thrashing of what remains.

Right, that was one of the motivations for suggesting the small queue.
At least that way, the number of buffers optimistically pinned by each
backend is limited.

The other was that once the outer portions plough through more than
a few pages per iteration of the sub-plan, the cost of doing that should
dominate the cost of pinning and unpinning.

> More generally, I don't believe that we have any way to know which
> buffers would be good candidates to keep pinned for a long time.

I'd think that pinning a buffer which we've only recently unpinned
is a pretty good indication that the same thing will happen again.

My proposed algorithm could be made to use exactly that criterion
by tracking a little bit more state. We'd have to tag queue entries
with a flag indicating whether they are
 Unpinned (COLD) Pinned, and unpinning should be delayed (HOT)
 Waiting to be unpinned (LUKEWARM)

UnpinBuffer() would check if the buffer is HOT, and if so add it to
the queue with flag LUKEWARM. Otherwise, it'd get immediately
unpinned and flagged as COLD (adding it to the queue if necessary).
PinBuffer() would pin the buffer and mark it as HOT if it was COLD,
and just mark it as HOT if it was LUKEWARM. If the buffer isn't on
the queue already, PinBuffer() would simply pin it and be done.

This would give the following behaviour for a buffer that is pinned
repeatedly
 PinBuffer(): <not on queue> -> <not on queue> (refcount incremented) UnpinBuffer(): <not on queue> -> COLD (refcount
decremented)... PinBuffer(): COLD -> HOT (refcount incremented) UnpinBuffer(): HOT -> LUKEWARM (refcount *not*
decremented)... PinBuffer(): LUKEWARM -> HOT (refcount *not* incremented) UnpinBuffer(): HOT -> LUKEWARM (refcount
*not*decremented) … 

> Typically, we don't drop the pin in the first place if we know we're
> likely to touch that buffer again soon. btree root pages might be an
> exception, but I'm not even convinced of that one.

But Sergey's use-case pretty convincingly shows that, more generally,
inner sides of a nested loop join are also an exception, no? At least
if the inner side is either an index scan, or a seqscan of a really
small table.

best regards,
Florian Pflug



On Fri, Jun 1, 2012 at 8:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> A potential issue with this line of thinking is that your pin delay
>> queue could get highly pressured by outer portions of the query (as in
>> the OP's case)  that will get little or no benefit from the delayed
>> pin.  But choosing a sufficiently sized drain queue would work for
>> most reasonable cases assuming 32 isn't enough?  Why not something
>> much larger, for example the lesser of 1024, (NBuffers * .25) /
>> max_connections?  In other words, for you to get much benefit, you
>> have to pin the buffer sufficiently more than 1/N times among all
>> buffers.
>
> Allowing each backend to pin a large fraction of shared buffers sounds
> like a seriously bad idea to me.  That's just going to increase
> thrashing of what remains.

By 'large fraction', you mean 25%?  You could always set it lower, say
5%.   But if you can be smarter about which buffers to put in I agree:
a smaller queue is better.

> More generally, I don't believe that we have any way to know which
> buffers would be good candidates to keep pinned for a long time.
> Typically, we don't drop the pin in the first place if we know we're
> likely to touch that buffer again soon.  btree root pages might be an
> exception, but I'm not even convinced of that one.

Why not (unless Florian's warming concept is a better bet) hook it to
spinlock contention? That's what we're trying to avoid after all.
s_lock can be modified to return if it had to delay.  PinBuffer could
watch for that and stick it in the queue.   Both Florin's idea (AIUI)
or a s_lock based implementation require you to search your local
queue on every pin/unpin, which i think is the real cost.  Robert's
doesn't, although it is a more complicated approach.

merlin


On Fri, Jun 1, 2012 at 5:57 PM, Florian Pflug <fgp@phlo.org> wrote:
> My proposed algorithm could be made to use exactly that criterion
> by tracking a little bit more state. We'd have to tag queue entries
> with a flag indicating whether they are
>
>  Unpinned (COLD)
>
>  Pinned, and unpinning should be delayed (HOT)
>
>  Waiting to be unpinned (LUKEWARM)

This sounds awfully similar to the LIRS/CLOCK-Pro cache replacement
algorithms. They manage buffers by reuse distance based on last two
accesses. Because both algorithms demonstrate very good performance
over a wide range of cache sizes and workloads, it might be worth
exploring how they could be applicable here.

On the other hand, I agree with Merlin that searching the local queue
for every pin could be too heavy. Roberts approach amounts to
categorizing some buffers to be so hot that we basically lift them out
of the regular replacement algorithm management and don't even bother
tracking their usage on the account that it will be cheaper to figure
out their usage state after the fact. This might have some interesting
interactions with replacement algorithms that naturally keep separate
tiers of buffers. I think I'm going to try to see where that train of
thought takes me.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug <fgp@phlo.org> wrote:
> A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs
> by queing UnpinBuffer() requests for a while before actually updating
> shared state.

So, what happens when somebody wants a cleanup lock on the buffer
you've decided to keep pinned?  We have this problem already; I'm wary
of making it worse.

> We'd drain the unpin queue whenever we don't expect a PinBuffer() request
> to happen for a while. Returning to the main loop is an obvious such place,
> but there might be others.

However, on a workload like pgbench -S, dropping the pin when you
return to the main loop would render the optimization useless.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Fri, Jun 1, 2012 at 10:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug <fgp@phlo.org> wrote:
>
>> We'd drain the unpin queue whenever we don't expect a PinBuffer() request
>> to happen for a while. Returning to the main loop is an obvious such place,
>> but there might be others.
>
> However, on a workload like pgbench -S, dropping the pin when you
> return to the main loop would render the optimization useless.

But do we need the optimization on a workload like pgbench -S?  You
did some pretty massive scalability tests on that front, and I don't
think this problem was identified in them.

Cheers,

Jeff


On Jun1, 2012, at 19:51 , Robert Haas wrote:
> On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug <fgp@phlo.org> wrote:
>> A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs
>> by queing UnpinBuffer() requests for a while before actually updating
>> shared state.
>
> So, what happens when somebody wants a cleanup lock on the buffer
> you've decided to keep pinned?  We have this problem already; I'm wary
> of making it worse.

She has to wait, or skip the cleanup. We already keep buffers pinned for
potentially a long time - if cursors enter the game even over multiple
commands, I think. And even without cursors, a nestloop plan can keep
buffers used by the outer sub-plan pinned while the inner plan is
executing. So giving the inner plan the chance to do the same seems not
unreasonable.

Plus, the more refined version of that idea from my reply to Tom only
delays the unpinning after the buffer was pinned twice within a short
period of time. After that, it seems to me that the probability of it
getting pinned again is high enough to warrant slightly penalizing a
potential cleanup.

>> We'd drain the unpin queue whenever we don't expect a PinBuffer() request
>> to happen for a while. Returning to the main loop is an obvious such place,
>> but there might be others.
>
> However, on a workload like pgbench -S, dropping the pin when you
> return to the main loop would render the optimization useless.

Well, we could drain the queue only if the read() from the socket blocks.
But does pgbench -S really manage to cause significant spinlock contention
due to buffer pinning/unpinning? I'd expect other things to dominate there..

best regards,
Florian Pflug




On Fri, Jun 1, 2012 at 2:06 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Fri, Jun 1, 2012 at 10:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug <fgp@phlo.org> wrote:
>>
>>> We'd drain the unpin queue whenever we don't expect a PinBuffer() request
>>> to happen for a while. Returning to the main loop is an obvious such place,
>>> but there might be others.
>>
>> However, on a workload like pgbench -S, dropping the pin when you
>> return to the main loop would render the optimization useless.
>
> But do we need the optimization on a workload like pgbench -S?  You
> did some pretty massive scalability tests on that front, and I don't
> think this problem was identified in them.

I don't know if I mentioned it on list, but I believe there is an
effect, and I believe it will get stronger as we continue to make
improvements in other areas, and as we continue to ramp up the number
of cores.   Even now, you can get s_lock taking >1% of the CPU time,
which is not trivial, and not all of that is from buffer-pinning, but
it's in the mix, so I'm kind of reluctant to accept a solution that
involves assuming that the problem always has to occur within a
transaction rather than across transactions.  I think it's a dead-end
anyway on other grounds: we ought to be trying to find ways to hold
buffer pins for less time, not more time.

Another thought is that if the problem is limited to the root index
block, then we could consider less general solutions, like allowing
backends to cache the root index block and sending some kind of
invalidation when it gets split.  But I don't know that the root index
block is the only place where we can have this problem, so I'm a
little nervous about that angle, too.  Still, it might be worth some
research.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Fri, Jun 1, 2012 at 2:54 PM, Florian Pflug <fgp@phlo.org> wrote:
> On Jun1, 2012, at 19:51 , Robert Haas wrote:
>> On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug <fgp@phlo.org> wrote:
>>> A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs
>>> by queing UnpinBuffer() requests for a while before actually updating
>>> shared state.
>>
>> So, what happens when somebody wants a cleanup lock on the buffer
>> you've decided to keep pinned?  We have this problem already; I'm wary
>> of making it worse.
>
> She has to wait, or skip the cleanup. We already keep buffers pinned for
> potentially a long time - if cursors enter the game even over multiple
> commands, I think. And even without cursors, a nestloop plan can keep
> buffers used by the outer sub-plan pinned while the inner plan is
> executing. So giving the inner plan the chance to do the same seems not
> unreasonable.

Maybe.  TBH, I'm unhappy about how easy it is to leave pins floating
around and would like to reduce that further, not increase it.

> Plus, the more refined version of that idea from my reply to Tom only
> delays the unpinning after the buffer was pinned twice within a short
> period of time. After that, it seems to me that the probability of it
> getting pinned again is high enough to warrant slightly penalizing a
> potential cleanup.
>
>>> We'd drain the unpin queue whenever we don't expect a PinBuffer() request
>>> to happen for a while. Returning to the main loop is an obvious such place,
>>> but there might be others.
>>
>> However, on a workload like pgbench -S, dropping the pin when you
>> return to the main loop would render the optimization useless.
>
> Well, we could drain the queue only if the read() from the socket blocks.

I believe that it's going to block frequently on this type of
workload.  pgbench isn't fast enough to get the next query over there
by the time postgres is ready for the next query.

> But does pgbench -S really manage to cause significant spinlock contention
> due to buffer pinning/unpinning? I'd expect other things to dominate there..

See previous note.  It may not be the biggest effect, but I think it's
in the mix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> Another thought is that if the problem is limited to the root index
> block, then we could consider less general solutions, like allowing
> backends to cache the root index block and sending some kind of
> invalidation when it gets split.

Possibly worth noting here that we already do exactly that for btree
metapages.  However, I believe that case is a bit different, in that
using slightly stale metapage data is nonfatal.  Also, anything like
this can be a win only if the cached data can be expected to have pretty
long lifetime; so you would not want to cache the root page until the
index had gotten large enough that root page changes were infrequent.
Not sure how that could be mechanized ...
        regards, tom lane


On Jun1, 2012, at 21:07 , Robert Haas wrote:
> On Fri, Jun 1, 2012 at 2:54 PM, Florian Pflug <fgp@phlo.org> wrote:
>> On Jun1, 2012, at 19:51 , Robert Haas wrote:
>>> On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug <fgp@phlo.org> wrote:
>>>> We'd drain the unpin queue whenever we don't expect a PinBuffer() request
>>>> to happen for a while. Returning to the main loop is an obvious such place,
>>>> but there might be others.
>>>
>>> However, on a workload like pgbench -S, dropping the pin when you
>>> return to the main loop would render the optimization useless.
>>
>> Well, we could drain the queue only if the read() from the socket blocks.
>
> I believe that it's going to block frequently on this type of
> workload.  pgbench isn't fast enough to get the next query over there
> by the time postgres is ready for the next query.
>
>> But does pgbench -S really manage to cause significant spinlock contention
>> due to buffer pinning/unpinning? I'd expect other things to dominate there..
>
> See previous note.  It may not be the biggest effect, but I think it's
> in the mix.

Ok, now you've lost me. If the read() blocks, how on earth can a few
additional pins/unpins ever account for any meaningful execution time?

It seems to me that once read() blocks we're talking about a delay in the
order of the scheduling granularity (i.e., milliseconds, in the best case),
while even in the word case pinning a buffer shouldn't take more than
1000 cycles (for comparison, I think a cache miss across all layers costs
a few hundred cycles). So there's at the very least 3 order of magnitude
between those two...

best regards,
Florian Pflug



On Fri, Jun 1, 2012 at 3:16 PM, Florian Pflug <fgp@phlo.org> wrote:
> Ok, now you've lost me. If the read() blocks, how on earth can a few
> additional pins/unpins ever account for any meaningful execution time?
>
> It seems to me that once read() blocks we're talking about a delay in the
> order of the scheduling granularity (i.e., milliseconds, in the best case),
> while even in the word case pinning a buffer shouldn't take more than
> 1000 cycles (for comparison, I think a cache miss across all layers costs
> a few hundred cycles). So there's at the very least 3 order of magnitude
> between those two...

I'm not sure what you want me to say here.  s_lock shows up in the
profile, and some of that is from PinBuffer.  I think any detectable
number of calls to s_lock is a sign of Bad Things (TM).  I can't
reproduce anything as severe as what the OP is seeing, but what does
that prove?  In a couple years we'll have systems with 128 cores
floating around, and things that are minor problems at 32 or even 64
cores will be crippling at 128 cores.  IME, spinlock contention has a
very sharp tipping point.  It's only a minor annoyance and then you
hit some threshold number of cores and, bam, you're spending 70-90% of
your time across all cores fighting over that one spinlock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Fri, Jun 1, 2012 at 3:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jun 1, 2012 at 3:16 PM, Florian Pflug <fgp@phlo.org> wrote:
>> Ok, now you've lost me. If the read() blocks, how on earth can a few
>> additional pins/unpins ever account for any meaningful execution time?
>>
>> It seems to me that once read() blocks we're talking about a delay in the
>> order of the scheduling granularity (i.e., milliseconds, in the best case),
>> while even in the word case pinning a buffer shouldn't take more than
>> 1000 cycles (for comparison, I think a cache miss across all layers costs
>> a few hundred cycles). So there's at the very least 3 order of magnitude
>> between those two...
>
> I'm not sure what you want me to say here.  s_lock shows up in the
> profile, and some of that is from PinBuffer.  I think any detectable
> number of calls to s_lock is a sign of Bad Things (TM).  I can't
> reproduce anything as severe as what the OP is seeing, but what does
> that prove?  In a couple years we'll have systems with 128 cores
> floating around, and things that are minor problems at 32 or even 64
> cores will be crippling at 128 cores.  IME, spinlock contention has a
> very sharp tipping point.  It's only a minor annoyance and then you
> hit some threshold number of cores and, bam, you're spending 70-90% of
> your time across all cores fighting over that one spinlock.

I think your approach, nailing buffers, is really the way to go.  It
nails buffers based on detected contention which is very desirable --
uncontended spinlocks aren't broken and don't need to be fixed.  It
also doesn't add overhead in the general case whereas a side by side
backend queue does.

Another nice aspect is that you're not changing the lifetime of the
pin as the backend sees it but storing the important stuff (the
interplay with usage_count is a nice touch) on the buffer itself --
you want to keep as little as possible in the backend private side and
your patch does that; it's more amenable to 3rd party intervention
(flush your buffers right now!) then extended pins.  It exploits the
fact that pins can overlap and that the reference count is useless if
the buffer is always in memory anyways.  It immediately self corrects
when the first backend gripes whereas a per backend solution will
grind down as each backend independently determines it's got a problem
-- not pleasant if your case is 'walking' a set of buffers.

Buffer pins aren't a cache: with a cache you are trying to mask a slow
operation (like a disk i/o) with a faster such that the amount of slow
operations are minimized.  Buffer pins however are very different in
that we only care about contention on the reference count (the buffer
itself is not locked!) which makes me suspicious that caching type
algorithms are the wrong place to be looking.  I think it comes to do
picking between your relatively complex but general, lock displacement
approach or a specific strategy based on known bottlenecks.

merlin


On Sat, Jun 2, 2012 at 1:48 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> Buffer pins aren't a cache: with a cache you are trying to mask a slow
> operation (like a disk i/o) with a faster such that the amount of slow
> operations are minimized.  Buffer pins however are very different in
> that we only care about contention on the reference count (the buffer
> itself is not locked!) which makes me suspicious that caching type
> algorithms are the wrong place to be looking.  I think it comes to do
> picking between your relatively complex but general, lock displacement
> approach or a specific strategy based on known bottlenecks.

I agree that pins aren't like a cache. I mentioned the caching
algorithms because they work based on access frequency and highly
contended locks are likely to be accessed frequently even from a
single backend. However this only makes sense for the delayed
unpinning method, and I also have come to the conclusion that it's not
likely to work well. Besides delaying cleanup, the overhead for the
common case of uncontended access is just too much.

It seems to me that even the nailing approach will need a replacement
algorithm. The local pins still need to be published globally and
because shared memory size is fixed, the maximum amount of locally
pinned nailed buffers needs to be limited as well.

But anyway, I managed to completely misread the profile that Sergey
gave. Somehow I missed that the time went into the retry TAS in slock
instead of the inlined TAS. This shows that the issue isn't just
cacheline ping-pong but cacheline stealing. This could be somewhat
mitigated by making pinning lock-free. The Nb-GCLOCK paper that Robert
posted earlier in another thread describes an approach for this. I
have a WIP patch (attached) that makes the clock sweep lock-free in
the common case. This patch gave a 40% performance increase for an
extremely allocation heavy load running with 64 clients on a 4 core 1
socket system, lesser gains across the board. Pinning has a shorter
lock duration (and a different lock type) so the gain might be less,
or it might be a larger problem and post a higher gain. Either way, I
think the nailing approach should be explored further, cacheline
ping-pong could still be a problem with higher number of processors
and losing the spinlock also loses the ability to detect contention.


Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Attachment
On Fri, Jun 1, 2012 at 9:55 PM, Ants Aasma <ants@cybertec.at> wrote:
> On Sat, Jun 2, 2012 at 1:48 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> Buffer pins aren't a cache: with a cache you are trying to mask a slow
>> operation (like a disk i/o) with a faster such that the amount of slow
>> operations are minimized.  Buffer pins however are very different in
>> that we only care about contention on the reference count (the buffer
>> itself is not locked!) which makes me suspicious that caching type
>> algorithms are the wrong place to be looking.  I think it comes to do
>> picking between your relatively complex but general, lock displacement
>> approach or a specific strategy based on known bottlenecks.
>
> I agree that pins aren't like a cache. I mentioned the caching
> algorithms because they work based on access frequency and highly
> contended locks are likely to be accessed frequently even from a
> single backend. However this only makes sense for the delayed
> unpinning method, and I also have come to the conclusion that it's not
> likely to work well. Besides delaying cleanup, the overhead for the
> common case of uncontended access is just too much.
>
> It seems to me that even the nailing approach will need a replacement
> algorithm. The local pins still need to be published globally and
> because shared memory size is fixed, the maximum amount of locally
> pinned nailed buffers needs to be limited as well.
>
> But anyway, I managed to completely misread the profile that Sergey
> gave. Somehow I missed that the time went into the retry TAS in slock
> instead of the inlined TAS. This shows that the issue isn't just
> cacheline ping-pong but cacheline stealing. This could be somewhat
> mitigated by making pinning lock-free. The Nb-GCLOCK paper that Robert
> posted earlier in another thread describes an approach for this. I
> have a WIP patch (attached) that makes the clock sweep lock-free in
> the common case. This patch gave a 40% performance increase for an
> extremely allocation heavy load running with 64 clients on a 4 core 1
> socket system, lesser gains across the board. Pinning has a shorter
> lock duration (and a different lock type) so the gain might be less,
> or it might be a larger problem and post a higher gain. Either way, I
> think the nailing approach should be explored further, cacheline
> ping-pong could still be a problem with higher number of processors
> and losing the spinlock also loses the ability to detect contention.

Note sure about the rest of this patch, but this part is definitely bogus:

+#if !defined(pg_atomic_fetch_and_set)
+#define pg_atomic_fetch_and_set(dst, src, value) \
+       do { S_LOCK(&dummy_spinlock); \
+       dst = src; \
+       src = value; \
+       S_UNLOCK(&dummy_spinlock); } while (0)
+#endif

Locking a dummy backend-local spinlock doesn't provide atomicity
across multiple processes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Mon, Jun 4, 2012 at 5:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Note sure about the rest of this patch, but this part is definitely bogus:
>
> +#if !defined(pg_atomic_fetch_and_set)
> +#define pg_atomic_fetch_and_set(dst, src, value) \
> +       do { S_LOCK(&dummy_spinlock); \
> +       dst = src; \
> +       src = value; \
> +       S_UNLOCK(&dummy_spinlock); } while (0)
> +#endif
>
> Locking a dummy backend-local spinlock doesn't provide atomicity
> across multiple processes.

Right, of course. I don't know why I assumed that dummy_spinlock would
be global. In any case, this is very WIP and doesn't even aspire to be
portable yet. The main point was to see if there's any significant
performance to be gained by this method.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


On Mon, Jun 4, 2012 at 9:20 AM, Ants Aasma <ants@cybertec.at> wrote:
> On Mon, Jun 4, 2012 at 5:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Note sure about the rest of this patch, but this part is definitely bogus:
>>
>> +#if !defined(pg_atomic_fetch_and_set)
>> +#define pg_atomic_fetch_and_set(dst, src, value) \
>> +       do { S_LOCK(&dummy_spinlock); \
>> +       dst = src; \
>> +       src = value; \
>> +       S_UNLOCK(&dummy_spinlock); } while (0)
>> +#endif
>>
>> Locking a dummy backend-local spinlock doesn't provide atomicity
>> across multiple processes.
>
> Right, of course. I don't know why I assumed that dummy_spinlock would
> be global. In any case, this is very WIP and doesn't even aspire to be
> portable yet. The main point was to see if there's any significant
> performance to be gained by this method.

yeah -- those are fallback routines in case the compiler primitives
don't exist.  I think I understand what Ants is doing here: he's
reducing the coverage of the free list lock to only cover actually
popping a buffer off the free list; it no longer covers the clock
sweep.  That's a massive win if it works.   In order to get away with
that he had to decompose all manipulations from the clock sweep to the
Strategy to thread safe atomic operations.

What happens (in the very unlikely, but possible case?) if another
backend races to the buffer you've pointed at with 'victim'?  It looks
like multiple backends share the clock sweep now, but don't you need
to need an extra test to ensure it's still a candidate victim buffer?

merlin


On Mon, Jun 4, 2012 at 10:17 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> What happens (in the very unlikely, but possible case?) if another
> backend races to the buffer you've pointed at with 'victim'?  It looks
> like multiple backends share the clock sweep now, but don't you need
> to need an extra test to ensure it's still a candidate victim buffer?

Actually, I don't think you do: the existing check on refcount is
probably good enough.  Hm, why did you get rid of
BufferStrategyControl.lastFreeBuffer?

merlin


On Mon, Jun 4, 2012 at 6:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Mon, Jun 4, 2012 at 10:17 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> What happens (in the very unlikely, but possible case?) if another
>> backend races to the buffer you've pointed at with 'victim'?  It looks
>> like multiple backends share the clock sweep now, but don't you need
>> to need an extra test to ensure it's still a candidate victim buffer?
>
> Actually, I don't think you do: the existing check on refcount is
> probably good enough.  Hm, why did you get rid of
> BufferStrategyControl.lastFreeBuffer?

It was dead code as far as I could tell. That change isn't actually
relevant for this patch because free-list management is still
protected by a lock (except the initial unlocked test that is
doublechecked under lock) and so doesn't need any adjustment.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


On Mon, Jun 4, 2012 at 10:42 AM, Ants Aasma <ants@cybertec.at> wrote:
> On Mon, Jun 4, 2012 at 6:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Mon, Jun 4, 2012 at 10:17 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> What happens (in the very unlikely, but possible case?) if another
>>> backend races to the buffer you've pointed at with 'victim'?  It looks
>>> like multiple backends share the clock sweep now, but don't you need
>>> to need an extra test to ensure it's still a candidate victim buffer?
>>
>> Actually, I don't think you do: the existing check on refcount is
>> probably good enough.  Hm, why did you get rid of
>> BufferStrategyControl.lastFreeBuffer?
>
> It was dead code as far as I could tell. That change isn't actually
> relevant for this patch because free-list management is still
> protected by a lock (except the initial unlocked test that is
> doublechecked under lock) and so doesn't need any adjustment.

I have to admit -- this is pretty cool.  The changes look pretty clean
but I think it's going to need some benchmarking to prove it's
actually faster under all workloads and some more review though. If
there's heavy contention on getting victim buffers you're going to be
making a lot of atomic operations -- each one of which will lock down
the cache line.  A yielding lwlock is obviously going to have very
different behaviors under contention, so there needs to be a
performance test that really burns up the freelist.

If it's Well And Truly faster, you have to immediately start wondering
if the freelist lock can't be completely scrapped: buffer invalidation
only occurs in fairly rare events -- but when it does ISTM to be
fairly counter productive.  In trying to avoid the sweep you end up
pushing everyone towards a locky free list.  I don't know about you,
but I'd prefer to have more consistent behavior (as in, not "drop big
table, concurrency drops 50%").

Noted, this doesn't really address the OP's problem :-).  I bet it'd
supplement the nailing strategy nicely however.

merlin


Hi,

On Mon, 4 Jun 2012, Ants Aasma wrote:
> On Mon, Jun 4, 2012 at 7:44 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> I tried to keep it simple at first to find an answer to the question
> if it's even worth trying before expending large effort on it. If
> anyone with a multisocket machine would chip in, I'd love to know how
> this patch handles on larger machines. I think the most interesting
> workloads are read only loads with heavy buffer trashing but inside OS
> memory. Select only pgbench with 32MB shared_buffers was withín error
> margin, although slightly faster on my machine (Intel i2500K). The
> workload that I used to demonstrate gain was an aggregated index scan
> to minimise other overheads.

I've quickly tested your lockfree-getbuffer.patch patch with
the test case you provided and I barely see any improvement (2% at max)
https://docs.google.com/open?id=0B7koR68V2nM1QVBxWGpZdW4wd0U
tested with 24 core (48 ht cores, Xeon E7- 4807).
Although the tps vs number of threads looks weird....

Cheers,    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/

On Wed, Jun 6, 2012 at 2:27 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> I've quickly tested your lockfree-getbuffer.patch patch with the test case
> you provided and I barely see any improvement (2% at max)
> https://docs.google.com/open?id=0B7koR68V2nM1QVBxWGpZdW4wd0U
> tested with 24 core (48 ht cores, Xeon E7- 4807).
> Although the tps vs number of threads looks weird....

Was this the range scan on the test table? (sorry about the error in
the query, the x should really be id) In that case the results look
really suspicious. My machine (4 cores, no ht, @ 4GHz, newer arch)
peaked at 90tps with the stated configuration. Even when upping the
shared_buffers and enabling indexonlyscan I didn't see more than about
540tps per thread. The test is designed to exercise buffer eviction,
doing about 9800 buffer reads per transaction with 32MB of buffers.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


On Wed, 6 Jun 2012, Ants Aasma wrote:

> On Wed, Jun 6, 2012 at 2:27 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
>> I've quickly tested your lockfree-getbuffer.patch patch with the test case
>> you provided and I barely see any improvement (2% at max)
>> https://docs.google.com/open?id=0B7koR68V2nM1QVBxWGpZdW4wd0U
>> tested with 24 core (48 ht cores, Xeon E7- 4807).
>> Although the tps vs number of threads looks weird....
>
> Was this the range scan on the test table? (sorry about the error in
> the query, the x should really be id) In that case the results look
> really suspicious.

Yes, my fault partially, because without much thought I've put "value" 
instead of "x" in the script. Now after replacing it by "id" the tps are 
much smaller.

Here is the tps vs nthreads I did test up to 10 threads on my 24 cpu 
system (I disabled HT though):
https://docs.google.com/open?id=0B7koR68V2nM1Nk9OcWNJOTRrYVE

Your patch clearly improve the situation (the peak tps is ~ 10% higher), 
but the general picture is the same: flattening of tps vs nthreads.

Cheers,    S

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Wed, Jun 6, 2012 at 2:53 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> On Wed, 6 Jun 2012, Ants Aasma wrote:
>
>> On Wed, Jun 6, 2012 at 2:27 PM, Sergey Koposov <koposov@ast.cam.ac.uk>
>> wrote:
>>>
>>> I've quickly tested your lockfree-getbuffer.patch patch with the test
>>> case
>>> you provided and I barely see any improvement (2% at max)
>>> https://docs.google.com/open?id=0B7koR68V2nM1QVBxWGpZdW4wd0U
>>> tested with 24 core (48 ht cores, Xeon E7- 4807).
>>> Although the tps vs number of threads looks weird....
>>
>>
>> Was this the range scan on the test table? (sorry about the error in
>> the query, the x should really be id) In that case the results look
>> really suspicious.
>
>
> Yes, my fault partially, because without much thought I've put "value"
> instead of "x" in the script. Now after replacing it by "id" the tps are
> much smaller.
>
> Here is the tps vs nthreads I did test up to 10 threads on my 24 cpu system
> (I disabled HT though):
> https://docs.google.com/open?id=0B7koR68V2nM1Nk9OcWNJOTRrYVE
>
> Your patch clearly improve the situation (the peak tps is ~ 10% higher), but
> the general picture is the same: flattening of tps vs nthreads.

I think this is the expected result.  In the single user case the
spinklock never spins and only has to make the cpu-locking cache
instructions once.  can we see results @24 threads?

merlin


On Wed, 6 Jun 2012, Merlin Moncure wrote:
> I think this is the expected result.  In the single user case the
> spinklock never spins and only has to make the cpu-locking cache
> instructions once.  can we see results @24 threads?

Here 
https://docs.google.com/open?id=0B7koR68V2nM1NDJHLUhNSS0zbUk

*****************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


On Wed, Jun 6, 2012 at 11:42 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote:
> On Wed, 6 Jun 2012, Merlin Moncure wrote:
>>
>> I think this is the expected result.  In the single user case the
>> spinklock never spins and only has to make the cpu-locking cache
>> instructions once.  can we see results @24 threads?
>
>
> Here https://docs.google.com/open?id=0B7koR68V2nM1NDJHLUhNSS0zbUk

Thank you for testing, really interesting results. It looks like this
workload really isn't bound by BufFreelistLock, despite my best
efforts. I did some perf analysis on my machine.

First thing I noticed was that the profile was dominated by copying
buffers to postgresql shared_buffers. Seems obvious after the fact. To
see BufFreelistLock contention there needs to be some noticeable work
under that lock besides it being frequently acquired. I bumped
shared_buffers to 512M and the table size to 30M rows to compensate,
making the dataset 4.7G, with 640M of index.

Looking at instruction level annotation of StrategyGetBuffer I saw
about 1/3 of time taken up by the numBufferAllocs and bgwriterLatch
atomic changes. So I added a lockfree test for bgwriterLatch, and used
a local var for allocation counting, updating the shared var every 16
allocations, making the common case only do one locked instruction. I
added padding to separate frequently and less frequently modified vars
to different cache lines. This should be good enough for performance
testing purposes, a proper solution should at the very least handle
flushing the buffer allocation count on backend termination.

Attached are perf reports for master and the patched version running
the enlarged workload at -c 4. BufFreelistLock doesn't appear under
LWLockAcquire anymore, but all the overhead shifted to
StrategyGetBuffer. Basically it's the locked op overhead that is being
moved around. Some other observations:
* While there isn't any noticeable contention, the locking overhead is
still huge, looking at instruction level timings, it seems to be just
the cost of getting the lock cache line in exclusive mode.
* After patching StrategyGetBuffer is 59% buffer header lock acquisition.
* All of the top 11 items in the profile are dominated by one or two
cache misses. Specifically, navigating collision chains for
hash_search_with_hash_value, the lock for LWLockAcquire,
LWLockRelease, buffer header for StrategyGetBuffer, ReadBuffer_common,
PinBuffer, UnpinBuffer, TerminateBufferIO, PrivateRefCount for
PinBuffer, PinBuffer_locked, UnpinBuffer, heap tuple for
heap_hot_search_buffer.

Based on this information, it looks like modifying anything in the
buffer header can be a pretty heavy operation. Making heavily accessed
buffers read-mostly (eg by nailing them) could give a healthy boost to
performance. Atomic ops probably won't help significantly to alleviate
the cost because the cache lines need to bounce around anyway, except
maybe in extreme contention cases where spinning causes stolen cache
lines.

I also attached a patch with the most recent changes, should at least
fix the regression for single client case, maybe improve scaling. This
patch doesn't seem useful until the other sources of contention are
alleviated. I won't pursue getting this into a committable state
unless someone can show a workload where the clock sweep is an actual
bottleneck.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Attachment