Thread: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
Hi, I've been running some tests on pg 9.2beta1 and in particular a set of queries like create table _tmp0 as select * from ( select *, (select healpixid from idt_match as m where m.transitid=o.transitid) as x from idt_photoobservation as o offset 0 ) as y where x%16=ZZZ orderby x; (where ZZZ is some number integer number 0<=ZZZ<16) With the following plan: ---------------------------------------------------------------------------------------------------------------------- Sort (cost=3228814504.96..3228815137.21 rows=252902 width=498) Sort Key: y.x -> Subquery Scan on y (cost=0.00..3228791809.25rows=252902 width=498) Filter: ((y.x % 16::bigint) = 0) -> Limit (cost=0.00..3228033102.41rows=50580456 width=490) -> Seq Scan on idt_photoobservation o (cost=0.00..3228033102.41rows=50580456 width=490) SubPlan 1 -> Index Scan usingidt_match_transitid_idx on idt_match m (cost=0.00..63.74 rows=1 width=8) Index Cond: (transitid= o.transitid) The schema of the tables are: e2es2=> \d idt_match Table "public.idt_match" Column | Type | Modifiers -----------+----------+----------- sourceid | bigint | transitid | bigint | healpixid | bigint | flagsxm | smallint| Indexes: "idt_match_idx" btree (healpixid) "idt_match_transitid_idx" btree (transitid) Table "public.idt_photoobservation" Column | Type | Modifiers -----------+----------+----------- transitid | bigint | fluxbp0 | real | .... more columns .... Indexes: "idt_photoobservation_idx" btree (transitid) And I noticed than when I run the query like the one shown above in parallel (in multiple connections for ZZZ=0...8) the performance of each query drops down significantly (factor of 2)(despite the fact that during the execution of the query postgres is mostly CPU bound). So I tried to oprofile it and strace it, And that's what I saw: The strace -c of each process shows something like that ######### Process 18660 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 76.25 0.001342 0 268987 semop23.75 0.000418 0 61694 read 0.00 0.000000 0 138 lseek 0.00 0.000000 0 355 select 0.00 0.000000 0 3 kill ------ ----------- ----------- --------- --------- ---------------- 100.00 0.001760 331177 total ####### And the oprofile shows this on top: ------------------------------------------------------------------------------- 2863981 25.7117 ReleasePredicateLocks 2863981 100.000 ReleasePredicateLocks [self] ------------------------------------------------------------------------------- 1246629 11.1917 LocalBufferAlloc 1246629 100.000 LocalBufferAlloc [self] ------------------------------------------------------------------------------- 1135393 10.1931 CheckForSerializableConflictIn 1135393 100.000 CheckForSerializableConflictIn [self] ------------------------------------------------------------ So there is a lot of locking for some reason, And I was surprised to see anything related to SSI in the profile at all, because I'm not running serializable transactions (I was just running my queries from multiple psql sessions). Is there a reasonable explanation for what I'm seeing ? Thanks in advance, Sergey More info: there is no other concurrent activity on the machine. shared_buffers is 10G, The machine has ~ 60G of RAM, 24 cores and proper RAID Another note is that the oprofile and strace were obtained during the stage when the idt_photo.. table was scanned and data was being stored in the pgsql_tmp (not during the last bit when the data in pgsql_tmp were actually being sorted). ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Thu, May 24, 2012 at 8:24 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > Hi, > > I've been running some tests on pg 9.2beta1 and in particular a set > of queries like > > create table _tmp0 as select * from ( > select *, (select healpixid from idt_match as m where > m.transitid=o.transitid) > as x from idt_photoobservation as o offset 0 > ) as y where x%16=ZZZ order by x; > > (where ZZZ is some number integer number 0<=ZZZ<16) > > With the following plan: > ---------------------------------------------------------------------------------------------------------------------- > Sort (cost=3228814504.96..3228815137.21 rows=252902 width=498) > Sort Key: y.x > -> Subquery Scan on y (cost=0.00..3228791809.25 rows=252902 width=498) > Filter: ((y.x % 16::bigint) = 0) > -> Limit (cost=0.00..3228033102.41 rows=50580456 width=490) > -> Seq Scan on idt_photoobservation o > (cost=0.00..3228033102.41 rows=50580456 width=490) > SubPlan 1 > -> Index Scan using idt_match_transitid_idx on > idt_match m (cost=0.00..63.74 rows=1 width=8) > Index Cond: (transitid = o.transitid) > > The schema of the tables are: > > e2es2=> \d idt_match > Table "public.idt_match" > Column | Type | Modifiers > -----------+----------+----------- > sourceid | bigint | > transitid | bigint | > healpixid | bigint | > flagsxm | smallint | > Indexes: > "idt_match_idx" btree (healpixid) > "idt_match_transitid_idx" btree (transitid) > > Table "public.idt_photoobservation" > Column | Type | Modifiers > -----------+----------+----------- > transitid | bigint | > fluxbp0 | real | > .... > more columns > .... > Indexes: > "idt_photoobservation_idx" btree (transitid) > > And I noticed than when I run the query like the one shown above in parallel > (in multiple connections for ZZZ=0...8) the performance of each query drops > down significantly (factor of 2)(despite the fact that during the execution > of the query postgres is mostly CPU bound). > > So I tried to oprofile it and strace it, And that's what I saw: > > The strace -c of each process shows something like that > ######### > Process 18660 detached > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 76.25 0.001342 0 268987 semop > 23.75 0.000418 0 61694 read > 0.00 0.000000 0 138 lseek > 0.00 0.000000 0 355 select > 0.00 0.000000 0 3 kill > ------ ----------- ----------- --------- --------- ---------------- > 100.00 0.001760 331177 total > ####### > > And the oprofile shows this on top: > ------------------------------------------------------------------------------- > 2863981 25.7117 ReleasePredicateLocks > 2863981 100.000 ReleasePredicateLocks [self] > ------------------------------------------------------------------------------- > 1246629 11.1917 LocalBufferAlloc > 1246629 100.000 LocalBufferAlloc [self] > ------------------------------------------------------------------------------- > 1135393 10.1931 CheckForSerializableConflictIn > 1135393 100.000 CheckForSerializableConflictIn [self] > ------------------------------------------------------------ > > So there is a lot of locking for some reason, And I was surprised to see > anything related to SSI in the profile at all, because I'm not running > serializable transactions (I was just running my queries from multiple psql > sessions). Are you sure? I looked at all the ReleasePredicateLocks calls and they appear to be guarded by: /* Nothing to do if this is not a serializable transaction */ if (MySerializableXact == InvalidSerializableXact) return false; What's the default isolation mode set to? merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Thu, 24 May 2012, Merlin Moncure wrote: > Are you sure? I looked at all the ReleasePredicateLocks calls and > they appear to be guarded by: > > /* Nothing to do if this is not a serializable transaction */ > if (MySerializableXact == InvalidSerializableXact) > return false; > What's the default isolation mode set to? Sorry, it was indeed my mistake. I was pointing opreport to the binary which was recompiled, while the old version was still running , so I guess oprofile was picking up wrong function names Here is the correct oprofile: when multiple queries are running : samples % symbol name ------------------------------------------------------------------------------- 952457 13.8715 LWLockAcquire 952457 100.000 LWLockAcquire [self] ------------------------------------------------------------------------------- 779077 11.3464 PinBuffer 779077 100.000 PinBuffer [self] ------------------------------------------------------------------------------- 759898 11.0671 s_lock 759898 100.000 s_lock [self] ------------------------------------------------------------------------------- 689753 6.7178 slot_deform_tuple 689753 100.000 slot_deform_tuple [self] ------------------------------------------------------------------------------- 526002 5.1230 UnpinBuffer 526002 100.000 UnpinBuffer [self] When only one is running: samples % symbol name ------------------------------------------------------------------------------- 163268 14.0343 slot_deform_tuple 163268 100.000 slot_deform_tuple [self] ------------------------------------------------------------------------------- 126018 10.8324 _bt_compare 126018 100.000 _bt_compare [self] ------------------------------------------------------------------------------- 113975 9.7972 ExecProject 113975 100.000 ExecProject [self] ------------------------------------------------------------------------------- 49760 4.2773 FunctionCall2Coll 49760 100.000 FunctionCall2Coll [self] ------------------------------------------------------------------------------- 49164 4.2261 LWLockAcquire 49164 100.000 LWLockAcquire [self] ------------------------------------------------------------------------------- 43526 3.7414 hash_search_with_hash_value 43526 100.000 hash_search_with_hash_value [self] ############ I guess there is nothing catastrophically wrong with that, but still I'm very suprised that you get severe locking problems (factor of two slow-down) when running parallel read-only transactions. Sergey ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Jeff Janes
Date:
On Thu, May 24, 2012 at 6:24 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > Hi, > > I've been running some tests on pg 9.2beta1 and in particular a set > of queries like ... > > And I noticed than when I run the query like the one shown above in parallel > (in multiple connections for ZZZ=0...8) the performance of each query drops > down significantly (factor of 2)(despite the fact that during the execution > of the query postgres is mostly CPU bound). Is this a regression from previous versions (9.0, 9.1), or has the slow down always been there? Cheers, Jeff
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Thu, May 24, 2012 at 1:42 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > I guess there is nothing catastrophically wrong with that, but still I'm > very suprised that you get severe locking problems (factor of two slow-down) > when running parallel read-only transactions. Me, too. How many concurrent connections are you running, and does your working set exceed shared_buffers? Can you provide a self-contained reproducible test case? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Thu, 24 May 2012, Robert Haas wrote: > On Thu, May 24, 2012 at 1:42 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: >> I guess there is nothing catastrophically wrong with that, but still I'm >> very suprised that you get severe locking problems (factor of two slow-down) >> when running parallel read-only transactions. > > Me, too. How many concurrent connections are you running, and does > your working set exceed shared_buffers? Can you provide a > self-contained reproducible test case? The last tests I've been doing were with 8 connections. And the working set is roughly 30Gig, which is ~ 3x the shared buffers. (but ~ 50% of RAM). Regarding the test-case, I'll try to see whether I can still observe the same slowing down if I chop the main table by a factor of few, so I can put the data somewhere for download. S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Thu, May 24, 2012 at 2:19 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > On Thu, 24 May 2012, Robert Haas wrote: >> On Thu, May 24, 2012 at 1:42 PM, Sergey Koposov <koposov@ast.cam.ac.uk> >> wrote: >>> >>> I guess there is nothing catastrophically wrong with that, but still I'm >>> very suprised that you get severe locking problems (factor of two >>> slow-down) >>> when running parallel read-only transactions. >> >> Me, too. How many concurrent connections are you running, and does >> your working set exceed shared_buffers? Can you provide a >> self-contained reproducible test case? > > The last tests I've been doing were with 8 connections. > And the working set is roughly 30Gig, which is ~ 3x the shared buffers. (but > ~ 50% of RAM). Given that additional information, I would say these results are expected. Unfortunately, our BufFreelistLock is a serious contention point, and I think that's what you're hitting. See the graph here: http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html As you can see, raw performance isn't much worse with the larger data sets, but scalability at high connection counts is severely degraded once the working set no longer fits in shared_buffers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Thu, May 24, 2012 at 1:43 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 24, 2012 at 2:19 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: >> On Thu, 24 May 2012, Robert Haas wrote: >>> On Thu, May 24, 2012 at 1:42 PM, Sergey Koposov <koposov@ast.cam.ac.uk> >>> wrote: >>>> >>>> I guess there is nothing catastrophically wrong with that, but still I'm >>>> very suprised that you get severe locking problems (factor of two >>>> slow-down) >>>> when running parallel read-only transactions. >>> >>> Me, too. How many concurrent connections are you running, and does >>> your working set exceed shared_buffers? Can you provide a >>> self-contained reproducible test case? >> >> The last tests I've been doing were with 8 connections. >> And the working set is roughly 30Gig, which is ~ 3x the shared buffers. (but >> ~ 50% of RAM). > > Given that additional information, I would say these results are > expected. Unfortunately, our BufFreelistLock is a serious contention > point, and I think that's what you're hitting. See the graph here: > > http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html > > As you can see, raw performance isn't much worse with the larger data > sets, but scalability at high connection counts is severely degraded > once the working set no longer fits in shared_buffers. Hm, wouldn't the BufFreelistLock issue be ameliorated if StrategyGetBuffer could reserve multiple buffers so that you'd draw down your local list and only then go back to the global pool? (easier said than done obviously). merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Thu, May 24, 2012 at 2:24 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> As you can see, raw performance isn't much worse with the larger data >> sets, but scalability at high connection counts is severely degraded >> once the working set no longer fits in shared_buffers. > > Hm, wouldn't the BufFreelistLock issue be ameliorated if > StrategyGetBuffer could reserve multiple buffers so that you'd draw > down your local list and only then go back to the global pool? (easier > said than done obviously). hm, looking at the code some more, it looks like the whole point of the strategy system is to do that. ISTM bulk insert type queries would be good candidates for a buffer strategy somehow? merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Thu, 24 May 2012, Robert Haas wrote: > As you can see, raw performance isn't much worse with the larger data > sets, but scalability at high connection counts is severely degraded > once the working set no longer fits in shared_buffers. Actually the problem persits even when I trim the dataset size to be within the shared_buffers. Here is the dump (0.5 gig in size, tested with shared_buffers=10G, work_mem=500Mb): http://www.ast.cam.ac.uk/~koposov/files/dump.gz And I attach the script For my toy dataset the performance of a single thread goes down from ~6.4 to 18 seconds (~ 3 times worse), And actually while running the script repeatedly on my main machine, for some reason I saw some variation in terms of how much threaded execution is slower than a single thread. Now I see 25 seconds for multi threaded run vs the same ~ 6 second for a single thread. The oprofile shows 782355 21.5269 s_lock 782355 100.000 s_lock [self] ------------------------------------------------------------------------------- 709801 19.5305 PinBuffer 709801 100.000 PinBuffer [self] ------------------------------------------------------------------------------- 326457 8.9826 LWLockAcquire 326457 100.000 LWLockAcquire [self] ------------------------------------------------------------------------------- 309437 8.5143 UnpinBuffer 309437 100.000 UnpinBuffer [self] ------------------------------------------------------------------------------- 252972 6.9606 ReadBuffer_common 252972 100.000 ReadBuffer_common [self] ------------------------------------------------------------------------------- 201558 5.5460 LockBuffer 201558 100.000 LockBuffer [self] ------------------------------------------------------------ It is interesting that On another machine with much smaller shared memory (3G), smaller RAM (12G), smaller number of cpus and PG 9.1 running I was getting consistently ~ 7.2 vs 4.5 sec (for multi vs single thread) PS Just in case the CPU on the main machine I'm testing is Xeon(R) CPU E7- 4807 (the total number of real cores is 24) ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Jeff Janes
Date:
On Thu, May 24, 2012 at 12:46 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, May 24, 2012 at 2:24 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> As you can see, raw performance isn't much worse with the larger data >>> sets, but scalability at high connection counts is severely degraded >>> once the working set no longer fits in shared_buffers. >> >> Hm, wouldn't the BufFreelistLock issue be ameliorated if >> StrategyGetBuffer could reserve multiple buffers so that you'd draw >> down your local list and only then go back to the global pool? (easier >> said than done obviously). > > hm, looking at the code some more, it looks like the whole point of > the strategy system is to do that. I thought you were suggesting that the StrategyGetBuffer would pre-allocate multiple buffers to a backend under the cover of a single BufFreelistLock. If that is what you were suggesting, that is not what the strategy system is currently for. It is for locally reusing buffers, not for gang-allocating them. If a backend could somehow predict that the buffer it is about to read in is likely going to be a cold buffer, perhaps it would make sense for each backend to maintain an small ring of its own which it can reuse for such cold buffers. > ISTM bulk insert type queries > would be good candidates for a buffer strategy somehow? Probably. There is a code or README comment to that effect that I stumbled upon just ra couple hours ago, but can't immediately re-find it. Cheers, Jeff
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Thu, May 24, 2012 at 3:46 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, May 24, 2012 at 2:24 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> As you can see, raw performance isn't much worse with the larger data >>> sets, but scalability at high connection counts is severely degraded >>> once the working set no longer fits in shared_buffers. >> >> Hm, wouldn't the BufFreelistLock issue be ameliorated if >> StrategyGetBuffer could reserve multiple buffers so that you'd draw >> down your local list and only then go back to the global pool? (easier >> said than done obviously). > > hm, looking at the code some more, it looks like the whole point of > the strategy system is to do that. ISTM bulk insert type queries > would be good candidates for a buffer strategy somehow? Yep. commit 85e2cedf985bfecaf43a18ca17433070f439fb0e Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Thu Nov 6 20:51:15 2008 +0000 Improve bulk-insert performance by keeping the current target buffer pinned (but not locked, as that would risk deadlocks). Also, make it work in a sma ring of buffers to avoid having bulk inserts trash the whole buffer arena. Robert Haas, after an idea of Simon Riggs'. But that doesn't help here, since pgbench is only looking up one row per query. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Thu, May 24, 2012 at 3:35 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 24, 2012 at 3:46 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> hm, looking at the code some more, it looks like the whole point of >> the strategy system is to do that. ISTM bulk insert type queries >> would be good candidates for a buffer strategy somehow? > > Yep. > > commit 85e2cedf985bfecaf43a18ca17433070f439fb0e > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: Thu Nov 6 20:51:15 2008 +0000 > > Improve bulk-insert performance by keeping the current target buffer pinned > (but not locked, as that would risk deadlocks). Also, make it work in a sma > ring of buffers to avoid having bulk inserts trash the whole buffer arena. > > Robert Haas, after an idea of Simon Riggs'. > > But that doesn't help here, since pgbench is only looking up one row per query. Wait -- OP's gripe this isn't regarding standard pgbench, but multiple large concurrent 'insert into foo select...'. I looked in the code and it appears that the only bulk insert strategy using operations are copy, create table as select, and table rewrite operations. Concurrent INSERT SELECT apparently doesn't get the benefit of a strategy and should be fighting over the freelist once the pool exhausts. We don't get to skip wal of course, but we should be able to use a bulk insert strategy, especially if there was some way of predicting that a large number of tuples were going to be inserted. I'm wondering though of contention on the free list is in fact the OP's problem. merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Thu, May 24, 2012 at 4:46 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > Wait -- OP's gripe this isn't regarding standard pgbench, but multiple > large concurrent 'insert into foo select...'. I looked in the code > and it appears that the only bulk insert strategy using operations are > copy, create table as select, and table rewrite operations. Concurrent > INSERT SELECT apparently doesn't get the benefit of a strategy and > should be fighting over the freelist once the pool exhausts. I think you are right. > We don't get to skip wal of course, but we should be able to use a > bulk insert strategy, especially if there was some way of predicting > that a large number of tuples were going to be inserted. I'm > wondering though of contention on the free list is in fact the OP's > problem. Not sure. It might be some other LWLock, but it's hard to tell which one from the information provided. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, May 24, 2012 at 4:46 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> We don't get to skip wal of course, but we should be able to use a >> bulk insert strategy, especially if there was some way of predicting >> that a large number of tuples were going to be inserted. �I'm >> wondering though of contention on the free list is in fact the OP's >> problem. > Not sure. It might be some other LWLock, but it's hard to tell which > one from the information provided. Yeah. It seems quite plausible that Robert's select-only benchmark might be mainly tripping over the freelist lock, but I'm less convinced about something that's doing INSERT/SELECT, and therefore is also doing a lot of WAL activity, index insertions, etc. I'd want to see some instrumentation results before assuming we know where the bottleneck is there. regards, tom lane
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
Hi, On Thu, 24 May 2012, Robert Haas wrote: > Not sure. It might be some other LWLock, but it's hard to tell which > one from the information provided. If you could tell what's the best way to find out the info that you need, then I could run it reasonably quickly. S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Jeff Janes
Date:
On Thu, May 24, 2012 at 3:36 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > Hi, > > > On Thu, 24 May 2012, Robert Haas wrote: >> >> Not sure. It might be some other LWLock, but it's hard to tell which >> one from the information provided. > > > If you could tell what's the best way to find out the info that you need, > then I could run it reasonably quickly. Add #define LWLOCK_STATS near the top of: src/backend/storage/lmgr/lwlock.c and recompile and run a reduced-size workload. When the processes exits, they will dump a lot of data about LWLock usage to the logfile.Generally the LWLock with the most blocks on it willbe the main culprit. Cheers, Jeff
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Thu, 24 May 2012, Jeff Janes wrote: > Add > #define LWLOCK_STATS > near the top of: > src/backend/storage/lmgr/lwlock.c > > and recompile and run a reduced-size workload. When the processes > exits, they will dump a lot of data about LWLock usage to the logfile. > Generally the LWLock with the most blocks on it will be the main > culprit. Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted by blk. Not sure whether that's of much use or not: PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373 PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110 PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976 PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955 PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871 PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838 PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774 PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702 PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665 PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651 PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602 PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546 PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511 PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466 PID 7114 lwlock 47: shacq 227986 exacq 929 blk 449 PID 7115 lwlock 44: shacq 1127495 exacq 633 blk 401 PID 7115 lwlock 48: shacq 1124666 exacq 559 blk 397 PID 7112 lwlock 47: shacq 227993 exacq 1248 blk 387 PID 7110 lwlock 47: shacq 228218 exacq 1082 blk 378 PID 7111 lwlock 47: shacq 228093 exacq 907 blk 321 PID 7114 lwlock 42: shacq 232591 exacq 935 blk 318 PID 7113 lwlock 47: shacq 228085 exacq 909 blk 306 PID 7112 lwlock 34: shacq 343247 exacq 1255 blk 271 PID 7110 lwlock 42: shacq 232599 exacq 1049 blk 262 PID 7111 lwlock 34: shacq 343398 exacq 926 blk 255 PID 7112 lwlock 42: shacq 232505 exacq 1262 blk 240 PID 7108 lwlock 42: shacq 233215 exacq 1539 blk 237 PID 7110 lwlock 43: shacq 438768 exacq 1026 blk 231 PID 7114 lwlock 43: shacq 438795 exacq 898 blk 230 PID 7113 lwlock 34: shacq 343283 exacq 832 blk 226 PID 7110 lwlock 34: shacq 343338 exacq 1074 blk 226 PID 7114 lwlock 0: shacq 0 exacq 14864 blk 219 PID 7112 lwlock 43: shacq 438691 exacq 1123 blk 215 PID 7111 lwlock 42: shacq 232645 exacq 885 blk 215 PID 7114 lwlock 34: shacq 343362 exacq 939 blk 214 PID 7111 lwlock 0: shacq 0 exacq 13638 blk 205 PID 7111 lwlock 43: shacq 438817 exacq 899 blk 200 PID 7114 lwlock 7: shacq 0 exacq 547 blk 193 PID 7113 lwlock 42: shacq 232600 exacq 860 blk 192 PID 7110 lwlock 0: shacq 0 exacq 16862 blk 191 PID 7114 lwlock 38: shacq 333266 exacq 977 blk 183 PID 7112 lwlock 38: shacq 333175 exacq 1246 blk 183 PID 7113 lwlock 7: shacq 0 exacq 548 blk 178 PID 7113 lwlock 43: shacq 438720 exacq 833 blk 178 PID 7115 lwlock 7: shacq 0 exacq 549 blk 177 PID 7108 lwlock 47: shacq 228682 exacq 1495 blk 177 PID 7108 lwlock 34: shacq 343982 exacq 1595 blk 177 PID 7112 lwlock 0: shacq 0 exacq 19538 blk 172 PID 7111 lwlock 7: shacq 0 exacq 549 blk 172 PID 7115 lwlock 17: shacq 0 exacq 9927 blk 167 PID 7109 lwlock 34: shacq 343410 exacq 348 blk 167 PID 7112 lwlock 7: shacq 0 exacq 548 blk 166 PID 7110 lwlock 38: shacq 333249 exacq 1013 blk 165 PID 7113 lwlock 38: shacq 333226 exacq 828 blk 163 PID 7110 lwlock 7: shacq 0 exacq 548 blk 162 PID 7109 lwlock 7: shacq 0 exacq 548 blk 161 PID 7109 lwlock 47: shacq 228097 exacq 386 blk 158 PID 7112 lwlock 45: shacq 130843 exacq 1201 blk 154 PID 7109 lwlock 43: shacq 438787 exacq 314 blk 153 PID 7111 lwlock 38: shacq 333306 exacq 882 blk 152 PID 7108 lwlock 43: shacq 439454 exacq 1614 blk 151 PID 7108 lwlock 17: shacq 0 exacq 23165 blk 147 PID 7108 lwlock 7: shacq 0 exacq 549 blk 145 PID 7113 lwlock 0: shacq 0 exacq 13394 blk 142 PID 7112 lwlock 46: shacq 132972 exacq 1196 blk 142 PID 7112 lwlock 36: shacq 204508 exacq 1180 blk 140 PID 7109 lwlock 42: shacq 232611 exacq 363 blk 140 PID 7115 lwlock 34: shacq 343582 exacq 600 blk 139 PID 7114 lwlock 45: shacq 130818 exacq 903 blk 139 PID 7115 lwlock 42: shacq 232846 exacq 517 blk 138 PID 7108 lwlock 38: shacq 333989 exacq 1704 blk 137 PID 7115 lwlock 43: shacq 438993 exacq 558 blk 126 PID 7115 lwlock 0: shacq 0 exacq 3161 blk 125 PID 7110 lwlock 40: shacq 129577 exacq 1021 blk 122 PID 7113 lwlock 45: shacq 130833 exacq 814 blk 120 PID 7111 lwlock 17: shacq 0 exacq 4607 blk 119 PID 7109 lwlock 0: shacq 0 exacq 5711 blk 119 PID 7110 lwlock 45: shacq 130865 exacq 1001 blk 117 PID 7109 lwlock 38: shacq 333340 exacq 322 blk 115 PID 7114 lwlock 46: shacq 132997 exacq 860 blk 112 PID 7113 lwlock 36: shacq 204566 exacq 833 blk 112 PID 7112 lwlock 40: shacq 129528 exacq 1234 blk 111 PID 7115 lwlock 47: shacq 228397 exacq 516 blk 109 PID 7114 lwlock 36: shacq 204634 exacq 930 blk 109 PID 7112 lwlock 41: shacq 133600 exacq 1136 blk 109 PID 7111 lwlock 36: shacq 204675 exacq 830 blk 108 PID 7108 lwlock 0: shacq 0 exacq 6504 blk 108 PID 7115 lwlock 38: shacq 333512 exacq 581 blk 106 PID 7110 lwlock 36: shacq 204546 exacq 1014 blk 106 PID 7113 lwlock 37: shacq 127040 exacq 790 blk 105 PID 7111 lwlock 45: shacq 130893 exacq 863 blk 103 PID 7114 lwlock 35: shacq 135342 exacq 879 blk 102 PID 7113 lwlock 40: shacq 129560 exacq 853 blk 101 PID 7110 lwlock 41: shacq 133697 exacq 1063 blk 101 PID 7112 lwlock 35: shacq 135257 exacq 1229 blk 98 PID 7110 lwlock 46: shacq 132990 exacq 1059 blk 98 PID 7112 lwlock 37: shacq 127029 exacq 1229 blk 97 PID 7114 lwlock 40: shacq 129597 exacq 913 blk 96 PID 7110 lwlock 17: shacq 0 exacq 2655 blk 96 PID 7109 lwlock 46: shacq 133014 exacq 326 blk 96 PID 7111 lwlock 40: shacq 129720 exacq 890 blk 95 PID 7111 lwlock 46: shacq 133057 exacq 876 blk 94 PID 7114 lwlock 33: shacq 127651 exacq 921 blk 92 PID 7112 lwlock 39: shacq 127288 exacq 1176 blk 91 PID 7111 lwlock 39: shacq 127438 exacq 891 blk 91 PID 7114 lwlock 37: shacq 127038 exacq 894 blk 89 PID 7113 lwlock 33: shacq 127602 exacq 803 blk 89 PID 7110 lwlock 37: shacq 127038 exacq 1085 blk 89 PID 7110 lwlock 35: shacq 135333 exacq 1016 blk 88 PID 7113 lwlock 46: shacq 132954 exacq 833 blk 87 PID 7111 lwlock 41: shacq 133778 exacq 821 blk 87 PID 7115 lwlock 36: shacq 204845 exacq 535 blk 86 PID 7112 lwlock 33: shacq 127551 exacq 1149 blk 86 PID 7109 lwlock 17: shacq 0 exacq 3463 blk 86 PID 7110 lwlock 910: shacq 91 exacq 91 blk 84 PID 7110 lwlock 39: shacq 127396 exacq 1068 blk 84 PID 7109 lwlock 36: shacq 204581 exacq 363 blk 83 PID 7113 lwlock 41: shacq 133648 exacq 765 blk 82 PID 7114 lwlock 41: shacq 133757 exacq 897 blk 79 PID 7114 lwlock 39: shacq 127429 exacq 923 blk 79 S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Thu, May 24, 2012 at 6:26 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > On Thu, 24 May 2012, Jeff Janes wrote: >> >> Add >> #define LWLOCK_STATS >> near the top of: >> src/backend/storage/lmgr/lwlock.c >> >> and recompile and run a reduced-size workload. When the processes >> exits, they will dump a lot of data about LWLock usage to the logfile. >> Generally the LWLock with the most blocks on it will be the main >> culprit. > > > Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted > by blk. Not sure whether that's of much use or not: > > PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373 > PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110 > PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976 > PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955 > PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871 > PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838 > PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774 > PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702 > PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665 > PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651 > PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602 > PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546 > PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511 > PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466 > PID 7114 lwlock 47: shacq 227986 exacq 929 blk 449 > PID 7115 lwlock 44: shacq 1127495 exacq 633 blk 401 > PID 7115 lwlock 48: shacq 1124666 exacq 559 blk 397 > PID 7112 lwlock 47: shacq 227993 exacq 1248 blk 387 > PID 7110 lwlock 47: shacq 228218 exacq 1082 blk 378 > PID 7111 lwlock 47: shacq 228093 exacq 907 blk 321 > PID 7114 lwlock 42: shacq 232591 exacq 935 blk 318 > PID 7113 lwlock 47: shacq 228085 exacq 909 blk 306 > PID 7112 lwlock 34: shacq 343247 exacq 1255 blk 271 > PID 7110 lwlock 42: shacq 232599 exacq 1049 blk 262 > PID 7111 lwlock 34: shacq 343398 exacq 926 blk 255 > PID 7112 lwlock 42: shacq 232505 exacq 1262 blk 240 > PID 7108 lwlock 42: shacq 233215 exacq 1539 blk 237 > PID 7110 lwlock 43: shacq 438768 exacq 1026 blk 231 > PID 7114 lwlock 43: shacq 438795 exacq 898 blk 230 > PID 7113 lwlock 34: shacq 343283 exacq 832 blk 226 > PID 7110 lwlock 34: shacq 343338 exacq 1074 blk 226 > PID 7114 lwlock 0: shacq 0 exacq 14864 blk 219 > PID 7112 lwlock 43: shacq 438691 exacq 1123 blk 215 > PID 7111 lwlock 42: shacq 232645 exacq 885 blk 215 > PID 7114 lwlock 34: shacq 343362 exacq 939 blk 214 > PID 7111 lwlock 0: shacq 0 exacq 13638 blk 205 > PID 7111 lwlock 43: shacq 438817 exacq 899 blk 200 > PID 7114 lwlock 7: shacq 0 exacq 547 blk 193 > PID 7113 lwlock 42: shacq 232600 exacq 860 blk 192 > PID 7110 lwlock 0: shacq 0 exacq 16862 blk 191 > PID 7114 lwlock 38: shacq 333266 exacq 977 blk 183 > PID 7112 lwlock 38: shacq 333175 exacq 1246 blk 183 > PID 7113 lwlock 7: shacq 0 exacq 548 blk 178 > PID 7113 lwlock 43: shacq 438720 exacq 833 blk 178 > PID 7115 lwlock 7: shacq 0 exacq 549 blk 177 > PID 7108 lwlock 47: shacq 228682 exacq 1495 blk 177 > PID 7108 lwlock 34: shacq 343982 exacq 1595 blk 177 > PID 7112 lwlock 0: shacq 0 exacq 19538 blk 172 > PID 7111 lwlock 7: shacq 0 exacq 549 blk 172 > PID 7115 lwlock 17: shacq 0 exacq 9927 blk 167 > PID 7109 lwlock 34: shacq 343410 exacq 348 blk 167 > PID 7112 lwlock 7: shacq 0 exacq 548 blk 166 > PID 7110 lwlock 38: shacq 333249 exacq 1013 blk 165 > PID 7113 lwlock 38: shacq 333226 exacq 828 blk 163 > PID 7110 lwlock 7: shacq 0 exacq 548 blk 162 > PID 7109 lwlock 7: shacq 0 exacq 548 blk 161 > PID 7109 lwlock 47: shacq 228097 exacq 386 blk 158 > PID 7112 lwlock 45: shacq 130843 exacq 1201 blk 154 > PID 7109 lwlock 43: shacq 438787 exacq 314 blk 153 > PID 7111 lwlock 38: shacq 333306 exacq 882 blk 152 > PID 7108 lwlock 43: shacq 439454 exacq 1614 blk 151 > PID 7108 lwlock 17: shacq 0 exacq 23165 blk 147 > PID 7108 lwlock 7: shacq 0 exacq 549 blk 145 > PID 7113 lwlock 0: shacq 0 exacq 13394 blk 142 > PID 7112 lwlock 46: shacq 132972 exacq 1196 blk 142 > PID 7112 lwlock 36: shacq 204508 exacq 1180 blk 140 > PID 7109 lwlock 42: shacq 232611 exacq 363 blk 140 > PID 7115 lwlock 34: shacq 343582 exacq 600 blk 139 > PID 7114 lwlock 45: shacq 130818 exacq 903 blk 139 > PID 7115 lwlock 42: shacq 232846 exacq 517 blk 138 > PID 7108 lwlock 38: shacq 333989 exacq 1704 blk 137 > PID 7115 lwlock 43: shacq 438993 exacq 558 blk 126 > PID 7115 lwlock 0: shacq 0 exacq 3161 blk 125 > PID 7110 lwlock 40: shacq 129577 exacq 1021 blk 122 > PID 7113 lwlock 45: shacq 130833 exacq 814 blk 120 > PID 7111 lwlock 17: shacq 0 exacq 4607 blk 119 > PID 7109 lwlock 0: shacq 0 exacq 5711 blk 119 > PID 7110 lwlock 45: shacq 130865 exacq 1001 blk 117 > PID 7109 lwlock 38: shacq 333340 exacq 322 blk 115 > PID 7114 lwlock 46: shacq 132997 exacq 860 blk 112 > PID 7113 lwlock 36: shacq 204566 exacq 833 blk 112 > PID 7112 lwlock 40: shacq 129528 exacq 1234 blk 111 > PID 7115 lwlock 47: shacq 228397 exacq 516 blk 109 > PID 7114 lwlock 36: shacq 204634 exacq 930 blk 109 > PID 7112 lwlock 41: shacq 133600 exacq 1136 blk 109 > PID 7111 lwlock 36: shacq 204675 exacq 830 blk 108 > PID 7108 lwlock 0: shacq 0 exacq 6504 blk 108 > PID 7115 lwlock 38: shacq 333512 exacq 581 blk 106 > PID 7110 lwlock 36: shacq 204546 exacq 1014 blk 106 > PID 7113 lwlock 37: shacq 127040 exacq 790 blk 105 > PID 7111 lwlock 45: shacq 130893 exacq 863 blk 103 > PID 7114 lwlock 35: shacq 135342 exacq 879 blk 102 > PID 7113 lwlock 40: shacq 129560 exacq 853 blk 101 > PID 7110 lwlock 41: shacq 133697 exacq 1063 blk 101 > PID 7112 lwlock 35: shacq 135257 exacq 1229 blk 98 > PID 7110 lwlock 46: shacq 132990 exacq 1059 blk 98 > PID 7112 lwlock 37: shacq 127029 exacq 1229 blk 97 > PID 7114 lwlock 40: shacq 129597 exacq 913 blk 96 > PID 7110 lwlock 17: shacq 0 exacq 2655 blk 96 > PID 7109 lwlock 46: shacq 133014 exacq 326 blk 96 > PID 7111 lwlock 40: shacq 129720 exacq 890 blk 95 > PID 7111 lwlock 46: shacq 133057 exacq 876 blk 94 > PID 7114 lwlock 33: shacq 127651 exacq 921 blk 92 > PID 7112 lwlock 39: shacq 127288 exacq 1176 blk 91 > PID 7111 lwlock 39: shacq 127438 exacq 891 blk 91 > PID 7114 lwlock 37: shacq 127038 exacq 894 blk 89 > PID 7113 lwlock 33: shacq 127602 exacq 803 blk 89 > PID 7110 lwlock 37: shacq 127038 exacq 1085 blk 89 > PID 7110 lwlock 35: shacq 135333 exacq 1016 blk 88 > PID 7113 lwlock 46: shacq 132954 exacq 833 blk 87 > PID 7111 lwlock 41: shacq 133778 exacq 821 blk 87 > PID 7115 lwlock 36: shacq 204845 exacq 535 blk 86 > PID 7112 lwlock 33: shacq 127551 exacq 1149 blk 86 > PID 7109 lwlock 17: shacq 0 exacq 3463 blk 86 > PID 7110 lwlock 910: shacq 91 exacq 91 blk 84 > PID 7110 lwlock 39: shacq 127396 exacq 1068 blk 84 > PID 7109 lwlock 36: shacq 204581 exacq 363 blk 83 > PID 7113 lwlock 41: shacq 133648 exacq 765 blk 82 > PID 7114 lwlock 41: shacq 133757 exacq 897 blk 79 > PID 7114 lwlock 39: shacq 127429 exacq 923 blk 79 These are all on the buffer partition locks. That makes sense...I was wrong earlier: this case was in fact 'create table as', not 'insert select' which rules out both the freelist lock and the wal insert lock because create table as gets to use both a bulk insert strategy and wal avoiding logic (assuming wal log level is not 'archive or higher'.So, why aren't the lock partitions helping here? merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Fri, 25 May 2012, Merlin Moncure wrote: > These are all on the buffer partition locks. That makes sense...I was > wrong earlier: this case was in fact 'create table as', not 'insert > select' which rules out both the freelist lock and the wal insert lock > because create table as gets to use both a bulk insert strategy and > wal avoiding logic (assuming wal log level is not 'archive or higher'. > So, why aren't the lock partitions helping here? FYI the WAL level is default, and here are the relevand changed params from the .conf (autovacuum was disabled but the data was vacuumed manually). max_connections = 100 # (change requires restart) shared_buffers = 10GB # min 128kB work_mem = 500MB # min 64kB maintenance_work_mem = 500MB # min 1MB synchronous_commit = off # synchronization level; checkpoint_segments = 20 # in logfile segments, min 1, 16MB each enable_hashjoin = off enable_mergejoin = off effective_cache_size = 10GB autovacuum = off # Enable autovacuum subprocess? 'on' Regards, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Fri, May 25, 2012 at 8:06 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, May 24, 2012 at 6:26 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: >> On Thu, 24 May 2012, Jeff Janes wrote: >>> >>> Add >>> #define LWLOCK_STATS >>> near the top of: >>> src/backend/storage/lmgr/lwlock.c >>> >>> and recompile and run a reduced-size workload. When the processes >>> exits, they will dump a lot of data about LWLock usage to the logfile. >>> Generally the LWLock with the most blocks on it will be the main >>> culprit. >> >> >> Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted >> by blk. Not sure whether that's of much use or not: >> >> PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373 >> PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110 >> PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976 >> PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955 >> PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871 >> PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838 >> PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774 >> PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702 >> PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665 >> PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651 >> PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602 >> PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546 >> PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511 >> PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466 >> PID 7114 lwlock 47: shacq 227986 exacq 929 blk 449 >> PID 7115 lwlock 44: shacq 1127495 exacq 633 blk 401 >> PID 7115 lwlock 48: shacq 1124666 exacq 559 blk 397 >> PID 7112 lwlock 47: shacq 227993 exacq 1248 blk 387 >> PID 7110 lwlock 47: shacq 228218 exacq 1082 blk 378 >> PID 7111 lwlock 47: shacq 228093 exacq 907 blk 321 >> PID 7114 lwlock 42: shacq 232591 exacq 935 blk 318 >> PID 7113 lwlock 47: shacq 228085 exacq 909 blk 306 >> PID 7112 lwlock 34: shacq 343247 exacq 1255 blk 271 >> PID 7110 lwlock 42: shacq 232599 exacq 1049 blk 262 >> PID 7111 lwlock 34: shacq 343398 exacq 926 blk 255 >> PID 7112 lwlock 42: shacq 232505 exacq 1262 blk 240 >> PID 7108 lwlock 42: shacq 233215 exacq 1539 blk 237 >> PID 7110 lwlock 43: shacq 438768 exacq 1026 blk 231 >> PID 7114 lwlock 43: shacq 438795 exacq 898 blk 230 >> PID 7113 lwlock 34: shacq 343283 exacq 832 blk 226 >> PID 7110 lwlock 34: shacq 343338 exacq 1074 blk 226 >> PID 7114 lwlock 0: shacq 0 exacq 14864 blk 219 >> PID 7112 lwlock 43: shacq 438691 exacq 1123 blk 215 >> PID 7111 lwlock 42: shacq 232645 exacq 885 blk 215 >> PID 7114 lwlock 34: shacq 343362 exacq 939 blk 214 >> PID 7111 lwlock 0: shacq 0 exacq 13638 blk 205 >> PID 7111 lwlock 43: shacq 438817 exacq 899 blk 200 >> PID 7114 lwlock 7: shacq 0 exacq 547 blk 193 >> PID 7113 lwlock 42: shacq 232600 exacq 860 blk 192 >> PID 7110 lwlock 0: shacq 0 exacq 16862 blk 191 >> PID 7114 lwlock 38: shacq 333266 exacq 977 blk 183 >> PID 7112 lwlock 38: shacq 333175 exacq 1246 blk 183 >> PID 7113 lwlock 7: shacq 0 exacq 548 blk 178 >> PID 7113 lwlock 43: shacq 438720 exacq 833 blk 178 >> PID 7115 lwlock 7: shacq 0 exacq 549 blk 177 >> PID 7108 lwlock 47: shacq 228682 exacq 1495 blk 177 >> PID 7108 lwlock 34: shacq 343982 exacq 1595 blk 177 >> PID 7112 lwlock 0: shacq 0 exacq 19538 blk 172 >> PID 7111 lwlock 7: shacq 0 exacq 549 blk 172 >> PID 7115 lwlock 17: shacq 0 exacq 9927 blk 167 >> PID 7109 lwlock 34: shacq 343410 exacq 348 blk 167 >> PID 7112 lwlock 7: shacq 0 exacq 548 blk 166 >> PID 7110 lwlock 38: shacq 333249 exacq 1013 blk 165 >> PID 7113 lwlock 38: shacq 333226 exacq 828 blk 163 >> PID 7110 lwlock 7: shacq 0 exacq 548 blk 162 >> PID 7109 lwlock 7: shacq 0 exacq 548 blk 161 >> PID 7109 lwlock 47: shacq 228097 exacq 386 blk 158 >> PID 7112 lwlock 45: shacq 130843 exacq 1201 blk 154 >> PID 7109 lwlock 43: shacq 438787 exacq 314 blk 153 >> PID 7111 lwlock 38: shacq 333306 exacq 882 blk 152 >> PID 7108 lwlock 43: shacq 439454 exacq 1614 blk 151 >> PID 7108 lwlock 17: shacq 0 exacq 23165 blk 147 >> PID 7108 lwlock 7: shacq 0 exacq 549 blk 145 >> PID 7113 lwlock 0: shacq 0 exacq 13394 blk 142 >> PID 7112 lwlock 46: shacq 132972 exacq 1196 blk 142 >> PID 7112 lwlock 36: shacq 204508 exacq 1180 blk 140 >> PID 7109 lwlock 42: shacq 232611 exacq 363 blk 140 >> PID 7115 lwlock 34: shacq 343582 exacq 600 blk 139 >> PID 7114 lwlock 45: shacq 130818 exacq 903 blk 139 >> PID 7115 lwlock 42: shacq 232846 exacq 517 blk 138 >> PID 7108 lwlock 38: shacq 333989 exacq 1704 blk 137 >> PID 7115 lwlock 43: shacq 438993 exacq 558 blk 126 >> PID 7115 lwlock 0: shacq 0 exacq 3161 blk 125 >> PID 7110 lwlock 40: shacq 129577 exacq 1021 blk 122 >> PID 7113 lwlock 45: shacq 130833 exacq 814 blk 120 >> PID 7111 lwlock 17: shacq 0 exacq 4607 blk 119 >> PID 7109 lwlock 0: shacq 0 exacq 5711 blk 119 >> PID 7110 lwlock 45: shacq 130865 exacq 1001 blk 117 >> PID 7109 lwlock 38: shacq 333340 exacq 322 blk 115 >> PID 7114 lwlock 46: shacq 132997 exacq 860 blk 112 >> PID 7113 lwlock 36: shacq 204566 exacq 833 blk 112 >> PID 7112 lwlock 40: shacq 129528 exacq 1234 blk 111 >> PID 7115 lwlock 47: shacq 228397 exacq 516 blk 109 >> PID 7114 lwlock 36: shacq 204634 exacq 930 blk 109 >> PID 7112 lwlock 41: shacq 133600 exacq 1136 blk 109 >> PID 7111 lwlock 36: shacq 204675 exacq 830 blk 108 >> PID 7108 lwlock 0: shacq 0 exacq 6504 blk 108 >> PID 7115 lwlock 38: shacq 333512 exacq 581 blk 106 >> PID 7110 lwlock 36: shacq 204546 exacq 1014 blk 106 >> PID 7113 lwlock 37: shacq 127040 exacq 790 blk 105 >> PID 7111 lwlock 45: shacq 130893 exacq 863 blk 103 >> PID 7114 lwlock 35: shacq 135342 exacq 879 blk 102 >> PID 7113 lwlock 40: shacq 129560 exacq 853 blk 101 >> PID 7110 lwlock 41: shacq 133697 exacq 1063 blk 101 >> PID 7112 lwlock 35: shacq 135257 exacq 1229 blk 98 >> PID 7110 lwlock 46: shacq 132990 exacq 1059 blk 98 >> PID 7112 lwlock 37: shacq 127029 exacq 1229 blk 97 >> PID 7114 lwlock 40: shacq 129597 exacq 913 blk 96 >> PID 7110 lwlock 17: shacq 0 exacq 2655 blk 96 >> PID 7109 lwlock 46: shacq 133014 exacq 326 blk 96 >> PID 7111 lwlock 40: shacq 129720 exacq 890 blk 95 >> PID 7111 lwlock 46: shacq 133057 exacq 876 blk 94 >> PID 7114 lwlock 33: shacq 127651 exacq 921 blk 92 >> PID 7112 lwlock 39: shacq 127288 exacq 1176 blk 91 >> PID 7111 lwlock 39: shacq 127438 exacq 891 blk 91 >> PID 7114 lwlock 37: shacq 127038 exacq 894 blk 89 >> PID 7113 lwlock 33: shacq 127602 exacq 803 blk 89 >> PID 7110 lwlock 37: shacq 127038 exacq 1085 blk 89 >> PID 7110 lwlock 35: shacq 135333 exacq 1016 blk 88 >> PID 7113 lwlock 46: shacq 132954 exacq 833 blk 87 >> PID 7111 lwlock 41: shacq 133778 exacq 821 blk 87 >> PID 7115 lwlock 36: shacq 204845 exacq 535 blk 86 >> PID 7112 lwlock 33: shacq 127551 exacq 1149 blk 86 >> PID 7109 lwlock 17: shacq 0 exacq 3463 blk 86 >> PID 7110 lwlock 910: shacq 91 exacq 91 blk 84 >> PID 7110 lwlock 39: shacq 127396 exacq 1068 blk 84 >> PID 7109 lwlock 36: shacq 204581 exacq 363 blk 83 >> PID 7113 lwlock 41: shacq 133648 exacq 765 blk 82 >> PID 7114 lwlock 41: shacq 133757 exacq 897 blk 79 >> PID 7114 lwlock 39: shacq 127429 exacq 923 blk 79 > > These are all on the buffer partition locks. That makes sense...I was > wrong earlier: this case was in fact 'create table as', not 'insert > select' which rules out both the freelist lock and the wal insert lock > because create table as gets to use both a bulk insert strategy and > wal avoiding logic (assuming wal log level is not 'archive or higher'. > So, why aren't the lock partitions helping here? I think what's happening here is that the buffer partitions don't help (in fact, they hurt) in the presence of multiple concurrent scans that are operating on approximately the same data. Sooner or later the scans line up on each other and start binding when reassigning lock tags (which take out up to two ordered exclusive lwlocks). This is on the read side, so the buffer access strategy is zero help (I confirmed this off list). I also was curious to see what happened via increasing the number of buffer partitions; it was counter productive as expected. If this is the case, I have no clue what to do about it. If the OP could organize the so that they hit different physical regions of the table that would give confirmation if performance went up. Hm, what if BufTableHashPartition() was pseudo randomized so that different backends would not get the same buffer partition for a particular tag? merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes: > Hm, what if BufTableHashPartition() was pseudo randomized so that > different backends would not get the same buffer partition for a > particular tag? Huh? You have to make sure that different backends will find the same buffer for the same page, so I don't see how that can possibly work. regards, tom lane
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Fri, May 25, 2012 at 10:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> Hm, what if BufTableHashPartition() was pseudo randomized so that >> different backends would not get the same buffer partition for a >> particular tag? > > Huh? You have to make sure that different backends will find the same > buffer for the same page, so I don't see how that can possibly work. Right -- duh. Well, hm. Is this worth fixing? ISTM there's a bit of 'optimizing for pgbench-itis' in the buffer partitions -- they seem optimized to lever the mostly random access behavior of pgbench. But how likely is it to see multiple simultaneous scans in the real world?Interleaving scans like that is not a very effectiveoptimization -- if it was me, it'd be trying to organize something around a partitioned tid scan for parallel sequential access. merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Stephen Frost
Date:
* Merlin Moncure (mmoncure@gmail.com) wrote: > Right -- duh. Well, hm. Is this worth fixing? ISTM there's a bit of > 'optimizing for pgbench-itis' in the buffer partitions -- they seem > optimized to lever the mostly random access behavior of pgbench. But > how likely is it to see multiple simultaneous scans in the real world? > Interleaving scans like that is not a very effective optimization -- > if it was me, it'd be trying to organize something around a > partitioned tid scan for parallel sequential access. Didn't we implement a system whereby this is exactly what we intend to happen on the read side- that is, everyone doing a SeqScan gangs up on one ring buffer and follows it, which we felt was going to dramatically improve performance in some cases? Or is this completely different from that..? Thanks, Stephen
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Jeff Janes
Date:
On Thu, May 24, 2012 at 4:26 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > On Thu, 24 May 2012, Jeff Janes wrote: >> >> Add >> #define LWLOCK_STATS >> near the top of: >> src/backend/storage/lmgr/lwlock.c >> >> and recompile and run a reduced-size workload. When the processes >> exits, they will dump a lot of data about LWLock usage to the logfile. >> Generally the LWLock with the most blocks on it will be the main >> culprit. > > > Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted > by blk. Not sure whether that's of much use or not: > > PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373 > PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110 > PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976 > PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955 > PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871 > PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838 > PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774 > PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702 > PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665 > PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651 > PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602 > PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546 > PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511 > PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466 That is not an informative as I thought it would be (except to show WAL was not the issue). I'm guessing that 44 and 48 are the buffer mapping partitions which cover the root block of some highly used index. But just because those things are at the top of the list doesn't mean they are a problem. Something has to be at the top, and they don't dominate the total number of blocking they way I would expect them to if they were truly a substantial bottleneck. Cheers, Jeff
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Fri, May 25, 2012 at 11:17 AM, Stephen Frost <sfrost@snowman.net> wrote: > * Merlin Moncure (mmoncure@gmail.com) wrote: >> Right -- duh. Well, hm. Is this worth fixing? ISTM there's a bit of >> 'optimizing for pgbench-itis' in the buffer partitions -- they seem >> optimized to lever the mostly random access behavior of pgbench. But >> how likely is it to see multiple simultaneous scans in the real world? >> Interleaving scans like that is not a very effective optimization -- >> if it was me, it'd be trying to organize something around a >> partitioned tid scan for parallel sequential access. > > Didn't we implement a system whereby this is exactly what we intend to > happen on the read side- that is, everyone doing a SeqScan gangs up on > one ring buffer and follows it, which we felt was going to dramatically > improve performance in some cases? yeah: /* * If the table is large relative to NBuffers, use a bulk-read access * strategy and enable synchronized scanning (seesyncscan.c). Although * the thresholds for these features could be different, we make them the * same so that thereare only two behaviors to tune rather than four. * (However, some callers need to be able to disable one or both ofthese * behaviors, independently of the size of the table; also there is a GUC * variable that can disable synchronizedscanning.) * * During a rescan, don't make a new strategy object if we don't have to. */if (!RelationUsesLocalBuffers(scan->rs_rd)&& scan->rs_nblocks > NBuffers / 4){ allow_strat = scan->rs_allow_strat; allow_sync = scan->rs_allow_sync;}else allow_strat = allow_sync = false; if (allow_strat){ if (scan->rs_strategy == NULL) scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD);} I wonder if the logic here is just being too strict... merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes: > On Fri, May 25, 2012 at 11:17 AM, Stephen Frost <sfrost@snowman.net> wrote: >> Didn't we implement a system whereby this is exactly what we intend to >> happen on the read side- that is, everyone doing a SeqScan gangs up on >> one ring buffer and follows it, which we felt was going to dramatically >> improve performance in some cases? > yeah: > ... > I wonder if the logic here is just being too strict... I don't recall how much evidence there is behind the NBuffers/4 threshold. Maybe that needs some tuning? regards, tom lane
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Fri, 25 May 2012, Merlin Moncure wrote: > how likely is it to see multiple simultaneous scans in the real world? > Interleaving scans like that is not a very effective optimization -- > if it was me, it'd be trying to organize something around a > partitioned tid scan for parallel sequential access. Regarding the "real world use". I can say for myself is that the reason I'm running the queries like the ones I have in my test is that I have a large inflow of data every day, tens of gb, which has to be processed and since PG doesn't have any parallelization of queries built in, and running the processing in a single thread will take too long, I'm trying to work around that by manually running multiple threads in PG and trying to split the work among them. So not having scalability here is going to hurt us (but I admit that it's probably not very popular use-case for PG). Regards, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Thu, May 24, 2012 at 7:26 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted > by blk. Not sure whether that's of much use or not: What are the top dozen or so entries if you sort by shacq? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Fri, 25 May 2012, Robert Haas wrote: > On Thu, May 24, 2012 at 7:26 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: >> Here is the output from a multi-threaded run (8thtreads, 22 seconds) sorted >> by blk. Not sure whether that's of much use or not: > > What are the top dozen or so entries if you sort by shacq? Here it is: PID 7108 lwlock 44: shacq 1127902 exacq 1548 blk 511 PID 7115 lwlock 44: shacq 1127495 exacq 633 blk 401 PID 7114 lwlock 44: shacq 1127418 exacq 1024 blk 702 PID 7111 lwlock 44: shacq 1127324 exacq 957 blk 651 PID 7109 lwlock 44: shacq 1127261 exacq 388 blk 466 PID 7110 lwlock 44: shacq 1127256 exacq 1132 blk 774 PID 7113 lwlock 44: shacq 1127179 exacq 920 blk 665 PID 7112 lwlock 44: shacq 1127148 exacq 1323 blk 838 PID 7108 lwlock 48: shacq 1125039 exacq 1592 blk 546 PID 7115 lwlock 48: shacq 1124666 exacq 559 blk 397 PID 7111 lwlock 48: shacq 1124523 exacq 1009 blk 955 PID 7114 lwlock 48: shacq 1124502 exacq 1041 blk 976 PID 7110 lwlock 48: shacq 1124460 exacq 1128 blk 1110 PID 7109 lwlock 48: shacq 1124402 exacq 384 blk 602 PID 7112 lwlock 48: shacq 1124394 exacq 1350 blk 1373 PID 7113 lwlock 48: shacq 1124383 exacq 868 blk 871 PID 7115 lwlock 1020: shacq 1000000 exacq 0 blk 0 PID 7115 lwlock 1018: shacq 1000000 exacq 0 blk 0 PID 7114 lwlock 1020: shacq 1000000 exacq 0 blk 0 PID 7114 lwlock 1018: shacq 1000000 exacq 0 blk 0 PID 7113 lwlock 1020: shacq 1000000 exacq 0 blk 0 PID 7113 lwlock 1018: shacq 1000000 exacq 0 blk 0 PID 7112 lwlock 1020: shacq 1000000 exacq 0 blk 0 PID 7112 lwlock 1018: shacq 1000000 exacq 0 blk 0 PID 7111 lwlock 1020: shacq 1000000 exacq 0 blk 0 PID 7111 lwlock 1018: shacq 1000000 exacq 0 blk 0 PID 7110 lwlock 1020: shacq 1000000 exacq 0 blk 0 PID 7110 lwlock 1018: shacq 1000000 exacq 0 blk 0 PID 7109 lwlock 1020: shacq 1000000 exacq 0 blk 0 PID 7109 lwlock 1018: shacq 1000000 exacq 0 blk 0 PID 7108 lwlock 1020: shacq 1000000 exacq 0 blk 0 PID 7108 lwlock 1018: shacq 1000000 exacq 0 blk 0 PID 7108 lwlock 43: shacq 439454 exacq 1614 blk 151 PID 7115 lwlock 43: shacq 438993 exacq 558 blk 126 PID 7111 lwlock 43: shacq 438817 exacq 899 blk 200 PID 7114 lwlock 43: shacq 438795 exacq 898 blk 230 PID 7109 lwlock 43: shacq 438787 exacq 314 blk 153 PID 7110 lwlock 43: shacq 438768 exacq 1026 blk 231 PID 7113 lwlock 43: shacq 438720 exacq 833 blk 178 PID 7112 lwlock 43: shacq 438691 exacq 1123 blk 215 PID 7108 lwlock 34: shacq 343982 exacq 1595 blk 177 PID 7115 lwlock 34: shacq 343582 exacq 600 blk 139 PID 7109 lwlock 34: shacq 343410 exacq 348 blk 167 PID 7111 lwlock 34: shacq 343398 exacq 926 blk 255 PID 7114 lwlock 34: shacq 343362 exacq 939 blk 214 PID 7110 lwlock 34: shacq 343338 exacq 1074 blk 226 PID 7113 lwlock 34: shacq 343283 exacq 832 blk 226 PID 7112 lwlock 34: shacq 343247 exacq 1255 blk 271 PID 7108 lwlock 38: shacq 333989 exacq 1704 blk 137 PID 7115 lwlock 38: shacq 333512 exacq 581 blk 106 PID 7109 lwlock 38: shacq 333340 exacq 322 blk 115 PID 7111 lwlock 38: shacq 333306 exacq 882 blk 152 PID 7114 lwlock 38: shacq 333266 exacq 977 blk 183 PID 7110 lwlock 38: shacq 333249 exacq 1013 blk 165 PID 7113 lwlock 38: shacq 333226 exacq 828 blk 163 PID 7112 lwlock 38: shacq 333175 exacq 1246 blk 183 PID 7108 lwlock 42: shacq 233215 exacq 1539 blk 237 PID 7115 lwlock 42: shacq 232846 exacq 517 blk 138 PID 7111 lwlock 42: shacq 232645 exacq 885 blk 215 PID 7109 lwlock 42: shacq 232611 exacq 363 blk 140 PID 7113 lwlock 42: shacq 232600 exacq 860 blk 192 PID 7110 lwlock 42: shacq 232599 exacq 1049 blk 262 PID 7114 lwlock 42: shacq 232591 exacq 935 blk 318 PID 7112 lwlock 42: shacq 232505 exacq 1262 blk 240 PID 7108 lwlock 47: shacq 228682 exacq 1495 blk 177 PID 7115 lwlock 47: shacq 228397 exacq 516 blk 109 PID 7110 lwlock 47: shacq 228218 exacq 1082 blk 378 PID 7109 lwlock 47: shacq 228097 exacq 386 blk 158 PID 7111 lwlock 47: shacq 228093 exacq 907 blk 321 PID 7113 lwlock 47: shacq 228085 exacq 909 blk 306 PID 7112 lwlock 47: shacq 227993 exacq 1248 blk 387 PID 7114 lwlock 47: shacq 227986 exacq 929 blk 449 PID 7108 lwlock 36: shacq 205287 exacq 1592 blk 78 PID 7115 lwlock 36: shacq 204845 exacq 535 blk 86 PID 7111 lwlock 36: shacq 204675 exacq 830 blk 108 PID 7114 lwlock 36: shacq 204634 exacq 930 blk 109 PID 7109 lwlock 36: shacq 204581 exacq 363 blk 83 PID 7113 lwlock 36: shacq 204566 exacq 833 blk 112 PID 7110 lwlock 36: shacq 204546 exacq 1014 blk 106 PID 7112 lwlock 36: shacq 204508 exacq 1180 blk 140 PID 7108 lwlock 35: shacq 136067 exacq 1537 blk 56 PID 7115 lwlock 35: shacq 135532 exacq 494 blk 47 PID 7111 lwlock 35: shacq 135404 exacq 820 blk 76 PID 7109 lwlock 35: shacq 135362 exacq 326 blk 56 PID 7114 lwlock 35: shacq 135342 exacq 879 blk 102 PID 7110 lwlock 35: shacq 135333 exacq 1016 blk 88 PID 7113 lwlock 35: shacq 135318 exacq 829 blk 75 PID 7112 lwlock 35: shacq 135257 exacq 1229 blk 98 PID 7108 lwlock 41: shacq 134338 exacq 1540 blk 55 PID 7115 lwlock 41: shacq 133929 exacq 542 blk 50 PID 7109 lwlock 41: shacq 133781 exacq 360 blk 49 PID 7111 lwlock 41: shacq 133778 exacq 821 blk 87 PID 7114 lwlock 41: shacq 133757 exacq 897 blk 79 ... I've also put all the lwlock output here (2Mb) http://www.ast.cam.ac.uk/~koposov/files/pg/lwlocks.gz S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Fri, May 25, 2012 at 11:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> On Fri, May 25, 2012 at 11:17 AM, Stephen Frost <sfrost@snowman.net> wrote: >>> Didn't we implement a system whereby this is exactly what we intend to >>> happen on the read side- that is, everyone doing a SeqScan gangs up on >>> one ring buffer and follows it, which we felt was going to dramatically >>> improve performance in some cases? > >> yeah: >> ... >> I wonder if the logic here is just being too strict... > > I don't recall how much evidence there is behind the NBuffers/4 threshold. > Maybe that needs some tuning? probably. Sergey is going to get some numbers back from a hacked version that forces the scan to use a strategy. If that turns out to be the problem, I bet it'd make sense to hook the decision against ring size which is currently internal to freelist. In other words, since freelist is making the call to determine the size of the ring, it should also probably be making the call in terms of a particular input size being useful to apply a strategy to. Maybe something like IsAccessStrategyUseful(BufferAccessStrategyType btype, BlockNumber nblocks) ... merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Fri, 25 May 2012, Merlin Moncure wrote: > can you hack this in heapam.c and see if it helps? > line 131-ish: > > if (!RelationUsesLocalBuffers(scan->rs_rd) && > scan->rs_nblocks > NBuffers / 4) > becomes > if (!RelationUsesLocalBuffers(scan->rs_rd)) > > (also you can set the partition count back). The first few runs of my test with the suggested change made the multithreaded queries even slower by a factor of two. E.g. 44 seconds for multithreaded vs ~ 7sec single threaded. (with the default 9.2beta I get ~ 14-22sec for multithreaded run) But now after another few reruns of my test, i see again this variability in times for multithreaded runs. It went down to 18-22seconds. Then again to 44 sec. So overall the change you suggested either doesn't improve the situation or makes it worse . Regards, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Fri, May 25, 2012 at 10:30 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > I think what's happening here is that the buffer partitions don't help > (in fact, they hurt) in the presence of multiple concurrent scans that > are operating on approximately the same data. Sooner or later the > scans line up on each other and start binding when reassigning lock > tags (which take out up to two ordered exclusive lwlocks). This is on > the read side, so the buffer access strategy is zero help (I confirmed > this off list). This theory is seeming fairly plausible - how can we test it? How about trying it with synchronize_seqscans = off? If the synchronized-seqscan logic is causing contention on the buf mapping locks and individual buffer locks, that should fix it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Sat, 26 May 2012, Robert Haas wrote: > This theory is seeming fairly plausible - how can we test it? > > How about trying it with synchronize_seqscans = off? If the > synchronized-seqscan logic is causing contention on the buf mapping > locks and individual buffer locks, that should fix it. Turning off synch seq scans doesn't help either. 18 sec multithreaded run vs 7 sec single threaded. S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Stephen Frost
Date:
* Sergey Koposov (koposov@ast.cam.ac.uk) wrote: > Turning off synch seq scans doesn't help either. 18 sec > multithreaded run vs 7 sec single threaded. Alright, can you just time 'cat' when they're started a few seconds or whatever apart from each other? I can't imagine it being affected in the same way as these, but seems like it wouldn't hurt to check. Thanks, Stephen
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Sat, 26 May 2012, Stephen Frost wrote: > Alright, can you just time 'cat' when they're started a few seconds or > whatever apart from each other? I can't imagine it being affected in > the same way as these, but seems like it wouldn't hurt to check. I've tryed cat'ting a created in advance 8gig file on the same filesystem. And during the multi-threaded run it takes ~4sec, roughly the same as when run separately. ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
Hi, I did another test using the same data and the same code, which I've provided before and the performance of the single thread seems to be degrading quadratically with the number of threads. Here are the results: Nthreads Time_to_execute_one_thread 1 8.1 2 7.8 3 8.1 4 9.0 5 10.2 6 11.4 7 13.3 8 16.1 9 19.0 10 21.4 11 23.8 12 27.3 13 30.2 14 32.0 15 34.1 16 37.5 Regards, Sergey On Sat, 26 May 2012, Stephen Frost wrote: > * Sergey Koposov (koposov@ast.cam.ac.uk) wrote: >> Turning off synch seq scans doesn't help either. 18 sec >> multithreaded run vs 7 sec single threaded. > > Alright, can you just time 'cat' when they're started a few seconds or > whatever apart from each other? I can't imagine it being affected in > the same way as these, but seems like it wouldn't hurt to check. > > Thanks, > > Stephen > ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Sun, May 27, 2012 at 1:45 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > Hi, > > I did another test using the same data and the same code, which I've > provided before and the performance of the single thread seems to be > degrading quadratically with the number of threads. > > Here are the results: > Nthreads Time_to_execute_one_thread > 1 8.1 > 2 7.8 > 3 8.1 > 4 9.0 > 5 10.2 > 6 11.4 > 7 13.3 > 8 16.1 > 9 19.0 > 10 21.4 > 11 23.8 > 12 27.3 > 13 30.2 > 14 32.0 > 15 34.1 > 16 37.5 Ok, I double checked offline with Sergey to make sure the strategy wasn't helping...it isn't, and we confirmed it was being forced on after playing with the tunables a bit. So what is happening here? I have a hunch that this particular query is defeating the ring buffer strategy code. Here's the query: create table _tmp0 as select * from ( select *, (select healpixid from idt_match as m where m.transitid=o.transitid) as x from idt_photoobservation as o offset 0 ) as y where x%16=ZZZorder by x; (where ZZZ is some number integer number 0<=ZZZ<16) With the following plan: ----------------------------------------------------------------------------------------------------------------------Sort (cost=3228814504.96..3228815137.21 rows=252902 width=498) Sort Key: y.x -> Subquery Scan on y (cost=0.00..3228791809.25rows=252902 width=498) Filter: ((y.x % 16::bigint) = 0) -> Limit (cost=0.00..3228033102.41rows=50580456 width=490) -> Seq Scan on idt_photoobservation o (cost=0.00..3228033102.41 rows=50580456 width=490) SubPlan 1 -> Index Scan using idt_match_transitid_idxon idt_match m (cost=0.00..63.74 rows=1 width=8) Index Cond: (transitid = o.transitid) If I understand this properly,.Sergey is scanning a large table and looking up an integer value from a smaller table row by row and explicitly forcing it as such (via 'offset 0'). That integer value is then used to filter the resultant table down to size 1/N so that multiple workers can process the data which is then sorted. This isn't a particularly great way to attack the problem, but it should scale better than it does. I think the problem is coming not from the main seq scan, but from the dependent index lookup on idt_match. Even though the scan can be made to exhibit lock free behavior, the index lookups will eventually start lining up and hit the same sequences of pages in the same order -- bam. merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
Here is the actual explain analyze of the query on the smaller dataset which I have been using for the recent testing. test=# explain analyze create table _tmp0 as select * from ( select *, (select healpixid from idt_match as m wherem.transitid=o.transitid) as x from idt_photoobservation_small as o offset 0 ) as y wherex%16=0 order by x; QUERY PLAN p---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=63835201.73..63835214.23 rows=5000 width=498) (actual time=8203.041..8252.216 rows=173696 loops=1) Sort Key:y.x Sort Method: quicksort Memory: 182555kB -> Subquery Scan on y (cost=0.00..63834894.54 rows=5000 width=498)(actual time=0.102..7602.947 rows=173696 loops=1) Filter: ((y.x % 16::bigint) = 0) Rows Removedby Filter: 826304 -> Limit (cost=0.00..63819894.51 rows=1000002 width=490) (actual time=0.041..7296.401rows=1000000 loops=1) -> Seq Scan on idt_photoobservation_small o (cost=0.00..63819894.51rows=1000002 width=490) (actual time=0.038..7094.555 rows=1000000 loops=1) SubPlan1 -> Index Scan using idt_match_transitid_idx on idt_match m (cost=0.00..63.74 rows=1 width=8)(actual time=0.003..0.004 rows=1 loops=1000000) Index Cond: (transitid = o.transitid)Total runtime: 8908.631 ms ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Wed, May 30, 2012 at 10:42 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > Here is the actual explain analyze of the query on the smaller dataset > which I have been using for the recent testing. > > test=# explain analyze create table _tmp0 as select * from > > ( select *, > (select healpixid from idt_match as m where m.transitid=o.transitid) > as x from idt_photoobservation_small as o offset 0 > ) as y where x%16=0 order by x; > > QUERY PLAN > > p---------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Sort (cost=63835201.73..63835214.23 rows=5000 width=498) (actual > time=8203.041..8252.216 rows=173696 loops=1) > Sort Key: y.x > Sort Method: quicksort Memory: 182555kB > -> Subquery Scan on y (cost=0.00..63834894.54 rows=5000 width=498) > (actual time=0.102..7602.947 rows=173696 loops=1) > > Filter: ((y.x % 16::bigint) = 0) > Rows Removed by Filter: 826304 > -> Limit (cost=0.00..63819894.51 rows=1000002 width=490) (actual > time=0.041..7296.401 rows=1000000 loops=1) > -> Seq Scan on idt_photoobservation_small o > (cost=0.00..63819894.51 rows=1000002 width=490) (actual > time=0.038..7094.555 rows=1000000 loops=1) > SubPlan 1 > -> Index Scan using idt_match_transitid_idx on > idt_match m (cost=0.00..63.74 rows=1 width=8) (actual time=0.003..0.004 > rows=1 loops=1000000) > > Index Cond: (transitid = o.transitid) > Total runtime: 8908.631 ms Two things: 1. Can we see an explain analyze during a 'bogged' case? 2. Can we try to get 'index only scan' working over idt_match? That's should be a matter of adjusting the index so that it's: create index on idt_match (transitid, healpixid); merlin merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Wed, 30 May 2012, Merlin Moncure wrote: > > 1. Can we see an explain analyze during a 'bogged' case? Here is the one to one comparison of the 'bogged' ********** QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=63835201.73..63835214.23 rows=5000 width=498) (actual time=18007.500..18007.500 rows=0 loops=1) Sort Key: y.x Sort Method: quicksort Memory: 25kB -> Subquery Scan on y (cost=0.00..63834894.54 rows=5000 width=498) (actualtime=18007.454..18007.454 rows=0 loops=1) Filter: ((y.x % 16::bigint) = 7) Rows Removed by Filter:1000000 -> Limit (cost=0.00..63819894.51 rows=1000002 width=490) (actual time=0.047..17734.570 rows=1000000loops=1) -> Seq Scan on idt_photoobservation_small o (cost=0.00..63819894.51 rows=1000002 width=490)(actual time=0.045..17543.902 rows=1000000 loops=1) SubPlan 1 -> IndexScan using idt_match_transitid_idx on idt_match m (cost=0.00..63.74 rows=1 width=8) (actual time=0.015..0.015 rows=1loops=1000000) Index Cond: (transitid = o.transitid) Total runtime: 18056.866 ms (12 rows) Time: 18067.929 ms ************************* vs non-bogged: ************************** QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=63835201.73..63835214.23 rows=5000 width=498) (actual time=6635.133..6635.133 rows=0 loops=1) Sort Key: y.x Sort Method: quicksort Memory: 25kB -> Subquery Scan on y (cost=0.00..63834894.54 rows=5000 width=498) (actualtime=6635.091..6635.091 rows=0 loops=1) Filter: ((y.x % 16::bigint) = 7) Rows Removed by Filter: 1000000 -> Limit (cost=0.00..63819894.51 rows=1000002 width=490) (actual time=0.059..6344.683 rows=1000000 loops=1) -> Seq Scan on idt_photoobservation_small o (cost=0.00..63819894.51 rows=1000002 width=490) (actualtime=0.056..6149.429 rows=1000000 loops=1) SubPlan 1 -> Index Scan usingidt_match_transitid_idx on idt_match m (cost=0.00..63.74 rows=1 width=8) (actual time=0.003..0.004 rows=1 loops=1000000) Index Cond: (transitid = o.transitid) Total runtime: 6669.215 ms (12 rows) Time: 6673.991 ms ************************** > > 2. Can we try to get 'index only scan' working over idt_match? That's > should be a matter of adjusting the index so that it's: > create index on idt_match (transitid, healpixid); I don't think I could do that. I created the index, you asked (although it is unclear why it would help...), tried explain, and it still used a simple index scan. After disabling the index scan it decided to use the bitmap scan (which isn't actually faster in the multithreaded setup. And after disabling the bitmapscan it switched to seqscan). Cheers, Sergey PS Just for the reference the current indices on idt_match are "idt_match_idx" btree (healpixid) "idt_match_transitid_healpixid_idx"btree (transitid, healpixid) "idt_match_transitid_idx" btree (transitid) ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Wed, May 30, 2012 at 12:58 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > Here is the one to one comparison of the 'bogged' ********** > > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Sort (cost=63835201.73..63835214.23 rows=5000 width=498) (actual > time=18007.500..18007.500 rows=0 loops=1) > Sort Key: y.x > Sort Method: quicksort Memory: 25kB > -> Subquery Scan on y (cost=0.00..63834894.54 rows=5000 width=498) > (actual time=18007.454..18007.454 rows=0 loops=1) > Filter: ((y.x % 16::bigint) = 7) > Rows Removed by Filter: 1000000 > -> Limit (cost=0.00..63819894.51 rows=1000002 width=490) (actual > time=0.047..17734.570 rows=1000000 loops=1) > -> Seq Scan on idt_photoobservation_small o > (cost=0.00..63819894.51 rows=1000002 width=490) (actual > time=0.045..17543.902 rows=1000000 loops=1) > SubPlan 1 > -> Index Scan using idt_match_transitid_idx on > idt_match m (cost=0.00..63.74 rows=1 width=8) (actual time=0.015..0.015 > rows=1 loops=1000000) > > Index Cond: (transitid = o.transitid) > Total runtime: 18056.866 ms > (12 rows) > > Time: 18067.929 ms > ************************* > > vs non-bogged: > > ************************** > > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Sort (cost=63835201.73..63835214.23 rows=5000 width=498) (actual > time=6635.133..6635.133 rows=0 loops=1) > Sort Key: y.x > Sort Method: quicksort Memory: 25kB > -> Subquery Scan on y (cost=0.00..63834894.54 rows=5000 width=498) > (actual time=6635.091..6635.091 rows=0 loops=1) > Filter: ((y.x % 16::bigint) = 7) > Rows Removed by Filter: 1000000 > -> Limit (cost=0.00..63819894.51 rows=1000002 width=490) (actual > time=0.059..6344.683 rows=1000000 loops=1) > -> Seq Scan on idt_photoobservation_small o > (cost=0.00..63819894.51 rows=1000002 width=490) (actual > time=0.056..6149.429 rows=1000000 loops=1) > > SubPlan 1 > -> Index Scan using idt_match_transitid_idx on > idt_match m (cost=0.00..63.74 rows=1 width=8) (actual time=0.003..0.004 > rows=1 loops=1000000) > Index Cond: (transitid = o.transitid) > Total runtime: 6669.215 ms > (12 rows) > > Time: 6673.991 ms Hmm. So the total time across all loops is 3000-4000 ms in the unbogged case, and 15000 ms in the bogged case. The time for the seqscan parent is 6344.683 in the unbogged case, so ~3s once you subtract out the index scans, and 17543.902 in the bogged case, so ~3s once you subtract out the index scans. So this seems to support Merlin's theory that the index scan is what's getting slow under heavy concurrency. Apparently, the index scan slows down 4-5x but the seqscan is just as fast as ever. Very interesting... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Wed, May 30, 2012 at 12:11 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 30, 2012 at 12:58 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: >> Here is the one to one comparison of the 'bogged' ********** >> >> QUERY PLAN >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> Sort (cost=63835201.73..63835214.23 rows=5000 width=498) (actual >> time=18007.500..18007.500 rows=0 loops=1) >> Sort Key: y.x >> Sort Method: quicksort Memory: 25kB >> -> Subquery Scan on y (cost=0.00..63834894.54 rows=5000 width=498) >> (actual time=18007.454..18007.454 rows=0 loops=1) >> Filter: ((y.x % 16::bigint) = 7) >> Rows Removed by Filter: 1000000 >> -> Limit (cost=0.00..63819894.51 rows=1000002 width=490) (actual >> time=0.047..17734.570 rows=1000000 loops=1) >> -> Seq Scan on idt_photoobservation_small o >> (cost=0.00..63819894.51 rows=1000002 width=490) (actual >> time=0.045..17543.902 rows=1000000 loops=1) >> SubPlan 1 >> -> Index Scan using idt_match_transitid_idx on >> idt_match m (cost=0.00..63.74 rows=1 width=8) (actual time=0.015..0.015 >> rows=1 loops=1000000) >> >> Index Cond: (transitid = o.transitid) >> Total runtime: 18056.866 ms >> (12 rows) >> >> Time: 18067.929 ms >> ************************* >> >> vs non-bogged: >> >> ************************** >> >> QUERY PLAN >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> Sort (cost=63835201.73..63835214.23 rows=5000 width=498) (actual >> time=6635.133..6635.133 rows=0 loops=1) >> Sort Key: y.x >> Sort Method: quicksort Memory: 25kB >> -> Subquery Scan on y (cost=0.00..63834894.54 rows=5000 width=498) >> (actual time=6635.091..6635.091 rows=0 loops=1) >> Filter: ((y.x % 16::bigint) = 7) >> Rows Removed by Filter: 1000000 >> -> Limit (cost=0.00..63819894.51 rows=1000002 width=490) (actual >> time=0.059..6344.683 rows=1000000 loops=1) >> -> Seq Scan on idt_photoobservation_small o >> (cost=0.00..63819894.51 rows=1000002 width=490) (actual >> time=0.056..6149.429 rows=1000000 loops=1) >> >> SubPlan 1 >> -> Index Scan using idt_match_transitid_idx on >> idt_match m (cost=0.00..63.74 rows=1 width=8) (actual time=0.003..0.004 >> rows=1 loops=1000000) >> Index Cond: (transitid = o.transitid) >> Total runtime: 6669.215 ms >> (12 rows) >> >> Time: 6673.991 ms > > Hmm. So the total time across all loops is 3000-4000 ms in the > unbogged case, and 15000 ms in the bogged case. The time for the > seqscan parent is 6344.683 in the unbogged case, so ~3s once you > subtract out the index scans, and 17543.902 in the bogged case, so ~3s > once you subtract out the index scans. So this seems to support > Merlin's theory that the index scan is what's getting slow under heavy > concurrency. Apparently, the index scan slows down 4-5x but the > seqscan is just as fast as ever. Very interesting... Yeah, an 'index only scan' *would* fix the performance problem if my theory is correct. By skipping the heap lookup it would also skip the buffer retagging which is binding up the multiple concurrent scans. Sergey, just in case you don't know, an IOS works like this. If all the fields that are fetched from a table are present in the index and the index points towards a tuple on a page that contains only visible tuples according to the visibility map, you get to skip jumping back into the heap for visibility checks. In your case, I think that would result in almost linear scaling. Hm, why aren't we getting a IOS? Just for kicks (assuming this is test data), can we drop the index on just transitid, leaving the index on transitid, healpixid? Is enable_indexonlyscan on? Has idt_match been vacuumed? What kind of plan do you get when do: select healpixid from idt_match where transitid = <some valid transitid>; merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Wed, 30 May 2012, Merlin Moncure wrote: > > Hm, why aren't we getting a IOS? Just for kicks (assuming this is > test data), can we drop the index on just transitid, leaving the index > on transitid, healpixid? Is enable_indexonlyscan on? Has idt_match > been vacuumed? What kind of plan do you get when do: Okay dropping the index on transitid solved the issue with indexonlyscan but didn't solve the original problem. Actually the indexonlyscan made the sequential queries faster but not the parallel ones. Here is a side by side explain analyze again for indexonly scans Bogged: ************ QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=82262274.19..82262286.69 rows=5000 width=498) (actual time=19635.318..19635.318 rows=0 loops=1) Sort Key: y.x Sort Method: quicksort Memory: 25kB -> Subquery Scan on y (cost=0.00..82261967.00 rows=5000 width=498) (actualtime=19635.282..19635.282 rows=0 loops=1) Filter: ((y.x % 16::bigint) = 7) Rows Removed by Filter:1000000 -> Limit (cost=0.00..82246967.00 rows=1000000 width=490) (actual time=0.076..19373.922 rows=1000000loops=1) -> Seq Scan on idt_photoobservation_small o (cost=0.00..82246967.00 rows=1000000 width=490)(actual time=0.074..19179.336 rows=1000000 loops=1) SubPlan 1 -> IndexOnly Scan using idt_match_transitid_healpixid_idx on idt_match m (cost=0.00..82.17 rows=1 width=8) (actual time=0.016..0.017rows=1 loops=1000000) Index Cond: (transitid = o.transitid) Heap Fetches: 0 Total runtime: 19702.727 ms (13 rows) *********** vs non-bogged *********** QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=82262274.19..82262286.69 rows=5000 width=498) (actual time=6198.334..6198.334 rows=0 loops=1) Sort Key: y.x Sort Method: quicksort Memory: 25kB -> Subquery Scan on y (cost=0.00..82261967.00 rows=5000 width=498) (actualtime=6198.285..6198.285 rows=0 loops=1) Filter: ((y.x % 16::bigint) = 7) Rows Removed by Filter: 1000000 -> Limit (cost=0.00..82246967.00 rows=1000000 width=490) (actual time=0.116..5932.766 rows=1000000 loops=1) -> Seq Scan on idt_photoobservation_small o (cost=0.00..82246967.00 rows=1000000 width=490) (actualtime=0.113..5730.012 rows=1000000 loops=1) SubPlan 1 -> Index Only Scanusing idt_match_transitid_healpixid_idx on idt_match m (cost=0.00..82.17 rows=1 width=8) (actual time=0.003..0.003 rows=1loops=1000000) Index Cond: (transitid = o.transitid) Heap Fetches:0 Total runtime: 6239.512 ms (13 rows) ****** Cheers, S PS I didn't know that index on two columns together with the indexonlyscan can help the queries based on one of those columns. Very helpful! ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Wed, May 30, 2012 at 1:45 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > On Wed, 30 May 2012, Merlin Moncure wrote: >> >> >> Hm, why aren't we getting a IOS? Just for kicks (assuming this is >> test data), can we drop the index on just transitid, leaving the index >> on transitid, healpixid? Is enable_indexonlyscan on? Has idt_match >> been vacuumed? What kind of plan do you get when do: > > > Okay dropping the index on transitid solved the issue with indexonlyscan but > didn't solve the original problem. Actually the indexonlyscan made the > sequential queries faster but not the parallel ones. hurk -- ISTM that since IOS is masikng the heap lookups, there must be contention on the index itself? Does this working set fit in shared memory? If so, what happens when you do a database restart and repeat the IOS test? merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Wed, 30 May 2012, Merlin Moncure wrote: > hurk -- ISTM that since IOS is masikng the heap lookups, there must > be contention on the index itself? Does this working set fit in > shared memory? If so, what happens when you do a database restart and > repeat the IOS test? The dataset fits well in shared buffers Here are the data sizes public.idt_photoobservation_small | 521 MB | 528756 public.idt_match_idx | 1084 MB | 528762 public.idt_match_transitid_healpixid_idx | 1522 MB | 1955140 public.idt_match | 2906 MB | 528753 And shared buffers are 10G If I restart the db the timings do not change significantly. There is always some variation which I don't really understand, e.g. the parallel runs sometimes take 18s, or 25 seconds, or 30 seconds per thread. So there is something else affecting the runs -- I don't know, maybe that's related to which thread starts first, or what is the starting point of the seq scan... (there is no other activity on the machine btw). Cheers, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Jeff Janes
Date:
On Wed, May 30, 2012 at 11:45 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > On Wed, 30 May 2012, Merlin Moncure wrote: >> >> >> Hm, why aren't we getting a IOS? Just for kicks (assuming this is >> test data), can we drop the index on just transitid, leaving the index >> on transitid, healpixid? Is enable_indexonlyscan on? Has idt_match >> been vacuumed? What kind of plan do you get when do: > > > Okay dropping the index on transitid solved the issue with indexonlyscan but > didn't solve the original problem. Actually the indexonlyscan made the > sequential queries faster but not the parallel ones. How big is idt_match? What if you drop all indexes on idt_match, encouraging all the backends to do hash joins against it, which occur in local memory and so don't have contention? Cheers, Jeff
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Wed, May 30, 2012 at 3:15 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Wed, May 30, 2012 at 11:45 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: >> On Wed, 30 May 2012, Merlin Moncure wrote: >>> >>> >>> Hm, why aren't we getting a IOS? Just for kicks (assuming this is >>> test data), can we drop the index on just transitid, leaving the index >>> on transitid, healpixid? Is enable_indexonlyscan on? Has idt_match >>> been vacuumed? What kind of plan do you get when do: >> >> >> Okay dropping the index on transitid solved the issue with indexonlyscan but >> didn't solve the original problem. Actually the indexonlyscan made the >> sequential queries faster but not the parallel ones. > > How big is idt_match? What if you drop all indexes on idt_match, > encouraging all the backends to do hash joins against it, which occur > in local memory and so don't have contention? You just missed his post -- it's only 3G. can you run your 'small' working set against 48gb shared buffers? merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Wed, 30 May 2012, Merlin Moncure wrote: >> How big is idt_match? What if you drop all indexes on idt_match, >> encouraging all the backends to do hash joins against it, which occur >> in local memory and so don't have contention? > > You just missed his post -- it's only 3G. can you run your 'small' > working set against 48gb shared buffers? Just tried 3times and it actually got much worse ~ 70-80 seconds per query in the parallel setup ( i.e. >10 times slower than the single run) The oprofile then looks like this: CPU: Intel Architectural Perfmon, speed 1862 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % linenr info symbol name ------------------------------------------------------------------------------- 883523 46.3676 (no location information) s_lock 883523 100.000 (no location information) s_lock [self] ------------------------------------------------------------------------------- 303984 15.9532 (no location information) PinBuffer 303984 100.000 (no location information) PinBuffer [self] The problem is that since there is that variability in times, I don't really 100% know whether this trend of slow-down with increasing shared memory is genuine or not. I've also tried just in case shared_buffers=1G, and it is still very slow (50 sec). After that I changed the shared buffers back to 10G and the timings got back to 25 sec. Weird... I still wonder whether there is problem with the way the locking is done (as referenced in the recent "droping tables slowiness" thread). Cheers, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Florian Pflug
Date:
On May30, 2012, at 22:07 , Sergey Koposov wrote: > If I restart the db the timings do not change significantly. There is always some variation which I don't really understand,e.g. the parallel runs sometimes > take 18s, or 25 seconds, or 30 seconds per thread. So there is something else affecting > the runs -- I don't know, maybe that's related to which thread starts first, > or what is the starting point of the seq scan... (there is no other activity on the machine btw). I wonder if the huge variance could be caused by non-uniform synchronization costs across different cores. That's not allthat unlikely, because at least some cache levels (L2 and/or L3, I think) are usually shared between all cores on a singledie. Thus, a cache bouncing line between cores on the same die might very well be faster then it bouncing between coreson different dies. On linux, you can use the taskset command to explicitly assign processes to cores. The easiest way to check if that makesa difference is to assign one core for each connection to the postmaster before launching your test. Assuming that cpuassignment are inherited to child processes, that should then spread your backends out over exactly the cores you specify. best regards, Florian Pflug
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Wed, 30 May 2012, Florian Pflug wrote: > > I wonder if the huge variance could be caused by non-uniform > synchronization costs across different cores. That's not all that > unlikely, because at least some cache levels (L2 and/or L3, I think) are > usually shared between all cores on a single die. Thus, a cache bouncing > line between cores on the same die might very well be faster then it > bouncing between cores on different dies. > > On linux, you can use the taskset command to explicitly assign processes > to cores. The easiest way to check if that makes a difference is to > assign one core for each connection to the postmaster before launching > your test. Assuming that cpu assignment are inherited to child > processes, that should then spread your backends out over exactly the > cores you specify. Wow, thanks! This seems to be working to some extend. I've found that distributing each thread x ( 0<x<7) to the cpu 1+3*x (reminder, that i have HT disabled and in total I have 4 cpus with 6 proper cores each) gives quite good results. And after a few runs, I seem to be getting a more or less stable results for the multiple threads, with the performance of multithreaded runs going from 6 to 11 seconds for various threads. (another reminder is that 5-6 seconds is roughly the timing of a my queries running in a single thread). So to some extend one can say that the problem is partially solved (i.e. it is probably understood) But the question now is whether there is a *PG* problem here or not, or is it Intel's or Linux's problem ? Because still the slowdown was caused by locking. If there wouldn't be locking there wouldn't be any problems (as demonstrated a while ago by just cat'ting the files in multiple threads). Cheers, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Florian Pflug
Date:
On May31, 2012, at 01:16 , Sergey Koposov wrote: > On Wed, 30 May 2012, Florian Pflug wrote: >> >> I wonder if the huge variance could be caused by non-uniform synchronization costs across different cores. That's notall that unlikely, because at least some cache levels (L2 and/or L3, I think) are usually shared between all cores ona single die. Thus, a cache bouncing line between cores on the same die might very well be faster then it bouncing betweencores on different dies. >> >> On linux, you can use the taskset command to explicitly assign processes to cores. The easiest way to check if that makesa difference is to assign one core for each connection to the postmaster before launching your test. Assuming that cpuassignment are inherited to child processes, that should then spread your backends out over exactly the cores you specify. > > Wow, thanks! This seems to be working to some extend. I've found that distributing each thread x ( 0<x<7) to the cpu 1+3*x > (reminder, that i have HT disabled and in total I have 4 cpus with 6 proper cores each) gives quite good results. And aftera few runs, I seem to be getting a more or less stable results for the multiple threads, with the performance of multithreadedruns going from 6 to 11 seconds for various threads. (another reminder is that 5-6 seconds is roughly the timingof a my queries running in a single thread). Wait, so performance *increased* by spreading the backends out over as many dies as possible, not by using as few as possible?That'd be exactly the opposite of what I'd have expected. (I'm assuming that cores on one die have ascending ids on linux. If youcould post the contents of /proc/cpuinfo, we could verify that) > So to some extend one can say that the problem is partially solved (i.e. it is probably understood) Not quite, I think. We still don't really know why there's that much spinlock contention AFAICS. But what we've learned isthat the actual spinning on a contested lock is only part of the problem. The cache-line bouncing caused by all those lock acquisition isthe other part, and it's pretty expensive too - otherwise, moving the backends around wouldn't have helped. > But the question now is whether there is a *PG* problem here or not, or is it Intel's or Linux's problem ? Neither Intel nor Linux can do much about this, I fear. Synchronization will always be expensive, and the more so the largerthe number of cores. Linux could maybe pick a better process to core assignment, but it probably won't be able to pickthe optimal one for every workload. So unfortunately, this is a postgres problem I'd say. > Because still the slowdown was caused by locking. If there wouldn't be locking there wouldn't be any problems (as demonstrateda while ago by just cat'ting the files in multiple threads). Yup, we'll have to figure out a way to reduce the locking overhead. 9.2 already scales much better to a large number of coresthan previous versions did, but your test case shows that there's still room for improvement. best regards, Florian Pflug
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Thu, 31 May 2012, Florian Pflug wrote: > Wait, so performance *increased* by spreading the backends out over as > many dies as possible, not by using as few as possible? That'd be > exactly the opposite of what I'd have expected. (I'm assuming that cores > on one die have ascending ids on linux. If you could post the contents > of /proc/cpuinfo, we could verify that) Yes, you are correct. And I also can confirm that the cpus in the cpuinfo are ordered by "physical id" e.g. they go like 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 I did a specific test with just 6 threads (== number of cores per cpu) and ran it on a single phys cpu, it took ~ 12 seconds for each thread, and when I tried to spread it across 4 cpus it took 7-9 seconds per thread. But all these numbers are anyway significantly better then when I didn't use taskset. Which probably means without it the processes were jumping from core to core ? ... >> Because still the slowdown was caused by locking. If there wouldn't be >> locking there wouldn't be any problems (as demonstrated a while ago by >> just cat'ting the files in multiple threads). > > Yup, we'll have to figure out a way to reduce the locking overhead. 9.2 > already scales much better to a large number of cores than previous > versions did, but your test case shows that there's still room for > improvement. Yes, and unfortunately these scaling problems in read-only cpu bound queries, where naively I wound't expect any. Thanks, Sergey ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Jeff Janes
Date:
On Wed, May 30, 2012 at 4:16 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > But the question now is whether there is a *PG* problem here or not, or is > it Intel's or Linux's problem ? Because still the slowdown was caused by > locking. If there wouldn't be locking there wouldn't be any problems (as > demonstrated a while ago by just cat'ting the files in multiple threads). You cannot have a traditional RDBMS without locking. From your description of the problem, I probably wouldn't be using a traditional database system at all for this, but rather flat files and Perl. Or at least, I would partition the data before loading it to the DB, rather than trying to do it after. But anyway, is idt_match a fairly static table? If so, I'd partition that into 16 tables, and then have each one of your tasks join against a different one of those tables. That should relieve the contention on the index root block, and might have some other benefits as well. Cheers, Jeff
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Stephen Frost
Date:
Sergey, all, * Sergey Koposov (koposov@ast.cam.ac.uk) wrote: > I did a specific test with just 6 threads (== number of cores per cpu) > and ran it on a single phys cpu, it took ~ 12 seconds for each thread, > and when I tried to spread it across 4 cpus it took 7-9 seconds per > thread. But all these numbers are anyway significantly better then > when I didn't use taskset. Which probably means without it the > processes were jumping from core to core ? ... It sounds like the issue here really is the PG processes bouncing between the cores, which I do think is a Linux issue, though perhaps PG could somehow "encourage" Linux to be better about this somehow.. This isn't specific to Linux, however, under Windows there's something similar and they encourage the same kind of manual setting of things like IIS processes to CPUs.. Thanks, Stephen
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Stephen Frost
Date:
* Sergey Koposov (koposov@ast.cam.ac.uk) wrote: > I did a specific test with just 6 threads (== number of cores per cpu) > and ran it on a single phys cpu, it took ~ 12 seconds for each thread, > and when I tried to spread it across 4 cpus it took 7-9 seconds per > thread. But all these numbers are anyway significantly better then > when I didn't use taskset. Which probably means without it the > processes were jumping from core to core ? ... Oh, and wrt why 'cat' isn't really affected by this issue- that's because 'cat' isn't *doing* anything, CPU wise, really. The PG backends are actually doing real work and therefore they have both CPU state and memory accesses which are impacted when the process is moved from one core to another. If this system has NUMA (where memory is associated with a set of cores), then that can make it more painful when threads are being moved between cores also, because suddenly the memory you were accessing (which was 'fast') is now taking longer because you're having to go through another CPU to get to it. Thanks, Stephen
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Wed, 30 May 2012, Jeff Janes wrote: >> But the question now is whether there is a *PG* problem here or not, or is >> it Intel's or Linux's problem ? Because still the slowdown was caused by >> locking. If there wouldn't be locking there wouldn't be any problems (as >> demonstrated a while ago by just cat'ting the files in multiple threads). > > You cannot have a traditional RDBMS without locking. From your I understand the need of significant locking when there concurrent writes, but not when there only reads. But I'm not a RDBMS expert, so that's maybe that's misunderstanding on my side. > description of the problem, I probably wouldn't be using a traditional > database system at all for this, but rather flat files and Perl. Flat files and perl for 25-50 TB of data over few years is a bit extreme ;) > Or > at least, I would partition the data before loading it to the DB, > rather than trying to do it after. I intensionally did otherwise, because I thought that PG will to be much smarter than me in juggling the data I'm ingesting (~ tens of gig each day), join the appropriate bits of data and then split by partitions. Unfortunately I see that there are some scalability issues on the way, which I didn't expect. Those aren't fatal, but slightly disappointing. > But anyway, is idt_match a fairly static table? If so, I'd partition > that into 16 tables, and then have each one of your tasks join against > a different one of those tables. That should relieve the contention > on the index root block, and might have some other benefits as well. No, idt_match is getting filled by multi-threaded copy() and then joined with 4 other big tables like idt_phot. The result is then split into partitions. And I was trying different approaches to fully utilize the CPUs and/or I/O and somehow parallize the queries. That's the reasoning for somewhat contrived queries in my test. Cheers, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Wed, May 30, 2012 at 9:10 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > I understand the need of significant locking when there concurrent writes, > but not when there only reads. But I'm not a RDBMS expert, so that's maybe > that's misunderstanding on my side. If we knew in advance that no writes would come along during the execution of a particular test case, then we could skip a lot of locking on the reads. But we don't, so we have to be prepared for the possibility of writes at any time, which means doing things taking share-locks on data while it's actively being read. Also, even in an entirely read-only workload, buffers can get evicted from the database buffer cache, so we have to lock to make sure a buffer doesn't get evicted while we're reading it. I'd really like to find out exactly where all those s_lock calls are coming from. Is there any way you can get oprofile to output a partial stack backtrace? If you have perf it's very easy, just 'perf record -g -a <command to launch your test case>' and then 'perf report -g'. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Florian Pflug
Date:
On May31, 2012, at 02:26 , Sergey Koposov wrote: > On Thu, 31 May 2012, Florian Pflug wrote: >> Wait, so performance *increased* by spreading the backends out over as many dies as possible, not by using as few as possible?That'd be exactly the opposite of what I'd have expected. (I'm assuming that cores on one die have ascending idson linux. If you could post the contents of /proc/cpuinfo, we could verify that) > > Yes, you are correct. And I also can confirm that the cpus in the cpuinfo are ordered by "physical id" e.g. they go like > 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 > > I did a specific test with just 6 threads (== number of cores per cpu) > and ran it on a single phys cpu, it took ~ 12 seconds for each thread, > and when I tried to spread it across 4 cpus it took 7-9 seconds per thread. But all these numbers are anyway significantlybetter then when I didn't use taskset. Hm. The only resource that is shared between different cores on a die is usually the last cache level (L2 or L3). So by spreadingbackends out over more cores, you're increasing the total cache size. Maybe that could explain the behavior you'reseeing. > Which probably means without it the processes were jumping from core to core ? … Seems so. I completely fail to understand why they would, though. Since you've got far fewer runnable processes than cores,why would the kernel ever see a reason to migrate a process from one core to another? I think that you can query thecore that a process is running one from one of the files in /proc/<pid>. You could try gathering samples ever 50 ms orso during the test run - maybe that could shed some additional light on this. best regards, Florian Pflug
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Stephen Frost
Date:
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, May 30, 2012 at 9:10 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > > I understand the need of significant locking when there concurrent writes, > > but not when there only reads. But I'm not a RDBMS expert, so that's maybe > > that's misunderstanding on my side. > > If we knew in advance that no writes would come along during the > execution of a particular test case, then we could skip a lot of > locking on the reads. But we don't, so we have to be prepared for the > possibility of writes at any time, which means doing things taking > share-locks on data while it's actively being read. Uh, we have a read-only transaction mode, don't we? Or does that not help, because someone else, in another transaction, could take a read-write lock? Thanks, Stephen
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Jeff Janes
Date:
On Wed, May 30, 2012 at 7:00 PM, Stephen Frost <sfrost@snowman.net> wrote: > Robert, > > * Robert Haas (robertmhaas@gmail.com) wrote: >> On Wed, May 30, 2012 at 9:10 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: >> > I understand the need of significant locking when there concurrent writes, >> > but not when there only reads. But I'm not a RDBMS expert, so that's maybe >> > that's misunderstanding on my side. >> >> If we knew in advance that no writes would come along during the >> execution of a particular test case, then we could skip a lot of >> locking on the reads. But we don't, so we have to be prepared for the >> possibility of writes at any time, which means doing things taking >> share-locks on data while it's actively being read. > > Uh, we have a read-only transaction mode, don't we? Or does that not > help, because someone else, in another transaction, could take a > read-write lock? Yes, exactly. You would have to put the entire cluster into a read only mode, and the entire code base almost from top to bottom would have to be rewritten to detect and support such a mode. Even then, either you have a unified buffer cache (which needs locking even if the queries are read only), or each process uses its own local memory.In the latter case, how do you avoid oversubscribingphysical memory?Also, there are certain house-keeping tasks that can take place anytime, even in read-only transactions. You would either need to turn those off, or you would need to make a pass through the entire system doing all the house-keeping up front during the transition from read-write to read-only. It would basically be like writing an entirely new database management system, which just happens to share PG's on-disk format and PG's flavor of SLQ syntax and semantics. And is read only. (And since the OP is trying to create tables based on his read-only query, such a system wouldn't be very well suited.) Cheers, Jeff
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
Hi, On Wed, 30 May 2012, Robert Haas wrote: > I'd really like to find out exactly where all those s_lock calls are > coming from. Is there any way you can get oprofile to output a > partial stack backtrace? If you have perf it's very easy, just 'perf > record -g -a <command to launch your test case>' and then 'perf report > -g'. I repeated my test with 8 threads (without tasksetting) and with sharedbuffers=48g (because that seemed to trigger in particular long times ~ 80 seconds). And I attach the perf report. Cheers, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Thu, May 31, 2012 at 7:31 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > On Wed, 30 May 2012, Robert Haas wrote: > >> I'd really like to find out exactly where all those s_lock calls are >> coming from. Is there any way you can get oprofile to output a >> partial stack backtrace? If you have perf it's very easy, just 'perf >> record -g -a <command to launch your test case>' and then 'perf report >> -g'. > > > I repeated my test with 8 threads (without tasksetting) and with > sharedbuffers=48g (because that seemed to trigger in particular long times ~ > 80 seconds). And I attach the perf report. Thanks. How did you generate this perf report? It's cool, because I haven't figured out how to make perf generate a report that is easily email-able, and it seems you have. The only trouble is that there's no call stack information here for s_lock or PinBuffer, which is what I really want. It seems to have spit out call stack information only for the kernel functions, and not for user functions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Thu, 31 May 2012, Robert Haas wrote: > > Thanks. How did you generate this perf report? It's cool, because I > haven't figured out how to make perf generate a report that is easily > email-able, and it seems you have. I did pretty much what you have said, e.g. attached it to running process by perf record -g -p PID and then perf report -g > output And postgresql was compiled with cflags=-g > > The only trouble is that there's no call stack information here for > s_lock or PinBuffer, which is what I really want. It seems to have > spit out call stack information only for the kernel functions, and not > for user functions. Yes, I forgot to clean the old binaries when recompiled with cflags=-g. So not it is fixed. I attach the updated perf report (i.e. the first 10000 lines of it to reduce the file size). Cheers, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Thu, May 31, 2012 at 11:23 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > On Thu, 31 May 2012, Robert Haas wrote: >> >> Thanks. How did you generate this perf report? It's cool, because I >> haven't figured out how to make perf generate a report that is easily >> email-able, and it seems you have. > > I did pretty much what you have said, e.g. > attached it to running process by > perf record -g -p PID > and then > perf report -g > output Ah, interesting. I never tried sending the output to a file. > And postgresql was compiled with cflags=-g >> >> The only trouble is that there's no call stack information here for >> s_lock or PinBuffer, which is what I really want. It seems to have >> spit out call stack information only for the kernel functions, and not >> for user functions. > > Yes, I forgot to clean the old binaries when recompiled with cflags=-g. > So not it is fixed. I attach the updated perf report (i.e. the first 10000 > lines of it to reduce the file size). Oh, ho. So from this we can see that the problem is that we're getting huge amounts of spinlock contention when pinning and unpinning index pages. It would be nice to have a self-contained reproducible test case for this, so that we could experiment with it on other systems. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Thu, May 31, 2012 at 10:23 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > On Thu, 31 May 2012, Robert Haas wrote: >> >> >> Thanks. How did you generate this perf report? It's cool, because I >> haven't figured out how to make perf generate a report that is easily >> email-able, and it seems you have. > > > I did pretty much what you have said, e.g. > attached it to running process by > perf record -g -p PID > and then > perf report -g > output > > And postgresql was compiled with cflags=-g > >> >> The only trouble is that there's no call stack information here for >> s_lock or PinBuffer, which is what I really want. It seems to have >> spit out call stack information only for the kernel functions, and not >> for user functions. > > > Yes, I forgot to clean the old binaries when recompiled with cflags=-g. > So not it is fixed. I attach the updated perf report (i.e. the first 10000 > lines of it to reduce the file size). That's basically what we needed. The sequential scans are driving index scans are all simultaneously pointing at the same couple of pages in the index The are constantly pinning and unpinning -- the database is schizophrenically going back and forth between the key pages in the index being and not being allowed to be candidates for buffer eviction. Raising shared buffers doesn't help because it's just marking the buffers to be available for eviction, not the eviction process itself, that is the problem. IOS doens't help because it's index relation buffers, not the heap buffers we are binding up on (although if the btree was fixed it's entirely possible the problem could head right back to the heap for non IOS scans. merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Thu, 31 May 2012, Robert Haas wrote: > Oh, ho. So from this we can see that the problem is that we're > getting huge amounts of spinlock contention when pinning and unpinning > index pages. > > It would be nice to have a self-contained reproducible test case for > this, so that we could experiment with it on other systems. I have created it a few days ago: http://archives.postgresql.org/pgsql-hackers/2012-05/msg01143.php It is still valid. And I'm using exactly it to test. The only thing to change is to create a two-col index and drop another index. The scripts are precisely the ones I'm using now. The problem is that in order to see a really big slowdown (10 times slower than a single thread) I've had to raise the buffers to 48g but it was slow for smaller shared buffer settings as well. But I'm not sure how sensitive the test is to the hardware. Cheers, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Thu, May 31, 2012 at 11:54 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > On Thu, 31 May 2012, Robert Haas wrote: > >> Oh, ho. So from this we can see that the problem is that we're >> getting huge amounts of spinlock contention when pinning and unpinning >> index pages. >> >> It would be nice to have a self-contained reproducible test case for >> this, so that we could experiment with it on other systems. > > > I have created it a few days ago: > http://archives.postgresql.org/pgsql-hackers/2012-05/msg01143.php > > It is still valid. And I'm using exactly it to test. The only thing to > change is to create a two-col index and drop another index. > The scripts are precisely the ones I'm using now. > > The problem is that in order to see a really big slowdown (10 times slower > than a single thread) I've had to raise the buffers to 48g but it was slow > for smaller shared buffer settings as well. > > But I'm not sure how sensitive the test is to the hardware. It's not: high contention on spinlocks is going to suck no matter what hardware you have. I think the problem is pretty obvious now: any case where multiple backends are scanning the same sequence of buffers in a very tight loop is going to display this behavior. It doesn't come up that often: it takes a pretty unusual sequence of events to get a bunch of backends hitting the same buffer like that. Hm, I wonder if you could alleviate the symptoms by making making the Pin/UnpinBuffer smarter so that frequently pinned buffers could stay pinned longer -- kinda as if your private ref count was hacked to be higher in that case. It would be a complex fix for a narrow issue though. merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Thu, May 31, 2012 at 2:03 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, May 31, 2012 at 11:54 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: >> On Thu, 31 May 2012, Robert Haas wrote: >> >>> Oh, ho. So from this we can see that the problem is that we're >>> getting huge amounts of spinlock contention when pinning and unpinning >>> index pages. >>> >>> It would be nice to have a self-contained reproducible test case for >>> this, so that we could experiment with it on other systems. >> >> >> I have created it a few days ago: >> http://archives.postgresql.org/pgsql-hackers/2012-05/msg01143.php >> >> It is still valid. And I'm using exactly it to test. The only thing to >> change is to create a two-col index and drop another index. >> The scripts are precisely the ones I'm using now. >> >> The problem is that in order to see a really big slowdown (10 times slower >> than a single thread) I've had to raise the buffers to 48g but it was slow >> for smaller shared buffer settings as well. >> >> But I'm not sure how sensitive the test is to the hardware. > > It's not: high contention on spinlocks is going to suck no matter what > hardware you have. I think the problem is pretty obvious now: any > case where multiple backends are scanning the same sequence of buffers > in a very tight loop is going to display this behavior. It doesn't > come up that often: it takes a pretty unusual sequence of events to > get a bunch of backends hitting the same buffer like that. > > Hm, I wonder if you could alleviate the symptoms by making making the > Pin/UnpinBuffer smarter so that frequently pinned buffers could stay > pinned longer -- kinda as if your private ref count was hacked to be > higher in that case. It would be a complex fix for a narrow issue > though. This test case is unusual because it hits a whole series of buffers very hard. However, there are other cases where this happens on a single buffer that is just very, very hot, like the root block of a btree index, where the pin/unpin overhead hurts us. I've been thinking about this problem for a while, but it hasn't made it up to the top of my priority list, because workloads where pin/unpin is the dominant cost are still relatively uncommon. I expect them to get more common as we fix other problems. Anyhow, I do have some vague thoughts on how to fix this. Buffer pins are a lot like weak relation locks, in that they are a type of lock that is taken frequently, but rarely conflicts. And the fast-path locking in 9.2 provides a demonstration of how to handle this kind of problem efficiently: making the weak, rarely-conflicting locks cheaper, at the cost of some additional expense when a conflicting lock (in this case, a buffer cleanup lock) is taken. In particular, each backend has its own area to record weak relation locks, and a strong relation lock must scan all of those areas and migrate any locks found there to the main lock table. I don't think it would be feasible to adopt exactly this solution for buffer pins, because page eviction and buffer cleanup locks, while not exactly common, are common enough that we can't require a scan of N per-backend areas every time one of those operations occurs. But, maybe we could have a system of this type that only applies to the very hottest buffers. Suppose we introduce two new buffer flags, BUF_NAILED and BUF_NAIL_REMOVAL. When we detect excessive contention on the buffer header spinlock, we set BUF_NAILED. Once we do that, the buffer can't be evicted until that flag is removed, and backends are permitted to record pins in a per-backend area protected by a per-backend spinlock or lwlock, rather than in the buffer header. When we want to un-nail the buffer, we set BUF_NAIL_REMOVAL. At that point, it's no longer permissible to record new pins in the per-backend areas, but old ones may still exist. So then we scan all the per-backend areas and transfer the pins to the buffer header, or else just wait until no more exist; then, we clear both BUF_NAILED and BUF_NAIL_REMOVAL. So the pin algorithm looks like this: read buffer header flags (unlocked) if (flags & (BUF_NAILED|BUF_NAIL_REMOVAL) != BUF_NAILED) { take buffer header spinlock record pin in buffer header release buffer header spinlock; } else { take per-backend lwlock record pin in per-backend area release per-backend lwlock read buffer header flags (unlocked) if (flags & (BUF_NAILED|BUF_NAIL_REMOVAL) != BUF_NAILED) { take per-backend lwlock forget pin inper-backend area release per-backend lwlock take buffer header spinlock record pin in buffer header release buffer header spinlock } } Due to memory ordering effects, we might see the buffer as nailed when in fact nail removal has already begun (or even, completed). We can prevent that if (1) the nail removal code sets the nail removal flag before checking the per-backend areas and (2) the pin code checks the nail removal flag AFTER checking the per-backend areas. Since LWLockRelease is a sequencing point, the above algorithm is consistent with that scheme; the initial unlocked test of the buffer header flags is merely a heuristic to avoid extra work in the common case where the buffer isn't nailed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Thu, May 31, 2012 at 1:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, May 31, 2012 at 2:03 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Thu, May 31, 2012 at 11:54 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: >>> On Thu, 31 May 2012, Robert Haas wrote: >>> >>>> Oh, ho. So from this we can see that the problem is that we're >>>> getting huge amounts of spinlock contention when pinning and unpinning >>>> index pages. >>>> >>>> It would be nice to have a self-contained reproducible test case for >>>> this, so that we could experiment with it on other systems. >>> >>> >>> I have created it a few days ago: >>> http://archives.postgresql.org/pgsql-hackers/2012-05/msg01143.php >>> >>> It is still valid. And I'm using exactly it to test. The only thing to >>> change is to create a two-col index and drop another index. >>> The scripts are precisely the ones I'm using now. >>> >>> The problem is that in order to see a really big slowdown (10 times slower >>> than a single thread) I've had to raise the buffers to 48g but it was slow >>> for smaller shared buffer settings as well. >>> >>> But I'm not sure how sensitive the test is to the hardware. >> >> It's not: high contention on spinlocks is going to suck no matter what >> hardware you have. I think the problem is pretty obvious now: any >> case where multiple backends are scanning the same sequence of buffers >> in a very tight loop is going to display this behavior. It doesn't >> come up that often: it takes a pretty unusual sequence of events to >> get a bunch of backends hitting the same buffer like that. >> >> Hm, I wonder if you could alleviate the symptoms by making making the >> Pin/UnpinBuffer smarter so that frequently pinned buffers could stay >> pinned longer -- kinda as if your private ref count was hacked to be >> higher in that case. It would be a complex fix for a narrow issue >> though. > > This test case is unusual because it hits a whole series of buffers > very hard. However, there are other cases where this happens on a > single buffer that is just very, very hot, like the root block of a > btree index, where the pin/unpin overhead hurts us. I've been > thinking about this problem for a while, but it hasn't made it up to > the top of my priority list, because workloads where pin/unpin is the > dominant cost are still relatively uncommon. I expect them to get > more common as we fix other problems. > > Anyhow, I do have some vague thoughts on how to fix this. Buffer pins > are a lot like weak relation locks, in that they are a type of lock > that is taken frequently, but rarely conflicts. And the fast-path > locking in 9.2 provides a demonstration of how to handle this kind of > problem efficiently: making the weak, rarely-conflicting locks > cheaper, at the cost of some additional expense when a conflicting > lock (in this case, a buffer cleanup lock) is taken. In particular, > each backend has its own area to record weak relation locks, and a > strong relation lock must scan all of those areas and migrate any > locks found there to the main lock table. I don't think it would be > feasible to adopt exactly this solution for buffer pins, because page > eviction and buffer cleanup locks, while not exactly common, are > common enough that we can't require a scan of N per-backend areas > every time one of those operations occurs. > > But, maybe we could have a system of this type that only applies to > the very hottest buffers. Suppose we introduce two new buffer flags, > BUF_NAILED and BUF_NAIL_REMOVAL. When we detect excessive contention > on the buffer header spinlock, we set BUF_NAILED. Once we do that, > the buffer can't be evicted until that flag is removed, and backends > are permitted to record pins in a per-backend area protected by a > per-backend spinlock or lwlock, rather than in the buffer header. > When we want to un-nail the buffer, we set BUF_NAIL_REMOVAL. Hm, couple questions: how do you determine if/when to un-nail a buffer, and who makes that decision (bgwriter?) Is there a limit to how many buffers you are allowed to nail? It seems like a much stronger idea, but one downside I see vs the 'pin for longer idea' i was kicking around was how to deal stale nailed buffers and keeping them from uncontrollably growing so that you have to either stop nailing or forcibly evicting them. merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Thu, May 31, 2012 at 3:25 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > Hm, couple questions: how do you determine if/when to un-nail a > buffer, and who makes that decision (bgwriter?) Well, I think some experimentation might be required, but my first thought is to tie it into buffer eviction. If we record a pin in some side array in lieu of acquiring the buffer header spinlock, then we're not bumping the usage count. So perhaps what we might do is - when we first nail the buffer, we set it's usage count to something much higher than what's normally allowed, like say 50. When it gets down to 0 and somebody tries to evict it, we scan the per-backend arrays and count the number of pins we find there, and set the usage count to 5 * that number. The details could be completely wrong here. The point is that I think every so often we can just reconsider whether the buffer is still worthy of being nailed. Exactly how often we should reconsider it seems like an empirical question. > Is there a limit to > how many buffers you are allowed to nail? Probably not. The system had better be designed so that the threshold for nailing a buffer is pretty high. It should be exceedingly unlikely for more than a small percentage of buffers to meet the criteria; if it happens, then there's probably a good reason. > It seems like a much > stronger idea, but one downside I see vs the 'pin for longer idea' i > was kicking around was how to deal stale nailed buffers and keeping > them from uncontrollably growing so that you have to either stop > nailing or forcibly evicting them. I think we're really talking about variants on the same theme. My idea is basically "pin forever", which is taking "longer" to its logical extreme, but even if it were something short of forever I think you'd still need some way to flush the stale pins. Otherwise, vacuum can get stuck, etc. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Jeff Janes
Date:
On Sun, May 27, 2012 at 11:45 AM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > Hi, > > I did another test using the same data and the same code, which I've > provided before and the performance of the single thread seems to be > degrading quadratically with the number of threads. > > Here are the results: > Nthreads Time_to_execute_one_thread > 1 8.1 > 2 7.8 > 3 8.1 > 4 9.0 > 5 10.2 > 6 11.4 > 7 13.3 > 8 16.1 > 9 19.0 > 10 21.4 > 11 23.8 > 12 27.3 > 13 30.2 > 14 32.0 > 15 34.1 > 16 37.5 This looks more skew hyperbolic than quadratic. In the higher region, it doesn't seem to be increasing at an increasing rate, but rather increasing at a constant rate of about 2.5 sec per additional backend, i.e. converging to an asymptote with a slope of 2.5. A pedantic point, perhaps, but if you are trying to infer the nature of the bottleneck from the shape of the curve, they lead to rather different sets of hypotheses. Cheers, Jeff
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Jeff Janes
Date:
On Thu, May 31, 2012 at 9:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > Oh, ho. So from this we can see that the problem is that we're > getting huge amounts of spinlock contention when pinning and unpinning > index pages. > > It would be nice to have a self-contained reproducible test case for > this, so that we could experiment with it on other systems. I just posted a patch under subject "pgbench--new transaction type" that introduces a pgbench -P option. I think that that would do a good job of simulating unique-key look-ups on the inner side of a nested loop (which is basically what we have here) and so creating contention on index pages. Right now I don't have anything with more than 2 CPUs and 2 is not high enough to get much contention so I can't post any meaningful numbers. (pgbench -P might also be of interest in hash index investigation) Cheers, Jeff
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Jeff Janes
Date:
On Thu, May 31, 2012 at 11:50 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > This test case is unusual because it hits a whole series of buffers > very hard. However, there are other cases where this happens on a > single buffer that is just very, very hot, like the root block of a > btree index, where the pin/unpin overhead hurts us. I think that very very hot page is also the problem here, not a whole sequence of hot pages. Most of his buffer content sh lwlocks are on just two buffers, and most of his blocked buffer mapping lwlocks on are on just two partitions. So I am guessing that almost all of his spin-lock contention from Pin and Unpin are also coming from those same two buffers. Why there are two buffers when there is only one index root block involved, I don't know. Cheers, Jeff
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Jeff Janes
Date:
On Wed, May 30, 2012 at 6:10 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > On Wed, 30 May 2012, Jeff Janes wrote: > >> But anyway, is idt_match a fairly static table? If so, I'd partition >> that into 16 tables, and then have each one of your tasks join against >> a different one of those tables. That should relieve the contention >> on the index root block, and might have some other benefits as well. > > > No, idt_match is getting filled by multi-threaded copy() and then joined > with 4 other big tables like idt_phot. The result is then split into > partitions. That does make things more complicated. But you could you partition it at that level and then do the joins partition-wise? I don't have much experience at data partitioning (well, I do, but the experience is with partitioning in Perl with terabytes of flat files, not in PG :) ) but I think that once you have your partitioning keys you want to apply them the same way up and down the data set. Cheers, Jeff
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Thu, 31 May 2012, Jeff Janes wrote: >> >> No, idt_match is getting filled by multi-threaded copy() and then joined >> with 4 other big tables like idt_phot. The result is then split into >> partitions. > > That does make things more complicated. But you could you partition > it at that level and then do the joins partition-wise? Unfortunately the information to understand in which partition the data needs to be put in is only available from the idt_match table. So I have to do at least one join with idt_match. But I will experiment further with ways to perform queries, I just stopped doing that because I saw these problems with scalability. > I don't have much experience at data partitioning (well, I do, but the > experience is with partitioning in Perl with terabytes of flat files, > not in PG :) ) but I think that once you have your partitioning keys > you want to apply them the same way up and down the data set. I'm not sure what you mean by "the same way up and down the data set". Cheers, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Florian Pflug
Date:
On May31, 2012, at 20:50 , Robert Haas wrote: > Suppose we introduce two new buffer flags, > BUF_NAILED and BUF_NAIL_REMOVAL. When we detect excessive contention > on the buffer header spinlock, we set BUF_NAILED. Once we do that, > the buffer can't be evicted until that flag is removed, and backends > are permitted to record pins in a per-backend area protected by a > per-backend spinlock or lwlock, rather than in the buffer header. > When we want to un-nail the buffer, we set BUF_NAIL_REMOVAL. At that > point, it's no longer permissible to record new pins in the > per-backend areas, but old ones may still exist. So then we scan all > the per-backend areas and transfer the pins to the buffer header, or > else just wait until no more exist; then, we clear both BUF_NAILED and > BUF_NAIL_REMOVAL. A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs by queing UnpinBuffer() requests for a while before actually updating shared state. I'm imagining having a small unpin queue with, say, 32 entries in backend-local memory. When we unpin a buffer, we add the buffer at the front of the queue. If the queue overflows, we dequeue a buffer from the back of the queue and actually call UnpinBuffer(). If PinBuffer() is called for a queued buffer, we simply remove the buffer from the queue. We'd drain the unpin queue whenever we don't expect a PinBuffer() request to happen for a while. Returning to the main loop is an obvious such place, but there might be others. We could, for example, drain the queue every time we block on a lock or signal, and maybe also before we go do I/O. Or, we could have one such queue per resource owner, and drain it when we release the resource owner. We already avoid calling PinBuffer() multiple times for multiple overlapping pins of a single buffer by a single backend. The strategy above would extend that to not-quite-overlapping pins. best regards, Florian Pflug
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Fri, Jun 1, 2012 at 7:47 AM, Florian Pflug <fgp@phlo.org> wrote: > On May31, 2012, at 20:50 , Robert Haas wrote: >> Suppose we introduce two new buffer flags, >> BUF_NAILED and BUF_NAIL_REMOVAL. When we detect excessive contention >> on the buffer header spinlock, we set BUF_NAILED. Once we do that, >> the buffer can't be evicted until that flag is removed, and backends >> are permitted to record pins in a per-backend area protected by a >> per-backend spinlock or lwlock, rather than in the buffer header. >> When we want to un-nail the buffer, we set BUF_NAIL_REMOVAL. At that >> point, it's no longer permissible to record new pins in the >> per-backend areas, but old ones may still exist. So then we scan all >> the per-backend areas and transfer the pins to the buffer header, or >> else just wait until no more exist; then, we clear both BUF_NAILED and >> BUF_NAIL_REMOVAL. > > A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs > by queing UnpinBuffer() requests for a while before actually updating > shared state. > > I'm imagining having a small unpin queue with, say, 32 entries in > backend-local memory. When we unpin a buffer, we add the buffer at the > front of the queue. If the queue overflows, we dequeue a buffer from the > back of the queue and actually call UnpinBuffer(). If PinBuffer() is called > for a queued buffer, we simply remove the buffer from the queue. > > We'd drain the unpin queue whenever we don't expect a PinBuffer() request > to happen for a while. Returning to the main loop is an obvious such place, > but there might be others. We could, for example, drain the queue every time > we block on a lock or signal, and maybe also before we go do I/O. Or, we > could have one such queue per resource owner, and drain it when we release > the resource owner. > > We already avoid calling PinBuffer() multiple times for multiple overlapping > pins of a single buffer by a single backend. The strategy above would extend > that to not-quite-overlapping pins. A potential issue with this line of thinking is that your pin delay queue could get highly pressured by outer portions of the query (as in the OP's case) that will get little or no benefit from the delayed pin. But choosing a sufficiently sized drain queue would work for most reasonable cases assuming 32 isn't enough? Why not something much larger, for example the lesser of 1024, (NBuffers * .25) / max_connections? In other words, for you to get much benefit, you have to pin the buffer sufficiently more than 1/N times among all buffers. Or, maybe you only put contended buffers in the delay queue by watching for the delay count that is going to be returned from s_lock.That forces a lookup on each pin though. merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes: > A potential issue with this line of thinking is that your pin delay > queue could get highly pressured by outer portions of the query (as in > the OP's case) that will get little or no benefit from the delayed > pin. But choosing a sufficiently sized drain queue would work for > most reasonable cases assuming 32 isn't enough? Why not something > much larger, for example the lesser of 1024, (NBuffers * .25) / > max_connections? In other words, for you to get much benefit, you > have to pin the buffer sufficiently more than 1/N times among all > buffers. Allowing each backend to pin a large fraction of shared buffers sounds like a seriously bad idea to me. That's just going to increase thrashing of what remains. More generally, I don't believe that we have any way to know which buffers would be good candidates to keep pinned for a long time. Typically, we don't drop the pin in the first place if we know we're likely to touch that buffer again soon. btree root pages might be an exception, but I'm not even convinced of that one. regards, tom lane
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Florian Pflug
Date:
On Jun1, 2012, at 15:45 , Tom Lane wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> A potential issue with this line of thinking is that your pin delay >> queue could get highly pressured by outer portions of the query (as in >> the OP's case) that will get little or no benefit from the delayed >> pin. But choosing a sufficiently sized drain queue would work for >> most reasonable cases assuming 32 isn't enough? Why not something >> much larger, for example the lesser of 1024, (NBuffers * .25) / >> max_connections? In other words, for you to get much benefit, you >> have to pin the buffer sufficiently more than 1/N times among all >> buffers. > > Allowing each backend to pin a large fraction of shared buffers sounds > like a seriously bad idea to me. That's just going to increase > thrashing of what remains. Right, that was one of the motivations for suggesting the small queue. At least that way, the number of buffers optimistically pinned by each backend is limited. The other was that once the outer portions plough through more than a few pages per iteration of the sub-plan, the cost of doing that should dominate the cost of pinning and unpinning. > More generally, I don't believe that we have any way to know which > buffers would be good candidates to keep pinned for a long time. I'd think that pinning a buffer which we've only recently unpinned is a pretty good indication that the same thing will happen again. My proposed algorithm could be made to use exactly that criterion by tracking a little bit more state. We'd have to tag queue entries with a flag indicating whether they are Unpinned (COLD) Pinned, and unpinning should be delayed (HOT) Waiting to be unpinned (LUKEWARM) UnpinBuffer() would check if the buffer is HOT, and if so add it to the queue with flag LUKEWARM. Otherwise, it'd get immediately unpinned and flagged as COLD (adding it to the queue if necessary). PinBuffer() would pin the buffer and mark it as HOT if it was COLD, and just mark it as HOT if it was LUKEWARM. If the buffer isn't on the queue already, PinBuffer() would simply pin it and be done. This would give the following behaviour for a buffer that is pinned repeatedly PinBuffer(): <not on queue> -> <not on queue> (refcount incremented) UnpinBuffer(): <not on queue> -> COLD (refcount decremented)... PinBuffer(): COLD -> HOT (refcount incremented) UnpinBuffer(): HOT -> LUKEWARM (refcount *not* decremented)... PinBuffer(): LUKEWARM -> HOT (refcount *not* incremented) UnpinBuffer(): HOT -> LUKEWARM (refcount *not*decremented) … > Typically, we don't drop the pin in the first place if we know we're > likely to touch that buffer again soon. btree root pages might be an > exception, but I'm not even convinced of that one. But Sergey's use-case pretty convincingly shows that, more generally, inner sides of a nested loop join are also an exception, no? At least if the inner side is either an index scan, or a seqscan of a really small table. best regards, Florian Pflug
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Fri, Jun 1, 2012 at 8:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> A potential issue with this line of thinking is that your pin delay >> queue could get highly pressured by outer portions of the query (as in >> the OP's case) that will get little or no benefit from the delayed >> pin. But choosing a sufficiently sized drain queue would work for >> most reasonable cases assuming 32 isn't enough? Why not something >> much larger, for example the lesser of 1024, (NBuffers * .25) / >> max_connections? In other words, for you to get much benefit, you >> have to pin the buffer sufficiently more than 1/N times among all >> buffers. > > Allowing each backend to pin a large fraction of shared buffers sounds > like a seriously bad idea to me. That's just going to increase > thrashing of what remains. By 'large fraction', you mean 25%? You could always set it lower, say 5%. But if you can be smarter about which buffers to put in I agree: a smaller queue is better. > More generally, I don't believe that we have any way to know which > buffers would be good candidates to keep pinned for a long time. > Typically, we don't drop the pin in the first place if we know we're > likely to touch that buffer again soon. btree root pages might be an > exception, but I'm not even convinced of that one. Why not (unless Florian's warming concept is a better bet) hook it to spinlock contention? That's what we're trying to avoid after all. s_lock can be modified to return if it had to delay. PinBuffer could watch for that and stick it in the queue. Both Florin's idea (AIUI) or a s_lock based implementation require you to search your local queue on every pin/unpin, which i think is the real cost. Robert's doesn't, although it is a more complicated approach. merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Ants Aasma
Date:
On Fri, Jun 1, 2012 at 5:57 PM, Florian Pflug <fgp@phlo.org> wrote: > My proposed algorithm could be made to use exactly that criterion > by tracking a little bit more state. We'd have to tag queue entries > with a flag indicating whether they are > > Unpinned (COLD) > > Pinned, and unpinning should be delayed (HOT) > > Waiting to be unpinned (LUKEWARM) This sounds awfully similar to the LIRS/CLOCK-Pro cache replacement algorithms. They manage buffers by reuse distance based on last two accesses. Because both algorithms demonstrate very good performance over a wide range of cache sizes and workloads, it might be worth exploring how they could be applicable here. On the other hand, I agree with Merlin that searching the local queue for every pin could be too heavy. Roberts approach amounts to categorizing some buffers to be so hot that we basically lift them out of the regular replacement algorithm management and don't even bother tracking their usage on the account that it will be cheaper to figure out their usage state after the fact. This might have some interesting interactions with replacement algorithms that naturally keep separate tiers of buffers. I think I'm going to try to see where that train of thought takes me. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug <fgp@phlo.org> wrote: > A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs > by queing UnpinBuffer() requests for a while before actually updating > shared state. So, what happens when somebody wants a cleanup lock on the buffer you've decided to keep pinned? We have this problem already; I'm wary of making it worse. > We'd drain the unpin queue whenever we don't expect a PinBuffer() request > to happen for a while. Returning to the main loop is an obvious such place, > but there might be others. However, on a workload like pgbench -S, dropping the pin when you return to the main loop would render the optimization useless. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Jeff Janes
Date:
On Fri, Jun 1, 2012 at 10:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug <fgp@phlo.org> wrote: > >> We'd drain the unpin queue whenever we don't expect a PinBuffer() request >> to happen for a while. Returning to the main loop is an obvious such place, >> but there might be others. > > However, on a workload like pgbench -S, dropping the pin when you > return to the main loop would render the optimization useless. But do we need the optimization on a workload like pgbench -S? You did some pretty massive scalability tests on that front, and I don't think this problem was identified in them. Cheers, Jeff
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Florian Pflug
Date:
On Jun1, 2012, at 19:51 , Robert Haas wrote: > On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug <fgp@phlo.org> wrote: >> A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs >> by queing UnpinBuffer() requests for a while before actually updating >> shared state. > > So, what happens when somebody wants a cleanup lock on the buffer > you've decided to keep pinned? We have this problem already; I'm wary > of making it worse. She has to wait, or skip the cleanup. We already keep buffers pinned for potentially a long time - if cursors enter the game even over multiple commands, I think. And even without cursors, a nestloop plan can keep buffers used by the outer sub-plan pinned while the inner plan is executing. So giving the inner plan the chance to do the same seems not unreasonable. Plus, the more refined version of that idea from my reply to Tom only delays the unpinning after the buffer was pinned twice within a short period of time. After that, it seems to me that the probability of it getting pinned again is high enough to warrant slightly penalizing a potential cleanup. >> We'd drain the unpin queue whenever we don't expect a PinBuffer() request >> to happen for a while. Returning to the main loop is an obvious such place, >> but there might be others. > > However, on a workload like pgbench -S, dropping the pin when you > return to the main loop would render the optimization useless. Well, we could drain the queue only if the read() from the socket blocks. But does pgbench -S really manage to cause significant spinlock contention due to buffer pinning/unpinning? I'd expect other things to dominate there.. best regards, Florian Pflug
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Fri, Jun 1, 2012 at 2:06 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Fri, Jun 1, 2012 at 10:51 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug <fgp@phlo.org> wrote: >> >>> We'd drain the unpin queue whenever we don't expect a PinBuffer() request >>> to happen for a while. Returning to the main loop is an obvious such place, >>> but there might be others. >> >> However, on a workload like pgbench -S, dropping the pin when you >> return to the main loop would render the optimization useless. > > But do we need the optimization on a workload like pgbench -S? You > did some pretty massive scalability tests on that front, and I don't > think this problem was identified in them. I don't know if I mentioned it on list, but I believe there is an effect, and I believe it will get stronger as we continue to make improvements in other areas, and as we continue to ramp up the number of cores. Even now, you can get s_lock taking >1% of the CPU time, which is not trivial, and not all of that is from buffer-pinning, but it's in the mix, so I'm kind of reluctant to accept a solution that involves assuming that the problem always has to occur within a transaction rather than across transactions. I think it's a dead-end anyway on other grounds: we ought to be trying to find ways to hold buffer pins for less time, not more time. Another thought is that if the problem is limited to the root index block, then we could consider less general solutions, like allowing backends to cache the root index block and sending some kind of invalidation when it gets split. But I don't know that the root index block is the only place where we can have this problem, so I'm a little nervous about that angle, too. Still, it might be worth some research. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Fri, Jun 1, 2012 at 2:54 PM, Florian Pflug <fgp@phlo.org> wrote: > On Jun1, 2012, at 19:51 , Robert Haas wrote: >> On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug <fgp@phlo.org> wrote: >>> A simpler idea would be to collapse UnpinBuffer() / PinBuffer() pairs >>> by queing UnpinBuffer() requests for a while before actually updating >>> shared state. >> >> So, what happens when somebody wants a cleanup lock on the buffer >> you've decided to keep pinned? We have this problem already; I'm wary >> of making it worse. > > She has to wait, or skip the cleanup. We already keep buffers pinned for > potentially a long time - if cursors enter the game even over multiple > commands, I think. And even without cursors, a nestloop plan can keep > buffers used by the outer sub-plan pinned while the inner plan is > executing. So giving the inner plan the chance to do the same seems not > unreasonable. Maybe. TBH, I'm unhappy about how easy it is to leave pins floating around and would like to reduce that further, not increase it. > Plus, the more refined version of that idea from my reply to Tom only > delays the unpinning after the buffer was pinned twice within a short > period of time. After that, it seems to me that the probability of it > getting pinned again is high enough to warrant slightly penalizing a > potential cleanup. > >>> We'd drain the unpin queue whenever we don't expect a PinBuffer() request >>> to happen for a while. Returning to the main loop is an obvious such place, >>> but there might be others. >> >> However, on a workload like pgbench -S, dropping the pin when you >> return to the main loop would render the optimization useless. > > Well, we could drain the queue only if the read() from the socket blocks. I believe that it's going to block frequently on this type of workload. pgbench isn't fast enough to get the next query over there by the time postgres is ready for the next query. > But does pgbench -S really manage to cause significant spinlock contention > due to buffer pinning/unpinning? I'd expect other things to dominate there.. See previous note. It may not be the biggest effect, but I think it's in the mix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > Another thought is that if the problem is limited to the root index > block, then we could consider less general solutions, like allowing > backends to cache the root index block and sending some kind of > invalidation when it gets split. Possibly worth noting here that we already do exactly that for btree metapages. However, I believe that case is a bit different, in that using slightly stale metapage data is nonfatal. Also, anything like this can be a win only if the cached data can be expected to have pretty long lifetime; so you would not want to cache the root page until the index had gotten large enough that root page changes were infrequent. Not sure how that could be mechanized ... regards, tom lane
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Florian Pflug
Date:
On Jun1, 2012, at 21:07 , Robert Haas wrote: > On Fri, Jun 1, 2012 at 2:54 PM, Florian Pflug <fgp@phlo.org> wrote: >> On Jun1, 2012, at 19:51 , Robert Haas wrote: >>> On Fri, Jun 1, 2012 at 8:47 AM, Florian Pflug <fgp@phlo.org> wrote: >>>> We'd drain the unpin queue whenever we don't expect a PinBuffer() request >>>> to happen for a while. Returning to the main loop is an obvious such place, >>>> but there might be others. >>> >>> However, on a workload like pgbench -S, dropping the pin when you >>> return to the main loop would render the optimization useless. >> >> Well, we could drain the queue only if the read() from the socket blocks. > > I believe that it's going to block frequently on this type of > workload. pgbench isn't fast enough to get the next query over there > by the time postgres is ready for the next query. > >> But does pgbench -S really manage to cause significant spinlock contention >> due to buffer pinning/unpinning? I'd expect other things to dominate there.. > > See previous note. It may not be the biggest effect, but I think it's > in the mix. Ok, now you've lost me. If the read() blocks, how on earth can a few additional pins/unpins ever account for any meaningful execution time? It seems to me that once read() blocks we're talking about a delay in the order of the scheduling granularity (i.e., milliseconds, in the best case), while even in the word case pinning a buffer shouldn't take more than 1000 cycles (for comparison, I think a cache miss across all layers costs a few hundred cycles). So there's at the very least 3 order of magnitude between those two... best regards, Florian Pflug
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Fri, Jun 1, 2012 at 3:16 PM, Florian Pflug <fgp@phlo.org> wrote: > Ok, now you've lost me. If the read() blocks, how on earth can a few > additional pins/unpins ever account for any meaningful execution time? > > It seems to me that once read() blocks we're talking about a delay in the > order of the scheduling granularity (i.e., milliseconds, in the best case), > while even in the word case pinning a buffer shouldn't take more than > 1000 cycles (for comparison, I think a cache miss across all layers costs > a few hundred cycles). So there's at the very least 3 order of magnitude > between those two... I'm not sure what you want me to say here. s_lock shows up in the profile, and some of that is from PinBuffer. I think any detectable number of calls to s_lock is a sign of Bad Things (TM). I can't reproduce anything as severe as what the OP is seeing, but what does that prove? In a couple years we'll have systems with 128 cores floating around, and things that are minor problems at 32 or even 64 cores will be crippling at 128 cores. IME, spinlock contention has a very sharp tipping point. It's only a minor annoyance and then you hit some threshold number of cores and, bam, you're spending 70-90% of your time across all cores fighting over that one spinlock. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Fri, Jun 1, 2012 at 3:40 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jun 1, 2012 at 3:16 PM, Florian Pflug <fgp@phlo.org> wrote: >> Ok, now you've lost me. If the read() blocks, how on earth can a few >> additional pins/unpins ever account for any meaningful execution time? >> >> It seems to me that once read() blocks we're talking about a delay in the >> order of the scheduling granularity (i.e., milliseconds, in the best case), >> while even in the word case pinning a buffer shouldn't take more than >> 1000 cycles (for comparison, I think a cache miss across all layers costs >> a few hundred cycles). So there's at the very least 3 order of magnitude >> between those two... > > I'm not sure what you want me to say here. s_lock shows up in the > profile, and some of that is from PinBuffer. I think any detectable > number of calls to s_lock is a sign of Bad Things (TM). I can't > reproduce anything as severe as what the OP is seeing, but what does > that prove? In a couple years we'll have systems with 128 cores > floating around, and things that are minor problems at 32 or even 64 > cores will be crippling at 128 cores. IME, spinlock contention has a > very sharp tipping point. It's only a minor annoyance and then you > hit some threshold number of cores and, bam, you're spending 70-90% of > your time across all cores fighting over that one spinlock. I think your approach, nailing buffers, is really the way to go. It nails buffers based on detected contention which is very desirable -- uncontended spinlocks aren't broken and don't need to be fixed. It also doesn't add overhead in the general case whereas a side by side backend queue does. Another nice aspect is that you're not changing the lifetime of the pin as the backend sees it but storing the important stuff (the interplay with usage_count is a nice touch) on the buffer itself -- you want to keep as little as possible in the backend private side and your patch does that; it's more amenable to 3rd party intervention (flush your buffers right now!) then extended pins. It exploits the fact that pins can overlap and that the reference count is useless if the buffer is always in memory anyways. It immediately self corrects when the first backend gripes whereas a per backend solution will grind down as each backend independently determines it's got a problem -- not pleasant if your case is 'walking' a set of buffers. Buffer pins aren't a cache: with a cache you are trying to mask a slow operation (like a disk i/o) with a faster such that the amount of slow operations are minimized. Buffer pins however are very different in that we only care about contention on the reference count (the buffer itself is not locked!) which makes me suspicious that caching type algorithms are the wrong place to be looking. I think it comes to do picking between your relatively complex but general, lock displacement approach or a specific strategy based on known bottlenecks. merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Ants Aasma
Date:
On Sat, Jun 2, 2012 at 1:48 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > Buffer pins aren't a cache: with a cache you are trying to mask a slow > operation (like a disk i/o) with a faster such that the amount of slow > operations are minimized. Buffer pins however are very different in > that we only care about contention on the reference count (the buffer > itself is not locked!) which makes me suspicious that caching type > algorithms are the wrong place to be looking. I think it comes to do > picking between your relatively complex but general, lock displacement > approach or a specific strategy based on known bottlenecks. I agree that pins aren't like a cache. I mentioned the caching algorithms because they work based on access frequency and highly contended locks are likely to be accessed frequently even from a single backend. However this only makes sense for the delayed unpinning method, and I also have come to the conclusion that it's not likely to work well. Besides delaying cleanup, the overhead for the common case of uncontended access is just too much. It seems to me that even the nailing approach will need a replacement algorithm. The local pins still need to be published globally and because shared memory size is fixed, the maximum amount of locally pinned nailed buffers needs to be limited as well. But anyway, I managed to completely misread the profile that Sergey gave. Somehow I missed that the time went into the retry TAS in slock instead of the inlined TAS. This shows that the issue isn't just cacheline ping-pong but cacheline stealing. This could be somewhat mitigated by making pinning lock-free. The Nb-GCLOCK paper that Robert posted earlier in another thread describes an approach for this. I have a WIP patch (attached) that makes the clock sweep lock-free in the common case. This patch gave a 40% performance increase for an extremely allocation heavy load running with 64 clients on a 4 core 1 socket system, lesser gains across the board. Pinning has a shorter lock duration (and a different lock type) so the gain might be less, or it might be a larger problem and post a higher gain. Either way, I think the nailing approach should be explored further, cacheline ping-pong could still be a problem with higher number of processors and losing the spinlock also loses the ability to detect contention. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Attachment
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Robert Haas
Date:
On Fri, Jun 1, 2012 at 9:55 PM, Ants Aasma <ants@cybertec.at> wrote: > On Sat, Jun 2, 2012 at 1:48 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> Buffer pins aren't a cache: with a cache you are trying to mask a slow >> operation (like a disk i/o) with a faster such that the amount of slow >> operations are minimized. Buffer pins however are very different in >> that we only care about contention on the reference count (the buffer >> itself is not locked!) which makes me suspicious that caching type >> algorithms are the wrong place to be looking. I think it comes to do >> picking between your relatively complex but general, lock displacement >> approach or a specific strategy based on known bottlenecks. > > I agree that pins aren't like a cache. I mentioned the caching > algorithms because they work based on access frequency and highly > contended locks are likely to be accessed frequently even from a > single backend. However this only makes sense for the delayed > unpinning method, and I also have come to the conclusion that it's not > likely to work well. Besides delaying cleanup, the overhead for the > common case of uncontended access is just too much. > > It seems to me that even the nailing approach will need a replacement > algorithm. The local pins still need to be published globally and > because shared memory size is fixed, the maximum amount of locally > pinned nailed buffers needs to be limited as well. > > But anyway, I managed to completely misread the profile that Sergey > gave. Somehow I missed that the time went into the retry TAS in slock > instead of the inlined TAS. This shows that the issue isn't just > cacheline ping-pong but cacheline stealing. This could be somewhat > mitigated by making pinning lock-free. The Nb-GCLOCK paper that Robert > posted earlier in another thread describes an approach for this. I > have a WIP patch (attached) that makes the clock sweep lock-free in > the common case. This patch gave a 40% performance increase for an > extremely allocation heavy load running with 64 clients on a 4 core 1 > socket system, lesser gains across the board. Pinning has a shorter > lock duration (and a different lock type) so the gain might be less, > or it might be a larger problem and post a higher gain. Either way, I > think the nailing approach should be explored further, cacheline > ping-pong could still be a problem with higher number of processors > and losing the spinlock also loses the ability to detect contention. Note sure about the rest of this patch, but this part is definitely bogus: +#if !defined(pg_atomic_fetch_and_set) +#define pg_atomic_fetch_and_set(dst, src, value) \ + do { S_LOCK(&dummy_spinlock); \ + dst = src; \ + src = value; \ + S_UNLOCK(&dummy_spinlock); } while (0) +#endif Locking a dummy backend-local spinlock doesn't provide atomicity across multiple processes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Ants Aasma
Date:
On Mon, Jun 4, 2012 at 5:12 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Note sure about the rest of this patch, but this part is definitely bogus: > > +#if !defined(pg_atomic_fetch_and_set) > +#define pg_atomic_fetch_and_set(dst, src, value) \ > + do { S_LOCK(&dummy_spinlock); \ > + dst = src; \ > + src = value; \ > + S_UNLOCK(&dummy_spinlock); } while (0) > +#endif > > Locking a dummy backend-local spinlock doesn't provide atomicity > across multiple processes. Right, of course. I don't know why I assumed that dummy_spinlock would be global. In any case, this is very WIP and doesn't even aspire to be portable yet. The main point was to see if there's any significant performance to be gained by this method. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Mon, Jun 4, 2012 at 9:20 AM, Ants Aasma <ants@cybertec.at> wrote: > On Mon, Jun 4, 2012 at 5:12 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Note sure about the rest of this patch, but this part is definitely bogus: >> >> +#if !defined(pg_atomic_fetch_and_set) >> +#define pg_atomic_fetch_and_set(dst, src, value) \ >> + do { S_LOCK(&dummy_spinlock); \ >> + dst = src; \ >> + src = value; \ >> + S_UNLOCK(&dummy_spinlock); } while (0) >> +#endif >> >> Locking a dummy backend-local spinlock doesn't provide atomicity >> across multiple processes. > > Right, of course. I don't know why I assumed that dummy_spinlock would > be global. In any case, this is very WIP and doesn't even aspire to be > portable yet. The main point was to see if there's any significant > performance to be gained by this method. yeah -- those are fallback routines in case the compiler primitives don't exist. I think I understand what Ants is doing here: he's reducing the coverage of the free list lock to only cover actually popping a buffer off the free list; it no longer covers the clock sweep. That's a massive win if it works. In order to get away with that he had to decompose all manipulations from the clock sweep to the Strategy to thread safe atomic operations. What happens (in the very unlikely, but possible case?) if another backend races to the buffer you've pointed at with 'victim'? It looks like multiple backends share the clock sweep now, but don't you need to need an extra test to ensure it's still a candidate victim buffer? merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Mon, Jun 4, 2012 at 10:17 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > What happens (in the very unlikely, but possible case?) if another > backend races to the buffer you've pointed at with 'victim'? It looks > like multiple backends share the clock sweep now, but don't you need > to need an extra test to ensure it's still a candidate victim buffer? Actually, I don't think you do: the existing check on refcount is probably good enough. Hm, why did you get rid of BufferStrategyControl.lastFreeBuffer? merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Ants Aasma
Date:
On Mon, Jun 4, 2012 at 6:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Mon, Jun 4, 2012 at 10:17 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> What happens (in the very unlikely, but possible case?) if another >> backend races to the buffer you've pointed at with 'victim'? It looks >> like multiple backends share the clock sweep now, but don't you need >> to need an extra test to ensure it's still a candidate victim buffer? > > Actually, I don't think you do: the existing check on refcount is > probably good enough. Hm, why did you get rid of > BufferStrategyControl.lastFreeBuffer? It was dead code as far as I could tell. That change isn't actually relevant for this patch because free-list management is still protected by a lock (except the initial unlocked test that is doublechecked under lock) and so doesn't need any adjustment. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Mon, Jun 4, 2012 at 10:42 AM, Ants Aasma <ants@cybertec.at> wrote: > On Mon, Jun 4, 2012 at 6:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Mon, Jun 4, 2012 at 10:17 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> What happens (in the very unlikely, but possible case?) if another >>> backend races to the buffer you've pointed at with 'victim'? It looks >>> like multiple backends share the clock sweep now, but don't you need >>> to need an extra test to ensure it's still a candidate victim buffer? >> >> Actually, I don't think you do: the existing check on refcount is >> probably good enough. Hm, why did you get rid of >> BufferStrategyControl.lastFreeBuffer? > > It was dead code as far as I could tell. That change isn't actually > relevant for this patch because free-list management is still > protected by a lock (except the initial unlocked test that is > doublechecked under lock) and so doesn't need any adjustment. I have to admit -- this is pretty cool. The changes look pretty clean but I think it's going to need some benchmarking to prove it's actually faster under all workloads and some more review though. If there's heavy contention on getting victim buffers you're going to be making a lot of atomic operations -- each one of which will lock down the cache line. A yielding lwlock is obviously going to have very different behaviors under contention, so there needs to be a performance test that really burns up the freelist. If it's Well And Truly faster, you have to immediately start wondering if the freelist lock can't be completely scrapped: buffer invalidation only occurs in fairly rare events -- but when it does ISTM to be fairly counter productive. In trying to avoid the sweep you end up pushing everyone towards a locky free list. I don't know about you, but I'd prefer to have more consistent behavior (as in, not "drop big table, concurrency drops 50%"). Noted, this doesn't really address the OP's problem :-). I bet it'd supplement the nailing strategy nicely however. merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
Hi, On Mon, 4 Jun 2012, Ants Aasma wrote: > On Mon, Jun 4, 2012 at 7:44 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > I tried to keep it simple at first to find an answer to the question > if it's even worth trying before expending large effort on it. If > anyone with a multisocket machine would chip in, I'd love to know how > this patch handles on larger machines. I think the most interesting > workloads are read only loads with heavy buffer trashing but inside OS > memory. Select only pgbench with 32MB shared_buffers was withín error > margin, although slightly faster on my machine (Intel i2500K). The > workload that I used to demonstrate gain was an aggregated index scan > to minimise other overheads. I've quickly tested your lockfree-getbuffer.patch patch with the test case you provided and I barely see any improvement (2% at max) https://docs.google.com/open?id=0B7koR68V2nM1QVBxWGpZdW4wd0U tested with 24 core (48 ht cores, Xeon E7- 4807). Although the tps vs number of threads looks weird.... Cheers, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Ants Aasma
Date:
On Wed, Jun 6, 2012 at 2:27 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > I've quickly tested your lockfree-getbuffer.patch patch with the test case > you provided and I barely see any improvement (2% at max) > https://docs.google.com/open?id=0B7koR68V2nM1QVBxWGpZdW4wd0U > tested with 24 core (48 ht cores, Xeon E7- 4807). > Although the tps vs number of threads looks weird.... Was this the range scan on the test table? (sorry about the error in the query, the x should really be id) In that case the results look really suspicious. My machine (4 cores, no ht, @ 4GHz, newer arch) peaked at 90tps with the stated configuration. Even when upping the shared_buffers and enabling indexonlyscan I didn't see more than about 540tps per thread. The test is designed to exercise buffer eviction, doing about 9800 buffer reads per transaction with 32MB of buffers. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Wed, 6 Jun 2012, Ants Aasma wrote: > On Wed, Jun 6, 2012 at 2:27 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: >> I've quickly tested your lockfree-getbuffer.patch patch with the test case >> you provided and I barely see any improvement (2% at max) >> https://docs.google.com/open?id=0B7koR68V2nM1QVBxWGpZdW4wd0U >> tested with 24 core (48 ht cores, Xeon E7- 4807). >> Although the tps vs number of threads looks weird.... > > Was this the range scan on the test table? (sorry about the error in > the query, the x should really be id) In that case the results look > really suspicious. Yes, my fault partially, because without much thought I've put "value" instead of "x" in the script. Now after replacing it by "id" the tps are much smaller. Here is the tps vs nthreads I did test up to 10 threads on my 24 cpu system (I disabled HT though): https://docs.google.com/open?id=0B7koR68V2nM1Nk9OcWNJOTRrYVE Your patch clearly improve the situation (the peak tps is ~ 10% higher), but the general picture is the same: flattening of tps vs nthreads. Cheers, S ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Merlin Moncure
Date:
On Wed, Jun 6, 2012 at 2:53 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > On Wed, 6 Jun 2012, Ants Aasma wrote: > >> On Wed, Jun 6, 2012 at 2:27 PM, Sergey Koposov <koposov@ast.cam.ac.uk> >> wrote: >>> >>> I've quickly tested your lockfree-getbuffer.patch patch with the test >>> case >>> you provided and I barely see any improvement (2% at max) >>> https://docs.google.com/open?id=0B7koR68V2nM1QVBxWGpZdW4wd0U >>> tested with 24 core (48 ht cores, Xeon E7- 4807). >>> Although the tps vs number of threads looks weird.... >> >> >> Was this the range scan on the test table? (sorry about the error in >> the query, the x should really be id) In that case the results look >> really suspicious. > > > Yes, my fault partially, because without much thought I've put "value" > instead of "x" in the script. Now after replacing it by "id" the tps are > much smaller. > > Here is the tps vs nthreads I did test up to 10 threads on my 24 cpu system > (I disabled HT though): > https://docs.google.com/open?id=0B7koR68V2nM1Nk9OcWNJOTRrYVE > > Your patch clearly improve the situation (the peak tps is ~ 10% higher), but > the general picture is the same: flattening of tps vs nthreads. I think this is the expected result. In the single user case the spinklock never spins and only has to make the cpu-locking cache instructions once. can we see results @24 threads? merlin
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Sergey Koposov
Date:
On Wed, 6 Jun 2012, Merlin Moncure wrote: > I think this is the expected result. In the single user case the > spinklock never spins and only has to make the cpu-locking cache > instructions once. can we see results @24 threads? Here https://docs.google.com/open?id=0B7koR68V2nM1NDJHLUhNSS0zbUk ***************************************************** Sergey E. Koposov, PhD, Research Associate Institute of Astronomy, University of Cambridge Madingley road, CB3 0HA, Cambridge, UK Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile
From
Ants Aasma
Date:
On Wed, Jun 6, 2012 at 11:42 PM, Sergey Koposov <koposov@ast.cam.ac.uk> wrote: > On Wed, 6 Jun 2012, Merlin Moncure wrote: >> >> I think this is the expected result. In the single user case the >> spinklock never spins and only has to make the cpu-locking cache >> instructions once. can we see results @24 threads? > > > Here https://docs.google.com/open?id=0B7koR68V2nM1NDJHLUhNSS0zbUk Thank you for testing, really interesting results. It looks like this workload really isn't bound by BufFreelistLock, despite my best efforts. I did some perf analysis on my machine. First thing I noticed was that the profile was dominated by copying buffers to postgresql shared_buffers. Seems obvious after the fact. To see BufFreelistLock contention there needs to be some noticeable work under that lock besides it being frequently acquired. I bumped shared_buffers to 512M and the table size to 30M rows to compensate, making the dataset 4.7G, with 640M of index. Looking at instruction level annotation of StrategyGetBuffer I saw about 1/3 of time taken up by the numBufferAllocs and bgwriterLatch atomic changes. So I added a lockfree test for bgwriterLatch, and used a local var for allocation counting, updating the shared var every 16 allocations, making the common case only do one locked instruction. I added padding to separate frequently and less frequently modified vars to different cache lines. This should be good enough for performance testing purposes, a proper solution should at the very least handle flushing the buffer allocation count on backend termination. Attached are perf reports for master and the patched version running the enlarged workload at -c 4. BufFreelistLock doesn't appear under LWLockAcquire anymore, but all the overhead shifted to StrategyGetBuffer. Basically it's the locked op overhead that is being moved around. Some other observations: * While there isn't any noticeable contention, the locking overhead is still huge, looking at instruction level timings, it seems to be just the cost of getting the lock cache line in exclusive mode. * After patching StrategyGetBuffer is 59% buffer header lock acquisition. * All of the top 11 items in the profile are dominated by one or two cache misses. Specifically, navigating collision chains for hash_search_with_hash_value, the lock for LWLockAcquire, LWLockRelease, buffer header for StrategyGetBuffer, ReadBuffer_common, PinBuffer, UnpinBuffer, TerminateBufferIO, PrivateRefCount for PinBuffer, PinBuffer_locked, UnpinBuffer, heap tuple for heap_hot_search_buffer. Based on this information, it looks like modifying anything in the buffer header can be a pretty heavy operation. Making heavily accessed buffers read-mostly (eg by nailing them) could give a healthy boost to performance. Atomic ops probably won't help significantly to alleviate the cost because the cache lines need to bounce around anyway, except maybe in extreme contention cases where spinning causes stolen cache lines. I also attached a patch with the most recent changes, should at least fix the regression for single client case, maybe improve scaling. This patch doesn't seem useful until the other sources of contention are alleviated. I won't pursue getting this into a committable state unless someone can show a workload where the clock sweep is an actual bottleneck. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de