Thread: Speed up transaction completion faster after many relations areaccessed in a transaction
Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
Hello, The attached patch speeds up transaction completion when any prior transaction accessed many relations in the same session. The transaction records its own acquired lock information in the LOCALLOCK structure (a pair of lock object and lock mode). It stores LOCALLOCKs in a hash table in its local memory. The hash table automatically expands when the transactionacquires many relations. The hash table doesn't shrink. When the transaction commits or aborts, it scans thehash table to find LOCALLOCKs to release locks. The problem is that once some transaction accesses many relations, even subsequent transactions in the same session thatonly access a few relations take unreasonably long time to complete, because it has to scan the expanded hash table. The attached patch links LOCALLOCKS to PGPROC, so that releasing locks should only scan the list instead of the hash table. The hash table is kept because some functions want to find a particular LOCALLOCK quickly based on its hash value. This problem was uncovered while evaluating partitioning performance. When the application PREPAREs a statement once andthen EXECUTE-COMMIT repeatedly, the server creates a generic plan on the 6th EXECUTE. Unfortunately, creation of thegeneric plan of UPDATE/DELETE currently accesses all partitions of the target table (this itself needs improvement), expandingthe LOCALLOCK hash table. As a result, 7th and later EXECUTEs get slower. Imai-san confirmed performance improvement with this patch: https://commitfest.postgresql.org/22/1993/ Regards Takayuki Tsunakawa
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Tom Lane
Date:
"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes: > The attached patch speeds up transaction completion when any prior transaction accessed many relations in the same session. Hm. Putting a list header for a purely-local data structure into shared memory seems quite ugly. Isn't there a better place to keep that? Do we really want a dlist here at all? I'm concerned that bloating LOCALLOCK will cost us when there are many locks involved. This patch increases the size of LOCALLOCK by 25% if I counted right, which does not seem like a negligible penalty. My own thought about how to improve this situation was just to destroy and recreate LockMethodLocalHash at transaction end (or start) if its size exceeded $some-value. Leaving it permanently bloated seems like possibly a bad idea, even if we get rid of all the hash_seq_searches on it. regards, tom lane
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Tue, 19 Feb 2019 at 12:42, Tom Lane <tgl@sss.pgh.pa.us> wrote: > My own thought about how to improve this situation was just to destroy > and recreate LockMethodLocalHash at transaction end (or start) > if its size exceeded $some-value. Leaving it permanently bloated seems > like possibly a bad idea, even if we get rid of all the hash_seq_searches > on it. That seems like a good idea. Although, it would be good to know that it didn't add too much overhead dropping and recreating the table when every transaction happened to obtain more locks than $some-value. If it did, then maybe we could track the average locks per of recent transactions and just ditch the table after the locks are released if the locks held by the last transaction exceeded the average * 1.something. No need to go near shared memory to do that. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Andres Freund
Date:
Hi, On 2019-02-19 12:52:08 +1300, David Rowley wrote: > On Tue, 19 Feb 2019 at 12:42, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > My own thought about how to improve this situation was just to destroy > > and recreate LockMethodLocalHash at transaction end (or start) > > if its size exceeded $some-value. Leaving it permanently bloated seems > > like possibly a bad idea, even if we get rid of all the hash_seq_searches > > on it. > > That seems like a good idea. Although, it would be good to know that > it didn't add too much overhead dropping and recreating the table when > every transaction happened to obtain more locks than $some-value. If > it did, then maybe we could track the average locks per of recent > transactions and just ditch the table after the locks are released if > the locks held by the last transaction exceeded the average * > 1.something. No need to go near shared memory to do that. Isn't a large portion of benefits in this patch going to be mooted by the locking improvements discussed in the other threads? I.e. there's hopefully not going to be a ton of cases with low overhead where we acquire a lot of locks and release them very soon after. Sure, for DDL etc we will, but I can't see this mattering from a performance POV? I'm not against doing something like Tom proposes, but heuristics with magic constants like this tend to age purely / are hard to tune well across systems. Greetings, Andres Freund
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes: > On Tue, 19 Feb 2019 at 12:42, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> My own thought about how to improve this situation was just to destroy >> and recreate LockMethodLocalHash at transaction end (or start) >> if its size exceeded $some-value. Leaving it permanently bloated seems >> like possibly a bad idea, even if we get rid of all the hash_seq_searches >> on it. > That seems like a good idea. Although, it would be good to know that > it didn't add too much overhead dropping and recreating the table when > every transaction happened to obtain more locks than $some-value. If > it did, then maybe we could track the average locks per of recent > transactions and just ditch the table after the locks are released if > the locks held by the last transaction exceeded the average * > 1.something. No need to go near shared memory to do that. Yeah, I'd deliberately avoided saying how we'd choose $some-value ;-). Making it adaptive might not be a bad plan. regards, tom lane
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes: > Isn't a large portion of benefits in this patch going to be mooted by > the locking improvements discussed in the other threads? I.e. there's > hopefully not going to be a ton of cases with low overhead where we > acquire a lot of locks and release them very soon after. Sure, for DDL > etc we will, but I can't see this mattering from a performance POV? Mmm ... AIUI, the patches currently proposed can only help for what David called "point lookup" queries. There are still going to be queries that scan a large proportion of a partition tree, so if you've got tons of partitions, you'll be concerned about this sort of thing. > I'm not against doing something like Tom proposes, but heuristics with > magic constants like this tend to age purely / are hard to tune well > across systems. I didn't say it had to be a constant ... regards, tom lane
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Tue, 19 Feb 2019 at 12:56, Andres Freund <andres@anarazel.de> wrote: > Isn't a large portion of benefits in this patch going to be mooted by > the locking improvements discussed in the other threads? I.e. there's > hopefully not going to be a ton of cases with low overhead where we > acquire a lot of locks and release them very soon after. Sure, for DDL > etc we will, but I can't see this mattering from a performance POV? I think this patch was born from Amit's partition planner improvement patch. If not that one, which other threads did you have in mind? A problem exists where, if using a PREPAREd statement to plan a query to a partitioned table containing many partitions that a generic plan will never be favoured over a custom plan since the generic plan might not be able to prune partitions like the custom plan can. The actual problem is around that we do need to at some point generate a generic plan in order to know it's more costly and that requires locking possibly every partition. When plan_cache_mode = auto, this is done on the 6th execution of the statement. After Amit's partition planner changes [1], the custom plan will only lock partitions that are not pruned, so the 6th execution of the statement bloats the local lock table. [1] https://commitfest.postgresql.org/22/1778/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Andres Freund
Date:
Hi, On 2019-02-18 19:01:06 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Isn't a large portion of benefits in this patch going to be mooted by > > the locking improvements discussed in the other threads? I.e. there's > > hopefully not going to be a ton of cases with low overhead where we > > acquire a lot of locks and release them very soon after. Sure, for DDL > > etc we will, but I can't see this mattering from a performance POV? > > Mmm ... AIUI, the patches currently proposed can only help for what > David called "point lookup" queries. There are still going to be > queries that scan a large proportion of a partition tree, so if you've > got tons of partitions, you'll be concerned about this sort of thing. Agreed - but it seems not unlikely that for those the rest of the planner / executor overhead will entirely swamp any improvement we could make here. If I understand correctly the benchmarks here were made with "point" update and select queries, although the reference in the first post in this thread is a bit vague. > > I'm not against doing something like Tom proposes, but heuristics with > > magic constants like this tend to age purely / are hard to tune well > > across systems. > > I didn't say it had to be a constant ... Do you have good idea? Greetings, Andres Freund
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes: > On 2019-02-18 19:01:06 -0500, Tom Lane wrote: >> Mmm ... AIUI, the patches currently proposed can only help for what >> David called "point lookup" queries. There are still going to be >> queries that scan a large proportion of a partition tree, so if you've >> got tons of partitions, you'll be concerned about this sort of thing. > Agreed - but it seems not unlikely that for those the rest of the > planner / executor overhead will entirely swamp any improvement we could > make here. If I understand correctly the benchmarks here were made with > "point" update and select queries, although the reference in the first > post in this thread is a bit vague. I think what Maumau-san is on about here is that not only does your $giant-query take a long time, but it has a permanent negative effect on all subsequent transactions in the session. That seems worth doing something about. >> I didn't say it had to be a constant ... > Do you have good idea? I think David's on the right track --- keep some kind of moving average of the LOCALLOCK table size for each transaction, and nuke it if it exceeds some multiple of the recent average. Not sure offhand about how to get the data cheaply --- it might not be sufficient to look at transaction end, if we release LOCALLOCK entries before that (but do we?) regards, tom lane
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Andres Freund
Date:
Hi, On 2019-02-18 18:42:32 -0500, Tom Lane wrote: > "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes: > > The attached patch speeds up transaction completion when any prior transaction accessed many relations in the same session. > > Hm. Putting a list header for a purely-local data structure into shared > memory seems quite ugly. Isn't there a better place to keep that? Yea, I think it'd be just as fine to store that in a static variable (best defined directly besides LockMethodLocalHash). (Btw, I'd be entirely unsurprised if moving away from a dynahash for LockMethodLocalHash would be beneficial) > Do we really want a dlist here at all? I'm concerned that bloating > LOCALLOCK will cost us when there are many locks involved. This patch > increases the size of LOCALLOCK by 25% if I counted right, which does > not seem like a negligible penalty. It's currently struct LOCALLOCK { LOCALLOCKTAG tag; /* 0 20 */ /* XXX 4 bytes hole, try to pack */ LOCK * lock; /* 24 8 */ PROCLOCK * proclock; /* 32 8 */ uint32 hashcode; /* 40 4 */ /* XXX 4 bytes hole, try to pack */ int64 nLocks; /* 48 8 */ _Bool holdsStrongLockCount; /* 56 1 */ _Bool lockCleared; /* 57 1 */ /* XXX 2 bytes hole, try to pack */ int numLockOwners; /* 60 4 */ /* --- cacheline 1 boundary (64 bytes) --- */ int maxLockOwners; /* 64 4 */ /* XXX 4 bytes hole, try to pack */ LOCALLOCKOWNER * lockOwners; /* 72 8 */ /* size: 80, cachelines: 2, members: 10 */ /* sum members: 66, holes: 4, sum holes: 14 */ /* last cacheline: 16 bytes */ }; seems we could trivially squeeze most of the bytes for a dlist node out of padding. > My own thought about how to improve this situation was just to destroy > and recreate LockMethodLocalHash at transaction end (or start) > if its size exceeded $some-value. Leaving it permanently bloated seems > like possibly a bad idea, even if we get rid of all the hash_seq_searches > on it. OTOH, that'll force constant incremental resizing of the hashtable, for workloads that regularly need a lot of locks. And I'd assume in most cases if one transaction needs a lot of locks it's quite likely that future ones will need a lot of locks, too. Greetings, Andres Freund
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Andres Freund
Date:
Hi, On 2019-02-18 19:13:31 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2019-02-18 19:01:06 -0500, Tom Lane wrote: > >> Mmm ... AIUI, the patches currently proposed can only help for what > >> David called "point lookup" queries. There are still going to be > >> queries that scan a large proportion of a partition tree, so if you've > >> got tons of partitions, you'll be concerned about this sort of thing. > > > Agreed - but it seems not unlikely that for those the rest of the > > planner / executor overhead will entirely swamp any improvement we could > > make here. If I understand correctly the benchmarks here were made with > > "point" update and select queries, although the reference in the first > > post in this thread is a bit vague. > > I think what Maumau-san is on about here is that not only does your > $giant-query take a long time, but it has a permanent negative effect > on all subsequent transactions in the session. That seems worth > doing something about. Ah, yes, that makes sense. I'm inclined to think however that the original approach presented in this thread is better than the reset-the-whole-hashtable approach. Because: > I think David's on the right track --- keep some kind of moving average of > the LOCALLOCK table size for each transaction, and nuke it if it exceeds > some multiple of the recent average. Not sure offhand about how to get > the data cheaply --- it might not be sufficient to look at transaction > end, if we release LOCALLOCK entries before that (but do we?) Seems too complicated for my taste. And it doesn't solve the issue of having some transactions with few locks (say because the plan can be nicely pruned) interspersed with transactions where a lot of locks are acquired. Greetings, Andres Freund
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes: > On 2019-02-18 18:42:32 -0500, Tom Lane wrote: >> Do we really want a dlist here at all? I'm concerned that bloating >> LOCALLOCK will cost us when there are many locks involved. This patch >> increases the size of LOCALLOCK by 25% if I counted right, which does >> not seem like a negligible penalty. > It's currently [ 80 bytes with several padding holes ] > seems we could trivially squeeze most of the bytes for a dlist node out > of padding. Yeah, but if we want to rearrange the members into an illogical order to save some space, we should do that independently of this patch --- and then the overhead of this patch would be even worse than 25%. regards, tom lane
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Andres Freund
Date:
Hi, On 2019-02-18 19:24:54 -0500, Tom Lane wrote: > Yeah, but if we want to rearrange the members into an illogical order > to save some space, we should do that independently of this patch --- Sure, we should do that. I don't buy the "illogical" bit, just moving hashcode up to after tag isn't more or less logical, and saves most of the padding, and moving the booleans to the end isn't better/worse either. You always bring up that argument. While I agree that sometimes the most optimal ordering can be less natural, I think most of the time it vastly overstates how intelligent the original ordering was. Often new elements were either just added iteratively without consideration for padding, or the attention to padding was paid in 32bit times. I don't find struct LOCALLOCK { LOCALLOCKTAG tag; /* 0 20 */ uint32 hashcode; /* 20 4 */ LOCK * lock; /* 24 8 */ PROCLOCK * proclock; /* 32 8 */ int64 nLocks; /* 40 8 */ int numLockOwners; /* 48 4 */ int maxLockOwners; /* 52 4 */ LOCALLOCKOWNER * lockOwners; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ _Bool holdsStrongLockCount; /* 64 1 */ _Bool lockCleared; /* 65 1 */ /* size: 72, cachelines: 2, members: 10 */ /* padding: 6 */ /* last cacheline: 8 bytes */ }; less clear than struct LOCALLOCK { LOCALLOCKTAG tag; /* 0 20 */ /* XXX 4 bytes hole, try to pack */ LOCK * lock; /* 24 8 */ PROCLOCK * proclock; /* 32 8 */ uint32 hashcode; /* 40 4 */ /* XXX 4 bytes hole, try to pack */ int64 nLocks; /* 48 8 */ _Bool holdsStrongLockCount; /* 56 1 */ _Bool lockCleared; /* 57 1 */ /* XXX 2 bytes hole, try to pack */ int numLockOwners; /* 60 4 */ /* --- cacheline 1 boundary (64 bytes) --- */ int maxLockOwners; /* 64 4 */ /* XXX 4 bytes hole, try to pack */ LOCALLOCKOWNER * lockOwners; /* 72 8 */ /* size: 80, cachelines: 2, members: 10 */ /* sum members: 66, holes: 4, sum holes: 14 */ /* last cacheline: 16 bytes */ }; but it's smaller (althoug there's plenty trailing space). > and then the overhead of this patch would be even worse than 25%. IDK, we, including you, very often make largely independent improvements to make the cost of something else more palpable. Why's that not OK here? Especially because we're not comparing to an alternative where no cost is added, keeping track of e.g. a running average of the hashtable size isn't free either; nor does it help in the intermittent cases. - Andres
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes: > On 2019-02-18 19:24:54 -0500, Tom Lane wrote: >> Yeah, but if we want to rearrange the members into an illogical order >> to save some space, we should do that independently of this patch --- > Sure, we should do that. I don't buy the "illogical" bit, just moving > hashcode up to after tag isn't more or less logical, and saves most of > the padding, and moving the booleans to the end isn't better/worse > either. I hadn't looked at the details closely, but if we can squeeze out the padding space without any loss of intelligibility, sure let's do so. I still say that's independent of whether to adopt this patch though. > but it's smaller (althoug there's plenty trailing space). I think you're supposing that these things are independently palloc'd, but they aren't --- dynahash lays them out in arrays without palloc padding. > IDK, we, including you, very often make largely independent improvements > to make the cost of something else more palpable. Why's that not OK > here? When we do that, we aren't normally talking about overheads as high as 25% (even more, if it's measured as I think it ought to be). What I'm concerned about is that the patch is being advocated for cases where there are lots of LOCALLOCK entries --- which is exactly where the space overhead is going to hurt the most. > Especially because we're not comparing to an alternative where no > cost is added, keeping track of e.g. a running average of the hashtable > size isn't free either; nor does it help in the intermittent cases. What I was hoping for --- though perhaps it's not achievable --- was statistical overhead amounting to just a few more instructions per transaction. Adding dlist linking would add more instructions per hashtable entry/removal, which seems like it'd be a substantially bigger time penalty. As for the intermittent-usage issue, that largely depends on the details of the when-to-reset heuristic, which we don't have a concrete design for yet. But I could certainly imagine it waiting for a few transactions before deciding to chomp. Anyway, I'm not trying to veto the patch in this form, just suggesting that there are alternatives worth thinking about. regards, tom lane
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Andres Freund
Date:
Hi, On 2019-02-18 20:29:29 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > but it's smaller (althoug there's plenty trailing space). > > I think you're supposing that these things are independently palloc'd, but > they aren't --- dynahash lays them out in arrays without palloc padding. I don't think that matters, given that the trailing six bytes are included in sizeof() (and have to, to guarantee suitable alignment in arrays etc). Greetings, Andres Freund
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Simon Riggs
Date:
On Tue, 19 Feb 2019 at 00:20, Andres Freund <andres@anarazel.de> wrote:
On 2019-02-18 19:13:31 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-02-18 19:01:06 -0500, Tom Lane wrote:
> >> Mmm ... AIUI, the patches currently proposed can only help for what
> >> David called "point lookup" queries. There are still going to be
> >> queries that scan a large proportion of a partition tree, so if you've
> >> got tons of partitions, you'll be concerned about this sort of thing.
> I think what Maumau-san is on about here is that not only does your
> $giant-query take a long time, but it has a permanent negative effect
> on all subsequent transactions in the session. That seems worth
> doing something about.
Ah, yes, that makes sense. I'm inclined to think however that the
original approach presented in this thread is better than the
reset-the-whole-hashtable approach.
If it was just many-tables then blowing away the hash table would work fine.
The main issue seems to be with partitioning, not with the general case of many-tables. For that case, it seems like reset hashtable is too much.
Can we use our knowledge of the structure of locks, i.e. partition locks are all children of the partitioned table, to do a better job?
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Tomas Vondra
Date:
On 2/12/19 7:33 AM, Tsunakawa, Takayuki wrote: > ... > > This problem was uncovered while evaluating partitioning performance. > When the application PREPAREs a statement once and then > EXECUTE-COMMIT repeatedly, the server creates a generic plan on the > 6th EXECUTE. Unfortunately, creation of the generic plan of > UPDATE/DELETE currently accesses all partitions of the target table > (this itself needs improvement), expanding the LOCALLOCK hash table. > As a result, 7th and later EXECUTEs get slower. > > Imai-san confirmed performance improvement with this patch: > > https://commitfest.postgresql.org/22/1993/ > Can you quantify the effects? That is, how much slower/faster does it get? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com] > On 2/12/19 7:33 AM, Tsunakawa, Takayuki wrote: > > Imai-san confirmed performance improvement with this patch: > > > > https://commitfest.postgresql.org/22/1993/ > > > > Can you quantify the effects? That is, how much slower/faster does it get? Ugh, sorry, I wrote a wrong URL. The correct page is: https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512787EC%40g01jpexmbkw24 The quoted figures re: [v20 + faster-locallock-scan.patch] auto: 9,069 TPS custom: 9,015 TPS [v20] auto: 8,037 TPS custom: 8,933 TPS In the original problematic case, plan_cache_mode = auto (default), we can see about 13% improvement. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Hm. Putting a list header for a purely-local data structure into shared > memory seems quite ugly. Isn't there a better place to keep that? Agreed. I put it in the global variable. > Do we really want a dlist here at all? I'm concerned that bloating > LOCALLOCK will cost us when there are many locks involved. This patch > increases the size of LOCALLOCK by 25% if I counted right, which does > not seem like a negligible penalty. To delete the LOCALLOCK in RemoveLocalLock(), we need a dlist. slist requires the list iterator to be passed from callers. From: Andres Freund [mailto:andres@anarazel.de] > Sure, we should do that. I don't buy the "illogical" bit, just moving > hashcode up to after tag isn't more or less logical, and saves most of > the padding, and moving the booleans to the end isn't better/worse > either. > > I don't find Thanks, I've done it. From: Simon Riggs [mailto:simon@2ndquadrant.com] > Can we use our knowledge of the structure of locks, i.e. partition locks > are all children of the partitioned table, to do a better job? I couldn't come up with a idea. Regards Takayuki Tsunakawa
Attachment
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Peter Eisentraut
Date:
On 2019-02-20 07:20, Tsunakawa, Takayuki wrote: > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] >> Hm. Putting a list header for a purely-local data structure into shared >> memory seems quite ugly. Isn't there a better place to keep that? > > Agreed. I put it in the global variable. I think there is agreement on the principles of this patch. Perhaps it could be polished a bit. Your changes in LOCALLOCK still refer to PGPROC, from your first version of the patch. I think the reordering of struct members could be done as a separate preliminary patch. Some more documentation in the comment before dlist_head LocalLocks to explain this whole mechanism would be nice. You posted a link to some performance numbers, but I didn't see the test setup explained there. I'd like to get some more information on this impact of this. Is there an effect with 100 tables, or do you need 100000? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
Hi Peter, Imai-san, From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com] > Your changes in LOCALLOCK still refer to PGPROC, from your first version > of the patch. > > I think the reordering of struct members could be done as a separate > preliminary patch. > > Some more documentation in the comment before dlist_head LocalLocks to > explain this whole mechanism would be nice. Fixed. > You posted a link to some performance numbers, but I didn't see the test > setup explained there. I'd like to get some more information on this > impact of this. Is there an effect with 100 tables, or do you need 100000? Imai-san, can you tell us the test setup? Regards Takayuki Tsunakawa
Attachment
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Imai, Yoshikazu"
Date:
Hi Tsunakawa-san, Peter On Tue, Mar 19, 2019 at 7:53 AM, Tsunakawa, Takayuki wrote: > From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com] > > You posted a link to some performance numbers, but I didn't see the > > test setup explained there. I'd like to get some more information on > > this impact of this. Is there an effect with 100 tables, or do you > need 100000? > > Imai-san, can you tell us the test setup? Maybe I used this test setup[1]. I tested again with those settings for prepared transactions. I used Tsunakawa-san's patch for locallock[2] (which couldn't be applied to current master so I fixed it) and Amit's v32patch for speeding up planner[3]. [settings] plan_cache_mode = 'auto' or 'force_custom_plan' max_parallel_workers = 0 max_parallel_workers_per_gather = 0 max_locks_per_transaction = 4096 [partitioning table definitions(with 4096 partitions)] create table rt (a int, b int, c int) partition by range (a); \o /dev/null select 'create table rt' || x::text || ' partition of rt for values from (' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x; \gexec \o [select4096.sql] \set a random(1, 4096) select a from rt where a = :a; [pgbench(with 4096 partitions)] pgbench -n -f select4096.sql -T 60 -M prepared [results] master locallock v32 v32+locallock ------ --------- --- ------------- auto 21.9 22.9 6,834 7,355 custom 19.7 20.0 7,415 7,252 [1] https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A51256276%40g01jpexmbkw24 [2] https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1FBDFA00%40G01JPEXMBYT05 [3] https://www.postgresql.org/message-id/9feacaf6-ddb3-96dd-5b98-df5e927b1439%40lab.ntt.co.jp -- Yoshikazu Imai
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com] > Fixed. Rebased on HEAD. Regards Takayuki Tsunakawa
Attachment
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Peter Eisentraut
Date:
On 2019-03-19 10:21, Tsunakawa, Takayuki wrote: > From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com] >> Fixed. > > Rebased on HEAD. I have committed the first patch that reorganizes the struct. I'll have to spend some time evaluating the performance impact of the second patch, but it seems OK in principle. Performance tests from others welcome. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Peter Eisentraut
Date:
On 2019-03-19 16:38, Peter Eisentraut wrote: > On 2019-03-19 10:21, Tsunakawa, Takayuki wrote: >> From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com] >>> Fixed. >> >> Rebased on HEAD. > > I have committed the first patch that reorganizes the struct. I'll have > to spend some time evaluating the performance impact of the second > patch, but it seems OK in principle. Performance tests from others welcome. I did a bit of performance testing, both a plain pgbench and the suggested test case with 4096 partitions. I can't detect any performance improvements. In fact, within the noise, it tends to be just a bit on the slower side. So I'd like to kick it back to the patch submitter now and ask for more justification and performance analysis. Perhaps "speeding up planning with partitions" needs to be accepted first? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > I did a bit of performance testing, both a plain pgbench and the > suggested test case with 4096 partitions. I can't detect any > performance improvements. In fact, within the noise, it tends to be > just a bit on the slower side. > > So I'd like to kick it back to the patch submitter now and ask for more > justification and performance analysis. > > Perhaps "speeding up planning with partitions" needs to be accepted first? Yeah, I think it likely will require that patch to be able to measure the gains from this patch. If planning a SELECT to a partitioned table with a large number of partitions using PREPAREd statements, when we attempt the generic plan on the 6th execution, it does cause the local lock table to expand to fit all the locks for each partition. This does cause the LockReleaseAll() to become slow due to the hash_seq_search having to skip over many empty buckets. Since generating a custom plan for a partitioned table with many partitions is still slow in master, then I very much imagine you'll struggle to see the gains brought by this patch. I did a quick benchmark too and couldn't measure anything: create table hp (a int) partition by hash (a); select 'create table hp'||x|| ' partition of hp for values with (modulus 4096, remainder ' || x || ');' from generate_series(0,4095) x; bench.sql \set p_a 13315 select * from hp where a = :p_a; Master: $ pgbench -M prepared -n -T 60 -f bench.sql postgres tps = 31.844468 (excluding connections establishing) tps = 32.950154 (excluding connections establishing) tps = 31.534761 (excluding connections establishing) Patched: $ pgbench -M prepared -n -T 60 -f bench.sql postgres tps = 30.099133 (excluding connections establishing) tps = 32.157328 (excluding connections establishing) tps = 32.329884 (excluding connections establishing) The situation won't be any better with plan_cache_mode = force_generic_plan either. In this case, we'll only plan once but we'll also have to obtain and release a lock for each partition for each execution of the prepared statement. LockReleaseAll() is going to be slow in that case because it actually has to release a lot of locks. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: David Rowley [mailto:david.rowley@2ndquadrant.com] > On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: > > Perhaps "speeding up planning with partitions" needs to be accepted first? > > Yeah, I think it likely will require that patch to be able to measure > the gains from this patch. > > If planning a SELECT to a partitioned table with a large number of > partitions using PREPAREd statements, when we attempt the generic plan > on the 6th execution, it does cause the local lock table to expand to > fit all the locks for each partition. This does cause the > LockReleaseAll() to become slow due to the hash_seq_search having to > skip over many empty buckets. Since generating a custom plan for a > partitioned table with many partitions is still slow in master, then I > very much imagine you'll struggle to see the gains brought by this > patch. Thank you David for explaining. Although I may not understand the effect of "speeding up planning with partitions" patch,this patch takes effect even without it. That is, perform the following in the same session: 1. SELECT count(*) FROM table; on a table with many partitions. That bloats the LocalLockHash. 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1; 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction. Without the patch, each transaction'sLockReleaseAll() has to scan the bloated large hash table. Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Amit Langote
Date:
Tsunakawa-san, On 2019/03/26 17:21, Tsunakawa, Takayuki wrote: > From: David Rowley [mailto:david.rowley@2ndquadrant.com] >> On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut >> <peter.eisentraut@2ndquadrant.com> wrote: >>> Perhaps "speeding up planning with partitions" needs to be accepted first? >> >> Yeah, I think it likely will require that patch to be able to measure >> the gains from this patch. >> >> If planning a SELECT to a partitioned table with a large number of >> partitions using PREPAREd statements, when we attempt the generic plan >> on the 6th execution, it does cause the local lock table to expand to >> fit all the locks for each partition. This does cause the >> LockReleaseAll() to become slow due to the hash_seq_search having to >> skip over many empty buckets. Since generating a custom plan for a >> partitioned table with many partitions is still slow in master, then I >> very much imagine you'll struggle to see the gains brought by this >> patch. > > Thank you David for explaining. Although I may not understand the effect of "speeding up planning with partitions" patch,this patch takes effect even without it. That is, perform the following in the same session: > > 1. SELECT count(*) FROM table; on a table with many partitions. That bloats the LocalLockHash. > 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1; > 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction. Without the patch, each transaction'sLockReleaseAll() has to scan the bloated large hash table. My understanding of what David wrote is that the slowness of bloated hash table is hard to notice, because planning itself is pretty slow. With the "speeding up planning with partitions" patch, planning becomes quite fast, so the bloated hash table overhead and so your patch's benefit is easier to notice. This patch is clearly helpful, but it's just hard to notice it when the other big bottleneck is standing in the way. Thanks, Amit
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Tue, 26 Mar 2019 at 21:23, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > Thank you David for explaining. Although I may not understand the effect of "speeding up planning with partitions" patch,this patch takes effect even without it. That is, perform the following in the same session: > > 1. SELECT count(*) FROM table; on a table with many partitions. That bloats the LocalLockHash. > 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1; > 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction. Without the patch, each transaction'sLockReleaseAll() has to scan the bloated large hash table. Oh. I think I see what you're saying. Really the table in #2 would have to be some completely different table that's not partitioned. I think in that case it should make a difference. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Tue, 26 Mar 2019 at 21:55, David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Tue, 26 Mar 2019 at 21:23, Tsunakawa, Takayuki > <tsunakawa.takay@jp.fujitsu.com> wrote: > > Thank you David for explaining. Although I may not understand the effect of "speeding up planning with partitions" patch,this patch takes effect even without it. That is, perform the following in the same session: > > > > 1. SELECT count(*) FROM table; on a table with many partitions. That bloats the LocalLockHash. > > 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1; > > 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction. Without the patch, each transaction'sLockReleaseAll() has to scan the bloated large hash table. > > Oh. I think I see what you're saying. Really the table in #2 would > have to be some completely different table that's not partitioned. I > think in that case it should make a difference. Here a benchmark doing that using pgbench's script weight feature. I've set this up so the query that hits the partitioned table runs once for every 10k times the other script runs. I picked that number so the lock table was expanded fairly early on in the benchmark. setup: create table t1 (a int primary key); create table hp (a int) partition by hash (a); select 'create table hp'||x|| ' partition of hp for values with (modulus 4096, remainder ' || x || ');' from generate_series(0,4095) x; \gexec hp.sql select count(*) from hp; t1.sql \set p 1 select a from t1 where a = :p; Master = c8c885b7a5 Master: $ pgbench -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@10000 postgres SQL script 2: t1.sql - 1057306 transactions (100.0% of total, tps = 17621.756473) - 1081905 transactions (100.0% of total, tps = 18021.449914) - 1122420 transactions (100.0% of total, tps = 18690.804699) Master + 0002-speed-up-LOCALLOCK-scan.patch $ pgbench -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@10000 postgres SQL script 2: t1.sql - 1277014 transactions (100.0% of total, tps = 21283.551615) - 1184052 transactions (100.0% of total, tps = 19734.185872) - 1188523 transactions (100.0% of total, tps = 19785.835662) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
Hi Peter, From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com] > I did a bit of performance testing, both a plain pgbench and the > suggested test case with 4096 partitions. I can't detect any > performance improvements. In fact, within the noise, it tends to be > just a bit on the slower side. > > So I'd like to kick it back to the patch submitter now and ask for more > justification and performance analysis. > > Perhaps "speeding up planning with partitions" needs to be accepted first? David kindly showed how to demonstrate the performance improvement on March 26, so I changed the status to needs review. I'd appreciate it if you could continue the final check. Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Amit Langote
Date:
Hi, On 2019/04/04 13:37, Tsunakawa, Takayuki wrote: > Hi Peter, > > From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com] >> I did a bit of performance testing, both a plain pgbench and the >> suggested test case with 4096 partitions. I can't detect any >> performance improvements. In fact, within the noise, it tends to be >> just a bit on the slower side. >> >> So I'd like to kick it back to the patch submitter now and ask for more >> justification and performance analysis. >> >> Perhaps "speeding up planning with partitions" needs to be accepted first? > > David kindly showed how to demonstrate the performance improvement on March 26, so I changed the status to needs review. I'd appreciate it if you could continue the final check. Also, since the "speed up partition planning" patch went in (428b260f8), it might be possible to see the performance boost even with the partitioning example you cited upthread. Thanks, Amit
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Peter Eisentraut
Date:
On 2019-04-04 06:58, Amit Langote wrote: > Also, since the "speed up partition planning" patch went in (428b260f8), > it might be possible to see the performance boost even with the > partitioning example you cited upthread. I can't detect any performance improvement with the patch applied to current master, using the test case from Yoshikazu Imai (2019-03-19). -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
Hi Peter, Imai-san, From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com] > I can't detect any performance improvement with the patch applied to > current master, using the test case from Yoshikazu Imai (2019-03-19). That's strange... Peter, Imai-san, can you compare your test procedures? Peter, can you check and see the performance improvement with David's method on March 26 instead? Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Amit Langote
Date:
On 2019/04/05 5:42, Peter Eisentraut wrote: > On 2019-04-04 06:58, Amit Langote wrote: >> Also, since the "speed up partition planning" patch went in (428b260f8), >> it might be possible to see the performance boost even with the >> partitioning example you cited upthread. > > I can't detect any performance improvement with the patch applied to > current master, using the test case from Yoshikazu Imai (2019-03-19). I was able to detect it as follows. * partitioned table setup: $ cat ht.sql drop table ht cascade; create table ht (a int primary key, b int, c int) partition by hash (a); select 'create table ht' || x::text || ' partition of ht for values with (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0, 8191) x; \gexec * pgbench script: $ cat select.sql \set param random(1, 8192) select * from ht where a = :param * pgbench (5 minute run with -M prepared) pgbench -n -M prepared -T 300 -f select.sql * tps: plan_cache_mode = auto HEAD: 1915 tps Patched: 2394 tps plan_cache_mode = custom (non-problematic: generic plan is never created) HEAD: 2402 tps Patched: 2393 tps Thanks, Amit
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Imai, Yoshikazu"
Date:
On Fri, Apr 5, 2019 at 1:31 AM, Amit Langote wrote: > On 2019/04/05 5:42, Peter Eisentraut wrote: > > On 2019-04-04 06:58, Amit Langote wrote: > >> Also, since the "speed up partition planning" patch went in > >> (428b260f8), it might be possible to see the performance boost even > >> with the partitioning example you cited upthread. > > > > I can't detect any performance improvement with the patch applied to > > current master, using the test case from Yoshikazu Imai (2019-03-19). > > I was able to detect it as follows. > > * partitioned table setup: > > $ cat ht.sql > drop table ht cascade; > create table ht (a int primary key, b int, c int) partition by hash (a); > select 'create table ht' || x::text || ' partition of ht for values with > (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0, > 8191) x; > \gexec > > * pgbench script: > > $ cat select.sql > \set param random(1, 8192) > select * from ht where a = :param > > * pgbench (5 minute run with -M prepared) > > pgbench -n -M prepared -T 300 -f select.sql > > * tps: > > plan_cache_mode = auto > > HEAD: 1915 tps > Patched: 2394 tps > > plan_cache_mode = custom (non-problematic: generic plan is never created) > > HEAD: 2402 tps > Patched: 2393 tps Amit-san, thanks for testing this. I also re-ran my tests(3/19) with HEAD(413ccaa) and HEAD(413ccaa) + patched, and I can still detect the performance differencewith plan_cache_mode = auto. Thanks -- Yoshikazu Imai
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
Hi Amit-san, Imai-snan, From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] > I was able to detect it as follows. > plan_cache_mode = auto > > HEAD: 1915 tps > Patched: 2394 tps > > plan_cache_mode = custom (non-problematic: generic plan is never created) > > HEAD: 2402 tps > Patched: 2393 tps Thanks a lot for very quick confirmation. I'm relieved to still see the good results. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Imai, Yoshikazu"
Date:
On Fri, Apr 5, 2019 at 0:05 AM, Tsunakawa, Takayuki wrote: > From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com] > > I can't detect any performance improvement with the patch applied to > > current master, using the test case from Yoshikazu Imai (2019-03-19). > > That's strange... Peter, Imai-san, can you compare your test procedures? Just for make sure, I described my test procedures in detail. I install and setup HEAD and patched as follows. [HEAD(413ccaa)] (git pull) ./configure --prefix=/usr/local/pgsql-dev --enable-depend make clean make make install su postgres export PATH=/usr/local/pgsql-dev/bin:$PATH initdb -D /var/lib/pgsql/data-dev vi /var/lib/pgsql/data-dev/postgresql.conf ==== port = 44201 plan_cache_mode = 'auto' or 'force_custom_plan' max_parallel_workers = 0 max_parallel_workers_per_gather = 0 max_locks_per_transaction = 4096 ==== pg_ctl -D /var/lib/pgsql/data-dev start [HEAD(413ccaa) + patch] (git pull) patch -u -p1 < 0002.patch ./configure --prefix=/usr/local/pgsql-locallock --enable-depend make clean make make install su postgres export PATH=/usr/local/pgsql-locallock/bin:$PATH initdb -D /var/lib/pgsql/data-locallock vi /var/lib/pgsql/data-locallock/postgresql.conf ==== port = 44301 plan_cache_mode = 'auto' or 'force_custom_plan' max_parallel_workers = 0 max_parallel_workers_per_gather = 0 max_locks_per_transaction = 4096 ==== pg_ctl -D /var/lib/pgsql/data-locallock start And I tested as follows. (creating partitioned table for port 44201) (creating partitioned table for port 44301) (creating select4096.sql) for i in `seq 1 5`; do pgbench -n -f select4096.sql -T 60 -M prepared -p 44201 | grep including; pgbench -n -f select4096.sql -T 60 -M prepared -p 44301 | grep including; done tps = 8146.039546 (including connections establishing) tps = 9021.340872 (including connections establishing) tps = 8011.186017 (including connections establishing) tps = 8926.191054 (including connections establishing) tps = 8006.769690 (including connections establishing) tps = 9028.716806 (including connections establishing) tps = 8057.709961 (including connections establishing) tps = 9017.713714 (including connections establishing) tps = 7956.332863 (including connections establishing) tps = 9126.650533 (including connections establishing) Thanks -- Yoshikazu Imai
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Peter Eisentraut
Date:
On 2019-03-19 10:21, Tsunakawa, Takayuki wrote: > From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com] >> Fixed. > > Rebased on HEAD. Do you need the dlist_foreach_modify() calls? You are not actually modifying the list structure. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > I can't detect any performance improvement with the patch applied to > current master, using the test case from Yoshikazu Imai (2019-03-19). FWIW, I tried this patch against current HEAD (959d00e9d). Using the test case described by Amit at <be25cadf-982e-3f01-88b4-443a6667e16a@lab.ntt.co.jp> I do measure an undeniable speedup, close to 35%. However ... I submit that that's a mighty extreme test case. (I had to increase max_locks_per_transaction to get it to run at all.) We should not be using that sort of edge case to drive performance optimization choices. If I reduce the number of partitions in Amit's example from 8192 to something more real-world, like 128, I do still measure a performance gain, but it's ~ 1.5% which is below what I'd consider a reproducible win. I'm accustomed to seeing changes up to 2% in narrow benchmarks like this one, even when "nothing changes" except unrelated code. Trying a standard pgbench test case (pgbench -M prepared -S with one client and an -s 10 database), it seems that the patch is about 0.5% slower than HEAD. Again, that's below the noise threshold, but it's not promising for the net effects of this patch on workloads that aren't specifically about large and prunable partition sets. I'm also fairly concerned about the effects of the patch on sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88 bytes, a 22% increase. That's a lot if you're considering cases with many locks. On the whole I don't think there's an adequate case for committing this patch. I'd also point out that this is hardly the only place where we've seen hash_seq_search on nearly-empty hash tables become a bottleneck. So I'm not thrilled about attacking that with one-table-at-time patches. I'd rather see us do something to let hash_seq_search win across the board. I spent some time wondering whether we could adjust the data structure so that all the live entries in a hashtable are linked into one chain, but I don't quite see how to do it without adding another list link to struct HASHELEMENT, which seems pretty expensive. I'll sketch the idea I had, just in case it triggers a better idea in someone else. Assuming we are willing to add another pointer to HASHELEMENT, use the two pointers to create a doubly-linked circular list that includes all live entries in the hashtable, with a list header in the hashtable's control area. (Maybe we'd use a dlist for this, but it's not essential.) Require this list to be organized so that all entries that belong to the same hash bucket are consecutive in the list, and make each non-null hash bucket header point to the first entry in the list for its bucket. To allow normal searches to detect when they've run through their bucket, add a flag to HASHELEMENT that is set only in entries that are the first, or perhaps last, of their bucket (so you'd detect end-of-bucket by checking the flag instead of testing for a null pointer). Adding a bool field is free due to alignment considerations, at least on 64-bit machines. Given this, I think normal hash operations are more-or-less the same cost as before, while hash_seq_search just has to follow the circular list. I tried to figure out how to do the same thing with a singly-linked instead of doubly-linked list, but it doesn't quite work: if you need to remove the first element of a bucket, you have no cheap way to find its predecessor in the overall list (which belongs to some other bucket, but you don't know which one). Maybe we could just mark such entries dead (there's plenty of room for another flag bit) and plan to clean them up later? But it's not clear how to ensure that they'd get cleaned up in any sort of timely fashion. Another issue is that probably none of this works for the partitioned hash tables we use for some of the larger shared-memory hashes. But I'm not sure we care about hash_seq_search for those, so maybe we just say those are a different data structure. regards, tom lane
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Andres Freund
Date:
Hi, On 2019-04-05 23:03:11 -0400, Tom Lane wrote: > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > > I can't detect any performance improvement with the patch applied to > > current master, using the test case from Yoshikazu Imai (2019-03-19). > > FWIW, I tried this patch against current HEAD (959d00e9d). > Using the test case described by Amit at > <be25cadf-982e-3f01-88b4-443a6667e16a@lab.ntt.co.jp> > I do measure an undeniable speedup, close to 35%. > > However ... I submit that that's a mighty extreme test case. > (I had to increase max_locks_per_transaction to get it to run > at all.) We should not be using that sort of edge case to drive > performance optimization choices. > > If I reduce the number of partitions in Amit's example from 8192 > to something more real-world, like 128, I do still measure a > performance gain, but it's ~ 1.5% which is below what I'd consider > a reproducible win. I'm accustomed to seeing changes up to 2% > in narrow benchmarks like this one, even when "nothing changes" > except unrelated code. I'm not sure it's actually that narrow these days. With all the partitioning improvements happening, the numbers of locks commonly held are going to rise. And while 8192 partitions is maybe on the more extreme side, it's a workload with only a single table, and plenty workloads touch more than a single partitioned table. > Trying a standard pgbench test case (pgbench -M prepared -S with > one client and an -s 10 database), it seems that the patch is about > 0.5% slower than HEAD. Again, that's below the noise threshold, > but it's not promising for the net effects of this patch on workloads > that aren't specifically about large and prunable partition sets. Yea, that's concerning. > I'm also fairly concerned about the effects of the patch on > sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88 > bytes, a 22% increase. That's a lot if you're considering cases > with many locks. I'm not sure I'm quite that concerned. For one, a good bit of that space was up for grabs until the recent reordering of LOCALLOCK and nobody complained. But more importantly, I think commonly the amount of locks around is fairly constrained, isn't it? We can't really have that many concurrently held locks, due to the shared memory space, and the size of a LOCALLOCK isn't that big compared to say relcache entries. We also probably fairly easily could win some space back - e.g. make nLocks 32 bits. I wonder if one approach to solve this wouldn't be to just make the hashtable drastically smaller. Right now we'll often have have lots empty entries that are 72 bytes + dynahash overhead. That's plenty of memory that needs to be skipped over. I wonder if we instead should have an array of held locallocks, and a hashtable with {hashcode, offset_in_array} + custom comparator for lookups. That'd mean we could either scan the array of locallocks at release (which'd need to skip over entries that have already been released), or we could scan the much smaller hashtable sequentially. I don't think the above idea is quite there, and I'm tired, but I thought it might still be worth bringing up. > I spent some time wondering whether we could adjust the data structure > so that all the live entries in a hashtable are linked into one chain, > but I don't quite see how to do it without adding another list link to > struct HASHELEMENT, which seems pretty expensive. Yea :( Greetings, Andres Freund
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes: > I wonder if one approach to solve this wouldn't be to just make the > hashtable drastically smaller. Right now we'll often have have lots > empty entries that are 72 bytes + dynahash overhead. That's plenty of > memory that needs to be skipped over. I wonder if we instead should > have an array of held locallocks, and a hashtable with {hashcode, > offset_in_array} + custom comparator for lookups. Well, that's not going to work all that well for retail lock releases; you'll end up with holes in the array, maybe a lot of them. However, it led me to think of another way we might approach the general hashtable problem: right now, we are not making any use of the fact that the hashtable's entries are laid out in big slabs (arrays). What if we tried to ensure that the live entries are allocated fairly compactly in those arrays, and then implemented hash_seq_search as a scan over the arrays, ignoring the hash bucket structure per se? We'd need a way to reliably tell a live entry from a free entry, but again, there's plenty of space for a flag bit or two. This might perform poorly if you allocated a bunch of entries, freed most-but-not-all, and then wanted to seqscan the remainder; you'd end up with the same problem I complained of above that you're iterating over an array that's mostly uninteresting. In principle we could keep count of the live vs free entries and dynamically decide to scan via the hash bucket structure instead of searching the storage array when the array is too sparse; but that might be overly complicated. I haven't tried to work this out in detail, it's just a late night brainstorm. But, again, I'd much rather solve this in dynahash.c than by layering some kind of hack on top of it. regards, tom lane
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Peter Eisentraut
Date:
On 2019-04-06 05:03, Tom Lane wrote: > Trying a standard pgbench test case (pgbench -M prepared -S with > one client and an -s 10 database), it seems that the patch is about > 0.5% slower than HEAD. Again, that's below the noise threshold, > but it's not promising for the net effects of this patch on workloads > that aren't specifically about large and prunable partition sets. In my testing, I've also noticed that it seems to be slightly on the slower side for these simple tests. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Sat, 6 Apr 2019 at 16:03, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'd also point out that this is hardly the only place where we've > seen hash_seq_search on nearly-empty hash tables become a bottleneck. > So I'm not thrilled about attacking that with one-table-at-time patches. > I'd rather see us do something to let hash_seq_search win across > the board. Rewinding back to mid-Feb: You wrote: > My own thought about how to improve this situation was just to destroy > and recreate LockMethodLocalHash at transaction end (or start) > if its size exceeded $some-value. Leaving it permanently bloated seems > like possibly a bad idea, even if we get rid of all the hash_seq_searches > on it. Which I thought was an okay idea. I think the one advantage that would have over making hash_seq_search() faster for large and mostly empty tables is that over-sized hash tables are just not very cache efficient, and if we don't need it to be that large then we should probably consider making it smaller again. I've had a go at implementing this and using Amit's benchmark the performance looks pretty good. I can't detect any slowdown for the general case. master: plan_cache_mode = auto: $ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 9373.698212 (excluding connections establishing) tps = 9356.993148 (excluding connections establishing) tps = 9367.579806 (excluding connections establishing) plan_cache_mode = force_custom_plan: $ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 12863.758185 (excluding connections establishing) tps = 12787.766054 (excluding connections establishing) tps = 12817.878940 (excluding connections establishing) shrink_bloated_locallocktable.patch: plan_cache_mode = auto: $ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 12756.021211 (excluding connections establishing) tps = 12800.939518 (excluding connections establishing) tps = 12804.501977 (excluding connections establishing) plan_cache_mode = force_custom_plan: $ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 12763.448836 (excluding connections establishing) tps = 12901.673271 (excluding connections establishing) tps = 12856.512745 (excluding connections establishing) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes: > On Sat, 6 Apr 2019 at 16:03, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> My own thought about how to improve this situation was just to destroy >> and recreate LockMethodLocalHash at transaction end (or start) >> if its size exceeded $some-value. Leaving it permanently bloated seems >> like possibly a bad idea, even if we get rid of all the hash_seq_searches >> on it. > Which I thought was an okay idea. I think the one advantage that > would have over making hash_seq_search() faster for large and mostly > empty tables is that over-sized hash tables are just not very cache > efficient, and if we don't need it to be that large then we should > probably consider making it smaller again. > I've had a go at implementing this and using Amit's benchmark the > performance looks pretty good. I can't detect any slowdown for the > general case. I like the concept ... but the particular implementation, not so much. It seems way overcomplicated. In the first place, why should we add code to copy entries? Just don't do it except when the table is empty. In the second, I think we could probably have a far cheaper test for how big the table is --- maybe we'd need to expose some function in dynahash.c, but the right way here is just to see how many buckets there are. I don't like adding statistics counting for this, because it's got basically nothing to do with what the actual problem is. (If you acquire and release one lock, and do that over and over, you don't have a bloat problem no matter how many times you do it.) LockMethodLocalHash is special in that it predictably goes to empty at the end of every transaction, so that de-bloating at that point is a workable strategy. I think we'd probably need something more robust if we were trying to fix this generally for all hash tables. But if we're going to go with the one-off hack approach, we should certainly try to keep that hack as simple as possible. regards, tom lane
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Mon, 8 Apr 2019 at 02:20, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I like the concept ... but the particular implementation, not so much. > It seems way overcomplicated. In the first place, why should we > add code to copy entries? Just don't do it except when the table > is empty. In the second, I think we could probably have a far > cheaper test for how big the table is --- maybe we'd need to > expose some function in dynahash.c, but the right way here is just > to see how many buckets there are. I don't like adding statistics > counting for this, because it's got basically nothing to do with > what the actual problem is. (If you acquire and release one lock, > and do that over and over, you don't have a bloat problem no > matter how many times you do it.) hash_get_num_entries() looks cheap enough to me. Can you explain why you think that's too expensive? > LockMethodLocalHash is special in that it predictably goes to empty > at the end of every transaction, so that de-bloating at that point > is a workable strategy. I think we'd probably need something more > robust if we were trying to fix this generally for all hash tables. > But if we're going to go with the one-off hack approach, we should > certainly try to keep that hack as simple as possible. As cheap as possible sounds good, but I'm confused at why you think the table will always be empty at the end of transaction. It's my understanding and I see from debugging that session level locks remain in there. If I don't copy those into the new table they'll be lost. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Mon, 8 Apr 2019 at 02:36, David Rowley <david.rowley@2ndquadrant.com> wrote: > > LockMethodLocalHash is special in that it predictably goes to empty > > at the end of every transaction, so that de-bloating at that point > > is a workable strategy. I think we'd probably need something more > > robust if we were trying to fix this generally for all hash tables. > > But if we're going to go with the one-off hack approach, we should > > certainly try to keep that hack as simple as possible. > > As cheap as possible sounds good, but I'm confused at why you think > the table will always be empty at the end of transaction. It's my > understanding and I see from debugging that session level locks remain > in there. If I don't copy those into the new table they'll be lost. Or we could just skip the table recreation if there are no session-levels. That would require calling hash_get_num_entries() on the table again and just recreating the table if there are 0 locks. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes: > On Mon, 8 Apr 2019 at 02:20, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I like the concept ... but the particular implementation, not so much. >> It seems way overcomplicated. In the first place, why should we >> add code to copy entries? Just don't do it except when the table >> is empty. In the second, I think we could probably have a far >> cheaper test for how big the table is --- maybe we'd need to >> expose some function in dynahash.c, but the right way here is just >> to see how many buckets there are. I don't like adding statistics >> counting for this, because it's got basically nothing to do with >> what the actual problem is. (If you acquire and release one lock, >> and do that over and over, you don't have a bloat problem no >> matter how many times you do it.) > hash_get_num_entries() looks cheap enough to me. Can you explain why > you think that's too expensive? What I objected to cost-wise was counting the number of lock acquisitions/releases, which seems entirely beside the point. We *should* be using hash_get_num_entries(), but only to verify that the table is empty before resetting it. The additional bit that is needed is to see whether the number of buckets is large enough to justify calling the table bloated. > As cheap as possible sounds good, but I'm confused at why you think > the table will always be empty at the end of transaction. It's conceivable that it won't be, which is why we need a test. I'm simply arguing that if it is not, we can just postpone de-bloating till it is. Session-level locks are so rarely used that there's no need to sweat about that case. regards, tom lane
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Mon, 8 Apr 2019 at 02:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <david.rowley@2ndquadrant.com> writes: > > hash_get_num_entries() looks cheap enough to me. Can you explain why > > you think that's too expensive? > > What I objected to cost-wise was counting the number of lock > acquisitions/releases, which seems entirely beside the point. > > We *should* be using hash_get_num_entries(), but only to verify > that the table is empty before resetting it. The additional bit > that is needed is to see whether the number of buckets is large > enough to justify calling the table bloated. The reason I thought it was a good idea to track some history there was to stop the lock table constantly being shrunk back to the default size every time a simple single table query was executed. For example, a workload repeatably doing: SELECT * FROM table_with_lots_of_partitions; SELECT * FROM non_partitioned_table; I was worried that obtaining locks on the partitioned table would become a little slower because it would have to expand the hash table each time the query is executed. > > As cheap as possible sounds good, but I'm confused at why you think > > the table will always be empty at the end of transaction. > > It's conceivable that it won't be, which is why we need a test. > I'm simply arguing that if it is not, we can just postpone de-bloating > till it is. Session-level locks are so rarely used that there's no > need to sweat about that case. That seems fair. It would certainly simplify the patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes: > On Mon, 8 Apr 2019 at 02:59, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> We *should* be using hash_get_num_entries(), but only to verify >> that the table is empty before resetting it. The additional bit >> that is needed is to see whether the number of buckets is large >> enough to justify calling the table bloated. > The reason I thought it was a good idea to track some history there > was to stop the lock table constantly being shrunk back to the default > size every time a simple single table query was executed. I think that's probably gilding the lily, considering that this whole issue is pretty new. There's no evidence that expanding the local lock table is a significant drag on queries that need a lot of locks. regards, tom lane
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Mon, 8 Apr 2019 at 03:20, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <david.rowley@2ndquadrant.com> writes: > > The reason I thought it was a good idea to track some history there > > was to stop the lock table constantly being shrunk back to the default > > size every time a simple single table query was executed. > > I think that's probably gilding the lily, considering that this whole > issue is pretty new. There's no evidence that expanding the local > lock table is a significant drag on queries that need a lot of locks. Okay. Here's another version with all the average locks code removed that only recreates the table when it's completely empty. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Andres Freund
Date:
Hi, On 2019-04-08 03:40:52 +1200, David Rowley wrote: > On Mon, 8 Apr 2019 at 03:20, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > David Rowley <david.rowley@2ndquadrant.com> writes: > > > The reason I thought it was a good idea to track some history there > > > was to stop the lock table constantly being shrunk back to the default > > > size every time a simple single table query was executed. > > > > I think that's probably gilding the lily, considering that this whole > > issue is pretty new. There's no evidence that expanding the local > > lock table is a significant drag on queries that need a lot of locks. > > Okay. Here's another version with all the average locks code removed > that only recreates the table when it's completely empty. Could you benchmark your adversarial case? - Andres
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Mon, 8 Apr 2019 at 03:47, Andres Freund <andres@anarazel.de> wrote: > Could you benchmark your adversarial case? Which case? I imagine the worst case for v2 is a query that just constantly asks for over 16 locks. Perhaps a prepared plan, so not to add planner overhead. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes: > Okay. Here's another version with all the average locks code removed > that only recreates the table when it's completely empty. Um ... I don't see where you're destroying the old hash? Also, I entirely dislike wiring in assumptions about hash_seq_search's private state structure here. I think it's worth having an explicit entry point in dynahash.c to get the current number of buckets. Also, I would not define "significantly bloated" as "the table has grown at all". I think the threshold ought to be at least ~100 buckets, if we're starting at 16. Probably we ought to try to gather some evidence to inform the choice of cutoff here. Maybe instrument the regression tests to see how big the table typically gets? regards, tom lane
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Mon, 8 Apr 2019 at 04:09, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Um ... I don't see where you're destroying the old hash? In CreateLocalLockHash. > Also, I entirely dislike wiring in assumptions about hash_seq_search's > private state structure here. I think it's worth having an explicit > entry point in dynahash.c to get the current number of buckets. Okay. Added hash_get_max_bucket() > Also, I would not define "significantly bloated" as "the table has > grown at all". I think the threshold ought to be at least ~100 > buckets, if we're starting at 16. I wouldn't either. I don't think the comment says that. It says there can be slowdowns when its significantly bloated, and then goes on to say that we just resize when it's bigger than standard. > Probably we ought to try to gather some evidence to inform the > choice of cutoff here. Maybe instrument the regression tests to > see how big the table typically gets? In partition_prune.sql I see use of a bucket as high as 285 on my machine with: drop table lp, coll_pruning, rlp, mc3p, mc2p, boolpart, rp, coll_pruning_multi, like_op_noprune, lparted_by_int2, rparted_by_int2; I've not added any sort of cut-off though as I benchmarked it and surprisingly I don't see any slowdown with the worst case. So I'm thinking there might not be any point. alter system set plan_cache_mode = 'force_generic_plan'; create table hp (a int primary key) partition by hash (a); select 'create table hp' || x::text || ' partition of hp for values with (modulus 32, remainder ' || (x)::text || ');' from generate_series(0,31) x; \gexec select.sql \set p 1 select * from hp where a = :p Master $ pgbench -n -M prepared -f select.sql -T 60 postgres tps = 11834.764309 (excluding connections establishing) tps = 12279.212223 (excluding connections establishing) tps = 12007.263547 (excluding connections establishing) Patched: $ pgbench -n -M prepared -f select.sql -T 60 postgres tps = 13380.684817 (excluding connections establishing) tps = 12790.999279 (excluding connections establishing) tps = 12568.892558 (excluding connections establishing) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > On the whole I don't think there's an adequate case for committing > this patch. From: Andres Freund [mailto:andres@anarazel.de] > On 2019-04-05 23:03:11 -0400, Tom Lane wrote: > > If I reduce the number of partitions in Amit's example from 8192 > > to something more real-world, like 128, I do still measure a > > performance gain, but it's ~ 1.5% which is below what I'd consider > > a reproducible win. I'm accustomed to seeing changes up to 2% > > in narrow benchmarks like this one, even when "nothing changes" > > except unrelated code. > > I'm not sure it's actually that narrow these days. With all the > partitioning improvements happening, the numbers of locks commonly held > are going to rise. And while 8192 partitions is maybe on the more > extreme side, it's a workload with only a single table, and plenty > workloads touch more than a single partitioned table. I would feel happy if I could say such a many-partitions use case is narrow or impractical and ignore it, but it's not narrow. Two of our customers are actually requesting such usage: one uses 5,500 partitions and is trying to migrate froma commercial database on Linux, and the other requires 200,000 partitions to migrate from a legacy database on a mainframe. At first, I thought such many partitions indicate a bad application design, but it sounded valid (or at leastI can't insist that's bad). PostgreSQL is now expected to handle such huge workloads. From: Andres Freund [mailto:andres@anarazel.de] > I'm not sure I'm quite that concerned. For one, a good bit of that space > was up for grabs until the recent reordering of LOCALLOCK and nobody > complained. But more importantly, I think commonly the amount of locks > around is fairly constrained, isn't it? We can't really have that many > concurrently held locks, due to the shared memory space, and the size of > a LOCALLOCK isn't that big compared to say relcache entries. We also > probably fairly easily could win some space back - e.g. make nLocks 32 > bits. +1 From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > I'd also point out that this is hardly the only place where we've > seen hash_seq_search on nearly-empty hash tables become a bottleneck. > So I'm not thrilled about attacking that with one-table-at-time patches. > I'd rather see us do something to let hash_seq_search win across > the board. > > I spent some time wondering whether we could adjust the data structure > so that all the live entries in a hashtable are linked into one chain, > but I don't quite see how to do it without adding another list link to > struct HASHELEMENT, which seems pretty expensive. I think the linked list of LOCALLOCK approach is natural, simple, and good. In the Jim Gray's classic book "Transactionprocessing: concepts and techniques", we can find the following sentence in "8.4.5 Lock Manager Internal Logic." The sample implementation code in the book uses a similar linked list to remember and release a transaction's acquiredlocks. "All the locks of a transaction are kept in a list so they can be quickly found and released at commit or rollback." And handling this issue with the LOCALLOCK linked list is more natural than with the hash table resize. We just want toquickly find all grabbed locks, so we use a linked list. A hash table is a mechanism to find a particular item quickly. So it was merely wrong to use the hash table to iterate all grabbed locks. Also, the hash table got big becausesome operation in the session needed it, and some subsequent operations in the same session may need it again. Sowe wouldn't be relieved with shrinking the hash table. Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
'Andres Freund'
Date:
Hi, On 2019-04-08 02:28:12 +0000, Tsunakawa, Takayuki wrote: > I think the linked list of LOCALLOCK approach is natural, simple, and > good. Did you see that people measured slowdowns? Greetings, Andres Freund
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: 'Andres Freund' [mailto:andres@anarazel.de] > On 2019-04-08 02:28:12 +0000, Tsunakawa, Takayuki wrote: > > I think the linked list of LOCALLOCK approach is natural, simple, and > > good. > > Did you see that people measured slowdowns? Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels like a somewhat extreme test case. And that mightbe within noise as was mentioned. If we want to remove even the noise, we may have to think of removing the LocalLockHash completely. But it doesn't seemfeasible... Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Mon, 8 Apr 2019 at 14:54, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > > From: 'Andres Freund' [mailto:andres@anarazel.de] > > Did you see that people measured slowdowns? > > Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels like a somewhat extreme test case. And thatmight be within noise as was mentioned. > > If we want to remove even the noise, we may have to think of removing the LocalLockHash completely. But it doesn't seemfeasible... It would be good to get your view on the shrink_bloated_locallocktable_v3.patch I worked on last night. I was unable to measure any overhead to solving the problem that way. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: David Rowley [mailto:david.rowley@2ndquadrant.com] > It would be good to get your view on the > shrink_bloated_locallocktable_v3.patch I worked on last night. I was > unable to measure any overhead to solving the problem that way. Thanks, it looks super simple and good. I understood the idea behind your patch is: * Transactions that touch many partitions and/or tables are a special event and not normal, and the hash table bloat is anunlucky accident. So it's reasonable to revert the bloated hash table back to the original size. * Repeated transactions that acquire many locks have to enlarge the hash table every time. However, the overhead of hashtable expansion should be hidden behind other various processing (acquiring/releasing locks, reading/writing the relations,accessing the catalogs of those relations) TBH, I think the linked list approach feels more intuitive because the resulting code looks what it wants to do (efficientlyiterate over acquired locks) and is based on the classic book. But your approach seems to relieve people. SoI'm OK with your patch. I'm registering you as another author and me as a reviewer, and marking this ready for committer. Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Peter Eisentraut
Date:
On 2019-04-08 05:46, Tsunakawa, Takayuki wrote: > I'm registering you as another author and me as a reviewer, and marking this ready for committer. Moved to next commit fest. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Mon, 8 Apr 2019 at 04:09, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Also, I would not define "significantly bloated" as "the table has > grown at all". I think the threshold ought to be at least ~100 > buckets, if we're starting at 16. I've revised the patch to add a new constant named LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the hash table grows over that size we shrink it back down to LOCKMETHODLOCALHASH_INIT_SIZE, which I've kept at 16. I'm not opposed to setting it to 128. For this particular benchmark, it won't make any difference as it's only going to affect something that does not quite use 128 locks and has to work with a slightly bloated local lock table. I think hitting 64 locks in a transaction is a good indication that it's not a simple transaction so users are probably unlikely to notice the small slowdown from the hash table reinitialisation. Since quite a bit has changed around partition planning lately, I've taken a fresh set of benchmarks on today's master. I'm using something very close to Amit's benchmark from upthread. I just changed the query so we hit the same partition each time instead of a random one. create table ht (a int primary key, b int, c int) partition by hash (a); select 'create table ht' || x::text || ' partition of ht for values with (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,8191) x; \gexec select.sql: \set p 1 select * from ht where a = :p master @ a193cbec119 + shrink_bloated_locallocktable_v4.patch: plan_cache_mode = 'auto'; ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 14101.226982 (excluding connections establishing) tps = 14034.250962 (excluding connections establishing) tps = 14107.937755 (excluding connections establishing) plan_cache_mode = 'force_custom_plan'; ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 14240.366770 (excluding connections establishing) tps = 14272.244886 (excluding connections establishing) tps = 14130.684315 (excluding connections establishing) master @ a193cbec119: plan_cache_mode = 'auto'; ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 10467.027666 (excluding connections establishing) tps = 10333.700917 (excluding connections establishing) tps = 10633.084426 (excluding connections establishing) plan_cache_mode = 'force_custom_plan'; ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 13938.083272 (excluding connections establishing) tps = 14143.241802 (excluding connections establishing) tps = 14097.406758 (excluding connections establishing) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: David Rowley [mailto:david.rowley@2ndquadrant.com] > I've revised the patch to add a new constant named > LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the hash Thank you, and good performance. The patch passed make check. I'm OK with the current patch, but I have a few comments. Please take them as you see fit (I wouldn't mind if you don't.) (1) +#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64 How about LOCKMETHODLOCALHASH_SHRINK_THRESHOLD, because this determines the threshold value to trigger shrinkage? Code inPostgreSQL seems to use the term threshold. (2) +/* Complain if the above are not set to something sane */ +#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE +#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE" +#endif I don't think these are necessary, because these are fixed and not configurable. FYI, src/include/utils/memutils.h doesn'thave #error to test these macros. #define ALLOCSET_DEFAULT_MINSIZE 0 #define ALLOCSET_DEFAULT_INITSIZE (8 * 1024) #define ALLOCSET_DEFAULT_MAXSIZE (8 * 1024 * 1024) (3) + if (hash_get_num_entries(LockMethodLocalHash) == 0 && + hash_get_max_bucket(LockMethodLocalHash) > + LOCKMETHODLOCALHASH_SHRINK_SIZE) + CreateLocalLockHash(); I get an impression that Create just creates something where there's nothing. How about Init or Recreate? Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Mon, 17 Jun 2019 at 15:05, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > (1) > +#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64 > > How about LOCKMETHODLOCALHASH_SHRINK_THRESHOLD, because this determines the threshold value to trigger shrinkage? Codein PostgreSQL seems to use the term threshold. That's probably better. I've renamed it to that. > (2) > +/* Complain if the above are not set to something sane */ > +#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE > +#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE" > +#endif > > I don't think these are necessary, because these are fixed and not configurable. FYI, src/include/utils/memutils.h doesn'thave #error to test these macros. Yeah. I was thinking it was overkill when I wrote it, but somehow couldn't bring myself to remove it. Done now. > (3) > + if (hash_get_num_entries(LockMethodLocalHash) == 0 && > + hash_get_max_bucket(LockMethodLocalHash) > > + LOCKMETHODLOCALHASH_SHRINK_SIZE) > + CreateLocalLockHash(); > > I get an impression that Create just creates something where there's nothing. How about Init or Recreate? Renamed to InitLocalLoclHash() v5 is attached. Thank you for the review. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: David Rowley [mailto:david.rowley@2ndquadrant.com] v5 is attached. Thank you, looks good. I find it ready for committer (I noticed the status is already set so.) Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Thu, 27 Jun 2019 at 12:59, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > > From: David Rowley [mailto:david.rowley@2ndquadrant.com] > Thank you, looks good. I find it ready for committer (I noticed the status is already set so.) Thanks for looking. I've just been looking at this again and I thought I'd better check the performance of the worst case for the patch, where the hash table is rebuilt each query. To do this I first created a single column 70 partition partitioned table ("p") and left it empty. I then checked the performance of: SELECT * FROM p; Having 70 partitions means that the lock table's max bucket goes over the LOCKMETHODLOCALHASH_SHRINK_THRESHOLD which is set to 64 and results in the table being rebuilt each time the query is run. The performance was as follows: 70 partitions: LOCKMETHODLOCALHASH_SHRINK_THRESHOLD = 64 master + shrink_bloated_locallocktable_v5.patch: ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres tps = 8427.053378 (excluding connections establishing) tps = 8583.251821 (excluding connections establishing) tps = 8569.587268 (excluding connections establishing) tps = 8552.988483 (excluding connections establishing) tps = 8527.735108 (excluding connections establishing) master (93907478): ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres tps = 8712.919411 (excluding connections establishing) tps = 8760.190372 (excluding connections establishing) tps = 8755.069470 (excluding connections establishing) tps = 8747.389735 (excluding connections establishing) tps = 8758.275202 (excluding connections establishing) patched is 2.45% slower If I increase the partition count to 140 and put the LOCKMETHODLOCALHASH_SHRINK_THRESHOLD up to 128, then the performance is as follows: master + shrink_bloated_locallocktable_v5.patch: ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres tps = 2548.917856 (excluding connections establishing) tps = 2561.283564 (excluding connections establishing) tps = 2549.669870 (excluding connections establishing) tps = 2421.971864 (excluding connections establishing) tps = 2428.983660 (excluding connections establishing) Master (93907478): ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres tps = 2605.407529 (excluding connections establishing) tps = 2600.691426 (excluding connections establishing) tps = 2594.123983 (excluding connections establishing) tps = 2455.745644 (excluding connections establishing) tps = 2450.061483 (excluding connections establishing) patched is 1.54% slower I'd rather not put the LOCKMETHODLOCALHASH_SHRINK_THRESHOLD up any higher than 128 since it can detract from the improvement we're trying to make with this patch. Now, this case of querying a partitioned table that happens to be completely empty seems a bit unrealistic. Something more realistic might be index scanning all partitions to find a value that only exists in a single partition. Assuming the partitions actually have some records, then that's going to be a more expensive query, so the overhead of rebuilding the table will be less noticeable. A previous version of the patch has already had some heuristics to try to only rebuild the hash table when it's likely beneficial. I'd rather not go exploring in that area again. Is anyone particularly concerned about the worst-case slowdown here being about 1.54%? The best case, and arguably a more realistic case above showed a 34% speedup for the best case. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Thu, 18 Jul 2019 at 14:53, David Rowley <david.rowley@2ndquadrant.com> wrote: > Is anyone particularly concerned about the worst-case slowdown here > being about 1.54%? The best case, and arguably a more realistic case > above showed a 34% speedup for the best case. I took a bit more time to test the performance on this. I thought I might have been a bit unfair on the patch by giving it completely empty tables to look at. It just seems too unrealistic to have a large number of completely empty partitions. I decided to come up with a more realistic case where there are 140 partitions but we're performing an index scan to find a record that can exist in only 1 of the 140 partitions. create table hp (a int, b int) partition by hash(a); select 'create table hp'||x||' partition of hp for values with (modulus 140, remainder ' || x || ');' from generate_series(0,139)x; create index on hp (b); insert into hp select x,x from generate_series(1, 140000) x; analyze hp; select3.sql: select * from hp where b = 1 Master: $ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2124.591367 (excluding connections establishing) tps = 2158.754837 (excluding connections establishing) tps = 2146.348465 (excluding connections establishing) tps = 2148.995800 (excluding connections establishing) tps = 2154.223600 (excluding connections establishing) Patched: $ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2002.480729 (excluding connections establishing) tps = 1997.272944 (excluding connections establishing) tps = 1992.527079 (excluding connections establishing) tps = 1995.789367 (excluding connections establishing) tps = 2001.195760 (excluding connections establishing) so it turned out it's even slower, and not by a small amount either! Patched is 6.93% slower on average with this case :-( I went back to the drawing board on this and I've added some code that counts the number of times we've seen the table to be oversized and just shrinks the table back down on the 1000th time. 6.93% / 1000 is not all that much. Of course, not all the extra overhead might be from rebuilding the table, so here's a test with the updated patch. $ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2142.414323 (excluding connections establishing) tps = 2139.666899 (excluding connections establishing) tps = 2138.744789 (excluding connections establishing) tps = 2138.300299 (excluding connections establishing) tps = 2137.346278 (excluding connections establishing) Just a 0.34% drop. Pretty hard to pick that out the noise. Testing the original case that shows the speedup: create table ht (a int primary key, b int, c int) partition by hash (a); select 'create table ht' || x::text || ' partition of ht for values with (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,8191) x; \gexec select.sql: \set p 1 select * from ht where a = :p Master: $ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 10172.035036 (excluding connections establishing) tps = 10192.780529 (excluding connections establishing) tps = 10331.306003 (excluding connections establishing) Patched: $ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 15080.765549 (excluding connections establishing) tps = 14994.404069 (excluding connections establishing) tps = 14982.923480 (excluding connections establishing) That seems fine, 46% faster. v6 is attached. I plan to push this in a few days unless someone objects. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: David Rowley [mailto:david.rowley@2ndquadrant.com] > I went back to the drawing board on this and I've added some code that counts > the number of times we've seen the table to be oversized and just shrinks > the table back down on the 1000th time. 6.93% / 1000 is not all that much. I'm afraid this kind of hidden behavior would appear mysterious to users. They may wonder "Why is the same query fast atfirst in the session (5 or 6 times of execution), then gets slower for a while, and gets faster again? Is there somethingto tune? Am I missing something wrong with my app (e.g. how to use prepared statements)?" So I prefer v5. > Of course, not all the extra overhead might be from rebuilding the table, > so here's a test with the updated patch. Where else does the extra overhead come from? Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Mon, 22 Jul 2019 at 12:48, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > > From: David Rowley [mailto:david.rowley@2ndquadrant.com] > > I went back to the drawing board on this and I've added some code that counts > > the number of times we've seen the table to be oversized and just shrinks > > the table back down on the 1000th time. 6.93% / 1000 is not all that much. > > I'm afraid this kind of hidden behavior would appear mysterious to users. They may wonder "Why is the same query fastat first in the session (5 or 6 times of execution), then gets slower for a while, and gets faster again? Is there somethingto tune? Am I missing something wrong with my app (e.g. how to use prepared statements)?" So I prefer v5. I personally don't think that's true. The only way you'll notice the LockReleaseAll() overhead is to execute very fast queries with a bloated lock table. It's pretty hard to notice that a single 0.1ms query is slow. You'll need to execute thousands of them before you'll be able to measure it, and once you've done that, the lock shrink code will have run and the query will be performing optimally again. I voice my concerns with v5 and I wasn't really willing to push it with a known performance regression of 7% in a fairly valid case. v6 does not suffer from that. > > Of course, not all the extra overhead might be from rebuilding the table, > > so here's a test with the updated patch. > > Where else does the extra overhead come from? hash_get_num_entries(LockMethodLocalHash) == 0 && + hash_get_max_bucket(LockMethodLocalHash) > + LOCKMETHODLOCALHASH_SHRINK_THRESHOLD) that's executed every time, not every 1000 times. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: David Rowley [mailto:david.rowley@2ndquadrant.com] > I personally don't think that's true. The only way you'll notice the > LockReleaseAll() overhead is to execute very fast queries with a > bloated lock table. It's pretty hard to notice that a single 0.1ms > query is slow. You'll need to execute thousands of them before you'll > be able to measure it, and once you've done that, the lock shrink code > will have run and the query will be performing optimally again. Maybe so. Will the difference be noticeable between plan_cache_mode=auto (default) and plan_cache_mode=custom? > I voice my concerns with v5 and I wasn't really willing to push it > with a known performance regression of 7% in a fairly valid case. v6 > does not suffer from that. You're right. We may have to consider the unpredictability to users by this hidden behavior as a compromise for higher throughput. > > Where else does the extra overhead come from? > > hash_get_num_entries(LockMethodLocalHash) == 0 && > + hash_get_max_bucket(LockMethodLocalHash) > > + LOCKMETHODLOCALHASH_SHRINK_THRESHOLD) > > that's executed every time, not every 1000 times. I see. Thanks. Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Mon, 22 Jul 2019 at 14:21, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > > From: David Rowley [mailto:david.rowley@2ndquadrant.com] > > I personally don't think that's true. The only way you'll notice the > > LockReleaseAll() overhead is to execute very fast queries with a > > bloated lock table. It's pretty hard to notice that a single 0.1ms > > query is slow. You'll need to execute thousands of them before you'll > > be able to measure it, and once you've done that, the lock shrink code > > will have run and the query will be performing optimally again. > > Maybe so. Will the difference be noticeable between plan_cache_mode=auto (default) and plan_cache_mode=custom? For the use case we've been measuring with partitioned tables and the generic plan generation causing a sudden spike in the number of obtained locks, then having plan_cache_mode = force_custom_plan will cause the lock table not to become bloated. I'm not sure there's anything interesting to measure there. The only additional code that gets executed is the hash_get_num_entries() and possibly hash_get_max_bucket. Maybe it's worth swapping the order of those calls since most of the time the entry will be 0 and the max bucket count threshold won't be exceeded. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: David Rowley [mailto:david.rowley@2ndquadrant.com] > For the use case we've been measuring with partitioned tables and the > generic plan generation causing a sudden spike in the number of > obtained locks, then having plan_cache_mode = force_custom_plan will > cause the lock table not to become bloated. I'm not sure there's > anything interesting to measure there. I meant the difference between the following two cases, where the query only touches one partition (e.g. SELECT ... WHEREpkey = value): * plan_cache_mode=force_custom_plan: LocalLockHash won't bloat. The query execution time is steady. * plan_cache_mode=auto: LocalLockHash bloats on the sixth execution due to the creation of the generic plan. The genericplan is not adopted because its cost is high. Later executions of the query will suffer from the bloat until the1006th execution when LocalLockHash is shrunk. Depending on the number of transactions and what each transaction does, I thought the difference will be noticeable or not. Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Mon, 22 Jul 2019 at 16:36, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > > From: David Rowley [mailto:david.rowley@2ndquadrant.com] > > For the use case we've been measuring with partitioned tables and the > > generic plan generation causing a sudden spike in the number of > > obtained locks, then having plan_cache_mode = force_custom_plan will > > cause the lock table not to become bloated. I'm not sure there's > > anything interesting to measure there. > > I meant the difference between the following two cases, where the query only touches one partition (e.g. SELECT ... WHEREpkey = value): > > * plan_cache_mode=force_custom_plan: LocalLockHash won't bloat. The query execution time is steady. > > * plan_cache_mode=auto: LocalLockHash bloats on the sixth execution due to the creation of the generic plan. The genericplan is not adopted because its cost is high. Later executions of the query will suffer from the bloat until the1006th execution when LocalLockHash is shrunk. I measured this again in https://www.postgresql.org/message-id/CAKJS1f_ycJ-6QTKC70pZRYdwsAwUo+t0_CV0eXC=J-b5A2MegQ@mail.gmail.com where I posted the v6 patch. It's the final results in the email. I didn't measure plan_cache_mode = force_custom_plan. There'd be no lock table bloat in that case and the additional overhead would just be from the hash_get_num_entries() && hash_get_max_bucket() calls, which the first results show next to no overhead for. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Mon, 22 Jul 2019 at 12:48, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > > From: David Rowley [mailto:david.rowley@2ndquadrant.com] > > I went back to the drawing board on this and I've added some code that counts > > the number of times we've seen the table to be oversized and just shrinks > > the table back down on the 1000th time. 6.93% / 1000 is not all that much. > > I'm afraid this kind of hidden behavior would appear mysterious to users. They may wonder "Why is the same query fastat first in the session (5 or 6 times of execution), then gets slower for a while, and gets faster again? Is there somethingto tune? Am I missing something wrong with my app (e.g. how to use prepared statements)?" So I prefer v5. Another counter-argument to this is that there's already an unexplainable slowdown after you run a query which obtains a large number of locks in a session or use prepared statements and a partitioned table with the default plan_cache_mode setting. Are we not just righting a wrong here? Albeit, possibly 1000 queries later. I am, of course, open to other ideas which solve the problem that v5 has, but failing that, I don't see v6 as all that bad. At least all the logic is contained in one function. I know Tom wanted to steer clear of heuristics to reinitialise the table, but most of the stuff that was in the patch back when he complained was trying to track the average number of locks over the previous N transactions, and his concerns were voiced before I showed the 7% performance regression with unconditionally rebuilding the table. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Andres Freund
Date:
Hi, On 2019-07-21 21:37:28 +1200, David Rowley wrote: > select.sql: > \set p 1 > select * from ht where a = :p > > Master: > > $ pgbench -n -f select.sql -T 60 -M prepared postgres > tps = 10172.035036 (excluding connections establishing) > tps = 10192.780529 (excluding connections establishing) > tps = 10331.306003 (excluding connections establishing) > > Patched: > > $ pgbench -n -f select.sql -T 60 -M prepared postgres > tps = 15080.765549 (excluding connections establishing) > tps = 14994.404069 (excluding connections establishing) > tps = 14982.923480 (excluding connections establishing) > > That seems fine, 46% faster. > > v6 is attached. > > I plan to push this in a few days unless someone objects. It does seem far less objectionable than the other case. I hate to throw in one more wrench into a topic finally making progress, but: Have either of you considered just replacing the dynahash table with a simplehash style one? Given the obvious speed sensitivity, and the fact that for it (in contrast to the shared lock table) no partitioning is needed, that seems like a good thing to try. It seems quite possible that both the iteration and plain manipulations are going to be faster, due to far less indirections - e.g. the iteration through the array will just be an array walk with a known stride, far easier for the CPU to prefetch. Greetings, Andres Freund
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: David Rowley [mailto:david.rowley@2ndquadrant.com] > Another counter-argument to this is that there's already an > unexplainable slowdown after you run a query which obtains a large > number of locks in a session or use prepared statements and a > partitioned table with the default plan_cache_mode setting. Are we not > just righting a wrong here? Albeit, possibly 1000 queries later. > > I am, of course, open to other ideas which solve the problem that v5 > has, but failing that, I don't see v6 as all that bad. At least all > the logic is contained in one function. I know Tom wanted to steer > clear of heuristics to reinitialise the table, but most of the stuff > that was in the patch back when he complained was trying to track the > average number of locks over the previous N transactions, and his > concerns were voiced before I showed the 7% performance regression > with unconditionally rebuilding the table. I think I understood what you mean. Sorry, I don't have a better idea. This unexplanability is probably what we shouldaccept to avoid the shocking 7% slowdown. OTOH, how about my original patch that is based on the local lock list? I expect that it won't that significant slowdownin the same test case. If it's not satisfactory, then v6 is the best to commit. Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Tue, 23 Jul 2019 at 15:47, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > OTOH, how about my original patch that is based on the local lock list? I expect that it won't that significant slowdownin the same test case. If it's not satisfactory, then v6 is the best to commit. I think we need to move beyond the versions that have been reviewed and rejected. I don't think the reasons for their rejection will have changed. I've attached v7, which really is v6 with some comments adjusted and the order of the hash_get_num_entries and hash_get_max_bucket function calls swapped. I think hash_get_num_entries() will return 0 most of the time where we're calling it, so it makes sense to put the condition that's less likely to be true first in the if condition. I'm pretty happy with v7 now. If anyone has any objections to it, please speak up very soon. Otherwise, I plan to push it about this time tomorrow. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes: > I've attached v7, which really is v6 with some comments adjusted and > the order of the hash_get_num_entries and hash_get_max_bucket function > calls swapped. I think hash_get_num_entries() will return 0 most of > the time where we're calling it, so it makes sense to put the > condition that's less likely to be true first in the if condition. > I'm pretty happy with v7 now. If anyone has any objections to it, > please speak up very soon. Otherwise, I plan to push it about this > time tomorrow. I dunno, this seems close to useless in this form. As it stands, once hash_get_max_bucket has exceeded the threshold, you will arbitrarily reset the table 1000 transactions later (since the max bucket is certainly not gonna decrease otherwise). So that's got two big problems: 1. In the assumed-common case where most of the transactions take few locks, you wasted cycles for 999 transactions. 2. You'll reset the table independently of subsequent history, even if the session's usage is pretty much always over the threshold. Admittedly, if you do this only once per 1K transactions, it's probably not a horrible overhead --- but you can't improve point 1 without making it a bigger overhead. I did complain about the cost of tracking the stats proposed by some earlier patches, but I don't think the answer to that is to not track any stats at all. The proposed hash_get_max_bucket() function is quite cheap, so I think it wouldn't be out of line to track the average value of that at transaction end over the session's lifespan, and reset if the current value is more than some-multiple of the average. The tricky part here is that if some xact kicks that up to say 64 entries, and we don't choose to reset, then the reading for subsequent transactions will be 64 even if they use very few locks. So we probably need to not use a plain average, but account for that effect somehow. Maybe we could look at how quickly the number goes up after we reset? [ thinks for awhile... ] As a straw-man proposal, I suggest the following (completely untested) idea: * Make the table size threshold variable, not constant. * If, at end-of-transaction when the table is empty, the table bucket count exceeds the threshold, reset immediately; but if it's been less than K transactions since the last reset, increase the threshold (by say 10%). I think K can be a constant; somewhere between 10 and 100 would probably work. At process start, we should act as though the last reset were more than K transactions ago (i.e., don't increase the threshold at the first reset). The main advantage this has over v7 is that we don't have the 1000-transaction delay before reset, which ISTM is giving up much of the benefit of the whole idea. Also, if the session is consistently using lots of locks, we'll adapt to that after awhile and not do useless table resets. regards, tom lane
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
Thanks for having a look at this. On Wed, 24 Jul 2019 at 04:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <david.rowley@2ndquadrant.com> writes: > > I'm pretty happy with v7 now. If anyone has any objections to it, > > please speak up very soon. Otherwise, I plan to push it about this > > time tomorrow. > > I dunno, this seems close to useless in this form. As it stands, > once hash_get_max_bucket has exceeded the threshold, you will > arbitrarily reset the table 1000 transactions later (since the > max bucket is certainly not gonna decrease otherwise). So that's > got two big problems: > > 1. In the assumed-common case where most of the transactions take > few locks, you wasted cycles for 999 transactions. > > 2. You'll reset the table independently of subsequent history, > even if the session's usage is pretty much always over the > threshold. Admittedly, if you do this only once per 1K > transactions, it's probably not a horrible overhead --- but you > can't improve point 1 without making it a bigger overhead. This is true, but I think you might be overestimating just how much effort is wasted with #1. We're only seeing this overhead in small very fast to execute xacts. In the test case in [1], I was getting about 10k TPS unpatched and about 15k patched. This means that, on average, 1 unpatched xact takes 100 microseconds and 1 average patched xact takes 66 microseconds, so the additional time spent doing the hash_seq_search() must be about 34 microseconds. So we'll waste a total of 34 *milliseconds* if we wait for 1000 xacts before we reset the table. With 10k TPS we're going to react to the change in 0.1 seconds. I think you'd struggle to measure that 1 xact is taking 34 microseconds longer without running a few thousand queries. My view is that nobody is ever going to notice that it takes 1k xacts to shrink the table, and I've already shown that the overhead of the shrink every 1k xact is tiny. I mentioned 0.34% in [1] using v6. This is likely a bit smaller in v7 due to swapping the order of the if condition to put the less likely case first. Since the overhead of rebuilding the table was 7% when done every xact, then it stands to reason that this has become 0.007% doing it every 1k xats and that the additional overhead to make up that 0.34% is from testing if the reset condition has been met (or noise). That's not something we can remove completely with any solution that resets the hash table. > I did complain about the cost of tracking the stats proposed by > some earlier patches, but I don't think the answer to that is to > not track any stats at all. The proposed hash_get_max_bucket() > function is quite cheap, so I think it wouldn't be out of line to > track the average value of that at transaction end over the > session's lifespan, and reset if the current value is more than > some-multiple of the average. > > The tricky part here is that if some xact kicks that up to > say 64 entries, and we don't choose to reset, then the reading > for subsequent transactions will be 64 even if they use very > few locks. So we probably need to not use a plain average, > but account for that effect somehow. Maybe we could look at > how quickly the number goes up after we reset? > > [ thinks for awhile... ] As a straw-man proposal, I suggest > the following (completely untested) idea: > > * Make the table size threshold variable, not constant. > > * If, at end-of-transaction when the table is empty, > the table bucket count exceeds the threshold, reset > immediately; but if it's been less than K transactions > since the last reset, increase the threshold (by say 10%). > > I think K can be a constant; somewhere between 10 and 100 would > probably work. At process start, we should act as though the last > reset were more than K transactions ago (i.e., don't increase the > threshold at the first reset). I think the problem with this idea is that there is no way once the threshold has been enlarged to recover from that to work better workloads that require very few locks again. If we end up with some large value for the variable threshold, there's no way to decrease that again. All this method stops is the needless hash table resets if the typical case requires many locks. The only way to know if we can reduce the threshold again is to count the locks released during LockReleaseAll(). Some version of the patch did that, and you objected. > The main advantage this has over v7 is that we don't have the > 1000-transaction delay before reset, which ISTM is giving up > much of the benefit of the whole idea. Also, if the session > is consistently using lots of locks, we'll adapt to that after > awhile and not do useless table resets. True, but you neglected to mention the looming and critical drawback, which pretty much makes that idea useless. All we'd need to do is give this a workload that throws that variable threshold up high so that it can't recover. It would be pretty simple then to show that LockReleaseAll() is still slow with workloads that just require a small number of locks... permanently with no means to recover. To be able to reduce the threshold down again we'd need to make a hash_get_num_entries(LockMethodLocalHash) call before performing the guts of LockReleaseAll(). We could then weight that onto some running average counter with a weight of, say... 10, so we react to changes fairly quickly, but not instantly. We could then have some sort of logic like "rebuild the hash table if running average 4 times less than max_bucket" I've attached a spreadsheet of that idea and the algorithm we could use to track the running average. Initially, I've mocked it up a series of transactions that use 1000 locks, then at row 123 dropped that to 10 locks. If we assume the max_bucket is 1000, then it takes until row 136 for the running average to drop below the max_bucket count, i.e 13 xacts. There we'd reset there at the init size of 16. If the average went up again, then we'd automatically expand the table as we do now. To make this work we'd need an additional call to hash_get_num_entries(), before we release the locks, so there is more overhead. [1] https://www.postgresql.org/message-id/CAKJS1f_ycJ-6QTKC70pZRYdwsAwUo+t0_CV0eXC=J-b5A2MegQ@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Wed, 24 Jul 2019 at 15:05, David Rowley <david.rowley@2ndquadrant.com> wrote: > To be able to reduce the threshold down again we'd need to make a > hash_get_num_entries(LockMethodLocalHash) call before performing the > guts of LockReleaseAll(). We could then weight that onto some running > average counter with a weight of, say... 10, so we react to changes > fairly quickly, but not instantly. We could then have some sort of > logic like "rebuild the hash table if running average 4 times less > than max_bucket" > > I've attached a spreadsheet of that idea and the algorithm we could > use to track the running average. Initially, I've mocked it up a > series of transactions that use 1000 locks, then at row 123 dropped > that to 10 locks. If we assume the max_bucket is 1000, then it takes > until row 136 for the running average to drop below the max_bucket > count, i.e 13 xacts. There we'd reset there at the init size of 16. If > the average went up again, then we'd automatically expand the table as > we do now. To make this work we'd need an additional call to > hash_get_num_entries(), before we release the locks, so there is more > overhead. Here's a patch with this implemented. I've left a NOTICE in there to make it easier for people to follow along at home and see when the lock table is reset. There will be a bit of additional overhead to the reset detection logic over the v7 patch. Namely: additional hash_get_num_entries() call before releasing the locks, and more complex and floating-point maths instead of more simple integer maths in v7. Here's a demo with the debug NOTICE in there to show us what's going on. -- setup create table a (a int) partition by range (a); select 'create table a'||x||' partition of a for values from('||x||') to ('||x+1||');' from generate_Series(1,1000) x; \gexec $ psql postgres NOTICE: max_bucket = 15, threshold = 64.000000, running_avg_locks 0.100000 Reset? No psql (13devel) # \o /dev/null # select * from a where a < 100; NOTICE: max_bucket = 101, threshold = 64.000000, running_avg_locks 10.090000 Reset? Yes # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 76.324005, running_avg_locks 19.081001 Reset? Yes # select * from a where a < 100; A couple of needless resets there... Maybe we can get rid of those by setting the initial running average up to something higher than 0.0. NOTICE: max_bucket = 99, threshold = 108.691605, running_avg_locks 27.172901 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 137.822449, running_avg_locks 34.455612 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 164.040207, running_avg_locks 41.010052 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 187.636185, running_avg_locks 46.909046 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 208.872559, running_avg_locks 52.218140 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 227.985306, running_avg_locks 56.996326 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 245.186768, running_avg_locks 61.296692 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 260.668091, running_avg_locks 65.167023 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 274.601288, running_avg_locks 68.650322 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 287.141174, running_avg_locks 71.785294 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 298.427063, running_avg_locks 74.606766 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 308.584351, running_avg_locks 77.146088 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 317.725922, running_avg_locks 79.431480 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 325.953339, running_avg_locks 81.488335 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 333.358002, running_avg_locks 83.339500 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 340.022217, running_avg_locks 85.005554 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 346.019989, running_avg_locks 86.504997 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 351.417999, running_avg_locks 87.854500 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 356.276184, running_avg_locks 89.069046 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 360.648560, running_avg_locks 90.162140 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 364.583710, running_avg_locks 91.145927 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 368.125336, running_avg_locks 92.031334 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 371.312805, running_avg_locks 92.828201 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 374.181519, running_avg_locks 93.545380 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 376.763367, running_avg_locks 94.190842 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 379.087036, running_avg_locks 94.771759 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 381.178345, running_avg_locks 95.294586 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 383.060516, running_avg_locks 95.765129 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 384.754456, running_avg_locks 96.188614 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 386.279022, running_avg_locks 96.569756 Reset? No -- Here I switch to only selecting from 9 partitions instead of 99. # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 351.651123, running_avg_locks 87.912781 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 320.486023, running_avg_locks 80.121506 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 292.437408, running_avg_locks 73.109352 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 267.193665, running_avg_locks 66.798416 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 244.474304, running_avg_locks 61.118576 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 224.026871, running_avg_locks 56.006718 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 205.624176, running_avg_locks 51.406044 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 189.061752, running_avg_locks 47.265438 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 174.155579, running_avg_locks 43.538895 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 160.740021, running_avg_locks 40.185005 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 148.666016, running_avg_locks 37.166504 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 137.799408, running_avg_locks 34.449852 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 128.019470, running_avg_locks 32.004868 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 119.217522, running_avg_locks 29.804380 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 111.295769, running_avg_locks 27.823942 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 104.166191, running_avg_locks 26.041548 Reset? No # select * from a where a < 10; NOTICE: max_bucket = 99, threshold = 97.749573, running_avg_locks 24.437393 Reset? Yes It took 17 xacts to react to the change and reset the lock table. # select * from a where a < 10; NOTICE: max_bucket = 15, threshold = 91.974617, running_avg_locks 22.993654 Reset? No notice max_bucket is back at 15 again. Any thoughts on this? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Wed, 24 Jul 2019 at 16:16, David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Wed, 24 Jul 2019 at 15:05, David Rowley <david.rowley@2ndquadrant.com> wrote: > > To be able to reduce the threshold down again we'd need to make a > > hash_get_num_entries(LockMethodLocalHash) call before performing the > > guts of LockReleaseAll(). We could then weight that onto some running > > average counter with a weight of, say... 10, so we react to changes > > fairly quickly, but not instantly. We could then have some sort of > > logic like "rebuild the hash table if running average 4 times less > > than max_bucket" > > > > I've attached a spreadsheet of that idea and the algorithm we could > > use to track the running average. Initially, I've mocked it up a > > series of transactions that use 1000 locks, then at row 123 dropped > > that to 10 locks. If we assume the max_bucket is 1000, then it takes > > until row 136 for the running average to drop below the max_bucket > > count, i.e 13 xacts. There we'd reset there at the init size of 16. If > > the average went up again, then we'd automatically expand the table as > > we do now. To make this work we'd need an additional call to > > hash_get_num_entries(), before we release the locks, so there is more > > overhead. > > Here's a patch with this implemented. I've left a NOTICE in there to > make it easier for people to follow along at home and see when the > lock table is reset. Here's a more polished version with the debug code removed, complete with benchmarks. -- Test 1. Select 1 record from a 140 partitioned table. Tests creating a large number of locks with a fast query. create table hp (a int, b int) partition by hash(a); select 'create table hp'||x||' partition of hp for values with (modulus 140, remainder ' || x || ');' from generate_series(0,139)x; create index on hp (b); insert into hp select x,x from generate_series(1, 140000) x; analyze hp; select3.sql: select * from hp where b = 1 - Master: $ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2098.628975 (excluding connections establishing) tps = 2101.797672 (excluding connections establishing) tps = 2085.317292 (excluding connections establishing) tps = 2094.931999 (excluding connections establishing) tps = 2092.328908 (excluding connections establishing) Patched: $ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2101.691459 (excluding connections establishing) tps = 2104.533249 (excluding connections establishing) tps = 2106.499123 (excluding connections establishing) tps = 2104.033459 (excluding connections establishing) tps = 2105.463629 (excluding connections establishing) (I'm surprised there is not more overhead in the additional tracking added to calculate the running average) -- Test 2. Tests a prepared query which will perform a generic plan on the 6th execution then fallback on a custom plan due to it pruning all but one partition. Master suffers from the lock table becoming bloated after locking all partitions when planning the generic plan. create table ht (a int primary key, b int, c int) partition by hash (a); select 'create table ht' || x::text || ' partition of ht for values with (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,8191) x; \gexec select.sql: \set p 1 select * from ht where a = :p Master: $ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 10207.780843 (excluding connections establishing) tps = 10205.772688 (excluding connections establishing) tps = 10214.896449 (excluding connections establishing) tps = 10157.572153 (excluding connections establishing) tps = 10147.764477 (excluding connections establishing) Patched: $ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 14677.636570 (excluding connections establishing) tps = 14661.437186 (excluding connections establishing) tps = 14647.202877 (excluding connections establishing) tps = 14784.165753 (excluding connections establishing) tps = 14850.355344 (excluding connections establishing) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes: > Here's a more polished version with the debug code removed, complete > with benchmarks. A few gripes: You're measuring the number of locks held at completion of the transaction, which fails to account for locks transiently taken and released, so that the actual peak usage will be more. I think we mostly only do that for system catalog accesses, so *maybe* the number of extra locks involved isn't very large, but that's a very shaky assumption. I don't especially like the fact that this seems to have a hard-wired (and undocumented) assumption that buckets == entries, ie that the fillfactor of the table is set at 1.0. lock.c has no business knowing that. Perhaps instead of returning the raw bucket count, you could have dynahash.c return bucket count times fillfactor, so that the number is in the same units as the entry count. This: running_avg_locks -= running_avg_locks / 10.0; running_avg_locks += numLocksHeld / 10.0; seems like a weird way to do the calculation. Personally I'd write running_avg_locks += (numLocksHeld - running_avg_locks) / 10.0; which is more the way I'm used to seeing exponential moving averages computed --- at least, it seems clearer to me why this will converge towards the average value of numLocksHeld over time. It also makes it clear that it wouldn't be sane to use two different divisors, whereas your formulation makes it look like maybe they could be set independently. Your compiler might not complain that LockReleaseAll's numLocksHeld is potentially uninitialized, but other people's compilers will. On the whole, I don't especially like this approach, because of the confusion between peak lock count and end-of-xact lock count. That seems way too likely to cause problems. regards, tom lane
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Thomas Munro
Date:
On Thu, Jul 25, 2019 at 5:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > David Rowley <david.rowley@2ndquadrant.com> writes: > > Here's a more polished version with the debug code removed, complete > > with benchmarks. > > A few gripes: > > [gripes] Based on the above, I've set this to "Waiting on Author", in the next CF. -- Thomas Munro https://enterprisedb.com
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
David Rowley
Date:
On Thu, 25 Jul 2019 at 05:49, Tom Lane <tgl@sss.pgh.pa.us> wrote: > On the whole, I don't especially like this approach, because of the > confusion between peak lock count and end-of-xact lock count. That > seems way too likely to cause problems. Thanks for having a look at this. I've not addressed the points you've mentioned due to what you mention above. The only way I can think of so far to resolve that would be to add something to track peak lock usage. The best I can think of to do that, short of adding something to dynahash.c is to check how many locks are held each time we obtain a lock, then if that count is higher than the previous time we checked, then update the maximum locks held, (probably a global variable). That seems pretty horrible to me and adds overhead each time we obtain a lock, which is a pretty performance-critical path. I've not tested what Andres mentioned about simplehash instead of dynahash yet. I did a quick scan of simplehash and it looked like SH_START_ITERATE would suffer the same problems as dynahash's hash_seq_search(), albeit, perhaps with some more efficient memory lookups. i.e it still has to skip over empty buckets, which might be costly in a bloated table. For now, I'm out of ideas. If anyone else feels like suggesting something of picking this up, feel free. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Tomas Vondra
Date:
On Wed, Aug 14, 2019 at 07:25:10PM +1200, David Rowley wrote: >On Thu, 25 Jul 2019 at 05:49, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> On the whole, I don't especially like this approach, because of the >> confusion between peak lock count and end-of-xact lock count. That >> seems way too likely to cause problems. > >Thanks for having a look at this. I've not addressed the points >you've mentioned due to what you mention above. The only way I can >think of so far to resolve that would be to add something to track >peak lock usage. The best I can think of to do that, short of adding >something to dynahash.c is to check how many locks are held each time >we obtain a lock, then if that count is higher than the previous time >we checked, then update the maximum locks held, (probably a global >variable). That seems pretty horrible to me and adds overhead each >time we obtain a lock, which is a pretty performance-critical path. > Would it really be a measurable overhead? I mean, we only really need one int counter, and you don't need to do the check on every lock acquisition - you just need to recheck on the first lock release. But maybe I'm underestimating how expensive it is ... Talking about dynahash - doesn't it already track this information? Maybe not directly but surely it has to track the number of entries in the hash table, in order to compute fill factor. Can't we piggy-back on that and track the highest fill-factor for a particular period of time? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Alvaro Herrera
Date:
On 2019-Aug-14, David Rowley wrote: > For now, I'm out of ideas. If anyone else feels like suggesting > something of picking this up, feel free. Hmm ... is this patch rejected, or is somebody still trying to get it to committable state? David, you're listed as committer. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com] > Hmm ... is this patch rejected, or is somebody still trying to get it to > committable state? David, you're listed as committer. I don't think it's rejected. It would be a pity (mottainai) to refuse this, because it provides significant speedup despiteits simple modification. Again, I think the v2 patch is OK. Tom's comment was as follows: [Tom's comment against v2] ---------------------------------------- FWIW, I tried this patch against current HEAD (959d00e9d). Using the test case described by Amit at <be25cadf-982e-3f01-88b4-443a6667e16a(at)lab(dot)ntt(dot)co(dot)jp> I do measure an undeniable speedup, close to 35%. However ... I submit that that's a mighty extreme test case. (I had to increase max_locks_per_transaction to get it to run at all.) We should not be using that sort of edge case to drive performance optimization choices. If I reduce the number of partitions in Amit's example from 8192 to something more real-world, like 128, I do still measure a performance gain, but it's ~ 1.5% which is below what I'd consider a reproducible win. I'm accustomed to seeing changes up to 2% in narrow benchmarks like this one, even when "nothing changes" except unrelated code. Trying a standard pgbench test case (pgbench -M prepared -S with one client and an -s 10 database), it seems that the patch is about 0.5% slower than HEAD. Again, that's below the noise threshold, but it's not promising for the net effects of this patch on workloads that aren't specifically about large and prunable partition sets. I'm also fairly concerned about the effects of the patch on sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88 bytes, a 22% increase. That's a lot if you're considering cases with many locks. On the whole I don't think there's an adequate case for committing this patch. I'd also point out that this is hardly the only place where we've seen hash_seq_search on nearly-empty hash tables become a bottleneck. So I'm not thrilled about attacking that with one-table-at-time patches. I'd rather see us do something to let hash_seq_search win across the board. ---------------------------------------- * Extreme test case: Not extreme. Two of our customers, who are considering to use PostgreSQL, are using thousands of partitions now. We hitthis issue -- a point query gets nearly 20% slower after automatically creating a generic plan. That's the reason forthis proposal. * 0.5% slowdown with pgbench: I think it's below the noise, as Tom said. * sizeof(LOCALLOCK): As Andres replied to Tom in the immediately following mail, LOCALLOCK was bigger in PG 11. * Use case is narrow: No. The bloated LockMethodLocalHash affects the performance of the items below as well as transaction commit/abort: - AtPrepare_Locks() and PostPrepare_Locks(): the hash table is scanned twice in PREPARE! - LockReleaseSession: advisory lock - LockReleaseCurrentOwner: ?? - LockReassignCurrentOwner: ?? Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Alvaro Herrera
Date:
On 2019-Sep-03, Tsunakawa, Takayuki wrote: > From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com] > > Hmm ... is this patch rejected, or is somebody still trying to get it to > > committable state? David, you're listed as committer. > > I don't think it's rejected. It would be a pity (mottainai) to refuse > this, because it provides significant speedup despite its simple > modification. I don't necessarily disagree with your argumentation, but Travis is complaining thusly: gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 -Wall -Werror-I../../../../src/include -I/usr/include/x86_64-linux-gnu -D_GNU_SOURCE -c -o lock.o lock.c 1840lock.c:486:1: error: conflicting types for ‘TryShrinkLocalLockHash’ 1841 TryShrinkLocalLockHash(long numLocksHeld) 1842 ^ 1843lock.c:351:20: note: previous declaration of ‘TryShrinkLocalLockHash’ was here 1844 static inline void TryShrinkLocalLockHash(void); 1845 ^ 1846<builtin>: recipe for target 'lock.o' failed Please fix. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
RE: Speed up transaction completion faster after many relations areaccessed in a transaction
From
"Tsunakawa, Takayuki"
Date:
From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com] > On 2019-Sep-03, Tsunakawa, Takayuki wrote: > > I don't think it's rejected. It would be a pity (mottainai) to refuse > > this, because it provides significant speedup despite its simple > > modification. > > I don't necessarily disagree with your argumentation, but Travis is > complaining thusly: I tried to revise David's latest patch (v8) and address Tom's comments in his last mail. But I'm a bit at a loss. First, to accurately count the maximum number of acquired locks in a transaction, we need to track the maximum entries inthe hash table, and make it available via a new function like hash_get_max_entries(). However, to cover the shared partitionedhash table (that is not necessary for LockMethodLocalHash), we must add a spin lock in hashhdr and lock/unlockit when entering and removing entries in the hash table. It spoils the effort to decrease contention by hashhdr->freelists[].mutex. Do we want to track the maximum number of acquired locks in the global variable in lock.c, notin the hash table? Second, I couldn't understand the comment about the fill factor well. I can understand that it's not correct to comparethe number of hash buckets and the number of locks. But what can we do? I'm sorry to repeat what I mentioned in my previous mail, but my v2 patch's approach is based on the database textbook andseems intuitive. So I attached the rebased version. Regards Takayuki Tsunakawa
Attachment
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Michael Paquier
Date:
On Thu, Sep 26, 2019 at 07:11:53AM +0000, Tsunakawa, Takayuki wrote: > I'm sorry to repeat what I mentioned in my previous mail, but my v2 > patch's approach is based on the database textbook and seems > intuitive. So I attached the rebased version. If you wish to do so, that's fine by me but I have not dived into the details of the thread much. Please not anyway that the patch does not apply anymore and that it needs a rebase. So for now I have moved the patch to next CF, waiting on author. -- Michael
Attachment
Re: Speed up transaction completion faster after many relations areaccessed in a transaction
From
Tomas Vondra
Date:
Hi, the patch was in WoA since December, waiting for a rebase. I've marked it as returned with feedback. Feel free to re-submit an updated version into the next CF. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
David Rowley
Date:
On Wed, 14 Aug 2019 at 19:25, David Rowley <david.rowley@2ndquadrant.com> wrote: > For now, I'm out of ideas. If anyone else feels like suggesting > something of picking this up, feel free. This is a pretty old thread, so we might need a recap: # Recap Basically LockReleaseAll() becomes slow after a large number of locks have all been held at once by a backend. The problem is that the LockMethodLocalHash dynahash table must grow to store all the locks and when later transactions only take a few locks, LockReleaseAll() is slow due to hash_seq_search() having to skip the sparsely filled buckets in the bloated hash table. The following things were tried on this thread. Each one failed: 1) Use a dlist in LOCALLOCK to track the next and prev LOCALLOCK. Simply loop over the dlist in LockReleaseAll(). 2) Try dropping and recreating the dynahash table if it becomes bloated using some heuristics to decide what "bloated" means and if recreating is worthwhile. #1 failed due to concerns with increasing the size of LOCALLOCK to store the dlist pointers. Performance regressions were seen too. Possibly due to size increase or additional overhead from pushing onto the dlist. #2 failed because it was difficult to agree on what the heuristics would be and we had no efficient way to determine the maximum number of locks that a given transaction held at any one time. We only know how many were still held at LockReleaseAll(). There were also some suggestions to fix dynahash's hash_seq_search() slowness, and also a suggestion to try using simplehash.h instead of dynahash.c. Unfortunately simplehash.h would suffer the same issues as it too would have to skip over empty buckets in a sparsely populated hash table. I'd like to revive this effort as I have a new idea on how to solve the problem. # Background Over in [1] I'm trying to improve the performance of smgropen() during recovery. The slowness there comes from the dynahash table lookups to find the correct SMgrRelation. Over there I proposed to use simplehash instead of dynahash because it's quite a good bit faster and far lessens the hash lookup overhead during recovery. One problem on that thread is that relcache keeps a pointer into the SMgrRelation (RelationData.rd_smgr) and because simplehash moves things around during inserts and deletes, then we can't have anything point to simplehash entries, they're unstable. I fixed that over on the other thread by having the simplehash entry point to a palloced SMgrRelationData... My problem is, I don't really like that idea as it means we need to palloc() pfree() lots of little chunks of memory. To fix the above, I really think we need a version of simplehash that has stable pointers. Providing that implementation is faster than dynahash, then it will help solve the smgropen() slowness during recovery. # A new hashtable implementation I ended up thinking of this thread again because the implementation of the stable pointer hash that I ended up writing for [1] happens to be lightning fast for hash table sequential scans, even if the table has become bloated. The reason the seq scans are so fast is that the implementation loops over the data arrays, which are tightly packed and store the actual data rather than pointers to the data. The code does not need to loop over the bucket array for this at all, so how large that has become is irrelevant to hash table seq scan performance. The patch stores elements in "segments" which is set to some power of 2 value. When we run out of space to store new items in a segment, we just allocate another segment. When we remove items from the table, new items reuse the first unused item in the first segment with free space. This helps to keep the used elements tightly packed. A segment keeps a bitmap of used items so that means scanning all used items is very fast. If you flip the bits in the used_item bitmap, then you get a free list for that segment, so it's also very fast to find a free element when inserting a new item into the table. I've called the hash table implementation "generichash". It uses the same preprocessor tricks as simplehash.h does and borrows the same linear probing code that's used in simplehash. The bucket array just stores the hash value and a uint32 index into the segment item that stores the data. Since segments store a power of 2 items, we can easily address both the segment number and the item within the segment from the single uint32 index value. The 'index' field just stores a special value when the bucket is empty. No need to add another field for that. This means the bucket type is just 8 bytes wide. One thing I will mention about the new hash table implementation is that GH_ITEMS_PER_SEGMENT is, by default, set to 256. This means that's the minimum size for the table. I could drop this downto 64, but that's still quite a bit more than the default size of the dynahash table of 16. I think 16 is a bit on the low side and that it might be worth making this 64 anyway. I'd just need to lower GH_ITEMS_PER_SEGMENT down. The current code does not allow it to go lower as I've done nothing to allow partial bitmap words, they're 64-bits on a 64-bit machine. I've not done too much benchmarking between it and simplehash.h, but I think in some cases it will be faster. Since the bucket type is just 8 bytes, moving stuff around during insert/deletes will be cheaper than with simplehash. Lookups are likely to be a bit slower due to having to lookup the item within the segment, which is a few pointer dereferences away. A quick test shows an improvement when compared to dynahash. # select count(pg_try_advisory_lock(99999,99999)) from generate_series(1,1000000); Master: Time: 190.257 ms Time: 189.440 ms Time: 188.768 ms Patched: Time: 182.779 ms Time: 182.267 ms Time: 186.902 ms This is just hitting the local lock table. The advisory lock key is the same each time, so it remains a lock check. Also, it's a pretty small gain, but I'm mostly trying to show that the new hash table is not any slower than dynahash for probing for inserts. The real wins come from what we're trying to solve in this thread -- the performance of LockReleaseAll(). Benchmarking results measuring the TPS of a simple select from an empty table after another transaction has bloated the locallock table. Master: 127544 tps 113971 tps 123255 tps 121615 tps Patched: 170803 tps 167305 tps 165002 tps 171976 tps About 38% faster. The benchmark I used was: t1.sql: \set p 1 select a from t1 where a = :p hp.sql: select count(*) from hp "hp" is a hash partitioned table with 10k partitions. pgbench -j 20 -c 20 -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@100000 postgres I'm using the query to the hp table to bloat the locallock table. It's only executed every 1 in 100,000 queries. The tps numbers above are the ones to run t1.sql I've not quite looked into why yet, but the hp.sql improved performance by 58%. It went from an average of 1.061377 in master to an average of 1.683616 in the patched version. I can't quite see where this gain is coming from. It's pretty hard to get good stable performance results out of this AMD machine, so it might be related to that. That's why I ran 20 threads. It seems slightly better. The machine seems to have trouble waking up properly for a single thread. It would be good if someone could repeat the tests to see if the gains appear on other hardware. Also, it would be good to hear what people think about solving the problem this way. Patch attached. David [1] https://www.postgresql.org/message-id/flat/CAApHDvpkWOGLh_bYg7jproXN8B2g2T9dWDcqsmKsXG5+WwZaqw@mail.gmail.com
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Zhihong Yu
Date:
On Sun, Jun 20, 2021 at 6:56 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 14 Aug 2019 at 19:25, David Rowley <david.rowley@2ndquadrant.com> wrote:
> For now, I'm out of ideas. If anyone else feels like suggesting
> something of picking this up, feel free.
This is a pretty old thread, so we might need a recap:
# Recap
Basically LockReleaseAll() becomes slow after a large number of locks
have all been held at once by a backend. The problem is that the
LockMethodLocalHash dynahash table must grow to store all the locks
and when later transactions only take a few locks, LockReleaseAll() is
slow due to hash_seq_search() having to skip the sparsely filled
buckets in the bloated hash table.
The following things were tried on this thread. Each one failed:
1) Use a dlist in LOCALLOCK to track the next and prev LOCALLOCK.
Simply loop over the dlist in LockReleaseAll().
2) Try dropping and recreating the dynahash table if it becomes
bloated using some heuristics to decide what "bloated" means and if
recreating is worthwhile.
#1 failed due to concerns with increasing the size of LOCALLOCK to
store the dlist pointers. Performance regressions were seen too.
Possibly due to size increase or additional overhead from pushing onto
the dlist.
#2 failed because it was difficult to agree on what the heuristics
would be and we had no efficient way to determine the maximum number
of locks that a given transaction held at any one time. We only know
how many were still held at LockReleaseAll().
There were also some suggestions to fix dynahash's hash_seq_search()
slowness, and also a suggestion to try using simplehash.h instead of
dynahash.c. Unfortunately simplehash.h would suffer the same issues as
it too would have to skip over empty buckets in a sparsely populated
hash table.
I'd like to revive this effort as I have a new idea on how to solve the problem.
# Background
Over in [1] I'm trying to improve the performance of smgropen() during
recovery. The slowness there comes from the dynahash table lookups to
find the correct SMgrRelation. Over there I proposed to use simplehash
instead of dynahash because it's quite a good bit faster and far
lessens the hash lookup overhead during recovery. One problem on that
thread is that relcache keeps a pointer into the SMgrRelation
(RelationData.rd_smgr) and because simplehash moves things around
during inserts and deletes, then we can't have anything point to
simplehash entries, they're unstable. I fixed that over on the other
thread by having the simplehash entry point to a palloced
SMgrRelationData... My problem is, I don't really like that idea as it
means we need to palloc() pfree() lots of little chunks of memory.
To fix the above, I really think we need a version of simplehash that
has stable pointers. Providing that implementation is faster than
dynahash, then it will help solve the smgropen() slowness during
recovery.
# A new hashtable implementation
I ended up thinking of this thread again because the implementation of
the stable pointer hash that I ended up writing for [1] happens to be
lightning fast for hash table sequential scans, even if the table has
become bloated. The reason the seq scans are so fast is that the
implementation loops over the data arrays, which are tightly packed
and store the actual data rather than pointers to the data. The code
does not need to loop over the bucket array for this at all, so how
large that has become is irrelevant to hash table seq scan
performance.
The patch stores elements in "segments" which is set to some power of
2 value. When we run out of space to store new items in a segment, we
just allocate another segment. When we remove items from the table,
new items reuse the first unused item in the first segment with free
space. This helps to keep the used elements tightly packed. A
segment keeps a bitmap of used items so that means scanning all used
items is very fast. If you flip the bits in the used_item bitmap,
then you get a free list for that segment, so it's also very fast to
find a free element when inserting a new item into the table.
I've called the hash table implementation "generichash". It uses the
same preprocessor tricks as simplehash.h does and borrows the same
linear probing code that's used in simplehash. The bucket array just
stores the hash value and a uint32 index into the segment item that
stores the data. Since segments store a power of 2 items, we can
easily address both the segment number and the item within the segment
from the single uint32 index value. The 'index' field just stores a
special value when the bucket is empty. No need to add another field
for that. This means the bucket type is just 8 bytes wide.
One thing I will mention about the new hash table implementation is
that GH_ITEMS_PER_SEGMENT is, by default, set to 256. This means
that's the minimum size for the table. I could drop this downto 64,
but that's still quite a bit more than the default size of the
dynahash table of 16. I think 16 is a bit on the low side and that it
might be worth making this 64 anyway. I'd just need to lower
GH_ITEMS_PER_SEGMENT down. The current code does not allow it to go
lower as I've done nothing to allow partial bitmap words, they're
64-bits on a 64-bit machine.
I've not done too much benchmarking between it and simplehash.h, but I
think in some cases it will be faster. Since the bucket type is just 8
bytes, moving stuff around during insert/deletes will be cheaper than
with simplehash. Lookups are likely to be a bit slower due to having
to lookup the item within the segment, which is a few pointer
dereferences away.
A quick test shows an improvement when compared to dynahash.
# select count(pg_try_advisory_lock(99999,99999)) from
generate_series(1,1000000);
Master:
Time: 190.257 ms
Time: 189.440 ms
Time: 188.768 ms
Patched:
Time: 182.779 ms
Time: 182.267 ms
Time: 186.902 ms
This is just hitting the local lock table. The advisory lock key is
the same each time, so it remains a lock check. Also, it's a pretty
small gain, but I'm mostly trying to show that the new hash table is
not any slower than dynahash for probing for inserts.
The real wins come from what we're trying to solve in this thread --
the performance of LockReleaseAll().
Benchmarking results measuring the TPS of a simple select from an
empty table after another transaction has bloated the locallock table.
Master:
127544 tps
113971 tps
123255 tps
121615 tps
Patched:
170803 tps
167305 tps
165002 tps
171976 tps
About 38% faster.
The benchmark I used was:
t1.sql:
\set p 1
select a from t1 where a = :p
hp.sql:
select count(*) from hp
"hp" is a hash partitioned table with 10k partitions.
pgbench -j 20 -c 20 -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@100000 postgres
I'm using the query to the hp table to bloat the locallock table. It's
only executed every 1 in 100,000 queries. The tps numbers above are
the ones to run t1.sql
I've not quite looked into why yet, but the hp.sql improved
performance by 58%. It went from an average of 1.061377 in master to
an average of 1.683616 in the patched version. I can't quite see where
this gain is coming from. It's pretty hard to get good stable
performance results out of this AMD machine, so it might be related to
that. That's why I ran 20 threads. It seems slightly better. The
machine seems to have trouble waking up properly for a single thread.
It would be good if someone could repeat the tests to see if the gains
appear on other hardware.
Also, it would be good to hear what people think about solving the
problem this way.
Patch attached.
David
[1] https://www.postgresql.org/message-id/flat/CAApHDvpkWOGLh_bYg7jproXN8B2g2T9dWDcqsmKsXG5+WwZaqw@mail.gmail.com
Hi,
+ * GH_ELEMENT_TYPE defines the data type that the hashtable stores. Each
+ * instance of GH_ELEMENT_TYPE which is stored in the hash table is done so
+ * inside a GH_SEGMENT.
+ * instance of GH_ELEMENT_TYPE which is stored in the hash table is done so
+ * inside a GH_SEGMENT.
I think the second sentence can be written as (since done means stored, it is redundant):
Each instance of GH_ELEMENT_TYPE is stored in the hash table inside a GH_SEGMENT.
+ * Macros for translating a bucket's index into the segment and another to
+ * determine the item number within the segment.
+ */
+#define GH_INDEX_SEGMENT(i) (i) / GH_ITEMS_PER_SEGMENT
+ * determine the item number within the segment.
+ */
+#define GH_INDEX_SEGMENT(i) (i) / GH_ITEMS_PER_SEGMENT
into the segment -> into the segment number (in the code I see segidx but I wonder if segment index may cause slight confusion).
+ GH_BITMAP_WORD used_items[GH_BITMAP_WORDS]; /* A 1-bit for each used item
+ * in the items array */
+ * in the items array */
'A 1-bit' -> One bit (A and 1 mean the same)
+ uint32 first_free_segment;
Since the segment may not be totally free, maybe name the field first_segment_with_free_slot
+ * This is similar to GH_NEXT_ONEBIT but flips the bits before operating on
+ * each GH_BITMAP_WORD.
+ * each GH_BITMAP_WORD.
It seems the only difference from GH_NEXT_ONEBIT is in this line:
+ GH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */
If a 4th parameter is added to signify whether the flipping should be done, these two functions can be unified.
+ * next insert will store in this segment. If it's already pointing to an
+ * earlier segment, then leave it be.
+ * earlier segment, then leave it be.
The last sentence is confusing: first_free_segment cannot point to some segment and earlier segment at the same time.
Maybe drop the last sentence.
Cheers
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
David Rowley
Date:
Thanks for having a look at this. On Mon, 21 Jun 2021 at 05:02, Zhihong Yu <zyu@yugabyte.com> wrote: > + * GH_ELEMENT_TYPE defines the data type that the hashtable stores. Each > + * instance of GH_ELEMENT_TYPE which is stored in the hash table is done so > + * inside a GH_SEGMENT. > > I think the second sentence can be written as (since done means stored, it is redundant): I've rewords this entire paragraph slightly. > Each instance of GH_ELEMENT_TYPE is stored in the hash table inside a GH_SEGMENT. > > + * Macros for translating a bucket's index into the segment and another to > + * determine the item number within the segment. > + */ > +#define GH_INDEX_SEGMENT(i) (i) / GH_ITEMS_PER_SEGMENT > > into the segment -> into the segment number (in the code I see segidx but I wonder if segment index may cause slight confusion). I've adjusted this comment > + GH_BITMAP_WORD used_items[GH_BITMAP_WORDS]; /* A 1-bit for each used item > + * in the items array */ > > 'A 1-bit' -> One bit (A and 1 mean the same) I think you might have misread this. We're storing a 1-bit for each used item rather than a 0-bit. If I remove the 'A' then it's not clear what the meaning of each bit's value is. > + uint32 first_free_segment; > > Since the segment may not be totally free, maybe name the field first_segment_with_free_slot I don't really like that. It feels too long to me. > + * This is similar to GH_NEXT_ONEBIT but flips the bits before operating on > + * each GH_BITMAP_WORD. > > It seems the only difference from GH_NEXT_ONEBIT is in this line: > > + GH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */ > > If a 4th parameter is added to signify whether the flipping should be done, these two functions can be unified. I don't want to do that. I'd rather have them separate to ensure the compiler does not create any additional needless branching. Those functions are pretty hot. > + * next insert will store in this segment. If it's already pointing to an > + * earlier segment, then leave it be. > > The last sentence is confusing: first_free_segment cannot point to some segment and earlier segment at the same time. > Maybe drop the last sentence. I've adjusted this comment to become: * Check if we need to lower the next_free_segment. We want new inserts * to fill the lower segments first, so only change the first_free_segment * if the removed entry was stored in an earlier segment. Thanks for having a look at this. I'll attach an updated patch soon. David
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
David Rowley
Date:
On Mon, 21 Jun 2021 at 01:56, David Rowley <dgrowleyml@gmail.com> wrote: > # A new hashtable implementation > Also, it would be good to hear what people think about solving the > problem this way. Because over on [1] I'm also trying to improve the performance of smgropen(), I posted the patch for the new hash table over there too. Between that thread and discussions with Thomas and Andres off-list, I get the idea that pretty much nobody likes the idea of having another hash table implementation. Thomas wants to solve it another way and Andres has concerns that working with bitmaps is too slow. Andres suggested freelists would be faster, but I'm not really a fan of the idea as, unless I have a freelist per array segment then I won't be able to quickly identify the earliest segment slot to re-fill unused slots with. That would mean memory would get more fragmented over time instead of the fragmentation being slowly fixed as new items are added after deletes. So I've not really tried implementing that to see how it performs. Both Andres and Thomas expressed a dislike to the name "generichash" too. Anyway, since I did make a few small changes to the hash table implementation before doing all that off-list talking, I thought I should at least post where I got to here so that anything else that comes up can get compared to where I got to, instead of where I was with this. I did end up renaming the hash table to "densehash" rather than generichash. Naming is hard, but I went with dense as memory density was on my mind when I wrote it. Having a compact 8-byte bucket width and packing the data into arrays in a dense way. The word dense came up a few times, so went with that. I also adjusted the hash seq scan code so that it performs better when faced a non-sparsely populated table. Previously my benchmark for that case didn't do well [2]. I've attached the benchmark results from running the benchmark that's included in hashbench.tar.bz2. I ran this 10 times using the included test.sh with ./test.sh 10. I included the results I got on my AMD machine in the attached bz2 file in results.csv. You can see from the attached dense_vs_generic_vs_simple.png that dense hash is quite comparable to simplehash for inserts/deletes and lookups. It's not quite as fast as simplehash at iterations when the table is not bloated, but blows simplehash out the water when the hashtables have become bloated due to having once contained a large number of records but no longer do. Anyway, unless there is some interest in me taking this idea further then, due to the general feedback received on [1], I'm not planning on pushing this any further. I'll leave the commitfest entry as is for now to give others a chance to see this. David [1] https://postgr.es/m/CAApHDvowgRaQupC%3DL37iZPUzx1z7-N8deD7TxQSm8LR%2Bf4L3-A%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAApHDvpuzJTQNKQ_bnAccvi-68xuh%2Bv87B4P6ycU-UiN0dqyTg%40mail.gmail.com
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
David Rowley
Date:
On Mon, 12 Jul 2021 at 19:23, David Rowley <dgrowleyml@gmail.com> wrote: > I also adjusted the hash seq scan code so that it performs better when > faced a non-sparsely populated table. Previously my benchmark for > that case didn't do well [2]. I was running some select only pgbench tests today on an AMD 3990x machine with a large number of processes. I saw that LockReleaseAll was coming up on the profile a bit at: Master: 0.77% postgres [.] LockReleaseAll I wondered if this patch would help, so I tried it and got: dense hash lockrelease all: 0.67% postgres [.] LockReleaseAll It's a very small increase which translated to about a 0.62% gain in tps. It made me think it might be worth doing something about this LockReleaseAll can show up when releasing small numbers of locks. pgbench -T 240 -P 10 -c 132 -j 132 -S -M prepared --random-seed=12345 postgres Units = tps Sec master dense hash LockReleaseAll 10 3758201.2 3713521.5 98.81% 20 3810125.5 3844142.9 100.89% 30 3806505.1 3848458 101.10% 40 3816094.8 3855706.6 101.04% 50 3820317.2 3851717.7 100.82% 60 3827809 3851499.4 100.62% 70 3828757.9 3849312 100.54% 80 3824492.1 3852378.8 100.73% 90 3816502.1 3854793.8 101.00% 100 3819124.1 3860418.6 101.08% 110 3816154.3 3845327.7 100.76% 120 3817070.5 3845842.5 100.75% 130 3815424.7 3847626 100.84% 140 3823631.1 3846760.6 100.60% 150 3820963.8 3840196.6 100.50% 160 3827737 3841149.3 100.35% 170 3827779.2 3840130.9 100.32% 180 3829352 3842814.5 100.35% 190 3825518.3 3841991 100.43% 200 3823477.2 3839390.7 100.42% 210 3809304.3 3836433.5 100.71% 220 3814328.5 3842073.7 100.73% 230 3811399.3 3843780.7 100.85% avg 3816959.53 3840672.478 100.62% David
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Michael Paquier
Date:
On Tue, Jul 20, 2021 at 05:04:19PM +1200, David Rowley wrote: > On Mon, 12 Jul 2021 at 19:23, David Rowley <dgrowleyml@gmail.com> wrote: > > I also adjusted the hash seq scan code so that it performs better when > > faced a non-sparsely populated table. Previously my benchmark for > > that case didn't do well [2]. > > I was running some select only pgbench tests today on an AMD 3990x > machine with a large number of processes. > > I saw that LockReleaseAll was coming up on the profile a bit at: This last update was two months ago, and the patch has not moved since: https://commitfest.postgresql.org/34/3220/ Do you have plans to work more on that or perhaps the CF entry should be withdrawn or RwF'd? -- Michael
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Yura Sokolov
Date:
I've made some remarks in related thread: https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1FB976EF@G01JPEXMBYT05 The new status of this patch is: Waiting on Author
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Michael Paquier
Date:
On Fri, Oct 01, 2021 at 04:03:09PM +0900, Michael Paquier wrote: > This last update was two months ago, and the patch has not moved > since: > https://commitfest.postgresql.org/34/3220/ > > Do you have plans to work more on that or perhaps the CF entry should > be withdrawn or RwF'd? Two months later, this has been switched to RwF. -- Michael
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
David Rowley
Date:
On Fri, 3 Dec 2021 at 20:36, Michael Paquier <michael@paquier.xyz> wrote: > Two months later, this has been switched to RwF. I was discussing this patch with Andres. He's not very keen on my densehash hash table idea and suggested that instead of relying on trying to make the hash table iteration faster, why don't we just ditch the lossiness of the resource owner code which only records the first 16 locks, and just make it have a linked list of all locks. This is a little more along the lines of the original patch, however, it does not increase the size of the LOCALLOCK struct. I've attached a patch which does this. This was mostly written (quickly) by Andres, I just did a cleanup, fixed up a few mistakes and fixed a few bugs. I ran the same performance tests on this patch as I did back in [1]: -- Test 1. Select 1 record from a 140 partitioned table. Tests creating a large number of locks with a fast query. create table hp (a int, b int) partition by hash(a); select 'create table hp'||x||' partition of hp for values with (modulus 140, remainder ' || x || ');' from generate_series(0,139)x; create index on hp (b); insert into hp select x,x from generate_series(1, 140000) x; analyze hp; select3.sql: select * from hp where b = 1 select3.sql master drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2099.708748 (without initial connection time) tps = 2100.398516 (without initial connection time) tps = 2094.882341 (without initial connection time) tps = 2113.218820 (without initial connection time) tps = 2104.717597 (without initial connection time) select3.sql patched drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2010.070738 (without initial connection time) tps = 1994.963606 (without initial connection time) tps = 1994.668849 (without initial connection time) tps = 1995.948168 (without initial connection time) tps = 1985.650945 (without initial connection time) You can see that there's a performance regression here. I've not yet studied why this appears. -- Test 2. Tests a prepared query which will perform a generic plan on the 6th execution then fallback on a custom plan due to it pruning all but one partition. Master suffers from the lock table becoming bloated after locking all partitions when planning the generic plan. create table ht (a int primary key, b int, c int) partition by hash (a); select 'create table ht' || x::text || ' partition of ht for values with (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,8191) x; \gexec select.sql: \set p 1 select * from ht where a = :p select.sql master drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 18014.460090 (without initial connection time) tps = 17973.358889 (without initial connection time) tps = 17847.480647 (without initial connection time) tps = 18038.332507 (without initial connection time) tps = 17776.143206 (without initial connection time) select.sql patched drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 32393.457106 (without initial connection time) tps = 32277.204349 (without initial connection time) tps = 32160.719830 (without initial connection time) tps = 32530.038130 (without initial connection time) tps = 32299.019657 (without initial connection time) You can see that there are some quite good performance gains with this test. I'm going to add this to the January commitfest. David [1] https://www.postgresql.org/message-id/CAKJS1f8Lt0kS4bb5EH%3DhV%2BksqBZNnmVa8jujoYBYu5PVhWbZZg%40mail.gmail.com
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Zhihong Yu
Date:
On Fri, Dec 31, 2021 at 5:45 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 3 Dec 2021 at 20:36, Michael Paquier <michael@paquier.xyz> wrote:
> Two months later, this has been switched to RwF.
I was discussing this patch with Andres. He's not very keen on my
densehash hash table idea and suggested that instead of relying on
trying to make the hash table iteration faster, why don't we just
ditch the lossiness of the resource owner code which only records the
first 16 locks, and just make it have a linked list of all locks.
This is a little more along the lines of the original patch, however,
it does not increase the size of the LOCALLOCK struct.
I've attached a patch which does this. This was mostly written
(quickly) by Andres, I just did a cleanup, fixed up a few mistakes and
fixed a few bugs.
I ran the same performance tests on this patch as I did back in [1]:
-- Test 1. Select 1 record from a 140 partitioned table. Tests
creating a large number of locks with a fast query.
create table hp (a int, b int) partition by hash(a);
select 'create table hp'||x||' partition of hp for values with
(modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
create index on hp (b);
insert into hp select x,x from generate_series(1, 140000) x;
analyze hp;
select3.sql: select * from hp where b = 1
select3.sql master
drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2099.708748 (without initial connection time)
tps = 2100.398516 (without initial connection time)
tps = 2094.882341 (without initial connection time)
tps = 2113.218820 (without initial connection time)
tps = 2104.717597 (without initial connection time)
select3.sql patched
drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2010.070738 (without initial connection time)
tps = 1994.963606 (without initial connection time)
tps = 1994.668849 (without initial connection time)
tps = 1995.948168 (without initial connection time)
tps = 1985.650945 (without initial connection time)
You can see that there's a performance regression here. I've not yet
studied why this appears.
-- Test 2. Tests a prepared query which will perform a generic plan on
the 6th execution then fallback on a custom plan due to it pruning all
but one partition. Master suffers from the lock table becoming
bloated after locking all partitions when planning the generic plan.
create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec
select.sql:
\set p 1
select * from ht where a = :p
select.sql master
drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 18014.460090 (without initial connection time)
tps = 17973.358889 (without initial connection time)
tps = 17847.480647 (without initial connection time)
tps = 18038.332507 (without initial connection time)
tps = 17776.143206 (without initial connection time)
select.sql patched
drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 32393.457106 (without initial connection time)
tps = 32277.204349 (without initial connection time)
tps = 32160.719830 (without initial connection time)
tps = 32530.038130 (without initial connection time)
tps = 32299.019657 (without initial connection time)
You can see that there are some quite good performance gains with this test.
I'm going to add this to the January commitfest.
David
[1] https://www.postgresql.org/message-id/CAKJS1f8Lt0kS4bb5EH%3DhV%2BksqBZNnmVa8jujoYBYu5PVhWbZZg%40mail.gmail.com
Hi,
+ Assert(locallock->nLocks >= 0);
I think the assertion is not needed since the above code is in if block :
+ if (locallockowner->nLocks < locallock->nLocks)
the condition, locallock->nLocks >= 0, would always hold after the subtraction.
Cheers
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
David Rowley
Date:
On Sat, 1 Jan 2022 at 15:40, Zhihong Yu <zyu@yugabyte.com> wrote: > + locallock->nLocks -= locallockowner->nLocks; > + Assert(locallock->nLocks >= 0); > > I think the assertion is not needed since the above code is in if block : > > + if (locallockowner->nLocks < locallock->nLocks) > > the condition, locallock->nLocks >= 0, would always hold after the subtraction. That makes sense. I've removed the Assert in the attached patch. Thanks for looking at the patch I've also spent a bit more time on the patch. There were quite a few outdated comments remaining. Also, the PROCLOCK releaseMask field appears to no longer be needed. I also did a round of benchmarking on the attached patch using a very recent master. I've attached .sql files and the script I used to benchmark. With 1024 partitions, lock1.sql shows about a 4.75% performance increase. This would become larger with more partitions and less with fewer partitions. With the patch, lock2.sql about a 10% performance increase over master. lock3.sql does not seem to have changed much and lock4.sql shows a small regression with the patch of about 2.5%. I'm not quite sure how worried we should be about lock4.sql slowing down lightly. 2.5% is fairly small given how hard I'm exercising the locking code in that test. There's also nothing much to say that the slowdown is not just due to code alignment changes. I also understand that Amit L is working on another patch that will improve the situation for lock1.sql by not taking the locks on relations that will be run-time pruned at executor startup. I think it's still worth solving this regardless of Amit's patch as with current master we still have a situation where short fast queries which access a small number of tables can become slower once the backend has obtained a large number of locks concurrently and bloated the locallocktable. As for the patch. I feel it's a pretty invasive change to how we release locks and the resowner code. I'd be quite happy for some review of it. Here are the full results as output by the attached script. drowley@amd3990x:~$ echo master master drowley@amd3990x:~$ ./lockbench.sh lock1.sql tps = 38078.011433 (without initial connection time) tps = 38070.016792 (without initial connection time) tps = 39223.118769 (without initial connection time) tps = 37510.105287 (without initial connection time) tps = 38164.394128 (without initial connection time) lock2.sql tps = 247.963797 (without initial connection time) tps = 247.374174 (without initial connection time) tps = 248.412744 (without initial connection time) tps = 248.192629 (without initial connection time) tps = 248.503728 (without initial connection time) lock3.sql tps = 1162.937692 (without initial connection time) tps = 1160.968689 (without initial connection time) tps = 1166.908643 (without initial connection time) tps = 1160.288547 (without initial connection time) tps = 1160.336572 (without initial connection time) lock4.sql tps = 282.173560 (without initial connection time) tps = 284.470330 (without initial connection time) tps = 286.089644 (without initial connection time) tps = 285.548487 (without initial connection time) tps = 284.313505 (without initial connection time) drowley@amd3990x:~$ echo Patched Patched drowley@amd3990x:~$ ./lockbench.sh lock1.sql tps = 40338.975219 (without initial connection time) tps = 39803.433365 (without initial connection time) tps = 39504.824194 (without initial connection time) tps = 39843.422438 (without initial connection time) tps = 40624.483013 (without initial connection time) lock2.sql tps = 274.413309 (without initial connection time) tps = 271.978813 (without initial connection time) tps = 275.795091 (without initial connection time) tps = 273.628649 (without initial connection time) tps = 273.049977 (without initial connection time) lock3.sql tps = 1168.557054 (without initial connection time) tps = 1168.139469 (without initial connection time) tps = 1166.366440 (without initial connection time) tps = 1165.464214 (without initial connection time) tps = 1167.250809 (without initial connection time) lock4.sql tps = 274.842298 (without initial connection time) tps = 277.911394 (without initial connection time) tps = 278.702620 (without initial connection time) tps = 275.715606 (without initial connection time) tps = 278.816060 (without initial connection time) David
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Yura Sokolov
Date:
Good day, David. I'm looking on patch and don't get some moments. `GrantLockLocal` allocates `LOCALLOCKOWNER` and links it into `locallock->locallockowners`. It links it regardless `owner` could be NULL. But then `RemoveLocalLock` does `Assert(locallockowner->owner != NULL);`. Why it should not fail? `GrantLockLocal` allocates `LOCALLOCKOWNER` in `TopMemoryContext`. But there is single `pfree(locallockowner)` in `LockReassignOwner`. Looks like there should be more `pfree`. Shouldn't they? `GrantLockLocal` does `dlist_push_tail`, but isn't it better to do `dlist_push_head`? Resource owners usually form stack, so usually when owner searches for itself it is last added to list. Then `dlist_foreach` will find it sooner if it were added to the head. regards --------- Yura Sokolov Postgres Professional y.sokolov@postgrespro.ru funny.falcon@gmail.com
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
David Rowley
Date:
On Wed, 6 Apr 2022 at 03:40, Yura Sokolov <y.sokolov@postgrespro.ru> wrote: > I'm looking on patch and don't get some moments. > > `GrantLockLocal` allocates `LOCALLOCKOWNER` and links it into > `locallock->locallockowners`. It links it regardless `owner` could be > NULL. But then `RemoveLocalLock` does `Assert(locallockowner->owner != NULL);`. > Why it should not fail? > > `GrantLockLocal` allocates `LOCALLOCKOWNER` in `TopMemoryContext`. > But there is single `pfree(locallockowner)` in `LockReassignOwner`. > Looks like there should be more `pfree`. Shouldn't they? > > `GrantLockLocal` does `dlist_push_tail`, but isn't it better to > do `dlist_push_head`? Resource owners usually form stack, so usually > when owner searches for itself it is last added to list. > Then `dlist_foreach` will find it sooner if it were added to the head. Thanks for having a look at this. It's a bit unrealistic for me to get a look at addressing these for v15. I've pushed this one out to the next CF. David
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Jacob Champion
Date:
This entry has been waiting on author input for a while (our current threshold is roughly two weeks), so I've marked it Returned with Feedback. Once you think the patchset is ready for review again, you (or any interested party) can resurrect the patch entry by visiting https://commitfest.postgresql.org/38/3501/ and changing the status to "Needs Review", and then changing the status again to "Move to next CF". (Don't forget the second step; hopefully we will have streamlined this in the near future!) Thanks, --Jacob
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
David Rowley
Date:
On Wed, 3 Aug 2022 at 07:04, Jacob Champion <jchampion@timescale.com> wrote: > This entry has been waiting on author input for a while (our current > threshold is roughly two weeks), so I've marked it Returned with > Feedback. Thanks for taking care of this. You dealt with this correctly based on the fact that I'd failed to rebase before or during the entire July CF. I'm still interested in having the LockReleaseAll slowness fixed, so here's a rebased patch. David
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Ankit Kumar Pandey
Date:
Hi David, This is review of speed up releasing of locks patch. Contents & Purpose: Subject is missing in patch. It would have been easier to understand purpose had it been included. Included in the patch are change in README, but no new tests are included.. Initial Run: The patch applies cleanly to HEAD. The regression tests all pass successfully against the new patch. Nitpicking & conclusion: I don't see any performance improvement in tests. Lots of comments were removed which were not fully replaced. Change of log level for ReleaseLockIfHeld: failed from warning to panic is mystery. Change in readme doesn't look right. `Any subsequent lockers are share lockers wait waiting for the VXID to terminate via some other method) is for deadlock`. This sentence could be rewritten. Also more comments could be added to explain new methods added. Thanks, Ankit
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
David Rowley
Date:
Thank you for looking at the patch. On Fri, 4 Nov 2022 at 04:43, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote: > I don't see any performance improvement in tests. Are you able to share what your test was? In order to see a performance improvement you're likely going to have to obtain a large number of locks in the session so that the local lock table becomes bloated, then continue to run some fast query and observe that LockReleaseAll has become slower as a result of the hash table becoming bloated. Running pgbench running a SELECT on a hash partitioned table with a good number of partitions to look up a single row with -M prepared. The reason this becomes slow is that the planner will try a generic plan on the 6th execution which will lock every partition and bloat the local lock table. From then on it will use a custom plan which only locks a single leaf partition. I just tried the following: $ pgbench -i --partition-method=hash --partitions=1000 postgres Master: $ pgbench -T 60 -S -M prepared postgres | grep tps tps = 21286.172326 (without initial connection time) Patched: $ pgbench -T 60 -S -M prepared postgres | grep tps tps = 23034.063261 (without initial connection time) If I try again with 10,000 partitions, I get: Master: $ pgbench -T 60 -S -M prepared postgres | grep tps tps = 13044.290903 (without initial connection time) Patched: $ pgbench -T 60 -S -M prepared postgres | grep tps tps = 22683.545686 (without initial connection time) David
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
vignesh C
Date:
On Wed, 3 Aug 2022 at 09:04, David Rowley <dgrowleyml@gmail.com> wrote: > > On Wed, 3 Aug 2022 at 07:04, Jacob Champion <jchampion@timescale.com> wrote: > > This entry has been waiting on author input for a while (our current > > threshold is roughly two weeks), so I've marked it Returned with > > Feedback. > > Thanks for taking care of this. You dealt with this correctly based on > the fact that I'd failed to rebase before or during the entire July > CF. > > I'm still interested in having the LockReleaseAll slowness fixed, so > here's a rebased patch. CFBot shows some compilation errors as in [1], please post an updated version for the same: [15:40:00.287] [1239/1809] Linking target src/backend/postgres [15:40:00.287] FAILED: src/backend/postgres [15:40:00.287] cc @src/backend/postgres.rsp [15:40:00.287] /usr/bin/ld: src/backend/postgres_lib.a.p/replication_logical_launcher.c.o: in function `logicalrep_worker_onexit': [15:40:00.287] /tmp/cirrus-ci-build/build/../src/backend/replication/logical/launcher.c:773: undefined reference to `LockReleaseAll' [15:40:00.287] collect2: error: ld returned 1 exit status [1] - https://cirrus-ci.com/task/4562493863886848 Regards, Vignesh
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
David Rowley
Date:
On Fri, 20 Jan 2023 at 00:26, vignesh C <vignesh21@gmail.com> wrote: > CFBot shows some compilation errors as in [1], please post an updated > version for the same: I've attached a rebased patch. While reading over this again, I wondered if instead of allocating the memory for the LOCALLOCKOWNER in TopMemoryContext, maybe we should create a Slab context as a child of TopMemoryContext and perform the allocations there. I feel like slab might be a better option here as it'll use slightly less memory due to it not rounding up allocations to the next power of 2. sizeof(LOCALLOCKOWNER) == 56, so it's not a great deal of memory, but more than nothing. The primary reason that I think this might be a good idea is mostly around better handling of chunk on block fragmentation in slab.c than aset.c. If we have transactions which create a large number of locks then we may end up growing the TopMemoryContext and never releasing the AllocBlocks and just having a high number of 64-byte chunks left on the freelist that'll maybe never be used again. I'm thinking slab.c might handle that better as it'll only keep around 10 completely empty SlabBlocks before it'll start free'ing them. The slab allocator is quite a bit faster now as a result of d21ded75f. I would like to get this LockReleaseAll problem finally fixed in PG16, but I'd feel much better about this patch if it had some review from someone who has more in-depth knowledge of the locking code. I've also gone and adjusted all the places that upgraded the elog(WARNING)s of local table corruption to PANIC and put them back to use WARNING again. While I think it might be a good idea to do that, it seems to be adding a bit more resistance to this patch which I don't think it really needs. Maybe we can consider that in a separate effort. David
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Amit Langote
Date:
Hi David, On Tue, Jan 24, 2023 at 12:58 PM David Rowley <dgrowleyml@gmail.com> wrote: > On Fri, 20 Jan 2023 at 00:26, vignesh C <vignesh21@gmail.com> wrote: > > CFBot shows some compilation errors as in [1], please post an updated > > version for the same: > > I've attached a rebased patch. Thanks for the new patch. Maybe you're planning to do it once this patch is post the PoC phase (isn't it?), but it would be helpful to have commentary on all the new dlist fields. Especially, I think the following warrants a bit more explanation than other: - /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */ - int nlocks; /* number of owned locks */ - LOCALLOCK *locks[MAX_RESOWNER_LOCKS]; /* list of owned locks */ + dlist_head locks; /* dlist of owned locks */ This seems to be replacing what is a cache with an upper limit on the number of cacheds locks with something that has no limit on how many per-owner locks are remembered. I wonder whether we'd be doing additional work in some cases with the new no-limit implementation that wasn't being done before (where the owner's locks array is overflowed) or maybe not much because the new implementation of ResourceOwner{Remember|Forget}Lock() is simple push/delete of a dlist node from the owner's dlist? The following comment is now obsolete: /* * LockReassignCurrentOwner * Reassign all locks belonging to CurrentResourceOwner to belong * to its parent resource owner. * * If the caller knows what those locks are, it can pass them as an array. * That speeds up the call significantly, when a lot of locks are held * (e.g pg_dump with a large schema). Otherwise, pass NULL for locallocks, * and we'll traverse through our hash table to find them. */ -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Andres Freund
Date:
Hi, On 2023-01-24 16:57:37 +1300, David Rowley wrote: > I've attached a rebased patch. Looks like there's some issue causing tests to fail probabilistically: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3501 Several failures are when testing a 32bit build. > While reading over this again, I wondered if instead of allocating the > memory for the LOCALLOCKOWNER in TopMemoryContext, maybe we should > create a Slab context as a child of TopMemoryContext and perform the > allocations there. Yes, that does make sense. > I would like to get this LockReleaseAll problem finally fixed in PG16, > but I'd feel much better about this patch if it had some review from > someone who has more in-depth knowledge of the locking code. I feel my review wouldn't be independent, but I'll give it a shot if nobody else does. Greetings, Andres Freund
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
David Rowley
Date:
Thanks for having a look at this. On Wed, 1 Feb 2023 at 03:07, Amit Langote <amitlangote09@gmail.com> wrote: > Maybe you're planning to do it once this patch is post the PoC phase > (isn't it?), but it would be helpful to have commentary on all the new > dlist fields. I've added comments on the new fields. Maybe we can say the patch is "wip". > This seems to be replacing what is a cache with an upper limit on the > number of cacheds locks with something that has no limit on how many > per-owner locks are remembered. I wonder whether we'd be doing > additional work in some cases with the new no-limit implementation > that wasn't being done before (where the owner's locks array is > overflowed) or maybe not much because the new implementation of > ResourceOwner{Remember|Forget}Lock() is simple push/delete of a dlist > node from the owner's dlist? It's a good question. The problem is I don't really have a good test to find out. The problem is we'd need to benchmark taking fewer than 16 locks. On trying that, I find that there's just too much variability in the performance between runs to determine if there's any slowdown. $ cat 10_locks.sql select count(pg_advisory_lock(x)) from generate_series(1,10) x; $ pgbench -f 10_locks.sql@1000 -M prepared -T 10 -n postgres | grep -E "(tps)" tps = 47809.306088 (without initial connection time) tps = 66859.789072 (without initial connection time) tps = 37885.924616 (without initial connection time) On trying with more locks, I see there are good wins from the patched version. $ cat 100_locks.sql select count(pg_advisory_lock(x)) from generate_series(1,100) x; $ cat 1k_locks.sql select count(pg_advisory_lock(x)) from generate_series(1,1000) x; $ cat 10k_locks.sql select count(pg_advisory_lock(x)) from generate_series(1,10000) x; Test 1: Take 100 locks but periodically take 10k locks to bloat the local lock table. master: $ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10 -n postgres | grep -E "(tps|script)" transaction type: multiple scripts tps = 2726.197037 (without initial connection time) SQL script 1: 100_locks.sql - 27219 transactions (99.9% of total, tps = 2722.496227) SQL script 2: 10k_locks.sql - 37 transactions (0.1% of total, tps = 3.700810) patched: $ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10 -n postgres | grep -E "(tps|script)" transaction type: multiple scripts tps = 34047.297822 (without initial connection time) SQL script 1: 100_locks.sql - 340039 transactions (99.9% of total, tps = 34012.688879) SQL script 2: 10k_locks.sql - 346 transactions (0.1% of total, tps = 34.608943) patched without slab context: $ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10 -n postgres | grep -E "(tps|script)" transaction type: multiple scripts tps = 34851.770846 (without initial connection time) SQL script 1: 100_locks.sql - 348097 transactions (99.9% of total, tps = 34818.662324) SQL script 2: 10k_locks.sql - 331 transactions (0.1% of total, tps = 33.108522) Test 2: Always take just 100 locks and don't bloat the local lock table. master: $ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep -E "(tps|script)" tps = 32682.491548 (without initial connection time) patched: $ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep -E "(tps|script)" tps = 35637.241815 (without initial connection time) patched without slab context: $ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep -E "(tps|script)" tps = 36192.185181 (without initial connection time) The attached 0003 patch is an experiment to see if using a slab memory context has any advantages for storing the LOCALLOCKOWNER structs. There seems to be a small performance hit from doing this. > The following comment is now obsolete: > > /* > * LockReassignCurrentOwner > * Reassign all locks belonging to CurrentResourceOwner to belong > * to its parent resource owner. > * > * If the caller knows what those locks are, it can pass them as an array. > * That speeds up the call significantly, when a lot of locks are held > * (e.g pg_dump with a large schema). Otherwise, pass NULL for locallocks, > * and we'll traverse through our hash table to find them. > */ I've removed the obsolete part. I've attached another set of patches. I do need to spend longer looking at this. I'm mainly attaching these as CI seems to be highlighting a problem that I'm unable to recreate locally and I wanted to see if the attached fixes it. David
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Heikki Linnakangas
Date:
On 10/02/2023 04:51, David Rowley wrote: > I've attached another set of patches. I do need to spend longer > looking at this. I'm mainly attaching these as CI seems to be > highlighting a problem that I'm unable to recreate locally and I > wanted to see if the attached fixes it. I like this patch's approach. > index 296dc82d2ee..edb8b6026e5 100644 > --- a/src/backend/commands/discard.c > +++ b/src/backend/commands/discard.c > @@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel) > Async_UnlistenAll(); > - LockReleaseAll(USER_LOCKMETHOD, true); > + LockReleaseSession(USER_LOCKMETHOD); > ResetPlanCache(); This assumes that there are no transaction-level advisory locks. I think that's OK. It took me a while to convince myself of that, though. I think we need a high level comment somewhere that explains what assumptions we make on which locks can be held in session mode and which in transaction mode. > @@ -3224,14 +3206,6 @@ PostPrepare_Locks(TransactionId xid) > Assert(lock->nGranted <= lock->nRequested); > Assert((proclock->holdMask & ~lock->grantMask) == 0); > > - /* Ignore it if nothing to release (must be a session lock) */ > - if (proclock->releaseMask == 0) > - continue; > - > - /* Else we should be releasing all locks */ > - if (proclock->releaseMask != proclock->holdMask) > - elog(PANIC, "we seem to have dropped a bit somewhere"); > - > /* > * We cannot simply modify proclock->tag.myProc to reassign > * ownership of the lock, because that's part of the hash key and This looks wrong. If you prepare a transaction that is holding any session locks, we will now transfer them to the prepared transaction. And its locallock entry will be out of sync. To fix, I think we could keep around the hash table that CheckForSessionAndXactLocks() builds, and use that here. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
David Rowley
Date:
Thank you for having a look at this. Apologies for not getting back to you sooner. On Wed, 5 Jul 2023 at 21:44, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 10/02/2023 04:51, David Rowley wrote: > > I've attached another set of patches. I do need to spend longer > > looking at this. I'm mainly attaching these as CI seems to be > > highlighting a problem that I'm unable to recreate locally and I > > wanted to see if the attached fixes it. > > I like this patch's approach. > > > index 296dc82d2ee..edb8b6026e5 100644 > > --- a/src/backend/commands/discard.c > > +++ b/src/backend/commands/discard.c > > @@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel) > > Async_UnlistenAll(); > > - LockReleaseAll(USER_LOCKMETHOD, true); > > + LockReleaseSession(USER_LOCKMETHOD); > > ResetPlanCache(); > > This assumes that there are no transaction-level advisory locks. I think > that's OK. It took me a while to convince myself of that, though. I > think we need a high level comment somewhere that explains what > assumptions we make on which locks can be held in session mode and which > in transaction mode. Isn't it ok because DISCARD ALL cannot run inside a transaction block, so there should be no locks taken apart from possibly session-level locks? I've added a call to LockAssertNoneHeld(false) in there. > > @@ -3224,14 +3206,6 @@ PostPrepare_Locks(TransactionId xid) > > Assert(lock->nGranted <= lock->nRequested); > > Assert((proclock->holdMask & ~lock->grantMask) == 0); > > > > - /* Ignore it if nothing to release (must be a session lock) */ > > - if (proclock->releaseMask == 0) > > - continue; > > - > > - /* Else we should be releasing all locks */ > > - if (proclock->releaseMask != proclock->holdMask) > > - elog(PANIC, "we seem to have dropped a bit somewhere"); > > - > > /* > > * We cannot simply modify proclock->tag.myProc to reassign > > * ownership of the lock, because that's part of the hash key and > > This looks wrong. If you prepare a transaction that is holding any > session locks, we will now transfer them to the prepared transaction. > And its locallock entry will be out of sync. To fix, I think we could > keep around the hash table that CheckForSessionAndXactLocks() builds, > and use that here. Good catch. I've modified the patch to keep the hash table built in CheckForSessionAndXactLocks around for longer so that we can check for session locks. I've attached an updated patch mainly to get CI checking this. I suspect something is wrong as subscription/015_stream is timing out. I've not gotten to the bottom of that yet. David
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Heikki Linnakangas
Date:
On 11/09/2023 15:00, David Rowley wrote: > On Wed, 5 Jul 2023 at 21:44, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >>> index 296dc82d2ee..edb8b6026e5 100644 >>> --- a/src/backend/commands/discard.c >>> +++ b/src/backend/commands/discard.c >>> @@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel) >>> Async_UnlistenAll(); >>> - LockReleaseAll(USER_LOCKMETHOD, true); >>> + LockReleaseSession(USER_LOCKMETHOD); >>> ResetPlanCache(); >> >> This assumes that there are no transaction-level advisory locks. I think >> that's OK. It took me a while to convince myself of that, though. I >> think we need a high level comment somewhere that explains what >> assumptions we make on which locks can be held in session mode and which >> in transaction mode. > > Isn't it ok because DISCARD ALL cannot run inside a transaction block, > so there should be no locks taken apart from possibly session-level > locks? Hmm, sounds valid. I think I convinced myself that it's OK through some other reasoning, but I don't remember it now. > I've added a call to LockAssertNoneHeld(false) in there. I don't see it in the patch? -- Heikki Linnakangas Neon (https://neon.tech)
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
David Rowley
Date:
On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I've added a call to LockAssertNoneHeld(false) in there. > > I don't see it in the patch? hmm. I must've git format-patch before committing that part. I'll try that again... see attached. David
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
Heikki Linnakangas
Date:
On 18/09/2023 07:08, David Rowley wrote: > On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> I've added a call to LockAssertNoneHeld(false) in there. >> >> I don't see it in the patch? > > hmm. I must've git format-patch before committing that part. > > I'll try that again... see attached. This needed a rebase after my ResourceOwner refactoring. Attached. A few quick comments: - It would be nice to add a test for the issue that you fixed in patch v7, i.e. if you prepare a transaction while holding session-level locks. - GrantLockLocal() now calls MemoryContextAlloc(), which can fail if you are out of memory. Is that handled gracefully or is the lock leaked? -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
Re: Speed up transaction completion faster after many relations are accessed in a transaction
From
vignesh C
Date:
On Thu, 9 Nov 2023 at 21:48, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 18/09/2023 07:08, David Rowley wrote: > > On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >>> I've added a call to LockAssertNoneHeld(false) in there. > >> > >> I don't see it in the patch? > > > > hmm. I must've git format-patch before committing that part. > > > > I'll try that again... see attached. > > This needed a rebase after my ResourceOwner refactoring. Attached. > > A few quick comments: > > - It would be nice to add a test for the issue that you fixed in patch > v7, i.e. if you prepare a transaction while holding session-level locks. > > - GrantLockLocal() now calls MemoryContextAlloc(), which can fail if you > are out of memory. Is that handled gracefully or is the lock leaked? CFBot shows one of the test has aborted at [1] with: [20:54:28.535] Core was generated by `postgres: subscriber: logical replication apply worker for subscription 16397 '. [20:54:28.535] Program terminated with signal SIGABRT, Aborted. [20:54:28.535] #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 [20:54:28.535] Download failed: Invalid argument. Continuing without source file ./signal/../sysdeps/unix/sysv/linux/raise.c. [20:54:28.627] [20:54:28.627] Thread 1 (Thread 0x7f0ea02d1a40 (LWP 50984)): [20:54:28.627] #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 ... ... [20:54:28.627] #2 0x00005618e989d62f in ExceptionalCondition (conditionName=conditionName@entry=0x5618e9b40f70 "dlist_is_empty(&(MyProc->myProcLocks[i]))", fileName=fileName@entry=0x5618e9b40ec0 "../src/backend/storage/lmgr/proc.c", lineNumber=lineNumber@entry=856) at ../src/backend/utils/error/assert.c:66 [20:54:28.627] No locals. [20:54:28.627] #3 0x00005618e95e6847 in ProcKill (code=<optimized out>, arg=<optimized out>) at ../src/backend/storage/lmgr/proc.c:856 [20:54:28.627] i = <optimized out> [20:54:28.627] proc = <optimized out> [20:54:28.627] procgloballist = <optimized out> [20:54:28.627] __func__ = "ProcKill" [20:54:28.627] #4 0x00005618e959ebcc in shmem_exit (code=code@entry=1) at ../src/backend/storage/ipc/ipc.c:276 [20:54:28.627] __func__ = "shmem_exit" [20:54:28.627] #5 0x00005618e959ecd0 in proc_exit_prepare (code=code@entry=1) at ../src/backend/storage/ipc/ipc.c:198 [20:54:28.627] __func__ = "proc_exit_prepare" [20:54:28.627] #6 0x00005618e959ee8e in proc_exit (code=code@entry=1) at ../src/backend/storage/ipc/ipc.c:111 [20:54:28.627] __func__ = "proc_exit" [20:54:28.627] #7 0x00005618e94aa54d in BackgroundWorkerMain () at ../src/backend/postmaster/bgworker.c:805 [20:54:28.627] local_sigjmp_buf = {{__jmpbuf = {94665009627112, -3865857745677845768, 0, 0, 140732736634980, 1, 3865354362587970296, 7379258256398875384}, __mask_was_saved = 1, __saved_mask = {__val = {18446744066192964099, 94665025527920, 94665025527920, 94665025527920, 0, 94665025528120, 8192, 1, 94664997686410, 94665009627040, 94664997622076, 94665025527920, 1, 0, 0, 140732736634980}}}} [20:54:28.627] worker = 0x5618eb37c570 [20:54:28.627] entrypt = <optimized out> [20:54:28.627] __func__ = "BackgroundWorkerMain" [20:54:28.627] #8 0x00005618e94b495c in do_start_bgworker (rw=rw@entry=0x5618eb3b73c8) at ../src/backend/postmaster/postmaster.c:5697 [20:54:28.627] worker_pid = <optimized out> [20:54:28.627] __func__ = "do_start_bgworker" [20:54:28.627] #9 0x00005618e94b4c32 in maybe_start_bgworkers () at ../src/backend/postmaster/postmaster.c:5921 [20:54:28.627] rw = 0x5618eb3b73c8 [20:54:28.627] num_launched = 0 [20:54:28.627] now = 0 [20:54:28.627] iter = {cur = 0x5618eb3b79a8, next = 0x5618eb382a20, prev = 0x5618ea44a980 <BackgroundWorkerList>} [20:54:28.627] #10 0x00005618e94b574a in process_pm_pmsignal () at ../src/backend/postmaster/postmaster.c:5073 [20:54:28.627] __func__ = "process_pm_pmsignal" [20:54:28.627] #11 0x00005618e94b5f4a in ServerLoop () at ../src/backend/postmaster/postmaster.c:1760 [1] - https://cirrus-ci.com/task/5118173163290624?logs=cores#L51 Regards, Vignesh