Thread: Speed up transaction completion faster after many relations areaccessed in a transaction

Speed up transaction completion faster after many relations areaccessed in a transaction

From
"Tsunakawa, Takayuki"
Date:
Hello,

The attached patch speeds up transaction completion when any prior transaction accessed many relations in the same
session.

The transaction records its own acquired lock information in the LOCALLOCK structure (a pair of lock object and lock
mode). It stores LOCALLOCKs in a hash table in its local memory.  The hash table automatically expands when the
transactionacquires many relations.  The hash table doesn't shrink.  When the transaction commits or aborts, it scans
thehash table to find LOCALLOCKs to release locks.
 

The problem is that once some transaction accesses many relations, even subsequent transactions in the same session
thatonly access a few relations take unreasonably long time to complete, because it has to scan the expanded hash
table.

The attached patch links LOCALLOCKS to PGPROC, so that releasing locks should only scan the list instead of the hash
table. The hash table is kept because some functions want to find a particular LOCALLOCK quickly based on its hash
value.

This problem was uncovered while evaluating partitioning performance.  When the application PREPAREs a statement once
andthen EXECUTE-COMMIT repeatedly, the server creates a generic plan on the 6th EXECUTE.  Unfortunately, creation of
thegeneric plan of UPDATE/DELETE currently accesses all partitions of the target table (this itself needs improvement),
expandingthe LOCALLOCK hash table.  As a result, 7th and later EXECUTEs get slower.
 

Imai-san confirmed performance improvement with this patch:

https://commitfest.postgresql.org/22/1993/


Regards
Takayuki Tsunakawa


Attachment
"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
> The attached patch speeds up transaction completion when any prior transaction accessed many relations in the same
session.

Hm.  Putting a list header for a purely-local data structure into shared
memory seems quite ugly.  Isn't there a better place to keep that?

Do we really want a dlist here at all?  I'm concerned that bloating
LOCALLOCK will cost us when there are many locks involved.  This patch
increases the size of LOCALLOCK by 25% if I counted right, which does
not seem like a negligible penalty.

My own thought about how to improve this situation was just to destroy
and recreate LockMethodLocalHash at transaction end (or start)
if its size exceeded $some-value.  Leaving it permanently bloated seems
like possibly a bad idea, even if we get rid of all the hash_seq_searches
on it.

            regards, tom lane


On Tue, 19 Feb 2019 at 12:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> My own thought about how to improve this situation was just to destroy
> and recreate LockMethodLocalHash at transaction end (or start)
> if its size exceeded $some-value.  Leaving it permanently bloated seems
> like possibly a bad idea, even if we get rid of all the hash_seq_searches
> on it.

That seems like a good idea. Although, it would be good to know that
it didn't add too much overhead dropping and recreating the table when
every transaction happened to obtain more locks than $some-value.  If
it did, then maybe we could track the average locks per of recent
transactions and just ditch the table after the locks are released if
the locks held by the last transaction exceeded the average *
1.something. No need to go near shared memory to do that.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Hi,

On 2019-02-19 12:52:08 +1300, David Rowley wrote:
> On Tue, 19 Feb 2019 at 12:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > My own thought about how to improve this situation was just to destroy
> > and recreate LockMethodLocalHash at transaction end (or start)
> > if its size exceeded $some-value.  Leaving it permanently bloated seems
> > like possibly a bad idea, even if we get rid of all the hash_seq_searches
> > on it.
> 
> That seems like a good idea. Although, it would be good to know that
> it didn't add too much overhead dropping and recreating the table when
> every transaction happened to obtain more locks than $some-value.  If
> it did, then maybe we could track the average locks per of recent
> transactions and just ditch the table after the locks are released if
> the locks held by the last transaction exceeded the average *
> 1.something. No need to go near shared memory to do that.

Isn't a large portion of benefits in this patch going to be mooted by
the locking improvements discussed in the other threads? I.e. there's
hopefully not going to be a ton of cases with low overhead where we
acquire a lot of locks and release them very soon after. Sure, for DDL
etc we will, but I can't see this mattering from a performance POV?

I'm not against doing something like Tom proposes, but heuristics with
magic constants like this tend to age purely / are hard to tune well
across systems.

Greetings,

Andres Freund


David Rowley <david.rowley@2ndquadrant.com> writes:
> On Tue, 19 Feb 2019 at 12:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> My own thought about how to improve this situation was just to destroy
>> and recreate LockMethodLocalHash at transaction end (or start)
>> if its size exceeded $some-value.  Leaving it permanently bloated seems
>> like possibly a bad idea, even if we get rid of all the hash_seq_searches
>> on it.

> That seems like a good idea. Although, it would be good to know that
> it didn't add too much overhead dropping and recreating the table when
> every transaction happened to obtain more locks than $some-value.  If
> it did, then maybe we could track the average locks per of recent
> transactions and just ditch the table after the locks are released if
> the locks held by the last transaction exceeded the average *
> 1.something. No need to go near shared memory to do that.

Yeah, I'd deliberately avoided saying how we'd choose $some-value ;-).
Making it adaptive might not be a bad plan.

            regards, tom lane


Andres Freund <andres@anarazel.de> writes:
> Isn't a large portion of benefits in this patch going to be mooted by
> the locking improvements discussed in the other threads? I.e. there's
> hopefully not going to be a ton of cases with low overhead where we
> acquire a lot of locks and release them very soon after. Sure, for DDL
> etc we will, but I can't see this mattering from a performance POV?

Mmm ... AIUI, the patches currently proposed can only help for what
David called "point lookup" queries.  There are still going to be
queries that scan a large proportion of a partition tree, so if you've
got tons of partitions, you'll be concerned about this sort of thing.

> I'm not against doing something like Tom proposes, but heuristics with
> magic constants like this tend to age purely / are hard to tune well
> across systems.

I didn't say it had to be a constant ...

            regards, tom lane


On Tue, 19 Feb 2019 at 12:56, Andres Freund <andres@anarazel.de> wrote:
> Isn't a large portion of benefits in this patch going to be mooted by
> the locking improvements discussed in the other threads? I.e. there's
> hopefully not going to be a ton of cases with low overhead where we
> acquire a lot of locks and release them very soon after. Sure, for DDL
> etc we will, but I can't see this mattering from a performance POV?

I think this patch was born from Amit's partition planner improvement
patch. If not that one, which other threads did you have in mind?

A problem exists where, if using a PREPAREd statement to plan a query
to a partitioned table containing many partitions that a generic plan
will never be favoured over a custom plan since the generic plan might
not be able to prune partitions like the custom plan can.   The actual
problem is around that we do need to at some point generate a generic
plan in order to know it's more costly and that requires locking
possibly every partition.  When plan_cache_mode = auto, this is done
on the 6th execution of the statement.  After Amit's partition planner
changes [1], the custom plan will only lock partitions that are not
pruned, so the 6th execution of the statement bloats the local lock
table.

[1] https://commitfest.postgresql.org/22/1778/

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Hi,

On 2019-02-18 19:01:06 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Isn't a large portion of benefits in this patch going to be mooted by
> > the locking improvements discussed in the other threads? I.e. there's
> > hopefully not going to be a ton of cases with low overhead where we
> > acquire a lot of locks and release them very soon after. Sure, for DDL
> > etc we will, but I can't see this mattering from a performance POV?
> 
> Mmm ... AIUI, the patches currently proposed can only help for what
> David called "point lookup" queries.  There are still going to be
> queries that scan a large proportion of a partition tree, so if you've
> got tons of partitions, you'll be concerned about this sort of thing.

Agreed - but it seems not unlikely that for those the rest of the
planner / executor overhead will entirely swamp any improvement we could
make here. If I understand correctly the benchmarks here were made with
"point" update and select queries, although the reference in the first
post in this thread is a bit vague.


> > I'm not against doing something like Tom proposes, but heuristics with
> > magic constants like this tend to age purely / are hard to tune well
> > across systems.
> 
> I didn't say it had to be a constant ...

Do you have good idea?

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2019-02-18 19:01:06 -0500, Tom Lane wrote:
>> Mmm ... AIUI, the patches currently proposed can only help for what
>> David called "point lookup" queries.  There are still going to be
>> queries that scan a large proportion of a partition tree, so if you've
>> got tons of partitions, you'll be concerned about this sort of thing.

> Agreed - but it seems not unlikely that for those the rest of the
> planner / executor overhead will entirely swamp any improvement we could
> make here. If I understand correctly the benchmarks here were made with
> "point" update and select queries, although the reference in the first
> post in this thread is a bit vague.

I think what Maumau-san is on about here is that not only does your
$giant-query take a long time, but it has a permanent negative effect
on all subsequent transactions in the session.  That seems worth
doing something about.

>> I didn't say it had to be a constant ...

> Do you have good idea?

I think David's on the right track --- keep some kind of moving average of
the LOCALLOCK table size for each transaction, and nuke it if it exceeds
some multiple of the recent average.  Not sure offhand about how to get
the data cheaply --- it might not be sufficient to look at transaction
end, if we release LOCALLOCK entries before that (but do we?)

            regards, tom lane


Hi,

On 2019-02-18 18:42:32 -0500, Tom Lane wrote:
> "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
> > The attached patch speeds up transaction completion when any prior transaction accessed many relations in the same
session.
> 
> Hm.  Putting a list header for a purely-local data structure into shared
> memory seems quite ugly.  Isn't there a better place to keep that?

Yea, I think it'd be just as fine to store that in a static
variable (best defined directly besides LockMethodLocalHash).

(Btw, I'd be entirely unsurprised if moving away from a dynahash for
LockMethodLocalHash would be beneficial)


> Do we really want a dlist here at all?  I'm concerned that bloating
> LOCALLOCK will cost us when there are many locks involved.  This patch
> increases the size of LOCALLOCK by 25% if I counted right, which does
> not seem like a negligible penalty.

It's currently

struct LOCALLOCK {
        LOCALLOCKTAG               tag;                  /*     0    20 */

        /* XXX 4 bytes hole, try to pack */

        LOCK *                     lock;                 /*    24     8 */
        PROCLOCK *                 proclock;             /*    32     8 */
        uint32                     hashcode;             /*    40     4 */

        /* XXX 4 bytes hole, try to pack */

        int64                      nLocks;               /*    48     8 */
        _Bool                      holdsStrongLockCount; /*    56     1 */
        _Bool                      lockCleared;          /*    57     1 */

        /* XXX 2 bytes hole, try to pack */

        int                        numLockOwners;        /*    60     4 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        int                        maxLockOwners;        /*    64     4 */

        /* XXX 4 bytes hole, try to pack */

        LOCALLOCKOWNER *           lockOwners;           /*    72     8 */

        /* size: 80, cachelines: 2, members: 10 */
        /* sum members: 66, holes: 4, sum holes: 14 */
        /* last cacheline: 16 bytes */
};

seems we could trivially squeeze most of the bytes for a dlist node out
of padding.


> My own thought about how to improve this situation was just to destroy
> and recreate LockMethodLocalHash at transaction end (or start)
> if its size exceeded $some-value.  Leaving it permanently bloated seems
> like possibly a bad idea, even if we get rid of all the hash_seq_searches
> on it.

OTOH, that'll force constant incremental resizing of the hashtable, for
workloads that regularly need a lot of locks. And I'd assume in most
cases if one transaction needs a lot of locks it's quite likely that
future ones will need a lot of locks, too.

Greetings,

Andres Freund


Hi,

On 2019-02-18 19:13:31 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-02-18 19:01:06 -0500, Tom Lane wrote:
> >> Mmm ... AIUI, the patches currently proposed can only help for what
> >> David called "point lookup" queries.  There are still going to be
> >> queries that scan a large proportion of a partition tree, so if you've
> >> got tons of partitions, you'll be concerned about this sort of thing.
> 
> > Agreed - but it seems not unlikely that for those the rest of the
> > planner / executor overhead will entirely swamp any improvement we could
> > make here. If I understand correctly the benchmarks here were made with
> > "point" update and select queries, although the reference in the first
> > post in this thread is a bit vague.
> 
> I think what Maumau-san is on about here is that not only does your
> $giant-query take a long time, but it has a permanent negative effect
> on all subsequent transactions in the session.  That seems worth
> doing something about.

Ah, yes, that makes sense.  I'm inclined to think however that the
original approach presented in this thread is better than the
reset-the-whole-hashtable approach. Because:


> I think David's on the right track --- keep some kind of moving average of
> the LOCALLOCK table size for each transaction, and nuke it if it exceeds
> some multiple of the recent average.  Not sure offhand about how to get
> the data cheaply --- it might not be sufficient to look at transaction
> end, if we release LOCALLOCK entries before that (but do we?)

Seems too complicated for my taste. And it doesn't solve the issue of
having some transactions with few locks (say because the plan can be
nicely pruned) interspersed with transactions where a lot of locks are
acquired.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2019-02-18 18:42:32 -0500, Tom Lane wrote:
>> Do we really want a dlist here at all?  I'm concerned that bloating
>> LOCALLOCK will cost us when there are many locks involved.  This patch
>> increases the size of LOCALLOCK by 25% if I counted right, which does
>> not seem like a negligible penalty.

> It's currently [ 80 bytes with several padding holes ]
> seems we could trivially squeeze most of the bytes for a dlist node out
> of padding.

Yeah, but if we want to rearrange the members into an illogical order
to save some space, we should do that independently of this patch ---
and then the overhead of this patch would be even worse than 25%.

            regards, tom lane


Hi,

On 2019-02-18 19:24:54 -0500, Tom Lane wrote:
> Yeah, but if we want to rearrange the members into an illogical order
> to save some space, we should do that independently of this patch ---

Sure, we should do that. I don't buy the "illogical" bit, just moving
hashcode up to after tag isn't more or less logical, and saves most of
the padding, and moving the booleans to the end isn't better/worse
either.

You always bring up that argument. While I agree that sometimes the most
optimal ordering can be less natural, I think most of the time it vastly
overstates how intelligent the original ordering was. Often new elements
were either just added iteratively without consideration for padding, or
the attention to padding was paid in 32bit times.

I don't find

struct LOCALLOCK {
        LOCALLOCKTAG               tag;                  /*     0    20 */
        uint32                     hashcode;             /*    20     4 */
        LOCK *                     lock;                 /*    24     8 */
        PROCLOCK *                 proclock;             /*    32     8 */
        int64                      nLocks;               /*    40     8 */
        int                        numLockOwners;        /*    48     4 */
        int                        maxLockOwners;        /*    52     4 */
        LOCALLOCKOWNER *           lockOwners;           /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        _Bool                      holdsStrongLockCount; /*    64     1 */
        _Bool                      lockCleared;          /*    65     1 */

        /* size: 72, cachelines: 2, members: 10 */
        /* padding: 6 */
        /* last cacheline: 8 bytes */
};

less clear than

struct LOCALLOCK {
        LOCALLOCKTAG               tag;                  /*     0    20 */

        /* XXX 4 bytes hole, try to pack */

        LOCK *                     lock;                 /*    24     8 */
        PROCLOCK *                 proclock;             /*    32     8 */
        uint32                     hashcode;             /*    40     4 */

        /* XXX 4 bytes hole, try to pack */

        int64                      nLocks;               /*    48     8 */
        _Bool                      holdsStrongLockCount; /*    56     1 */
        _Bool                      lockCleared;          /*    57     1 */

        /* XXX 2 bytes hole, try to pack */

        int                        numLockOwners;        /*    60     4 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        int                        maxLockOwners;        /*    64     4 */

        /* XXX 4 bytes hole, try to pack */

        LOCALLOCKOWNER *           lockOwners;           /*    72     8 */

        /* size: 80, cachelines: 2, members: 10 */
        /* sum members: 66, holes: 4, sum holes: 14 */
        /* last cacheline: 16 bytes */
};

but it's smaller (althoug there's plenty trailing space).


> and then the overhead of this patch would be even worse than 25%.

IDK, we, including you, very often make largely independent improvements
to make the cost of something else more palpable. Why's that not OK
here?  Especially because we're not comparing to an alternative where no
cost is added, keeping track of e.g. a running average of the hashtable
size isn't free either; nor does it help in the intermittent cases.

- Andres


Andres Freund <andres@anarazel.de> writes:
> On 2019-02-18 19:24:54 -0500, Tom Lane wrote:
>> Yeah, but if we want to rearrange the members into an illogical order
>> to save some space, we should do that independently of this patch ---

> Sure, we should do that. I don't buy the "illogical" bit, just moving
> hashcode up to after tag isn't more or less logical, and saves most of
> the padding, and moving the booleans to the end isn't better/worse
> either.

I hadn't looked at the details closely, but if we can squeeze out the
padding space without any loss of intelligibility, sure let's do so.
I still say that's independent of whether to adopt this patch though.

> but it's smaller (althoug there's plenty trailing space).

I think you're supposing that these things are independently palloc'd, but
they aren't --- dynahash lays them out in arrays without palloc padding.

> IDK, we, including you, very often make largely independent improvements
> to make the cost of something else more palpable. Why's that not OK
> here?

When we do that, we aren't normally talking about overheads as high as
25% (even more, if it's measured as I think it ought to be).  What I'm
concerned about is that the patch is being advocated for cases where
there are lots of LOCALLOCK entries --- which is exactly where the
space overhead is going to hurt the most.

> Especially because we're not comparing to an alternative where no
> cost is added, keeping track of e.g. a running average of the hashtable
> size isn't free either; nor does it help in the intermittent cases.

What I was hoping for --- though perhaps it's not achievable --- was
statistical overhead amounting to just a few more instructions per
transaction.  Adding dlist linking would add more instructions per
hashtable entry/removal, which seems like it'd be a substantially
bigger time penalty.  As for the intermittent-usage issue, that largely
depends on the details of the when-to-reset heuristic, which we don't
have a concrete design for yet.  But I could certainly imagine it waiting
for a few transactions before deciding to chomp.

Anyway, I'm not trying to veto the patch in this form, just suggesting
that there are alternatives worth thinking about.

            regards, tom lane


Hi,

On 2019-02-18 20:29:29 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > but it's smaller (althoug there's plenty trailing space).
> 
> I think you're supposing that these things are independently palloc'd, but
> they aren't --- dynahash lays them out in arrays without palloc padding.

I don't think that matters, given that the trailing six bytes are
included in sizeof() (and have to, to guarantee suitable alignment in
arrays etc).

Greetings,

Andres Freund


On Tue, 19 Feb 2019 at 00:20, Andres Freund <andres@anarazel.de> wrote:
 
On 2019-02-18 19:13:31 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-02-18 19:01:06 -0500, Tom Lane wrote:
> >> Mmm ... AIUI, the patches currently proposed can only help for what
> >> David called "point lookup" queries.  There are still going to be
> >> queries that scan a large proportion of a partition tree, so if you've
> >> got tons of partitions, you'll be concerned about this sort of thing.
 
> I think what Maumau-san is on about here is that not only does your
> $giant-query take a long time, but it has a permanent negative effect
> on all subsequent transactions in the session.  That seems worth
> doing something about.

Ah, yes, that makes sense.  I'm inclined to think however that the
original approach presented in this thread is better than the
reset-the-whole-hashtable approach.

If it was just many-tables then blowing away the hash table would work fine.

The main issue seems to be with partitioning, not with the general case of many-tables. For that case, it seems like reset hashtable is too much.

Can we use our knowledge of the structure of locks, i.e. partition locks are all children of the partitioned table, to do a better job?

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On 2/12/19 7:33 AM, Tsunakawa, Takayuki wrote:
> ...
> 
> This problem was uncovered while evaluating partitioning performance.
> When the application PREPAREs a statement once and then
> EXECUTE-COMMIT repeatedly, the server creates a generic plan on the
> 6th EXECUTE.  Unfortunately, creation of the generic plan of
> UPDATE/DELETE currently accesses all partitions of the target table
> (this itself needs improvement), expanding the LOCALLOCK hash table.
> As a result, 7th and later EXECUTEs get slower.
> 
> Imai-san confirmed performance improvement with this patch:
> 
> https://commitfest.postgresql.org/22/1993/
> 

Can you quantify the effects? That is, how much slower/faster does it get?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


RE: Speed up transaction completion faster after many relations areaccessed in a transaction

From
"Tsunakawa, Takayuki"
Date:
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com]
> On 2/12/19 7:33 AM, Tsunakawa, Takayuki wrote:
> > Imai-san confirmed performance improvement with this patch:
> >
> > https://commitfest.postgresql.org/22/1993/
> >
> 
> Can you quantify the effects? That is, how much slower/faster does it get?

Ugh, sorry, I wrote a wrong URL.  The correct page is:

https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512787EC%40g01jpexmbkw24

The quoted figures re:

[v20 + faster-locallock-scan.patch]
auto:   9,069 TPS
custom: 9,015 TPS

[v20]
auto:   8,037 TPS
custom: 8,933 TPS


In the original problematic case, plan_cache_mode = auto (default), we can see about 13% improvement.


Regards
Takayuki Tsunakawa




RE: Speed up transaction completion faster after many relations areaccessed in a transaction

From
"Tsunakawa, Takayuki"
Date:
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Hm.  Putting a list header for a purely-local data structure into shared
> memory seems quite ugly.  Isn't there a better place to keep that?

Agreed.  I put it in the global variable.


> Do we really want a dlist here at all?  I'm concerned that bloating
> LOCALLOCK will cost us when there are many locks involved.  This patch
> increases the size of LOCALLOCK by 25% if I counted right, which does
> not seem like a negligible penalty.

To delete the LOCALLOCK in RemoveLocalLock(), we need a dlist.  slist requires the list iterator to be passed from
callers.


From: Andres Freund [mailto:andres@anarazel.de]
> Sure, we should do that. I don't buy the "illogical" bit, just moving
> hashcode up to after tag isn't more or less logical, and saves most of
> the padding, and moving the booleans to the end isn't better/worse
> either.
> 
> I don't find

Thanks, I've done it.  


From: Simon Riggs [mailto:simon@2ndquadrant.com]
> Can we use our knowledge of the structure of locks, i.e. partition locks
> are all children of the partitioned table, to do a better job?

I couldn't come up with a idea.


Regards
Takayuki Tsunakawa


Attachment
On 2019-02-20 07:20, Tsunakawa, Takayuki wrote:
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>> Hm.  Putting a list header for a purely-local data structure into shared
>> memory seems quite ugly.  Isn't there a better place to keep that?
> 
> Agreed.  I put it in the global variable.

I think there is agreement on the principles of this patch.  Perhaps it
could be polished a bit.

Your changes in LOCALLOCK still refer to PGPROC, from your first version
of the patch.

I think the reordering of struct members could be done as a separate
preliminary patch.

Some more documentation in the comment before dlist_head LocalLocks to
explain this whole mechanism would be nice.

You posted a link to some performance numbers, but I didn't see the test
setup explained there.  I'd like to get some more information on this
impact of this.  Is there an effect with 100 tables, or do you need 100000?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Hi Peter, Imai-san,

From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com]
> Your changes in LOCALLOCK still refer to PGPROC, from your first version
> of the patch.
> 
> I think the reordering of struct members could be done as a separate
> preliminary patch.
> 
> Some more documentation in the comment before dlist_head LocalLocks to
> explain this whole mechanism would be nice.

Fixed.


> You posted a link to some performance numbers, but I didn't see the test
> setup explained there.  I'd like to get some more information on this
> impact of this.  Is there an effect with 100 tables, or do you need 100000?

Imai-san, can you tell us the test setup?


Regards
Takayuki Tsunakawa


Attachment
Hi Tsunakawa-san, Peter

On Tue, Mar 19, 2019 at 7:53 AM, Tsunakawa, Takayuki wrote:
> From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com]
> > You posted a link to some performance numbers, but I didn't see the
> > test setup explained there.  I'd like to get some more information on
> > this impact of this.  Is there an effect with 100 tables, or do you
> need 100000?
> 
> Imai-san, can you tell us the test setup?

Maybe I used this test setup[1].

I tested again with those settings for prepared transactions.
I used Tsunakawa-san's patch for locallock[2] (which couldn't be applied to current master so I fixed it) and Amit's
v32patch for speeding up planner[3]. 
 

[settings]
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096

[partitioning table definitions(with 4096 partitions)]
create table rt (a int, b int, c int) partition by range (a);

\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
 (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x;
\gexec
\o

[select4096.sql]
\set a random(1, 4096)
select a from rt where a = :a;

[pgbench(with 4096 partitions)]
pgbench -n -f select4096.sql -T 60 -M prepared

[results]
          master  locallock      v32  v32+locallock
          ------  ---------      ---  -------------
auto        21.9       22.9    6,834          7,355
custom      19.7       20.0    7,415          7,252


[1] https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A51256276%40g01jpexmbkw24
[2] https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1FBDFA00%40G01JPEXMBYT05
[3] https://www.postgresql.org/message-id/9feacaf6-ddb3-96dd-5b98-df5e927b1439%40lab.ntt.co.jp

--
Yoshikazu Imai


From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]
> Fixed.

Rebased on HEAD.


Regards
Takayuki Tsunakawa


Attachment
On 2019-03-19 10:21, Tsunakawa, Takayuki wrote:
> From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]
>> Fixed.
> 
> Rebased on HEAD.

I have committed the first patch that reorganizes the struct.  I'll have
to spend some time evaluating the performance impact of the second
patch, but it seems OK in principle.  Performance tests from others welcome.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On 2019-03-19 16:38, Peter Eisentraut wrote:
> On 2019-03-19 10:21, Tsunakawa, Takayuki wrote:
>> From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]
>>> Fixed.
>>
>> Rebased on HEAD.
> 
> I have committed the first patch that reorganizes the struct.  I'll have
> to spend some time evaluating the performance impact of the second
> patch, but it seems OK in principle.  Performance tests from others welcome.

I did a bit of performance testing, both a plain pgbench and the
suggested test case with 4096 partitions.  I can't detect any
performance improvements.  In fact, within the noise, it tends to be
just a bit on the slower side.

So I'd like to kick it back to the patch submitter now and ask for more
justification and performance analysis.

Perhaps "speeding up planning with partitions" needs to be accepted first?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> I did a bit of performance testing, both a plain pgbench and the
> suggested test case with 4096 partitions.  I can't detect any
> performance improvements.  In fact, within the noise, it tends to be
> just a bit on the slower side.
>
> So I'd like to kick it back to the patch submitter now and ask for more
> justification and performance analysis.
>
> Perhaps "speeding up planning with partitions" needs to be accepted first?

Yeah, I think it likely will require that patch to be able to measure
the gains from this patch.

If planning a SELECT to a partitioned table with a large number of
partitions using PREPAREd statements, when we attempt the generic plan
on the 6th execution, it does cause the local lock table to expand to
fit all the locks for each partition. This does cause the
LockReleaseAll() to become slow due to the hash_seq_search having to
skip over many empty buckets.   Since generating a custom plan for a
partitioned table with many partitions is still slow in master, then I
very much imagine you'll struggle to see the gains brought by this
patch.

I did a quick benchmark too and couldn't measure anything:

create table hp (a int) partition by hash (a);
select 'create table hp'||x|| ' partition of hp for values with
(modulus 4096, remainder ' || x || ');' from generate_series(0,4095)
x;

bench.sql
\set p_a 13315
select * from hp where a = :p_a;

Master:
$ pgbench -M prepared -n -T 60 -f bench.sql postgres
tps = 31.844468 (excluding connections establishing)
tps = 32.950154 (excluding connections establishing)
tps = 31.534761 (excluding connections establishing)

Patched:
$ pgbench -M prepared -n -T 60 -f bench.sql postgres
tps = 30.099133 (excluding connections establishing)
tps = 32.157328 (excluding connections establishing)
tps = 32.329884 (excluding connections establishing)

The situation won't be any better with plan_cache_mode =
force_generic_plan either. In this case, we'll only plan once but
we'll also have to obtain and release a lock for each partition for
each execution of the prepared statement. LockReleaseAll() is going to
be slow in that case because it actually has to release a lot of
locks.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
> > Perhaps "speeding up planning with partitions" needs to be accepted first?
> 
> Yeah, I think it likely will require that patch to be able to measure
> the gains from this patch.
> 
> If planning a SELECT to a partitioned table with a large number of
> partitions using PREPAREd statements, when we attempt the generic plan
> on the 6th execution, it does cause the local lock table to expand to
> fit all the locks for each partition. This does cause the
> LockReleaseAll() to become slow due to the hash_seq_search having to
> skip over many empty buckets.   Since generating a custom plan for a
> partitioned table with many partitions is still slow in master, then I
> very much imagine you'll struggle to see the gains brought by this
> patch.

Thank you David for explaining.  Although I may not understand the effect of "speeding up planning with partitions"
patch,this patch  takes effect even without it.  That is, perform the following in the same session:
 

1. SELECT count(*) FROM table; on a table with many partitions.  That bloats the LocalLockHash.
2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction.  Without the patch, each
transaction'sLockReleaseAll() has to scan the bloated large hash table.
 


Regards
Takayuki Tsunakawa



Tsunakawa-san,

On 2019/03/26 17:21, Tsunakawa, Takayuki wrote:
> From: David Rowley [mailto:david.rowley@2ndquadrant.com]
>> On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut
>> <peter.eisentraut@2ndquadrant.com> wrote:
>>> Perhaps "speeding up planning with partitions" needs to be accepted first?
>>
>> Yeah, I think it likely will require that patch to be able to measure
>> the gains from this patch.
>>
>> If planning a SELECT to a partitioned table with a large number of
>> partitions using PREPAREd statements, when we attempt the generic plan
>> on the 6th execution, it does cause the local lock table to expand to
>> fit all the locks for each partition. This does cause the
>> LockReleaseAll() to become slow due to the hash_seq_search having to
>> skip over many empty buckets.   Since generating a custom plan for a
>> partitioned table with many partitions is still slow in master, then I
>> very much imagine you'll struggle to see the gains brought by this
>> patch.
> 
> Thank you David for explaining.  Although I may not understand the effect of "speeding up planning with partitions"
patch,this patch  takes effect even without it.  That is, perform the following in the same session:
 
> 
> 1. SELECT count(*) FROM table; on a table with many partitions.  That bloats the LocalLockHash.
> 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
> 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction.  Without the patch, each
transaction'sLockReleaseAll() has to scan the bloated large hash table.
 

My understanding of what David wrote is that the slowness of bloated hash
table is hard to notice, because planning itself is pretty slow.  With the
"speeding up planning with partitions" patch, planning becomes quite fast,
so the bloated hash table overhead and so your patch's benefit is easier
to notice.  This patch is clearly helpful, but it's just hard to notice it
when the other big bottleneck is standing in the way.

Thanks,
Amit



On Tue, 26 Mar 2019 at 21:23, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> Thank you David for explaining.  Although I may not understand the effect of "speeding up planning with partitions"
patch,this patch  takes effect even without it.  That is, perform the following in the same session:
 
>
> 1. SELECT count(*) FROM table; on a table with many partitions.  That bloats the LocalLockHash.
> 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
> 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction.  Without the patch, each
transaction'sLockReleaseAll() has to scan the bloated large hash table.
 

Oh. I think I see what you're saying.  Really the table in #2 would
have to be some completely different table that's not partitioned. I
think in that case it should make a difference.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On Tue, 26 Mar 2019 at 21:55, David Rowley <david.rowley@2ndquadrant.com> wrote:
>
> On Tue, 26 Mar 2019 at 21:23, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
> > Thank you David for explaining.  Although I may not understand the effect of "speeding up planning with partitions"
patch,this patch  takes effect even without it.  That is, perform the following in the same session:
 
> >
> > 1. SELECT count(*) FROM table; on a table with many partitions.  That bloats the LocalLockHash.
> > 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
> > 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction.  Without the patch, each
transaction'sLockReleaseAll() has to scan the bloated large hash table.
 
>
> Oh. I think I see what you're saying.  Really the table in #2 would
> have to be some completely different table that's not partitioned. I
> think in that case it should make a difference.

Here a benchmark doing that using pgbench's script weight feature.

I've set this up so the query that hits the partitioned table runs
once for every 10k times the other script runs.  I picked that number
so the lock table was expanded fairly early on in the benchmark.

setup:
create table t1 (a int primary key);
create table hp (a int) partition by hash (a);
select 'create table hp'||x|| ' partition of hp for values with
(modulus 4096, remainder ' || x || ');' from generate_series(0,4095)
x;
\gexec

hp.sql
select count(*) from hp;

t1.sql
\set p 1
select a from t1 where a = :p;

Master = c8c885b7a5

Master:
$ pgbench -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@10000 postgres
SQL script 2: t1.sql
 - 1057306 transactions (100.0% of total, tps = 17621.756473)
 - 1081905 transactions (100.0% of total, tps = 18021.449914)
 - 1122420 transactions (100.0% of total, tps = 18690.804699)

Master + 0002-speed-up-LOCALLOCK-scan.patch

$ pgbench -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@10000 postgres
SQL script 2: t1.sql
 - 1277014 transactions (100.0% of total, tps = 21283.551615)
 - 1184052 transactions (100.0% of total, tps = 19734.185872)
 - 1188523 transactions (100.0% of total, tps = 19785.835662)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Hi Peter,

From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com]
> I did a bit of performance testing, both a plain pgbench and the
> suggested test case with 4096 partitions.  I can't detect any
> performance improvements.  In fact, within the noise, it tends to be
> just a bit on the slower side.
> 
> So I'd like to kick it back to the patch submitter now and ask for more
> justification and performance analysis.
> 
> Perhaps "speeding up planning with partitions" needs to be accepted first?

David kindly showed how to demonstrate the performance improvement on March 26, so I changed the status to needs
review. I'd appreciate it if you could continue the final check.
 


Regards
Takayuki Tsunakawa



Hi,

On 2019/04/04 13:37, Tsunakawa, Takayuki wrote:
> Hi Peter,
> 
> From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com]
>> I did a bit of performance testing, both a plain pgbench and the
>> suggested test case with 4096 partitions.  I can't detect any
>> performance improvements.  In fact, within the noise, it tends to be
>> just a bit on the slower side.
>>
>> So I'd like to kick it back to the patch submitter now and ask for more
>> justification and performance analysis.
>>
>> Perhaps "speeding up planning with partitions" needs to be accepted first?
> 
> David kindly showed how to demonstrate the performance improvement on March 26, so I changed the status to needs
review. I'd appreciate it if you could continue the final check.
 

Also, since the "speed up partition planning" patch went in (428b260f8),
it might be possible to see the performance boost even with the
partitioning example you cited upthread.

Thanks,
Amit




On 2019-04-04 06:58, Amit Langote wrote:
> Also, since the "speed up partition planning" patch went in (428b260f8),
> it might be possible to see the performance boost even with the
> partitioning example you cited upthread.

I can't detect any performance improvement with the patch applied to
current master, using the test case from Yoshikazu Imai (2019-03-19).

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Hi Peter, Imai-san,

From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com]
> I can't detect any performance improvement with the patch applied to
> current master, using the test case from Yoshikazu Imai (2019-03-19).

That's strange...  Peter, Imai-san, can you compare your test procedures?

Peter, can you check and see the performance improvement with David's method on March 26 instead?


Regards
Takayuki Tsunakawa


On 2019/04/05 5:42, Peter Eisentraut wrote:
> On 2019-04-04 06:58, Amit Langote wrote:
>> Also, since the "speed up partition planning" patch went in (428b260f8),
>> it might be possible to see the performance boost even with the
>> partitioning example you cited upthread.
> 
> I can't detect any performance improvement with the patch applied to
> current master, using the test case from Yoshikazu Imai (2019-03-19).

I was able to detect it as follows.

* partitioned table setup:

$ cat ht.sql
drop table ht cascade;
create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values with
(modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,
8191) x;
\gexec

* pgbench script:

$ cat select.sql
\set param random(1, 8192)
select * from ht where a = :param

* pgbench (5 minute run with -M prepared)

pgbench -n -M prepared -T 300 -f select.sql

* tps:

plan_cache_mode = auto

   HEAD: 1915 tps
Patched: 2394 tps

plan_cache_mode = custom (non-problematic: generic plan is never created)

   HEAD: 2402 tps
Patched: 2393 tps

Thanks,
Amit




On Fri, Apr 5, 2019 at 1:31 AM, Amit Langote wrote:
> On 2019/04/05 5:42, Peter Eisentraut wrote:
> > On 2019-04-04 06:58, Amit Langote wrote:
> >> Also, since the "speed up partition planning" patch went in
> >> (428b260f8), it might be possible to see the performance boost even
> >> with the partitioning example you cited upthread.
> >
> > I can't detect any performance improvement with the patch applied to
> > current master, using the test case from Yoshikazu Imai (2019-03-19).
> 
> I was able to detect it as follows.
> 
> * partitioned table setup:
> 
> $ cat ht.sql
> drop table ht cascade;
> create table ht (a int primary key, b int, c int) partition by hash (a);
> select 'create table ht' || x::text || ' partition of ht for values with
> (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,
> 8191) x;
> \gexec
> 
> * pgbench script:
> 
> $ cat select.sql
> \set param random(1, 8192)
> select * from ht where a = :param
> 
> * pgbench (5 minute run with -M prepared)
> 
> pgbench -n -M prepared -T 300 -f select.sql
> 
> * tps:
> 
> plan_cache_mode = auto
> 
>    HEAD: 1915 tps
> Patched: 2394 tps
> 
> plan_cache_mode = custom (non-problematic: generic plan is never created)
> 
>    HEAD: 2402 tps
> Patched: 2393 tps

Amit-san, thanks for testing this.

I also re-ran my tests(3/19) with HEAD(413ccaa) and HEAD(413ccaa) + patched, and I can still detect the performance
differencewith plan_cache_mode = auto.
 

Thanks
--
Yoshikazu Imai 


Hi Amit-san, Imai-snan,

From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp]
> I was able to detect it as follows.
> plan_cache_mode = auto
> 
>    HEAD: 1915 tps
> Patched: 2394 tps
> 
> plan_cache_mode = custom (non-problematic: generic plan is never created)
> 
>    HEAD: 2402 tps
> Patched: 2393 tps

Thanks a lot for very quick confirmation.  I'm relieved to still see the good results.


Regards
Takayuki Tsunakawa


On Fri, Apr 5, 2019 at 0:05 AM, Tsunakawa, Takayuki wrote:
> From: Peter Eisentraut [mailto:peter.eisentraut@2ndquadrant.com]
> > I can't detect any performance improvement with the patch applied to
> > current master, using the test case from Yoshikazu Imai (2019-03-19).
> 
> That's strange...  Peter, Imai-san, can you compare your test procedures?

Just for make sure, I described my test procedures in detail.

I install and setup HEAD and patched as follows.

[HEAD(413ccaa)]
(git pull)
./configure --prefix=/usr/local/pgsql-dev --enable-depend
make clean
make

make install

su postgres
export PATH=/usr/local/pgsql-dev/bin:$PATH
initdb -D /var/lib/pgsql/data-dev
vi /var/lib/pgsql/data-dev/postgresql.conf
====
port = 44201
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096
====
pg_ctl -D /var/lib/pgsql/data-dev start



[HEAD(413ccaa) + patch]
(git pull)
patch -u -p1 < 0002.patch
./configure --prefix=/usr/local/pgsql-locallock --enable-depend
make clean
make

make install

su postgres
export PATH=/usr/local/pgsql-locallock/bin:$PATH
initdb -D /var/lib/pgsql/data-locallock
vi /var/lib/pgsql/data-locallock/postgresql.conf
====
port = 44301
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096
====
pg_ctl -D /var/lib/pgsql/data-locallock start


And I tested as follows.

(creating partitioned table for port 44201)
(creating partitioned table for port 44301)
(creating select4096.sql)
for i in `seq 1 5`; do
  pgbench -n -f select4096.sql -T 60 -M prepared -p 44201 | grep including;
  pgbench -n -f select4096.sql -T 60 -M prepared -p 44301 | grep including;
done
tps = 8146.039546 (including connections establishing)
tps = 9021.340872 (including connections establishing)
tps = 8011.186017 (including connections establishing)
tps = 8926.191054 (including connections establishing)
tps = 8006.769690 (including connections establishing)
tps = 9028.716806 (including connections establishing)
tps = 8057.709961 (including connections establishing)
tps = 9017.713714 (including connections establishing)
tps = 7956.332863 (including connections establishing)
tps = 9126.650533 (including connections establishing)


Thanks
--
Yoshikazu Imai

On 2019-03-19 10:21, Tsunakawa, Takayuki wrote:
> From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]
>> Fixed.
> 
> Rebased on HEAD.

Do you need the dlist_foreach_modify() calls?  You are not actually
modifying the list structure.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> I can't detect any performance improvement with the patch applied to
> current master, using the test case from Yoshikazu Imai (2019-03-19).

FWIW, I tried this patch against current HEAD (959d00e9d).
Using the test case described by Amit at
<be25cadf-982e-3f01-88b4-443a6667e16a@lab.ntt.co.jp>
I do measure an undeniable speedup, close to 35%.

However ... I submit that that's a mighty extreme test case.
(I had to increase max_locks_per_transaction to get it to run
at all.)  We should not be using that sort of edge case to drive
performance optimization choices.

If I reduce the number of partitions in Amit's example from 8192
to something more real-world, like 128, I do still measure a
performance gain, but it's ~ 1.5% which is below what I'd consider
a reproducible win.  I'm accustomed to seeing changes up to 2%
in narrow benchmarks like this one, even when "nothing changes"
except unrelated code.

Trying a standard pgbench test case (pgbench -M prepared -S with
one client and an -s 10 database), it seems that the patch is about
0.5% slower than HEAD.  Again, that's below the noise threshold,
but it's not promising for the net effects of this patch on workloads
that aren't specifically about large and prunable partition sets.

I'm also fairly concerned about the effects of the patch on
sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88
bytes, a 22% increase.  That's a lot if you're considering cases
with many locks.

On the whole I don't think there's an adequate case for committing
this patch.

I'd also point out that this is hardly the only place where we've
seen hash_seq_search on nearly-empty hash tables become a bottleneck.
So I'm not thrilled about attacking that with one-table-at-time patches.
I'd rather see us do something to let hash_seq_search win across
the board.

I spent some time wondering whether we could adjust the data structure
so that all the live entries in a hashtable are linked into one chain,
but I don't quite see how to do it without adding another list link to
struct HASHELEMENT, which seems pretty expensive.

I'll sketch the idea I had, just in case it triggers a better idea
in someone else.  Assuming we are willing to add another pointer to
HASHELEMENT, use the two pointers to create a doubly-linked circular
list that includes all live entries in the hashtable, with a list
header in the hashtable's control area.  (Maybe we'd use a dlist for
this, but it's not essential.)  Require this list to be organized so
that all entries that belong to the same hash bucket are consecutive in
the list, and make each non-null hash bucket header point to the first
entry in the list for its bucket.  To allow normal searches to detect
when they've run through their bucket, add a flag to HASHELEMENT that
is set only in entries that are the first, or perhaps last, of their
bucket (so you'd detect end-of-bucket by checking the flag instead of
testing for a null pointer).  Adding a bool field is free due to
alignment considerations, at least on 64-bit machines.  Given this,
I think normal hash operations are more-or-less the same cost as
before, while hash_seq_search just has to follow the circular list.

I tried to figure out how to do the same thing with a singly-linked
instead of doubly-linked list, but it doesn't quite work: if you need
to remove the first element of a bucket, you have no cheap way to find
its predecessor in the overall list (which belongs to some other
bucket, but you don't know which one).  Maybe we could just mark such
entries dead (there's plenty of room for another flag bit) and plan
to clean them up later?  But it's not clear how to ensure that they'd
get cleaned up in any sort of timely fashion.

Another issue is that probably none of this works for the partitioned
hash tables we use for some of the larger shared-memory hashes.  But
I'm not sure we care about hash_seq_search for those, so maybe we just
say those are a different data structure.

            regards, tom lane



Hi,

On 2019-04-05 23:03:11 -0400, Tom Lane wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> > I can't detect any performance improvement with the patch applied to
> > current master, using the test case from Yoshikazu Imai (2019-03-19).
> 
> FWIW, I tried this patch against current HEAD (959d00e9d).
> Using the test case described by Amit at
> <be25cadf-982e-3f01-88b4-443a6667e16a@lab.ntt.co.jp>
> I do measure an undeniable speedup, close to 35%.
> 
> However ... I submit that that's a mighty extreme test case.
> (I had to increase max_locks_per_transaction to get it to run
> at all.)  We should not be using that sort of edge case to drive
> performance optimization choices.
> 
> If I reduce the number of partitions in Amit's example from 8192
> to something more real-world, like 128, I do still measure a
> performance gain, but it's ~ 1.5% which is below what I'd consider
> a reproducible win.  I'm accustomed to seeing changes up to 2%
> in narrow benchmarks like this one, even when "nothing changes"
> except unrelated code.

I'm not sure it's actually that narrow these days. With all the
partitioning improvements happening, the numbers of locks commonly held
are going to rise. And while 8192 partitions is maybe on the more
extreme side, it's a workload with only a single table, and plenty
workloads touch more than a single partitioned table.


> Trying a standard pgbench test case (pgbench -M prepared -S with
> one client and an -s 10 database), it seems that the patch is about
> 0.5% slower than HEAD.  Again, that's below the noise threshold,
> but it's not promising for the net effects of this patch on workloads
> that aren't specifically about large and prunable partition sets.

Yea, that's concerning.


> I'm also fairly concerned about the effects of the patch on
> sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88
> bytes, a 22% increase.  That's a lot if you're considering cases
> with many locks.

I'm not sure I'm quite that concerned. For one, a good bit of that space
was up for grabs until the recent reordering of LOCALLOCK and nobody
complained. But more importantly, I think commonly the amount of locks
around is fairly constrained, isn't it? We can't really have that many
concurrently held locks, due to the shared memory space, and the size of
a LOCALLOCK isn't that big compared to say relcache entries.  We also
probably fairly easily could win some space back - e.g. make nLocks 32
bits.

I wonder if one approach to solve this wouldn't be to just make the
hashtable drastically smaller. Right now we'll often have have lots
empty entries that are 72 bytes + dynahash overhead. That's plenty of
memory that needs to be skipped over.   I wonder if we instead should
have an array of held locallocks, and a hashtable with {hashcode,
offset_in_array} + custom comparator for lookups.  That'd mean we could
either scan the array of locallocks at release (which'd need to skip
over entries that have already been released), or we could scan the much
smaller hashtable sequentially.

I don't think the above idea is quite there, and I'm tired, but I
thought it might still be worth bringing up.


> I spent some time wondering whether we could adjust the data structure
> so that all the live entries in a hashtable are linked into one chain,
> but I don't quite see how to do it without adding another list link to
> struct HASHELEMENT, which seems pretty expensive.

Yea :(

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> I wonder if one approach to solve this wouldn't be to just make the
> hashtable drastically smaller. Right now we'll often have have lots
> empty entries that are 72 bytes + dynahash overhead. That's plenty of
> memory that needs to be skipped over.   I wonder if we instead should
> have an array of held locallocks, and a hashtable with {hashcode,
> offset_in_array} + custom comparator for lookups.

Well, that's not going to work all that well for retail lock releases;
you'll end up with holes in the array, maybe a lot of them.

However, it led me to think of another way we might approach the general
hashtable problem: right now, we are not making any use of the fact that
the hashtable's entries are laid out in big slabs (arrays).  What if we
tried to ensure that the live entries are allocated fairly compactly in
those arrays, and then implemented hash_seq_search as a scan over the
arrays, ignoring the hash bucket structure per se?

We'd need a way to reliably tell a live entry from a free entry, but
again, there's plenty of space for a flag bit or two.

This might perform poorly if you allocated a bunch of entries,
freed most-but-not-all, and then wanted to seqscan the remainder;
you'd end up with the same problem I complained of above that
you're iterating over an array that's mostly uninteresting.
In principle we could keep count of the live vs free entries and
dynamically decide to scan via the hash bucket structure instead of
searching the storage array when the array is too sparse; but that
might be overly complicated.

I haven't tried to work this out in detail, it's just a late
night brainstorm.  But, again, I'd much rather solve this in
dynahash.c than by layering some kind of hack on top of it.

            regards, tom lane



On 2019-04-06 05:03, Tom Lane wrote:
> Trying a standard pgbench test case (pgbench -M prepared -S with
> one client and an -s 10 database), it seems that the patch is about
> 0.5% slower than HEAD.  Again, that's below the noise threshold,
> but it's not promising for the net effects of this patch on workloads
> that aren't specifically about large and prunable partition sets.

In my testing, I've also noticed that it seems to be slightly on the
slower side for these simple tests.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Sat, 6 Apr 2019 at 16:03, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'd also point out that this is hardly the only place where we've
> seen hash_seq_search on nearly-empty hash tables become a bottleneck.
> So I'm not thrilled about attacking that with one-table-at-time patches.
> I'd rather see us do something to let hash_seq_search win across
> the board.

Rewinding back to mid-Feb:

You wrote:
> My own thought about how to improve this situation was just to destroy
> and recreate LockMethodLocalHash at transaction end (or start)
> if its size exceeded $some-value.  Leaving it permanently bloated seems
> like possibly a bad idea, even if we get rid of all the hash_seq_searches
> on it.

Which I thought was an okay idea.  I think the one advantage that
would have over making hash_seq_search() faster for large and mostly
empty tables is that over-sized hash tables are just not very cache
efficient, and if we don't need it to be that large then we should
probably consider making it smaller again.

I've had a go at implementing this and using Amit's benchmark the
performance looks pretty good. I can't detect any slowdown for the
general case.

master:

plan_cache_mode = auto:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 9373.698212 (excluding connections establishing)
tps = 9356.993148 (excluding connections establishing)
tps = 9367.579806 (excluding connections establishing)

plan_cache_mode = force_custom_plan:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 12863.758185 (excluding connections establishing)
tps = 12787.766054 (excluding connections establishing)
tps = 12817.878940 (excluding connections establishing)

shrink_bloated_locallocktable.patch:

plan_cache_mode = auto:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 12756.021211 (excluding connections establishing)
tps = 12800.939518 (excluding connections establishing)
tps = 12804.501977 (excluding connections establishing)

plan_cache_mode = force_custom_plan:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 12763.448836 (excluding connections establishing)
tps = 12901.673271 (excluding connections establishing)
tps = 12856.512745 (excluding connections establishing)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
David Rowley <david.rowley@2ndquadrant.com> writes:
> On Sat, 6 Apr 2019 at 16:03, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> My own thought about how to improve this situation was just to destroy
>> and recreate LockMethodLocalHash at transaction end (or start)
>> if its size exceeded $some-value.  Leaving it permanently bloated seems
>> like possibly a bad idea, even if we get rid of all the hash_seq_searches
>> on it.

> Which I thought was an okay idea.  I think the one advantage that
> would have over making hash_seq_search() faster for large and mostly
> empty tables is that over-sized hash tables are just not very cache
> efficient, and if we don't need it to be that large then we should
> probably consider making it smaller again.

> I've had a go at implementing this and using Amit's benchmark the
> performance looks pretty good. I can't detect any slowdown for the
> general case.

I like the concept ... but the particular implementation, not so much.
It seems way overcomplicated.  In the first place, why should we
add code to copy entries?  Just don't do it except when the table
is empty.  In the second, I think we could probably have a far
cheaper test for how big the table is --- maybe we'd need to
expose some function in dynahash.c, but the right way here is just
to see how many buckets there are.  I don't like adding statistics
counting for this, because it's got basically nothing to do with
what the actual problem is.  (If you acquire and release one lock,
and do that over and over, you don't have a bloat problem no
matter how many times you do it.)

LockMethodLocalHash is special in that it predictably goes to empty
at the end of every transaction, so that de-bloating at that point
is a workable strategy.  I think we'd probably need something more
robust if we were trying to fix this generally for all hash tables.
But if we're going to go with the one-off hack approach, we should
certainly try to keep that hack as simple as possible.

            regards, tom lane



On Mon, 8 Apr 2019 at 02:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I like the concept ... but the particular implementation, not so much.
> It seems way overcomplicated.  In the first place, why should we
> add code to copy entries?  Just don't do it except when the table
> is empty.  In the second, I think we could probably have a far
> cheaper test for how big the table is --- maybe we'd need to
> expose some function in dynahash.c, but the right way here is just
> to see how many buckets there are.  I don't like adding statistics
> counting for this, because it's got basically nothing to do with
> what the actual problem is.  (If you acquire and release one lock,
> and do that over and over, you don't have a bloat problem no
> matter how many times you do it.)

hash_get_num_entries() looks cheap enough to me. Can you explain why
you think that's too expensive?

> LockMethodLocalHash is special in that it predictably goes to empty
> at the end of every transaction, so that de-bloating at that point
> is a workable strategy.  I think we'd probably need something more
> robust if we were trying to fix this generally for all hash tables.
> But if we're going to go with the one-off hack approach, we should
> certainly try to keep that hack as simple as possible.

As cheap as possible sounds good, but I'm confused at why you think
the table will always be empty at the end of transaction. It's my
understanding and I see from debugging that session level locks remain
in there. If I don't copy those into the new table they'll be lost.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



On Mon, 8 Apr 2019 at 02:36, David Rowley <david.rowley@2ndquadrant.com> wrote:
> > LockMethodLocalHash is special in that it predictably goes to empty
> > at the end of every transaction, so that de-bloating at that point
> > is a workable strategy.  I think we'd probably need something more
> > robust if we were trying to fix this generally for all hash tables.
> > But if we're going to go with the one-off hack approach, we should
> > certainly try to keep that hack as simple as possible.
>
> As cheap as possible sounds good, but I'm confused at why you think
> the table will always be empty at the end of transaction. It's my
> understanding and I see from debugging that session level locks remain
> in there. If I don't copy those into the new table they'll be lost.

Or we could just skip the table recreation if there are no
session-levels.  That would require calling hash_get_num_entries() on
the table again and just recreating the table if there are 0 locks.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



David Rowley <david.rowley@2ndquadrant.com> writes:
> On Mon, 8 Apr 2019 at 02:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I like the concept ... but the particular implementation, not so much.
>> It seems way overcomplicated.  In the first place, why should we
>> add code to copy entries?  Just don't do it except when the table
>> is empty.  In the second, I think we could probably have a far
>> cheaper test for how big the table is --- maybe we'd need to
>> expose some function in dynahash.c, but the right way here is just
>> to see how many buckets there are.  I don't like adding statistics
>> counting for this, because it's got basically nothing to do with
>> what the actual problem is.  (If you acquire and release one lock,
>> and do that over and over, you don't have a bloat problem no
>> matter how many times you do it.)

> hash_get_num_entries() looks cheap enough to me. Can you explain why
> you think that's too expensive?

What I objected to cost-wise was counting the number of lock
acquisitions/releases, which seems entirely beside the point.

We *should* be using hash_get_num_entries(), but only to verify
that the table is empty before resetting it.  The additional bit
that is needed is to see whether the number of buckets is large
enough to justify calling the table bloated.

> As cheap as possible sounds good, but I'm confused at why you think
> the table will always be empty at the end of transaction.

It's conceivable that it won't be, which is why we need a test.
I'm simply arguing that if it is not, we can just postpone de-bloating
till it is.  Session-level locks are so rarely used that there's no
need to sweat about that case.

            regards, tom lane



On Mon, 8 Apr 2019 at 02:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > hash_get_num_entries() looks cheap enough to me. Can you explain why
> > you think that's too expensive?
>
> What I objected to cost-wise was counting the number of lock
> acquisitions/releases, which seems entirely beside the point.
>
> We *should* be using hash_get_num_entries(), but only to verify
> that the table is empty before resetting it.  The additional bit
> that is needed is to see whether the number of buckets is large
> enough to justify calling the table bloated.

The reason I thought it was a good idea to track some history there
was to stop the lock table constantly being shrunk back to the default
size every time a simple single table query was executed. For example,
a workload repeatably doing:

SELECT * FROM table_with_lots_of_partitions;
SELECT * FROM non_partitioned_table;

I was worried that obtaining locks on the partitioned table would
become a little slower because it would have to expand the hash table
each time the query is executed.

> > As cheap as possible sounds good, but I'm confused at why you think
> > the table will always be empty at the end of transaction.
>
> It's conceivable that it won't be, which is why we need a test.
> I'm simply arguing that if it is not, we can just postpone de-bloating
> till it is.  Session-level locks are so rarely used that there's no
> need to sweat about that case.

That seems fair.  It would certainly simplify the patch.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



David Rowley <david.rowley@2ndquadrant.com> writes:
> On Mon, 8 Apr 2019 at 02:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We *should* be using hash_get_num_entries(), but only to verify
>> that the table is empty before resetting it.  The additional bit
>> that is needed is to see whether the number of buckets is large
>> enough to justify calling the table bloated.

> The reason I thought it was a good idea to track some history there
> was to stop the lock table constantly being shrunk back to the default
> size every time a simple single table query was executed.

I think that's probably gilding the lily, considering that this whole
issue is pretty new.  There's no evidence that expanding the local
lock table is a significant drag on queries that need a lot of locks.

            regards, tom lane



On Mon, 8 Apr 2019 at 03:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > The reason I thought it was a good idea to track some history there
> > was to stop the lock table constantly being shrunk back to the default
> > size every time a simple single table query was executed.
>
> I think that's probably gilding the lily, considering that this whole
> issue is pretty new.  There's no evidence that expanding the local
> lock table is a significant drag on queries that need a lot of locks.

Okay.  Here's another version with all the average locks code removed
that only recreates the table when it's completely empty.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
Hi,

On 2019-04-08 03:40:52 +1200, David Rowley wrote:
> On Mon, 8 Apr 2019 at 03:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > David Rowley <david.rowley@2ndquadrant.com> writes:
> > > The reason I thought it was a good idea to track some history there
> > > was to stop the lock table constantly being shrunk back to the default
> > > size every time a simple single table query was executed.
> >
> > I think that's probably gilding the lily, considering that this whole
> > issue is pretty new.  There's no evidence that expanding the local
> > lock table is a significant drag on queries that need a lot of locks.
> 
> Okay.  Here's another version with all the average locks code removed
> that only recreates the table when it's completely empty.

Could you benchmark your adversarial case?

- Andres



On Mon, 8 Apr 2019 at 03:47, Andres Freund <andres@anarazel.de> wrote:
> Could you benchmark your adversarial case?

Which case?

I imagine the worst case for v2 is a query that just constantly asks
for over 16 locks. Perhaps a prepared plan, so not to add planner
overhead.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



David Rowley <david.rowley@2ndquadrant.com> writes:
> Okay.  Here's another version with all the average locks code removed
> that only recreates the table when it's completely empty.

Um ... I don't see where you're destroying the old hash?

Also, I entirely dislike wiring in assumptions about hash_seq_search's
private state structure here.  I think it's worth having an explicit
entry point in dynahash.c to get the current number of buckets.

Also, I would not define "significantly bloated" as "the table has
grown at all".  I think the threshold ought to be at least ~100
buckets, if we're starting at 16.

Probably we ought to try to gather some evidence to inform the
choice of cutoff here.  Maybe instrument the regression tests to
see how big the table typically gets?

            regards, tom lane



On Mon, 8 Apr 2019 at 04:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Um ... I don't see where you're destroying the old hash?

In CreateLocalLockHash.

> Also, I entirely dislike wiring in assumptions about hash_seq_search's
> private state structure here.  I think it's worth having an explicit
> entry point in dynahash.c to get the current number of buckets.

Okay. Added hash_get_max_bucket()

> Also, I would not define "significantly bloated" as "the table has
> grown at all".  I think the threshold ought to be at least ~100
> buckets, if we're starting at 16.

I wouldn't either. I don't think the comment says that. It says there
can be slowdowns when its significantly bloated, and then goes on to
say that we just resize when it's bigger than standard.

> Probably we ought to try to gather some evidence to inform the
> choice of cutoff here.  Maybe instrument the regression tests to
> see how big the table typically gets?

In partition_prune.sql I see use of a bucket as high as 285 on my machine with:

drop table lp, coll_pruning, rlp, mc3p, mc2p, boolpart, rp,
coll_pruning_multi, like_op_noprune, lparted_by_int2, rparted_by_int2;

I've not added any sort of cut-off though as I benchmarked it and
surprisingly I don't see any slowdown with the worst case. So I'm
thinking there might not be any point.

alter system set plan_cache_mode = 'force_generic_plan';
create table hp (a int primary key) partition by hash (a);
select 'create table hp' || x::text || ' partition of hp for values
with (modulus 32, remainder ' || (x)::text || ');' from
generate_series(0,31) x;
\gexec

select.sql
\set p 1
select * from hp where a = :p

Master
$ pgbench -n -M prepared -f select.sql -T 60 postgres
tps = 11834.764309 (excluding connections establishing)
tps = 12279.212223 (excluding connections establishing)
tps = 12007.263547 (excluding connections establishing)

Patched:
$ pgbench -n -M prepared -f select.sql -T 60 postgres
tps = 13380.684817 (excluding connections establishing)
tps = 12790.999279 (excluding connections establishing)
tps = 12568.892558 (excluding connections establishing)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> On the whole I don't think there's an adequate case for committing
> this patch.

From: Andres Freund [mailto:andres@anarazel.de]
> On 2019-04-05 23:03:11 -0400, Tom Lane wrote:
> > If I reduce the number of partitions in Amit's example from 8192
> > to something more real-world, like 128, I do still measure a
> > performance gain, but it's ~ 1.5% which is below what I'd consider
> > a reproducible win.  I'm accustomed to seeing changes up to 2%
> > in narrow benchmarks like this one, even when "nothing changes"
> > except unrelated code.
> 
> I'm not sure it's actually that narrow these days. With all the
> partitioning improvements happening, the numbers of locks commonly held
> are going to rise. And while 8192 partitions is maybe on the more
> extreme side, it's a workload with only a single table, and plenty
> workloads touch more than a single partitioned table.

I would feel happy if I could say such a many-partitions use case is narrow or impractical and ignore it, but it's not
narrow. Two of our customers are actually requesting such usage: one uses 5,500 partitions and is trying to migrate
froma commercial database on Linux, and the other requires 200,000 partitions to migrate from a legacy database on a
mainframe. At first, I thought such many partitions indicate a bad application design, but it sounded valid (or at
leastI can't insist that's bad).  PostgreSQL is now expected to handle such huge workloads.
 


From: Andres Freund [mailto:andres@anarazel.de]
> I'm not sure I'm quite that concerned. For one, a good bit of that space
> was up for grabs until the recent reordering of LOCALLOCK and nobody
> complained. But more importantly, I think commonly the amount of locks
> around is fairly constrained, isn't it? We can't really have that many
> concurrently held locks, due to the shared memory space, and the size of
> a LOCALLOCK isn't that big compared to say relcache entries.  We also
> probably fairly easily could win some space back - e.g. make nLocks 32
> bits.

+1



From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> I'd also point out that this is hardly the only place where we've
> seen hash_seq_search on nearly-empty hash tables become a bottleneck.
> So I'm not thrilled about attacking that with one-table-at-time patches.
> I'd rather see us do something to let hash_seq_search win across
> the board.
> 
> I spent some time wondering whether we could adjust the data structure
> so that all the live entries in a hashtable are linked into one chain,
> but I don't quite see how to do it without adding another list link to
> struct HASHELEMENT, which seems pretty expensive.

I think the linked list of LOCALLOCK approach is natural, simple, and good.  In the Jim Gray's classic book
"Transactionprocessing: concepts and techniques", we can find the following sentence in "8.4.5 Lock Manager Internal
Logic." The sample implementation code in the book uses a similar linked list to remember and release a transaction's
acquiredlocks.
 

"All the locks of a transaction are kept in a list so they can be quickly found and released at commit or rollback."

And handling this issue with the LOCALLOCK linked list is more natural than with the hash table resize.  We just want
toquickly find all grabbed locks, so we use a linked list.  A hash table is a mechanism to find a particular item
quickly. So it was merely wrong to use the hash table to iterate all grabbed locks.  Also, the hash table got big
becausesome operation in the session needed it, and some subsequent operations in the same session may need it again.
Sowe wouldn't be relieved with shrinking the hash table.
 


Regards
Takayuki Tsunakawa








Hi,

On 2019-04-08 02:28:12 +0000, Tsunakawa, Takayuki wrote:
> I think the linked list of LOCALLOCK approach is natural, simple, and
> good.

Did you see that people measured slowdowns?

Greetings,

Andres Freund



From: 'Andres Freund' [mailto:andres@anarazel.de]
> On 2019-04-08 02:28:12 +0000, Tsunakawa, Takayuki wrote:
> > I think the linked list of LOCALLOCK approach is natural, simple, and
> > good.
> 
> Did you see that people measured slowdowns?

Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels like a somewhat extreme test case.  And that
mightbe within noise as was mentioned.
 

If we want to remove even the noise, we may have to think of removing the LocalLockHash completely.  But it doesn't
seemfeasible...
 


Regards
Takayuki Tsunakawa








On Mon, 8 Apr 2019 at 14:54, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
>
> From: 'Andres Freund' [mailto:andres@anarazel.de]
> > Did you see that people measured slowdowns?
>
> Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels like a somewhat extreme test case.  And
thatmight be within noise as was mentioned.
 
>
> If we want to remove even the noise, we may have to think of removing the LocalLockHash completely.  But it doesn't
seemfeasible...
 

It would be good to get your view on the
shrink_bloated_locallocktable_v3.patch I worked on last night. I was
unable to measure any overhead to solving the problem that way.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> It would be good to get your view on the
> shrink_bloated_locallocktable_v3.patch I worked on last night. I was
> unable to measure any overhead to solving the problem that way.

Thanks, it looks super simple and good.  I understood the idea behind your patch is:

* Transactions that touch many partitions and/or tables are a special event and not normal, and the hash table bloat is
anunlucky accident.  So it's reasonable to revert the bloated hash table back to the original size.
 

* Repeated transactions that acquire many locks have to enlarge the hash table every time.  However, the overhead of
hashtable expansion should be hidden behind other various processing (acquiring/releasing locks, reading/writing the
relations,accessing the catalogs of those relations)
 


TBH, I think the linked list approach feels more intuitive because the resulting code looks what it wants to do
(efficientlyiterate over acquired locks) and is based on the classic book.  But your approach seems to relieve people.
SoI'm OK with your patch.
 

I'm registering you as another author and me as a reviewer, and marking this ready for committer.


Regards
Takayuki Tsunakawa





On 2019-04-08 05:46, Tsunakawa, Takayuki wrote:
> I'm registering you as another author and me as a reviewer, and marking this ready for committer.

Moved to next commit fest.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Mon, 8 Apr 2019 at 04:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Also, I would not define "significantly bloated" as "the table has
> grown at all".  I think the threshold ought to be at least ~100
> buckets, if we're starting at 16.

I've revised the patch to add a new constant named
LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the
hash table grows over that size we shrink it back down to
LOCKMETHODLOCALHASH_INIT_SIZE, which I've kept at 16.

I'm not opposed to setting it to 128. For this particular benchmark,
it won't make any difference as it's only going to affect something
that does not quite use 128 locks and has to work with a slightly
bloated local lock table. I think hitting 64 locks in a transaction is
a good indication that it's not a simple transaction so users are
probably unlikely to notice the small slowdown from the hash table
reinitialisation.

Since quite a bit has changed around partition planning lately, I've
taken a fresh set of benchmarks on today's master. I'm using something
very close to Amit's benchmark from upthread. I just changed the query
so we hit the same partition each time instead of a random one.

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

master @ a193cbec119 + shrink_bloated_locallocktable_v4.patch:

plan_cache_mode = 'auto';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 14101.226982 (excluding connections establishing)
tps = 14034.250962 (excluding connections establishing)
tps = 14107.937755 (excluding connections establishing)

plan_cache_mode = 'force_custom_plan';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 14240.366770 (excluding connections establishing)
tps = 14272.244886 (excluding connections establishing)
tps = 14130.684315 (excluding connections establishing)

master @ a193cbec119:

plan_cache_mode = 'auto';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 10467.027666 (excluding connections establishing)
tps = 10333.700917 (excluding connections establishing)
tps = 10633.084426 (excluding connections establishing)

plan_cache_mode = 'force_custom_plan';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 13938.083272 (excluding connections establishing)
tps = 14143.241802 (excluding connections establishing)
tps = 14097.406758 (excluding connections establishing)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> I've revised the patch to add a new constant named
> LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the hash

Thank you, and good performance.  The patch passed make check.

I'm OK with the current patch, but I have a few comments.  Please take them as you see fit (I wouldn't mind if you
don't.)


(1)
+#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64

How about LOCKMETHODLOCALHASH_SHRINK_THRESHOLD, because this determines the threshold value to trigger shrinkage?  Code
inPostgreSQL seems to use the term threshold.
 


(2)
+/* Complain if the above are not set to something sane */
+#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE
+#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE"
+#endif

I don't think these are necessary, because these are fixed and not configurable.  FYI, src/include/utils/memutils.h
doesn'thave #error to test these macros.
 

#define ALLOCSET_DEFAULT_MINSIZE   0
#define ALLOCSET_DEFAULT_INITSIZE  (8 * 1024)
#define ALLOCSET_DEFAULT_MAXSIZE   (8 * 1024 * 1024)


(3)
+    if (hash_get_num_entries(LockMethodLocalHash) == 0 &&
+        hash_get_max_bucket(LockMethodLocalHash) >
+        LOCKMETHODLOCALHASH_SHRINK_SIZE)
+        CreateLocalLockHash();

I get an impression that Create just creates something where there's nothing.  How about Init or Recreate?


Regards
Takayuki Tsunakawa


On Mon, 17 Jun 2019 at 15:05, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> (1)
> +#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64
>
> How about LOCKMETHODLOCALHASH_SHRINK_THRESHOLD, because this determines the threshold value to trigger shrinkage?
Codein PostgreSQL seems to use the term threshold.
 

That's probably better. I've renamed it to that.

> (2)
> +/* Complain if the above are not set to something sane */
> +#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE
> +#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE"
> +#endif
>
> I don't think these are necessary, because these are fixed and not configurable.  FYI, src/include/utils/memutils.h
doesn'thave #error to test these macros.
 

Yeah. I was thinking it was overkill when I wrote it, but somehow
couldn't bring myself to remove it. Done now.

> (3)
> +       if (hash_get_num_entries(LockMethodLocalHash) == 0 &&
> +               hash_get_max_bucket(LockMethodLocalHash) >
> +               LOCKMETHODLOCALHASH_SHRINK_SIZE)
> +               CreateLocalLockHash();
>
> I get an impression that Create just creates something where there's nothing.  How about Init or Recreate?

Renamed to InitLocalLoclHash()

v5 is attached.

Thank you for the review.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
From: David Rowley [mailto:david.rowley@2ndquadrant.com]
v5 is attached.


Thank you, looks good.  I find it ready for committer (I noticed the status is already set so.)


Regards
Takayuki Tsunakawa



On Thu, 27 Jun 2019 at 12:59, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
>
> From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> Thank you, looks good.  I find it ready for committer (I noticed the status is already set so.)

Thanks for looking.

I've just been looking at this again and I thought I'd better check
the performance of the worst case for the patch, where the hash table
is rebuilt each query.

To do this I first created a single column 70 partition partitioned
table ("p") and left it empty.

I then checked the performance of:

SELECT * FROM p;

Having 70 partitions means that the lock table's max bucket goes over
the LOCKMETHODLOCALHASH_SHRINK_THRESHOLD which is set to 64 and
results in the table being rebuilt each time the query is run.

The performance was as follows:

70 partitions: LOCKMETHODLOCALHASH_SHRINK_THRESHOLD = 64

master + shrink_bloated_locallocktable_v5.patch:

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 8427.053378 (excluding connections establishing)
tps = 8583.251821 (excluding connections establishing)
tps = 8569.587268 (excluding connections establishing)
tps = 8552.988483 (excluding connections establishing)
tps = 8527.735108 (excluding connections establishing)

master (93907478):

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 8712.919411 (excluding connections establishing)
tps = 8760.190372 (excluding connections establishing)
tps = 8755.069470 (excluding connections establishing)
tps = 8747.389735 (excluding connections establishing)
tps = 8758.275202 (excluding connections establishing)

patched is 2.45% slower


If I increase the partition count to 140 and put the
LOCKMETHODLOCALHASH_SHRINK_THRESHOLD up to 128, then the performance
is as follows:

master + shrink_bloated_locallocktable_v5.patch:

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 2548.917856 (excluding connections establishing)
tps = 2561.283564 (excluding connections establishing)
tps = 2549.669870 (excluding connections establishing)
tps = 2421.971864 (excluding connections establishing)
tps = 2428.983660 (excluding connections establishing)

Master (93907478):

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 2605.407529 (excluding connections establishing)
tps = 2600.691426 (excluding connections establishing)
tps = 2594.123983 (excluding connections establishing)
tps = 2455.745644 (excluding connections establishing)
tps = 2450.061483 (excluding connections establishing)

patched is 1.54% slower

I'd rather not put the LOCKMETHODLOCALHASH_SHRINK_THRESHOLD up any
higher than 128 since it can detract from the improvement we're trying
to make with this patch.

Now, this case of querying a partitioned table that happens to be
completely empty seems a bit unrealistic. Something more realistic
might be index scanning all partitions to find a value that only
exists in a single partition. Assuming the partitions actually have
some records, then that's going to be a more expensive query, so the
overhead of rebuilding the table will be less noticeable.

A previous version of the patch has already had some heuristics to try
to only rebuild the hash table when it's likely beneficial. I'd rather
not go exploring in that area again.

Is anyone particularly concerned about the worst-case slowdown here
being about 1.54%?  The best case, and arguably a more realistic case
above showed a 34% speedup for the best case.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



On Thu, 18 Jul 2019 at 14:53, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Is anyone particularly concerned about the worst-case slowdown here
> being about 1.54%?  The best case, and arguably a more realistic case
> above showed a 34% speedup for the best case.

I took a bit more time to test the performance on this. I thought I
might have been a bit unfair on the patch by giving it completely
empty tables to look at. It just seems too unrealistic to have a large
number of completely empty partitions. I decided to come up with a
more realistic case where there are 140 partitions but we're
performing an index scan to find a record that can exist in only 1 of
the 140 partitions.

create table hp (a int, b int) partition by hash(a);
select 'create table hp'||x||' partition of hp for values with
(modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
create index on hp (b);
insert into hp select x,x from generate_series(1, 140000) x;
analyze hp;

select3.sql: select * from hp where b = 1

Master:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2124.591367 (excluding connections establishing)
tps = 2158.754837 (excluding connections establishing)
tps = 2146.348465 (excluding connections establishing)
tps = 2148.995800 (excluding connections establishing)
tps = 2154.223600 (excluding connections establishing)

Patched:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2002.480729 (excluding connections establishing)
tps = 1997.272944 (excluding connections establishing)
tps = 1992.527079 (excluding connections establishing)
tps = 1995.789367 (excluding connections establishing)
tps = 2001.195760 (excluding connections establishing)

so it turned out it's even slower, and not by a small amount either!
Patched is 6.93% slower on average with this case :-(

I went back to the drawing board on this and I've added some code that
counts the number of times we've seen the table to be oversized and
just shrinks the table back down on the 1000th time.  6.93% / 1000 is
not all that much. Of course, not all the extra overhead might be from
rebuilding the table, so here's a test with the updated patch.

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2142.414323 (excluding connections establishing)
tps = 2139.666899 (excluding connections establishing)
tps = 2138.744789 (excluding connections establishing)
tps = 2138.300299 (excluding connections establishing)
tps = 2137.346278 (excluding connections establishing)

Just a 0.34% drop. Pretty hard to pick that out the noise.

Testing the original case that shows the speedup:

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

Master:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 10172.035036 (excluding connections establishing)
tps = 10192.780529 (excluding connections establishing)
tps = 10331.306003 (excluding connections establishing)

Patched:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 15080.765549 (excluding connections establishing)
tps = 14994.404069 (excluding connections establishing)
tps = 14982.923480 (excluding connections establishing)

That seems fine, 46% faster.

v6 is attached.

I plan to push this in a few days unless someone objects.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> I went back to the drawing board on this and I've added some code that counts
> the number of times we've seen the table to be oversized and just shrinks
> the table back down on the 1000th time.  6.93% / 1000 is not all that much.

I'm afraid this kind of hidden behavior would appear mysterious to users.  They may wonder "Why is the same query fast
atfirst in the session (5 or 6 times of execution), then gets slower for a while, and gets faster again?  Is there
somethingto tune?  Am I missing something wrong with my app (e.g. how to use prepared statements)?"  So I prefer v5.
 


> Of course, not all the extra overhead might be from rebuilding the table,
> so here's a test with the updated patch.

Where else does the extra overhead come from?


Regards
Takayuki Tsunakawa


On Mon, 22 Jul 2019 at 12:48, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
>
> From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> > I went back to the drawing board on this and I've added some code that counts
> > the number of times we've seen the table to be oversized and just shrinks
> > the table back down on the 1000th time.  6.93% / 1000 is not all that much.
>
> I'm afraid this kind of hidden behavior would appear mysterious to users.  They may wonder "Why is the same query
fastat first in the session (5 or 6 times of execution), then gets slower for a while, and gets faster again?  Is there
somethingto tune?  Am I missing something wrong with my app (e.g. how to use prepared statements)?"  So I prefer v5. 

I personally don't think that's true.  The only way you'll notice the
LockReleaseAll() overhead is to execute very fast queries with a
bloated lock table.  It's pretty hard to notice that a single 0.1ms
query is slow. You'll need to execute thousands of them before you'll
be able to measure it, and once you've done that, the lock shrink code
will have run and the query will be performing optimally again.

I voice my concerns with v5 and I wasn't really willing to push it
with a known performance regression of 7% in a fairly valid case. v6
does not suffer from that.

> > Of course, not all the extra overhead might be from rebuilding the table,
> > so here's a test with the updated patch.
>
> Where else does the extra overhead come from?

hash_get_num_entries(LockMethodLocalHash) == 0 &&
+ hash_get_max_bucket(LockMethodLocalHash) >
+ LOCKMETHODLOCALHASH_SHRINK_THRESHOLD)

that's executed every time, not every 1000 times.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> I personally don't think that's true.  The only way you'll notice the
> LockReleaseAll() overhead is to execute very fast queries with a
> bloated lock table.  It's pretty hard to notice that a single 0.1ms
> query is slow. You'll need to execute thousands of them before you'll
> be able to measure it, and once you've done that, the lock shrink code
> will have run and the query will be performing optimally again.

Maybe so.  Will the difference be noticeable between plan_cache_mode=auto (default) and plan_cache_mode=custom?


> I voice my concerns with v5 and I wasn't really willing to push it
> with a known performance regression of 7% in a fairly valid case. v6
> does not suffer from that.

You're right.  We may have to consider the unpredictability to users by this hidden behavior as a compromise for higher
throughput.


> > Where else does the extra overhead come from?
> 
> hash_get_num_entries(LockMethodLocalHash) == 0 &&
> + hash_get_max_bucket(LockMethodLocalHash) >
> + LOCKMETHODLOCALHASH_SHRINK_THRESHOLD)
> 
> that's executed every time, not every 1000 times.

I see.  Thanks.


Regards
Takayuki Tsunakawa


On Mon, 22 Jul 2019 at 14:21, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
>
> From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> > I personally don't think that's true.  The only way you'll notice the
> > LockReleaseAll() overhead is to execute very fast queries with a
> > bloated lock table.  It's pretty hard to notice that a single 0.1ms
> > query is slow. You'll need to execute thousands of them before you'll
> > be able to measure it, and once you've done that, the lock shrink code
> > will have run and the query will be performing optimally again.
>
> Maybe so.  Will the difference be noticeable between plan_cache_mode=auto (default) and plan_cache_mode=custom?

For the use case we've been measuring with partitioned tables and the
generic plan generation causing a sudden spike in the number of
obtained locks, then having plan_cache_mode = force_custom_plan will
cause the lock table not to become bloated. I'm not sure there's
anything interesting to measure there. The only additional code that
gets executed is the hash_get_num_entries() and possibly
hash_get_max_bucket. Maybe it's worth swapping the order of those
calls since most of the time the entry will be 0 and the max bucket
count threshold won't be exceeded.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> For the use case we've been measuring with partitioned tables and the
> generic plan generation causing a sudden spike in the number of
> obtained locks, then having plan_cache_mode = force_custom_plan will
> cause the lock table not to become bloated. I'm not sure there's
> anything interesting to measure there.

I meant the difference between the following two cases, where the query only touches one partition (e.g. SELECT ...
WHEREpkey = value):
 

* plan_cache_mode=force_custom_plan: LocalLockHash won't bloat.  The query execution time is steady.

* plan_cache_mode=auto: LocalLockHash bloats on the sixth execution due to the creation of the generic plan.  The
genericplan is not adopted because its cost is high.  Later executions of the query will suffer from the bloat until
the1006th execution when LocalLockHash is shrunk.
 


Depending on the number of transactions and what each transaction does, I thought the difference will be noticeable or
not.


Regards
Takayuki Tsunakawa



On Mon, 22 Jul 2019 at 16:36, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
>
> From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> > For the use case we've been measuring with partitioned tables and the
> > generic plan generation causing a sudden spike in the number of
> > obtained locks, then having plan_cache_mode = force_custom_plan will
> > cause the lock table not to become bloated. I'm not sure there's
> > anything interesting to measure there.
>
> I meant the difference between the following two cases, where the query only touches one partition (e.g. SELECT ...
WHEREpkey = value): 
>
> * plan_cache_mode=force_custom_plan: LocalLockHash won't bloat.  The query execution time is steady.
>
> * plan_cache_mode=auto: LocalLockHash bloats on the sixth execution due to the creation of the generic plan.  The
genericplan is not adopted because its cost is high.  Later executions of the query will suffer from the bloat until
the1006th execution when LocalLockHash is shrunk. 

I measured this again in
https://www.postgresql.org/message-id/CAKJS1f_ycJ-6QTKC70pZRYdwsAwUo+t0_CV0eXC=J-b5A2MegQ@mail.gmail.com
where I posted the v6 patch.  It's the final results in the email.   I
didn't measure plan_cache_mode = force_custom_plan. There'd be no lock
table bloat in that case and the additional overhead would just be
from the hash_get_num_entries() && hash_get_max_bucket() calls, which
the first results show next to no overhead for.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



On Mon, 22 Jul 2019 at 12:48, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
>
> From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> > I went back to the drawing board on this and I've added some code that counts
> > the number of times we've seen the table to be oversized and just shrinks
> > the table back down on the 1000th time.  6.93% / 1000 is not all that much.
>
> I'm afraid this kind of hidden behavior would appear mysterious to users.  They may wonder "Why is the same query
fastat first in the session (5 or 6 times of execution), then gets slower for a while, and gets faster again?  Is there
somethingto tune?  Am I missing something wrong with my app (e.g. how to use prepared statements)?"  So I prefer v5. 

Another counter-argument to this is that there's already an
unexplainable slowdown after you run a query which obtains a large
number of locks in a session or use prepared statements and a
partitioned table with the default plan_cache_mode setting. Are we not
just righting a wrong here? Albeit, possibly 1000 queries later.

I am, of course, open to other ideas which solve the problem that v5
has, but failing that, I don't see v6 as all that bad.  At least all
the logic is contained in one function.  I know Tom wanted to steer
clear of heuristics to reinitialise the table, but most of the stuff
that was in the patch back when he complained was trying to track the
average number of locks over the previous N transactions, and his
concerns were voiced before I showed the 7% performance regression
with unconditionally rebuilding the table.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Hi,

On 2019-07-21 21:37:28 +1200, David Rowley wrote:
> select.sql:
> \set p 1
> select * from ht where a = :p
> 
> Master:
> 
> $ pgbench -n -f select.sql -T 60 -M prepared postgres
> tps = 10172.035036 (excluding connections establishing)
> tps = 10192.780529 (excluding connections establishing)
> tps = 10331.306003 (excluding connections establishing)
> 
> Patched:
> 
> $ pgbench -n -f select.sql -T 60 -M prepared postgres
> tps = 15080.765549 (excluding connections establishing)
> tps = 14994.404069 (excluding connections establishing)
> tps = 14982.923480 (excluding connections establishing)
> 
> That seems fine, 46% faster.
> 
> v6 is attached.
> 
> I plan to push this in a few days unless someone objects.

It does seem far less objectionable than the other case.  I hate to
throw in one more wrench into a topic finally making progress, but: Have
either of you considered just replacing the dynahash table with a
simplehash style one?  Given the obvious speed sensitivity, and the fact
that for it (in contrast to the shared lock table) no partitioning is
needed, that seems like a good thing to try.  It seems quite possible
that both the iteration and plain manipulations are going to be faster,
due to far less indirections - e.g. the iteration through the array will
just be an array walk with a known stride, far easier for the CPU to
prefetch.

Greetings,

Andres Freund



From: David Rowley [mailto:david.rowley@2ndquadrant.com]
> Another counter-argument to this is that there's already an
> unexplainable slowdown after you run a query which obtains a large
> number of locks in a session or use prepared statements and a
> partitioned table with the default plan_cache_mode setting. Are we not
> just righting a wrong here? Albeit, possibly 1000 queries later.
> 
> I am, of course, open to other ideas which solve the problem that v5
> has, but failing that, I don't see v6 as all that bad.  At least all
> the logic is contained in one function.  I know Tom wanted to steer
> clear of heuristics to reinitialise the table, but most of the stuff
> that was in the patch back when he complained was trying to track the
> average number of locks over the previous N transactions, and his
> concerns were voiced before I showed the 7% performance regression
> with unconditionally rebuilding the table.

I think I understood what you mean.  Sorry, I don't have a better idea.  This unexplanability is probably what we
shouldaccept to avoid the shocking 7% slowdown.
 

OTOH, how about my original patch that is based on the local lock list?  I expect that it won't that significant
slowdownin the same test case.  If it's not satisfactory, then v6 is the best to commit.
 


Regards
Takayuki Tsunakawa



On Tue, 23 Jul 2019 at 15:47, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> OTOH, how about my original patch that is based on the local lock list?  I expect that it won't that significant
slowdownin the same test case.  If it's not satisfactory, then v6 is the best to commit.
 

I think we need to move beyond the versions that have been reviewed
and rejected. I don't think the reasons for their rejection will have
changed.

I've attached v7, which really is v6 with some comments adjusted and
the order of the hash_get_num_entries and hash_get_max_bucket function
calls swapped.  I think hash_get_num_entries() will return 0 most of
the time where we're calling it, so it makes sense to put the
condition that's less likely to be true first in the if condition.

I'm pretty happy with v7 now. If anyone has any objections to it,
please speak up very soon.  Otherwise, I plan to push it about this
time tomorrow.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
David Rowley <david.rowley@2ndquadrant.com> writes:
> I've attached v7, which really is v6 with some comments adjusted and
> the order of the hash_get_num_entries and hash_get_max_bucket function
> calls swapped.  I think hash_get_num_entries() will return 0 most of
> the time where we're calling it, so it makes sense to put the
> condition that's less likely to be true first in the if condition.

> I'm pretty happy with v7 now. If anyone has any objections to it,
> please speak up very soon.  Otherwise, I plan to push it about this
> time tomorrow.

I dunno, this seems close to useless in this form.  As it stands,
once hash_get_max_bucket has exceeded the threshold, you will
arbitrarily reset the table 1000 transactions later (since the
max bucket is certainly not gonna decrease otherwise).  So that's
got two big problems:

1. In the assumed-common case where most of the transactions take
few locks, you wasted cycles for 999 transactions.

2. You'll reset the table independently of subsequent history,
even if the session's usage is pretty much always over the
threshold.  Admittedly, if you do this only once per 1K
transactions, it's probably not a horrible overhead --- but you
can't improve point 1 without making it a bigger overhead.

I did complain about the cost of tracking the stats proposed by
some earlier patches, but I don't think the answer to that is to
not track any stats at all.  The proposed hash_get_max_bucket()
function is quite cheap, so I think it wouldn't be out of line to
track the average value of that at transaction end over the
session's lifespan, and reset if the current value is more than
some-multiple of the average.

The tricky part here is that if some xact kicks that up to
say 64 entries, and we don't choose to reset, then the reading
for subsequent transactions will be 64 even if they use very
few locks.  So we probably need to not use a plain average,
but account for that effect somehow.  Maybe we could look at
how quickly the number goes up after we reset?

[ thinks for awhile... ]  As a straw-man proposal, I suggest
the following (completely untested) idea:

* Make the table size threshold variable, not constant.

* If, at end-of-transaction when the table is empty,
the table bucket count exceeds the threshold, reset
immediately; but if it's been less than K transactions
since the last reset, increase the threshold (by say 10%).

I think K can be a constant; somewhere between 10 and 100 would
probably work.  At process start, we should act as though the last
reset were more than K transactions ago (i.e., don't increase the
threshold at the first reset).

The main advantage this has over v7 is that we don't have the
1000-transaction delay before reset, which ISTM is giving up
much of the benefit of the whole idea.  Also, if the session
is consistently using lots of locks, we'll adapt to that after
awhile and not do useless table resets.

            regards, tom lane



Thanks for having a look at this.

On Wed, 24 Jul 2019 at 04:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > I'm pretty happy with v7 now. If anyone has any objections to it,
> > please speak up very soon.  Otherwise, I plan to push it about this
> > time tomorrow.
>
> I dunno, this seems close to useless in this form.  As it stands,
> once hash_get_max_bucket has exceeded the threshold, you will
> arbitrarily reset the table 1000 transactions later (since the
> max bucket is certainly not gonna decrease otherwise).  So that's
> got two big problems:
>
> 1. In the assumed-common case where most of the transactions take
> few locks, you wasted cycles for 999 transactions.
>
> 2. You'll reset the table independently of subsequent history,
> even if the session's usage is pretty much always over the
> threshold.  Admittedly, if you do this only once per 1K
> transactions, it's probably not a horrible overhead --- but you
> can't improve point 1 without making it a bigger overhead.

This is true, but I think you might be overestimating just how much
effort is wasted with #1. We're only seeing this overhead in small
very fast to execute xacts. In the test case in [1], I was getting
about 10k TPS unpatched and about 15k patched. This means that, on
average, 1 unpatched xact takes 100 microseconds and 1 average patched
xact takes 66 microseconds, so the additional time spent doing the
hash_seq_search() must be about 34 microseconds. So we'll waste a
total of 34 *milliseconds* if we wait for 1000 xacts before we reset
the table.  With 10k TPS we're going to react to the change in 0.1
seconds.

I think you'd struggle to measure that 1 xact is taking 34
microseconds longer without running a few thousand queries. My view is
that nobody is ever going to notice that it takes 1k xacts to shrink
the table, and I've already shown that the overhead of the shrink
every 1k xact is tiny. I mentioned 0.34% in [1] using v6. This is
likely a bit smaller in v7 due to swapping the order of the if
condition to put the less likely case first. Since the overhead of
rebuilding the table was 7% when done every xact, then it stands to
reason that this has become 0.007% doing it every 1k xats and that the
additional overhead to make up that 0.34% is from testing if the reset
condition has been met (or noise).  That's not something we can remove
completely with any solution that resets the hash table.

> I did complain about the cost of tracking the stats proposed by
> some earlier patches, but I don't think the answer to that is to
> not track any stats at all.  The proposed hash_get_max_bucket()
> function is quite cheap, so I think it wouldn't be out of line to
> track the average value of that at transaction end over the
> session's lifespan, and reset if the current value is more than
> some-multiple of the average.
>
> The tricky part here is that if some xact kicks that up to
> say 64 entries, and we don't choose to reset, then the reading
> for subsequent transactions will be 64 even if they use very
> few locks.  So we probably need to not use a plain average,
> but account for that effect somehow.  Maybe we could look at
> how quickly the number goes up after we reset?
>
> [ thinks for awhile... ]  As a straw-man proposal, I suggest
> the following (completely untested) idea:
>
> * Make the table size threshold variable, not constant.
>
> * If, at end-of-transaction when the table is empty,
> the table bucket count exceeds the threshold, reset
> immediately; but if it's been less than K transactions
> since the last reset, increase the threshold (by say 10%).
>
> I think K can be a constant; somewhere between 10 and 100 would
> probably work.  At process start, we should act as though the last
> reset were more than K transactions ago (i.e., don't increase the
> threshold at the first reset).

I think the problem with this idea is that there is no way once the
threshold has been enlarged to recover from that to work better
workloads that require very few locks again. If we end up with some
large value for the variable threshold, there's no way to decrease
that again.  All this method stops is the needless hash table resets
if the typical case requires many locks. The only way to know if we
can reduce the threshold again is to count the locks released during
LockReleaseAll().  Some version of the patch did that, and you
objected.

> The main advantage this has over v7 is that we don't have the
> 1000-transaction delay before reset, which ISTM is giving up
> much of the benefit of the whole idea.  Also, if the session
> is consistently using lots of locks, we'll adapt to that after
> awhile and not do useless table resets.

True, but you neglected to mention the looming and critical drawback,
which pretty much makes that idea useless. All we'd need to do is give
this a workload that throws that variable threshold up high so that it
can't recover. It would be pretty simple then to show that
LockReleaseAll() is still slow with workloads that just require a
small number of locks... permanently with no means to recover.

To be able to reduce the threshold down again we'd need to make a
hash_get_num_entries(LockMethodLocalHash) call before performing the
guts of LockReleaseAll(). We could then weight that onto some running
average counter with a weight of, say... 10, so we react to changes
fairly quickly, but not instantly. We could then have some sort of
logic like "rebuild the hash table if running average 4 times less
than max_bucket"

I've attached a spreadsheet of that idea and the algorithm we could
use to track the running average.  Initially, I've mocked it up a
series of transactions that use 1000 locks, then at row 123 dropped
that to 10 locks. If we assume the max_bucket is 1000, then it takes
until row 136 for the running average to drop below the max_bucket
count, i.e 13 xacts. There we'd reset there at the init size of 16. If
the average went up again, then we'd automatically expand the table as
we do now.  To make this work we'd need an additional call to
hash_get_num_entries(), before we release the locks, so there is more
overhead.

[1] https://www.postgresql.org/message-id/CAKJS1f_ycJ-6QTKC70pZRYdwsAwUo+t0_CV0eXC=J-b5A2MegQ@mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
On Wed, 24 Jul 2019 at 15:05, David Rowley <david.rowley@2ndquadrant.com> wrote:
> To be able to reduce the threshold down again we'd need to make a
> hash_get_num_entries(LockMethodLocalHash) call before performing the
> guts of LockReleaseAll(). We could then weight that onto some running
> average counter with a weight of, say... 10, so we react to changes
> fairly quickly, but not instantly. We could then have some sort of
> logic like "rebuild the hash table if running average 4 times less
> than max_bucket"
>
> I've attached a spreadsheet of that idea and the algorithm we could
> use to track the running average.  Initially, I've mocked it up a
> series of transactions that use 1000 locks, then at row 123 dropped
> that to 10 locks. If we assume the max_bucket is 1000, then it takes
> until row 136 for the running average to drop below the max_bucket
> count, i.e 13 xacts. There we'd reset there at the init size of 16. If
> the average went up again, then we'd automatically expand the table as
> we do now.  To make this work we'd need an additional call to
> hash_get_num_entries(), before we release the locks, so there is more
> overhead.

Here's a patch with this implemented. I've left a NOTICE in there to
make it easier for people to follow along at home and see when the
lock table is reset.

There will be a bit of additional overhead to the reset detection
logic over the v7 patch. Namely: additional hash_get_num_entries()
call before releasing the locks, and more complex and floating-point
maths instead of more simple integer maths in v7.

Here's a demo with the debug NOTICE in there to show us what's going on.

-- setup
create table a (a int) partition by range (a);
select 'create table a'||x||' partition of a for values from('||x||')
to ('||x+1||');' from generate_Series(1,1000) x;
\gexec

$ psql postgres
NOTICE:  max_bucket = 15, threshold = 64.000000, running_avg_locks
0.100000 Reset? No
psql (13devel)
# \o /dev/null
# select * from a where a < 100;
NOTICE:  max_bucket = 101, threshold = 64.000000, running_avg_locks
10.090000 Reset? Yes
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 76.324005, running_avg_locks
19.081001 Reset? Yes
# select * from a where a < 100;

A couple of needless resets there... Maybe we can get rid of those by
setting the initial running average up to something higher than 0.0.

NOTICE:  max_bucket = 99, threshold = 108.691605, running_avg_locks
27.172901 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 137.822449, running_avg_locks
34.455612 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 164.040207, running_avg_locks
41.010052 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 187.636185, running_avg_locks
46.909046 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 208.872559, running_avg_locks
52.218140 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 227.985306, running_avg_locks
56.996326 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 245.186768, running_avg_locks
61.296692 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 260.668091, running_avg_locks
65.167023 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 274.601288, running_avg_locks
68.650322 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 287.141174, running_avg_locks
71.785294 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 298.427063, running_avg_locks
74.606766 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 308.584351, running_avg_locks
77.146088 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 317.725922, running_avg_locks
79.431480 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 325.953339, running_avg_locks
81.488335 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 333.358002, running_avg_locks
83.339500 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 340.022217, running_avg_locks
85.005554 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 346.019989, running_avg_locks
86.504997 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 351.417999, running_avg_locks
87.854500 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 356.276184, running_avg_locks
89.069046 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 360.648560, running_avg_locks
90.162140 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 364.583710, running_avg_locks
91.145927 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 368.125336, running_avg_locks
92.031334 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 371.312805, running_avg_locks
92.828201 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 374.181519, running_avg_locks
93.545380 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 376.763367, running_avg_locks
94.190842 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 379.087036, running_avg_locks
94.771759 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 381.178345, running_avg_locks
95.294586 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 383.060516, running_avg_locks
95.765129 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 384.754456, running_avg_locks
96.188614 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 386.279022, running_avg_locks
96.569756 Reset? No

-- Here I switch to only selecting from 9 partitions instead of 99.

# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 351.651123, running_avg_locks
87.912781 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 320.486023, running_avg_locks
80.121506 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 292.437408, running_avg_locks
73.109352 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 267.193665, running_avg_locks
66.798416 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 244.474304, running_avg_locks
61.118576 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 224.026871, running_avg_locks
56.006718 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 205.624176, running_avg_locks
51.406044 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 189.061752, running_avg_locks
47.265438 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 174.155579, running_avg_locks
43.538895 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 160.740021, running_avg_locks
40.185005 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 148.666016, running_avg_locks
37.166504 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 137.799408, running_avg_locks
34.449852 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 128.019470, running_avg_locks
32.004868 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 119.217522, running_avg_locks
29.804380 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 111.295769, running_avg_locks
27.823942 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 104.166191, running_avg_locks
26.041548 Reset? No
# select * from a where a < 10;
NOTICE:  max_bucket = 99, threshold = 97.749573, running_avg_locks
24.437393 Reset? Yes

It took 17 xacts to react to the change and reset the lock table.

# select * from a where a < 10;
NOTICE:  max_bucket = 15, threshold = 91.974617, running_avg_locks
22.993654 Reset? No

notice max_bucket is back at 15 again.

Any thoughts on this?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
On Wed, 24 Jul 2019 at 16:16, David Rowley <david.rowley@2ndquadrant.com> wrote:
>
> On Wed, 24 Jul 2019 at 15:05, David Rowley <david.rowley@2ndquadrant.com> wrote:
> > To be able to reduce the threshold down again we'd need to make a
> > hash_get_num_entries(LockMethodLocalHash) call before performing the
> > guts of LockReleaseAll(). We could then weight that onto some running
> > average counter with a weight of, say... 10, so we react to changes
> > fairly quickly, but not instantly. We could then have some sort of
> > logic like "rebuild the hash table if running average 4 times less
> > than max_bucket"
> >
> > I've attached a spreadsheet of that idea and the algorithm we could
> > use to track the running average.  Initially, I've mocked it up a
> > series of transactions that use 1000 locks, then at row 123 dropped
> > that to 10 locks. If we assume the max_bucket is 1000, then it takes
> > until row 136 for the running average to drop below the max_bucket
> > count, i.e 13 xacts. There we'd reset there at the init size of 16. If
> > the average went up again, then we'd automatically expand the table as
> > we do now.  To make this work we'd need an additional call to
> > hash_get_num_entries(), before we release the locks, so there is more
> > overhead.
>
> Here's a patch with this implemented. I've left a NOTICE in there to
> make it easier for people to follow along at home and see when the
> lock table is reset.

Here's a more polished version with the debug code removed, complete
with benchmarks.

-- Test 1. Select 1 record from a 140 partitioned table. Tests
creating a large number of locks with a fast query.

create table hp (a int, b int) partition by hash(a);
select 'create table hp'||x||' partition of hp for values with
(modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
create index on hp (b);
insert into hp select x,x from generate_series(1, 140000) x;
analyze hp;

select3.sql: select * from hp where b = 1

-
Master:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2098.628975 (excluding connections establishing)
tps = 2101.797672 (excluding connections establishing)
tps = 2085.317292 (excluding connections establishing)
tps = 2094.931999 (excluding connections establishing)
tps = 2092.328908 (excluding connections establishing)

Patched:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2101.691459 (excluding connections establishing)
tps = 2104.533249 (excluding connections establishing)
tps = 2106.499123 (excluding connections establishing)
tps = 2104.033459 (excluding connections establishing)
tps = 2105.463629 (excluding connections establishing)

(I'm surprised there is not more overhead in the additional tracking
added to calculate the running average)

-- Test 2. Tests a prepared query which will perform a generic plan on
the 6th execution then fallback on a custom plan due to it pruning all
but one partition.  Master suffers from the lock table becoming
bloated after locking all partitions when planning the generic plan.

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

Master:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 10207.780843 (excluding connections establishing)
tps = 10205.772688 (excluding connections establishing)
tps = 10214.896449 (excluding connections establishing)
tps = 10157.572153 (excluding connections establishing)
tps = 10147.764477 (excluding connections establishing)

Patched:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 14677.636570 (excluding connections establishing)
tps = 14661.437186 (excluding connections establishing)
tps = 14647.202877 (excluding connections establishing)
tps = 14784.165753 (excluding connections establishing)
tps = 14850.355344 (excluding connections establishing)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
David Rowley <david.rowley@2ndquadrant.com> writes:
> Here's a more polished version with the debug code removed, complete
> with benchmarks.

A few gripes:

You're measuring the number of locks held at completion of the
transaction, which fails to account for locks transiently taken and
released, so that the actual peak usage will be more.  I think we mostly
only do that for system catalog accesses, so *maybe* the number of extra
locks involved isn't very large, but that's a very shaky assumption.

I don't especially like the fact that this seems to have a hard-wired
(and undocumented) assumption that buckets == entries, ie that the
fillfactor of the table is set at 1.0.  lock.c has no business knowing
that.  Perhaps instead of returning the raw bucket count, you could have
dynahash.c return bucket count times fillfactor, so that the number is in
the same units as the entry count.

This:
        running_avg_locks -= running_avg_locks / 10.0;
        running_avg_locks += numLocksHeld / 10.0;
seems like a weird way to do the calculation.  Personally I'd write
        running_avg_locks += (numLocksHeld - running_avg_locks) / 10.0;
which is more the way I'm used to seeing exponential moving averages
computed --- at least, it seems clearer to me why this will converge
towards the average value of numLocksHeld over time.  It also makes
it clear that it wouldn't be sane to use two different divisors,
whereas your formulation makes it look like maybe they could be
set independently.

Your compiler might not complain that LockReleaseAll's numLocksHeld
is potentially uninitialized, but other people's compilers will.


On the whole, I don't especially like this approach, because of the
confusion between peak lock count and end-of-xact lock count.  That
seems way too likely to cause problems.

            regards, tom lane



On Thu, Jul 25, 2019 at 5:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > Here's a more polished version with the debug code removed, complete
> > with benchmarks.
>
> A few gripes:
>
> [gripes]

Based on the above, I've set this to "Waiting on Author", in the next CF.

-- 
Thomas Munro
https://enterprisedb.com



On Thu, 25 Jul 2019 at 05:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> On the whole, I don't especially like this approach, because of the
> confusion between peak lock count and end-of-xact lock count.  That
> seems way too likely to cause problems.

Thanks for having a look at this.  I've not addressed the points
you've mentioned due to what you mention above.  The only way I can
think of so far to resolve that would be to add something to track
peak lock usage.  The best I can think of to do that, short of adding
something to dynahash.c is to check how many locks are held each time
we obtain a lock, then if that count is higher than the previous time
we checked, then update the maximum locks held, (probably a global
variable).   That seems pretty horrible to me and adds overhead each
time we obtain a lock, which is a pretty performance-critical path.

I've not tested what Andres mentioned about simplehash instead of
dynahash yet. I did a quick scan of simplehash and it looked like
SH_START_ITERATE would suffer the same problems as dynahash's
hash_seq_search(), albeit, perhaps with some more efficient memory
lookups. i.e it still has to skip over empty buckets, which might be
costly in a bloated table.

For now, I'm out of ideas. If anyone else feels like suggesting
something of picking this up, feel free.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



On Wed, Aug 14, 2019 at 07:25:10PM +1200, David Rowley wrote:
>On Thu, 25 Jul 2019 at 05:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> On the whole, I don't especially like this approach, because of the
>> confusion between peak lock count and end-of-xact lock count.  That
>> seems way too likely to cause problems.
>
>Thanks for having a look at this.  I've not addressed the points
>you've mentioned due to what you mention above.  The only way I can
>think of so far to resolve that would be to add something to track
>peak lock usage.  The best I can think of to do that, short of adding
>something to dynahash.c is to check how many locks are held each time
>we obtain a lock, then if that count is higher than the previous time
>we checked, then update the maximum locks held, (probably a global
>variable).   That seems pretty horrible to me and adds overhead each
>time we obtain a lock, which is a pretty performance-critical path.
>

Would it really be a measurable overhead? I mean, we only really need
one int counter, and you don't need to do the check on every lock
acquisition - you just need to recheck on the first lock release. But
maybe I'm underestimating how expensive it is ...

Talking about dynahash - doesn't it already track this information?
Maybe not directly but surely it has to track the number of entries in
the hash table, in order to compute fill factor. Can't we piggy-back on
that and track the highest fill-factor for a particular period of time?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



On 2019-Aug-14, David Rowley wrote:

> For now, I'm out of ideas. If anyone else feels like suggesting
> something of picking this up, feel free.

Hmm ... is this patch rejected, or is somebody still trying to get it to
committable state?  David, you're listed as committer.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Speed up transaction completion faster after many relations areaccessed in a transaction

From
"Tsunakawa, Takayuki"
Date:
From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com]
> Hmm ... is this patch rejected, or is somebody still trying to get it to
> committable state?  David, you're listed as committer.

I don't think it's rejected.  It would be a pity (mottainai) to refuse this, because it provides significant speedup
despiteits simple modification.
 

Again, I think the v2 patch is OK.  Tom's comment was as follows:


[Tom's comment against v2]
----------------------------------------
FWIW, I tried this patch against current HEAD (959d00e9d).
Using the test case described by Amit at
<be25cadf-982e-3f01-88b4-443a6667e16a(at)lab(dot)ntt(dot)co(dot)jp>
I do measure an undeniable speedup, close to 35%.

However ... I submit that that's a mighty extreme test case.
(I had to increase max_locks_per_transaction to get it to run
at all.)  We should not be using that sort of edge case to drive
performance optimization choices.

If I reduce the number of partitions in Amit's example from 8192
to something more real-world, like 128, I do still measure a
performance gain, but it's ~ 1.5% which is below what I'd consider
a reproducible win.  I'm accustomed to seeing changes up to 2%
in narrow benchmarks like this one, even when "nothing changes"
except unrelated code.

Trying a standard pgbench test case (pgbench -M prepared -S with
one client and an -s 10 database), it seems that the patch is about
0.5% slower than HEAD.  Again, that's below the noise threshold,
but it's not promising for the net effects of this patch on workloads
that aren't specifically about large and prunable partition sets.

I'm also fairly concerned about the effects of the patch on
sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88
bytes, a 22% increase.  That's a lot if you're considering cases
with many locks.

On the whole I don't think there's an adequate case for committing
this patch.

I'd also point out that this is hardly the only place where we've
seen hash_seq_search on nearly-empty hash tables become a bottleneck.
So I'm not thrilled about attacking that with one-table-at-time patches.
I'd rather see us do something to let hash_seq_search win across
the board.
----------------------------------------


* Extreme test case: 
Not extreme.  Two of our customers, who are considering to use PostgreSQL, are using thousands of partitions now.  We
hitthis issue -- a point query gets nearly 20% slower after automatically creating a generic plan.  That's the reason
forthis proposal.
 

* 0.5% slowdown with pgbench:
I think it's below the noise, as Tom said.

* sizeof(LOCALLOCK):
As Andres replied to Tom in the immediately following mail, LOCALLOCK was bigger in PG 11.

* Use case is narrow:
No.  The bloated LockMethodLocalHash affects the performance of the items below as well as transaction commit/abort:
  - AtPrepare_Locks() and PostPrepare_Locks(): the hash table is scanned twice in PREPARE!
  - LockReleaseSession: advisory lock
  - LockReleaseCurrentOwner: ??
  - LockReassignCurrentOwner: ??


Regards
Takayuki Tsunakawa


On 2019-Sep-03, Tsunakawa, Takayuki wrote:

> From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com]
> > Hmm ... is this patch rejected, or is somebody still trying to get it to
> > committable state?  David, you're listed as committer.
> 
> I don't think it's rejected.  It would be a pity (mottainai) to refuse
> this, because it provides significant speedup despite its simple
> modification.

I don't necessarily disagree with your argumentation, but Travis is
complaining thusly:

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 -Wall
-Werror-I../../../../src/include -I/usr/include/x86_64-linux-gnu -D_GNU_SOURCE   -c -o lock.o lock.c
 
1840lock.c:486:1: error: conflicting types for ‘TryShrinkLocalLockHash’
1841 TryShrinkLocalLockHash(long numLocksHeld)
1842 ^
1843lock.c:351:20: note: previous declaration of ‘TryShrinkLocalLockHash’ was here
1844 static inline void TryShrinkLocalLockHash(void);
1845                    ^
1846<builtin>: recipe for target 'lock.o' failed

Please fix.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Speed up transaction completion faster after many relations areaccessed in a transaction

From
"Tsunakawa, Takayuki"
Date:
From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com]
> On 2019-Sep-03, Tsunakawa, Takayuki wrote:
> > I don't think it's rejected.  It would be a pity (mottainai) to refuse
> > this, because it provides significant speedup despite its simple
> > modification.
> 
> I don't necessarily disagree with your argumentation, but Travis is
> complaining thusly:

I tried to revise David's latest patch (v8) and address Tom's comments in his last mail.  But I'm a bit at a loss.

First, to accurately count the maximum number of acquired locks in a transaction, we need to track the maximum entries
inthe hash table, and make it available via a new function like hash_get_max_entries().  However, to cover the shared
partitionedhash table (that is not necessary for LockMethodLocalHash), we must add a spin lock in hashhdr and
lock/unlockit when entering and removing entries in the hash table.  It spoils the effort to decrease contention by
hashhdr->freelists[].mutex. Do we want to track the maximum number of acquired locks in the global variable in lock.c,
notin the hash table?
 

Second, I couldn't understand the comment about the fill factor well.  I can understand that it's not correct to
comparethe number of hash buckets and the number of locks.  But what can we do?
 


I'm sorry to repeat what I mentioned in my previous mail, but my v2 patch's approach is based on the database textbook
andseems intuitive.  So I attached the rebased version.
 


Regards
Takayuki Tsunakawa



Attachment
On Thu, Sep 26, 2019 at 07:11:53AM +0000, Tsunakawa, Takayuki wrote:
> I'm sorry to repeat what I mentioned in my previous mail, but my v2
> patch's approach is based on the database textbook and seems
> intuitive.  So I attached the rebased version.

If you wish to do so, that's fine by me but I have not dived into the
details of the thread much.  Please not anyway that the patch does not
apply anymore and that it needs a rebase.  So for now I have moved the
patch to next CF, waiting on author.
--
Michael

Attachment
Hi, the patch was in WoA since December, waiting for a rebase. I've
marked it as returned with feedback. Feel free to re-submit an updated
version into the next CF.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Wed, 14 Aug 2019 at 19:25, David Rowley <david.rowley@2ndquadrant.com> wrote:
> For now, I'm out of ideas. If anyone else feels like suggesting
> something of picking this up, feel free.

This is a pretty old thread, so we might need a recap:

# Recap

Basically LockReleaseAll() becomes slow after a large number of locks
have all been held at once by a backend.  The problem is that the
LockMethodLocalHash dynahash table must grow to store all the locks
and when later transactions only take a few locks, LockReleaseAll() is
slow due to hash_seq_search() having to skip the sparsely filled
buckets in the bloated hash table.

The following things were tried on this thread. Each one failed:

1) Use a dlist in LOCALLOCK to track the next and prev LOCALLOCK.
Simply loop over the dlist in LockReleaseAll().
2) Try dropping and recreating the dynahash table if it becomes
bloated using some heuristics to decide what "bloated" means and if
recreating is worthwhile.

#1 failed due to concerns with increasing the size of LOCALLOCK to
store the dlist pointers. Performance regressions were seen too.
Possibly due to size increase or additional overhead from pushing onto
the dlist.
#2 failed because it was difficult to agree on what the heuristics
would be and we had no efficient way to determine the maximum number
of locks that a given transaction held at any one time. We only know
how many were still held at LockReleaseAll().

There were also some suggestions to fix dynahash's hash_seq_search()
slowness, and also a suggestion to try using simplehash.h instead of
dynahash.c. Unfortunately simplehash.h would suffer the same issues as
it too would have to skip over empty buckets in a sparsely populated
hash table.

I'd like to revive this effort as I have a new idea on how to solve the problem.

# Background

Over in [1] I'm trying to improve the performance of smgropen() during
recovery.  The slowness there comes from the dynahash table lookups to
find the correct SMgrRelation. Over there I proposed to use simplehash
instead of dynahash because it's quite a good bit faster and far
lessens the hash lookup overhead during recovery.  One problem on that
thread is that relcache keeps a pointer into the SMgrRelation
(RelationData.rd_smgr) and because simplehash moves things around
during inserts and deletes, then we can't have anything point to
simplehash entries, they're unstable.  I fixed that over on the other
thread by having the simplehash entry point to a palloced
SMgrRelationData... My problem is, I don't really like that idea as it
means we need to palloc() pfree() lots of little chunks of memory.

To fix the above, I really think we need a version of simplehash that
has stable pointers.  Providing that implementation is faster than
dynahash, then it will help solve the smgropen() slowness during
recovery.

# A new hashtable implementation

I ended up thinking of this thread again because the implementation of
the stable pointer hash that I ended up writing for [1] happens to be
lightning fast for hash table sequential scans, even if the table has
become bloated.  The reason the seq scans are so fast is that the
implementation loops over the data arrays, which are tightly packed
and store the actual data rather than pointers to the data. The code
does not need to loop over the bucket array for this at all, so how
large that has become is irrelevant to hash table seq scan
performance.

The patch stores elements in "segments" which is set to some power of
2 value.  When we run out of space to store new items in a segment, we
just allocate another segment.  When we remove items from the table,
new items reuse the first unused item in the first segment with free
space.  This helps to keep the used elements tightly packed.  A
segment keeps a bitmap of used items so that means scanning all used
items is very fast.  If you flip the bits in the used_item bitmap,
then you get a free list for that segment, so it's also very fast to
find a free element when inserting a new item into the table.

I've called the hash table implementation "generichash". It uses the
same preprocessor tricks as simplehash.h does and borrows the same
linear probing code that's used in simplehash.   The bucket array just
stores the hash value and a uint32 index into the segment item that
stores the data. Since segments store a power of 2 items, we can
easily address both the segment number and the item within the segment
from the single uint32 index value. The 'index' field just stores a
special value when the bucket is empty. No need to add another field
for that.  This means the bucket type is just 8 bytes wide.

One thing I will mention about the new hash table implementation is
that GH_ITEMS_PER_SEGMENT is, by default, set to 256. This means
that's the minimum size for the table.  I could drop this downto 64,
but that's still quite a bit more than the default size of the
dynahash table of 16. I think 16 is a bit on the low side and that it
might be worth making this 64 anyway. I'd just need to lower
GH_ITEMS_PER_SEGMENT down.  The current code does not allow it to go
lower as I've done nothing to allow partial bitmap words, they're
64-bits on a 64-bit machine.

I've not done too much benchmarking between it and simplehash.h, but I
think in some cases it will be faster. Since the bucket type is just 8
bytes, moving stuff around during insert/deletes will be cheaper than
with simplehash. Lookups are likely to be a bit slower due to having
to lookup the item within the segment, which is a few pointer
dereferences away.

A quick test shows an improvement when compared to dynahash.

# select count(pg_try_advisory_lock(99999,99999)) from
generate_series(1,1000000);

Master:
Time: 190.257 ms
Time: 189.440 ms
Time: 188.768 ms

Patched:
Time: 182.779 ms
Time: 182.267 ms
Time: 186.902 ms

This is just hitting the local lock table. The advisory lock key is
the same each time, so it remains a lock check. Also, it's a pretty
small gain, but I'm mostly trying to show that the new hash table is
not any slower than dynahash for probing for inserts.

The real wins come from what we're trying to solve in this thread --
the performance of LockReleaseAll().

Benchmarking results measuring the TPS of a simple select from an
empty table after another transaction has bloated the locallock table.

Master:
127544 tps
113971 tps
123255 tps
121615 tps

Patched:
170803 tps
167305 tps
165002 tps
171976 tps

About 38% faster.

The benchmark I used was:

t1.sql:
\set p 1
select a from t1 where a = :p

hp.sql:
select count(*) from hp

"hp" is a hash partitioned table with 10k partitions.

pgbench -j 20 -c 20 -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@100000 postgres

I'm using the query to the hp table to bloat the locallock table. It's
only executed every 1 in 100,000 queries. The tps numbers above are
the ones to run t1.sql

I've not quite looked into why yet, but the hp.sql improved
performance by 58%. It went from an average of 1.061377 in master to
an average of 1.683616 in the patched version. I can't quite see where
this gain is coming from. It's pretty hard to get good stable
performance results out of this AMD machine, so it might be related to
that. That's why I ran 20 threads. It seems slightly better. The
machine seems to have trouble waking up properly for a single thread.

It would be good if someone could repeat the tests to see if the gains
appear on other hardware.

Also, it would be good to hear what people think about solving the
problem this way.

Patch attached.

David

[1] https://www.postgresql.org/message-id/flat/CAApHDvpkWOGLh_bYg7jproXN8B2g2T9dWDcqsmKsXG5+WwZaqw@mail.gmail.com

Attachment


On Sun, Jun 20, 2021 at 6:56 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 14 Aug 2019 at 19:25, David Rowley <david.rowley@2ndquadrant.com> wrote:
> For now, I'm out of ideas. If anyone else feels like suggesting
> something of picking this up, feel free.

This is a pretty old thread, so we might need a recap:

# Recap

Basically LockReleaseAll() becomes slow after a large number of locks
have all been held at once by a backend.  The problem is that the
LockMethodLocalHash dynahash table must grow to store all the locks
and when later transactions only take a few locks, LockReleaseAll() is
slow due to hash_seq_search() having to skip the sparsely filled
buckets in the bloated hash table.

The following things were tried on this thread. Each one failed:

1) Use a dlist in LOCALLOCK to track the next and prev LOCALLOCK.
Simply loop over the dlist in LockReleaseAll().
2) Try dropping and recreating the dynahash table if it becomes
bloated using some heuristics to decide what "bloated" means and if
recreating is worthwhile.

#1 failed due to concerns with increasing the size of LOCALLOCK to
store the dlist pointers. Performance regressions were seen too.
Possibly due to size increase or additional overhead from pushing onto
the dlist.
#2 failed because it was difficult to agree on what the heuristics
would be and we had no efficient way to determine the maximum number
of locks that a given transaction held at any one time. We only know
how many were still held at LockReleaseAll().

There were also some suggestions to fix dynahash's hash_seq_search()
slowness, and also a suggestion to try using simplehash.h instead of
dynahash.c. Unfortunately simplehash.h would suffer the same issues as
it too would have to skip over empty buckets in a sparsely populated
hash table.

I'd like to revive this effort as I have a new idea on how to solve the problem.

# Background

Over in [1] I'm trying to improve the performance of smgropen() during
recovery.  The slowness there comes from the dynahash table lookups to
find the correct SMgrRelation. Over there I proposed to use simplehash
instead of dynahash because it's quite a good bit faster and far
lessens the hash lookup overhead during recovery.  One problem on that
thread is that relcache keeps a pointer into the SMgrRelation
(RelationData.rd_smgr) and because simplehash moves things around
during inserts and deletes, then we can't have anything point to
simplehash entries, they're unstable.  I fixed that over on the other
thread by having the simplehash entry point to a palloced
SMgrRelationData... My problem is, I don't really like that idea as it
means we need to palloc() pfree() lots of little chunks of memory.

To fix the above, I really think we need a version of simplehash that
has stable pointers.  Providing that implementation is faster than
dynahash, then it will help solve the smgropen() slowness during
recovery.

# A new hashtable implementation

I ended up thinking of this thread again because the implementation of
the stable pointer hash that I ended up writing for [1] happens to be
lightning fast for hash table sequential scans, even if the table has
become bloated.  The reason the seq scans are so fast is that the
implementation loops over the data arrays, which are tightly packed
and store the actual data rather than pointers to the data. The code
does not need to loop over the bucket array for this at all, so how
large that has become is irrelevant to hash table seq scan
performance.

The patch stores elements in "segments" which is set to some power of
2 value.  When we run out of space to store new items in a segment, we
just allocate another segment.  When we remove items from the table,
new items reuse the first unused item in the first segment with free
space.  This helps to keep the used elements tightly packed.  A
segment keeps a bitmap of used items so that means scanning all used
items is very fast.  If you flip the bits in the used_item bitmap,
then you get a free list for that segment, so it's also very fast to
find a free element when inserting a new item into the table.

I've called the hash table implementation "generichash". It uses the
same preprocessor tricks as simplehash.h does and borrows the same
linear probing code that's used in simplehash.   The bucket array just
stores the hash value and a uint32 index into the segment item that
stores the data. Since segments store a power of 2 items, we can
easily address both the segment number and the item within the segment
from the single uint32 index value. The 'index' field just stores a
special value when the bucket is empty. No need to add another field
for that.  This means the bucket type is just 8 bytes wide.

One thing I will mention about the new hash table implementation is
that GH_ITEMS_PER_SEGMENT is, by default, set to 256. This means
that's the minimum size for the table.  I could drop this downto 64,
but that's still quite a bit more than the default size of the
dynahash table of 16. I think 16 is a bit on the low side and that it
might be worth making this 64 anyway. I'd just need to lower
GH_ITEMS_PER_SEGMENT down.  The current code does not allow it to go
lower as I've done nothing to allow partial bitmap words, they're
64-bits on a 64-bit machine.

I've not done too much benchmarking between it and simplehash.h, but I
think in some cases it will be faster. Since the bucket type is just 8
bytes, moving stuff around during insert/deletes will be cheaper than
with simplehash. Lookups are likely to be a bit slower due to having
to lookup the item within the segment, which is a few pointer
dereferences away.

A quick test shows an improvement when compared to dynahash.

# select count(pg_try_advisory_lock(99999,99999)) from
generate_series(1,1000000);

Master:
Time: 190.257 ms
Time: 189.440 ms
Time: 188.768 ms

Patched:
Time: 182.779 ms
Time: 182.267 ms
Time: 186.902 ms

This is just hitting the local lock table. The advisory lock key is
the same each time, so it remains a lock check. Also, it's a pretty
small gain, but I'm mostly trying to show that the new hash table is
not any slower than dynahash for probing for inserts.

The real wins come from what we're trying to solve in this thread --
the performance of LockReleaseAll().

Benchmarking results measuring the TPS of a simple select from an
empty table after another transaction has bloated the locallock table.

Master:
127544 tps
113971 tps
123255 tps
121615 tps

Patched:
170803 tps
167305 tps
165002 tps
171976 tps

About 38% faster.

The benchmark I used was:

t1.sql:
\set p 1
select a from t1 where a = :p

hp.sql:
select count(*) from hp

"hp" is a hash partitioned table with 10k partitions.

pgbench -j 20 -c 20 -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@100000 postgres

I'm using the query to the hp table to bloat the locallock table. It's
only executed every 1 in 100,000 queries. The tps numbers above are
the ones to run t1.sql

I've not quite looked into why yet, but the hp.sql improved
performance by 58%. It went from an average of 1.061377 in master to
an average of 1.683616 in the patched version. I can't quite see where
this gain is coming from. It's pretty hard to get good stable
performance results out of this AMD machine, so it might be related to
that. That's why I ran 20 threads. It seems slightly better. The
machine seems to have trouble waking up properly for a single thread.

It would be good if someone could repeat the tests to see if the gains
appear on other hardware.

Also, it would be good to hear what people think about solving the
problem this way.

Patch attached.

David

[1] https://www.postgresql.org/message-id/flat/CAApHDvpkWOGLh_bYg7jproXN8B2g2T9dWDcqsmKsXG5+WwZaqw@mail.gmail.com

Hi,

+ *   GH_ELEMENT_TYPE defines the data type that the hashtable stores.  Each
+ *   instance of GH_ELEMENT_TYPE which is stored in the hash table is done so
+ *   inside a GH_SEGMENT.

I think the second sentence can be written as (since done means stored, it is redundant):

Each instance of GH_ELEMENT_TYPE is stored in the hash table inside a GH_SEGMENT.

+ * Macros for translating a bucket's index into the segment and another to
+ * determine the item number within the segment.
+ */
+#define GH_INDEX_SEGMENT(i)    (i) / GH_ITEMS_PER_SEGMENT

into the segment -> into the segment number (in the code I see segidx but I wonder if segment index may cause slight confusion).

+   GH_BITMAP_WORD used_items[GH_BITMAP_WORDS]; /* A 1-bit for each used item
+                                                * in the items array */

'A 1-bit' -> One bit (A and 1 mean the same)

+   uint32      first_free_segment;

Since the segment may not be totally free, maybe name the field first_segment_with_free_slot

+ * This is similar to GH_NEXT_ONEBIT but flips the bits before operating on
+ * each GH_BITMAP_WORD.

It seems the only difference from GH_NEXT_ONEBIT is in this line:

+       GH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */

If a 4th parameter is added to signify whether the flipping should be done, these two functions can be unified.

+    * next insert will store in this segment.  If it's already pointing to an
+    * earlier segment, then leave it be.

The last sentence is confusing: first_free_segment cannot point to some segment and earlier segment at the same time.
Maybe drop the last sentence.

Cheers
Thanks for having a look at this.

On Mon, 21 Jun 2021 at 05:02, Zhihong Yu <zyu@yugabyte.com> wrote:
> + *   GH_ELEMENT_TYPE defines the data type that the hashtable stores.  Each
> + *   instance of GH_ELEMENT_TYPE which is stored in the hash table is done so
> + *   inside a GH_SEGMENT.
>
> I think the second sentence can be written as (since done means stored, it is redundant):

I've rewords this entire paragraph slightly.

> Each instance of GH_ELEMENT_TYPE is stored in the hash table inside a GH_SEGMENT.
>
> + * Macros for translating a bucket's index into the segment and another to
> + * determine the item number within the segment.
> + */
> +#define GH_INDEX_SEGMENT(i)    (i) / GH_ITEMS_PER_SEGMENT
>
> into the segment -> into the segment number (in the code I see segidx but I wonder if segment index may cause slight
confusion).

I've adjusted this comment

> +   GH_BITMAP_WORD used_items[GH_BITMAP_WORDS]; /* A 1-bit for each used item
> +                                                * in the items array */
>
> 'A 1-bit' -> One bit (A and 1 mean the same)

I think you might have misread this. We're storing a 1-bit for each
used item rather than a 0-bit.  If I remove the 'A' then it's not
clear what the meaning of each bit's value is.

> +   uint32      first_free_segment;
>
> Since the segment may not be totally free, maybe name the field first_segment_with_free_slot

I don't really like that. It feels too long to me.

> + * This is similar to GH_NEXT_ONEBIT but flips the bits before operating on
> + * each GH_BITMAP_WORD.
>
> It seems the only difference from GH_NEXT_ONEBIT is in this line:
>
> +       GH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */
>
> If a 4th parameter is added to signify whether the flipping should be done, these two functions can be unified.

I don't want to do that.  I'd rather have them separate to ensure the
compiler does not create any additional needless branching. Those
functions are pretty hot.

> +    * next insert will store in this segment.  If it's already pointing to an
> +    * earlier segment, then leave it be.
>
> The last sentence is confusing: first_free_segment cannot point to some segment and earlier segment at the same
time.
> Maybe drop the last sentence.

I've adjusted this comment to become:

* Check if we need to lower the next_free_segment.  We want new inserts
* to fill the lower segments first, so only change the first_free_segment
* if the removed entry was stored in an earlier segment.

Thanks for having a look at this.

I'll attach an updated patch soon.

David



On Mon, 21 Jun 2021 at 01:56, David Rowley <dgrowleyml@gmail.com> wrote:
> # A new hashtable implementation

> Also, it would be good to hear what people think about solving the
> problem this way.

Because over on [1] I'm also trying to improve the performance of
smgropen(), I posted the patch for the new hash table over there too.
Between that thread and discussions with Thomas and Andres off-list, I
get the idea that pretty much nobody likes the idea of having another
hash table implementation.  Thomas wants to solve it another way and
Andres has concerns that working with bitmaps is too slow. Andres
suggested freelists would be faster, but I'm not really a fan of the
idea as, unless I have a freelist per array segment then I won't be
able to quickly identify the earliest segment slot to re-fill unused
slots with. That would mean memory would get more fragmented over time
instead of the fragmentation being slowly fixed as new items are added
after deletes. So I've not really tried implementing that to see how
it performs.

Both Andres and Thomas expressed a dislike to the name "generichash" too.

Anyway, since I did make a few small changes to the hash table
implementation before doing all that off-list talking, I thought I
should at least post where I got to here so that anything else that
comes up can get compared to where I got to, instead of where I was
with this.

I did end up renaming the hash table to "densehash" rather than
generichash. Naming is hard, but I went with dense as memory density
was on my mind when I wrote it. Having a compact 8-byte bucket width
and packing the data into arrays in a dense way. The word dense came
up a few times, so went with that.

I also adjusted the hash seq scan code so that it performs better when
faced a non-sparsely populated table.  Previously my benchmark for
that case didn't do well [2].

I've attached the benchmark results from running the benchmark that's
included in hashbench.tar.bz2. I ran this 10 times using the included
test.sh with ./test.sh 10. I included the results I got on my AMD
machine in the attached bz2 file in results.csv.

You can see from the attached dense_vs_generic_vs_simple.png that
dense hash is quite comparable to simplehash for inserts/deletes and
lookups. It's not quite as fast as simplehash at iterations when the
table is not bloated, but blows simplehash out the water when the
hashtables have become bloated due to having once contained a large
number of records but no longer do.

Anyway, unless there is some interest in me taking this idea further
then, due to the general feedback received on [1], I'm not planning on
pushing this any further.  I'll leave the commitfest entry as is for
now to give others a chance to see this.

David

[1] https://postgr.es/m/CAApHDvowgRaQupC%3DL37iZPUzx1z7-N8deD7TxQSm8LR%2Bf4L3-A%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAApHDvpuzJTQNKQ_bnAccvi-68xuh%2Bv87B4P6ycU-UiN0dqyTg%40mail.gmail.com

Attachment
On Mon, 12 Jul 2021 at 19:23, David Rowley <dgrowleyml@gmail.com> wrote:
> I also adjusted the hash seq scan code so that it performs better when
> faced a non-sparsely populated table.  Previously my benchmark for
> that case didn't do well [2].

I was running some select only pgbench tests today on an AMD 3990x
machine with a large number of processes.

I saw that LockReleaseAll was coming up on the profile a bit at:

Master: 0.77% postgres [.] LockReleaseAll

I wondered if this patch would help, so I tried it and got:

dense hash lockrelease all: 0.67% postgres [.] LockReleaseAll

It's a very small increase which translated to about a 0.62% gain in
tps. It made me think it might be worth doing something about this
LockReleaseAll can show up when releasing small numbers of locks.

pgbench -T 240 -P 10 -c 132 -j 132 -S -M prepared --random-seed=12345 postgres

Units = tps

Sec        master           dense hash LockReleaseAll
10       3758201.2     3713521.5       98.81%
20       3810125.5     3844142.9     100.89%
30       3806505.1     3848458        101.10%
40       3816094.8     3855706.6     101.04%
50       3820317.2     3851717.7     100.82%
60       3827809     3851499.4        100.62%
70       3828757.9     3849312        100.54%
80       3824492.1     3852378.8     100.73%
90       3816502.1     3854793.8     101.00%
100     3819124.1     3860418.6     101.08%
110     3816154.3     3845327.7     100.76%
120     3817070.5     3845842.5     100.75%
130     3815424.7     3847626        100.84%
140     3823631.1     3846760.6     100.60%
150     3820963.8     3840196.6     100.50%
160     3827737        3841149.3     100.35%
170     3827779.2     3840130.9     100.32%
180      3829352       3842814.5     100.35%
190     3825518.3     3841991        100.43%
200     3823477.2     3839390.7     100.42%
210     3809304.3     3836433.5     100.71%
220     3814328.5     3842073.7     100.73%
230     3811399.3     3843780.7     100.85%
avg     3816959.53   3840672.478 100.62%

David



On Tue, Jul 20, 2021 at 05:04:19PM +1200, David Rowley wrote:
> On Mon, 12 Jul 2021 at 19:23, David Rowley <dgrowleyml@gmail.com> wrote:
> > I also adjusted the hash seq scan code so that it performs better when
> > faced a non-sparsely populated table.  Previously my benchmark for
> > that case didn't do well [2].
>
> I was running some select only pgbench tests today on an AMD 3990x
> machine with a large number of processes.
>
> I saw that LockReleaseAll was coming up on the profile a bit at:

This last update was two months ago, and the patch has not moved
since:
https://commitfest.postgresql.org/34/3220/

Do you have plans to work more on that or perhaps the CF entry should
be withdrawn or RwF'd?
--
Michael

Attachment
I've made some remarks in related thread:
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1FB976EF@G01JPEXMBYT05

The new status of this patch is: Waiting on Author

On Fri, Oct 01, 2021 at 04:03:09PM +0900, Michael Paquier wrote:
> This last update was two months ago, and the patch has not moved
> since:
> https://commitfest.postgresql.org/34/3220/
>
> Do you have plans to work more on that or perhaps the CF entry should
> be withdrawn or RwF'd?

Two months later, this has been switched to RwF.
--
Michael

Attachment
On Fri, 3 Dec 2021 at 20:36, Michael Paquier <michael@paquier.xyz> wrote:
> Two months later, this has been switched to RwF.

I was discussing this patch with Andres. He's not very keen on my
densehash hash table idea and suggested that instead of relying on
trying to make the hash table iteration faster, why don't we just
ditch the lossiness of the resource owner code which only records the
first 16 locks, and just make it have a linked list of all locks.

This is a little more along the lines of the original patch, however,
it does not increase the size of the LOCALLOCK struct.

I've attached a patch which does this.  This was mostly written
(quickly) by Andres, I just did a cleanup, fixed up a few mistakes and
fixed a few bugs.

I ran the same performance tests on this patch as I did back in [1]:

-- Test 1. Select 1 record from a 140 partitioned table. Tests
creating a large number of locks with a fast query.

create table hp (a int, b int) partition by hash(a);
select 'create table hp'||x||' partition of hp for values with
(modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
create index on hp (b);
insert into hp select x,x from generate_series(1, 140000) x;
analyze hp;

select3.sql: select * from hp where b = 1

select3.sql master

drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2099.708748 (without initial connection time)
tps = 2100.398516 (without initial connection time)
tps = 2094.882341 (without initial connection time)
tps = 2113.218820 (without initial connection time)
tps = 2104.717597 (without initial connection time)

select3.sql patched

drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2010.070738 (without initial connection time)
tps = 1994.963606 (without initial connection time)
tps = 1994.668849 (without initial connection time)
tps = 1995.948168 (without initial connection time)
tps = 1985.650945 (without initial connection time)

You can see that there's a performance regression here. I've not yet
studied why this appears.

-- Test 2. Tests a prepared query which will perform a generic plan on
the 6th execution then fallback on a custom plan due to it pruning all
but one partition.  Master suffers from the lock table becoming
bloated after locking all partitions when planning the generic plan.

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

select.sql master
drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 18014.460090 (without initial connection time)
tps = 17973.358889 (without initial connection time)
tps = 17847.480647 (without initial connection time)
tps = 18038.332507 (without initial connection time)
tps = 17776.143206 (without initial connection time)

select.sql patched
drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 32393.457106 (without initial connection time)
tps = 32277.204349 (without initial connection time)
tps = 32160.719830 (without initial connection time)
tps = 32530.038130 (without initial connection time)
tps = 32299.019657 (without initial connection time)

You can see that there are some quite good performance gains with this test.

I'm going to add this to the January commitfest.

David

[1] https://www.postgresql.org/message-id/CAKJS1f8Lt0kS4bb5EH%3DhV%2BksqBZNnmVa8jujoYBYu5PVhWbZZg%40mail.gmail.com

Attachment


On Fri, Dec 31, 2021 at 5:45 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 3 Dec 2021 at 20:36, Michael Paquier <michael@paquier.xyz> wrote:
> Two months later, this has been switched to RwF.

I was discussing this patch with Andres. He's not very keen on my
densehash hash table idea and suggested that instead of relying on
trying to make the hash table iteration faster, why don't we just
ditch the lossiness of the resource owner code which only records the
first 16 locks, and just make it have a linked list of all locks.

This is a little more along the lines of the original patch, however,
it does not increase the size of the LOCALLOCK struct.

I've attached a patch which does this.  This was mostly written
(quickly) by Andres, I just did a cleanup, fixed up a few mistakes and
fixed a few bugs.

I ran the same performance tests on this patch as I did back in [1]:

-- Test 1. Select 1 record from a 140 partitioned table. Tests
creating a large number of locks with a fast query.

create table hp (a int, b int) partition by hash(a);
select 'create table hp'||x||' partition of hp for values with
(modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
create index on hp (b);
insert into hp select x,x from generate_series(1, 140000) x;
analyze hp;

select3.sql: select * from hp where b = 1

select3.sql master

drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2099.708748 (without initial connection time)
tps = 2100.398516 (without initial connection time)
tps = 2094.882341 (without initial connection time)
tps = 2113.218820 (without initial connection time)
tps = 2104.717597 (without initial connection time)

select3.sql patched

drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2010.070738 (without initial connection time)
tps = 1994.963606 (without initial connection time)
tps = 1994.668849 (without initial connection time)
tps = 1995.948168 (without initial connection time)
tps = 1985.650945 (without initial connection time)

You can see that there's a performance regression here. I've not yet
studied why this appears.

-- Test 2. Tests a prepared query which will perform a generic plan on
the 6th execution then fallback on a custom plan due to it pruning all
but one partition.  Master suffers from the lock table becoming
bloated after locking all partitions when planning the generic plan.

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

select.sql master
drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 18014.460090 (without initial connection time)
tps = 17973.358889 (without initial connection time)
tps = 17847.480647 (without initial connection time)
tps = 18038.332507 (without initial connection time)
tps = 17776.143206 (without initial connection time)

select.sql patched
drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 32393.457106 (without initial connection time)
tps = 32277.204349 (without initial connection time)
tps = 32160.719830 (without initial connection time)
tps = 32530.038130 (without initial connection time)
tps = 32299.019657 (without initial connection time)

You can see that there are some quite good performance gains with this test.

I'm going to add this to the January commitfest.

David

[1] https://www.postgresql.org/message-id/CAKJS1f8Lt0kS4bb5EH%3DhV%2BksqBZNnmVa8jujoYBYu5PVhWbZZg%40mail.gmail.com
Hi,

+           locallock->nLocks -= locallockowner->nLocks;
+           Assert(locallock->nLocks >= 0);

I think the assertion is not needed since the above code is in if block :

+       if (locallockowner->nLocks < locallock->nLocks)

the condition, locallock->nLocks >= 0, would always hold after the subtraction.

Cheers
On Sat, 1 Jan 2022 at 15:40, Zhihong Yu <zyu@yugabyte.com> wrote:
> +           locallock->nLocks -= locallockowner->nLocks;
> +           Assert(locallock->nLocks >= 0);
>
> I think the assertion is not needed since the above code is in if block :
>
> +       if (locallockowner->nLocks < locallock->nLocks)
>
> the condition, locallock->nLocks >= 0, would always hold after the subtraction.

That makes sense. I've removed the Assert in the attached patch.
Thanks for looking at the patch

I've also spent a bit more time on the patch. There were quite a few
outdated comments remaining. Also, the PROCLOCK releaseMask field
appears to no longer be needed.

I also did a round of benchmarking on the attached patch using a very
recent master. I've attached .sql files and the script I used to
benchmark.

With 1024 partitions, lock1.sql shows about a 4.75% performance
increase. This would become larger with more partitions and less with
fewer partitions.
With the patch, lock2.sql about a 10% performance increase over master.
lock3.sql does not seem to have changed much and lock4.sql shows a
small regression with the patch of about 2.5%.

I'm not quite sure how worried we should be about lock4.sql slowing
down lightly. 2.5% is fairly small given how hard I'm exercising the
locking code in that test. There's also nothing much to say that the
slowdown is not just due to code alignment changes.

I also understand that Amit L is working on another patch that will
improve the situation for lock1.sql by not taking the locks on
relations that will be run-time pruned at executor startup.  I think
it's still worth solving this regardless of Amit's patch as with
current master we still have a situation where short fast queries
which access a small number of tables can become slower once the
backend has obtained a large number of locks concurrently and bloated
the locallocktable.

As for the patch. I feel it's a pretty invasive change to how we
release locks and the resowner code. I'd be quite happy for some
review of it.

Here are the full results as output by the attached script.

drowley@amd3990x:~$ echo master
master
drowley@amd3990x:~$ ./lockbench.sh
lock1.sql
tps = 38078.011433 (without initial connection time)
tps = 38070.016792 (without initial connection time)
tps = 39223.118769 (without initial connection time)
tps = 37510.105287 (without initial connection time)
tps = 38164.394128 (without initial connection time)
lock2.sql
tps = 247.963797 (without initial connection time)
tps = 247.374174 (without initial connection time)
tps = 248.412744 (without initial connection time)
tps = 248.192629 (without initial connection time)
tps = 248.503728 (without initial connection time)
lock3.sql
tps = 1162.937692 (without initial connection time)
tps = 1160.968689 (without initial connection time)
tps = 1166.908643 (without initial connection time)
tps = 1160.288547 (without initial connection time)
tps = 1160.336572 (without initial connection time)
lock4.sql
tps = 282.173560 (without initial connection time)
tps = 284.470330 (without initial connection time)
tps = 286.089644 (without initial connection time)
tps = 285.548487 (without initial connection time)
tps = 284.313505 (without initial connection time)


drowley@amd3990x:~$ echo Patched
Patched
drowley@amd3990x:~$ ./lockbench.sh
lock1.sql
tps = 40338.975219 (without initial connection time)
tps = 39803.433365 (without initial connection time)
tps = 39504.824194 (without initial connection time)
tps = 39843.422438 (without initial connection time)
tps = 40624.483013 (without initial connection time)
lock2.sql
tps = 274.413309 (without initial connection time)
tps = 271.978813 (without initial connection time)
tps = 275.795091 (without initial connection time)
tps = 273.628649 (without initial connection time)
tps = 273.049977 (without initial connection time)
lock3.sql
tps = 1168.557054 (without initial connection time)
tps = 1168.139469 (without initial connection time)
tps = 1166.366440 (without initial connection time)
tps = 1165.464214 (without initial connection time)
tps = 1167.250809 (without initial connection time)
lock4.sql
tps = 274.842298 (without initial connection time)
tps = 277.911394 (without initial connection time)
tps = 278.702620 (without initial connection time)
tps = 275.715606 (without initial connection time)
tps = 278.816060 (without initial connection time)

David

Attachment
Good day, David.

I'm looking on patch and don't get some moments.

`GrantLockLocal` allocates `LOCALLOCKOWNER` and links it into
`locallock->locallockowners`. It links it regardless `owner` could be
NULL. But then `RemoveLocalLock` does `Assert(locallockowner->owner != NULL);`.
Why it should not fail?

`GrantLockLocal` allocates `LOCALLOCKOWNER` in `TopMemoryContext`.
But there is single `pfree(locallockowner)` in `LockReassignOwner`.
Looks like there should be more `pfree`. Shouldn't they?

`GrantLockLocal` does `dlist_push_tail`, but isn't it better to
do `dlist_push_head`? Resource owners usually form stack, so usually
when owner searches for itself it is last added to list.
Then `dlist_foreach` will find it sooner if it were added to the head.

regards

---------

Yura Sokolov
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com





On Wed, 6 Apr 2022 at 03:40, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
> I'm looking on patch and don't get some moments.
>
> `GrantLockLocal` allocates `LOCALLOCKOWNER` and links it into
> `locallock->locallockowners`. It links it regardless `owner` could be
> NULL. But then `RemoveLocalLock` does `Assert(locallockowner->owner != NULL);`.
> Why it should not fail?
>
> `GrantLockLocal` allocates `LOCALLOCKOWNER` in `TopMemoryContext`.
> But there is single `pfree(locallockowner)` in `LockReassignOwner`.
> Looks like there should be more `pfree`. Shouldn't they?
>
> `GrantLockLocal` does `dlist_push_tail`, but isn't it better to
> do `dlist_push_head`? Resource owners usually form stack, so usually
> when owner searches for itself it is last added to list.
> Then `dlist_foreach` will find it sooner if it were added to the head.

Thanks for having a look at this.  It's a bit unrealistic for me to
get a look at addressing these for v15. I've pushed this one out to
the next CF.

David



This entry has been waiting on author input for a while (our current
threshold is roughly two weeks), so I've marked it Returned with
Feedback.

Once you think the patchset is ready for review again, you (or any
interested party) can resurrect the patch entry by visiting

    https://commitfest.postgresql.org/38/3501/

and changing the status to "Needs Review", and then changing the
status again to "Move to next CF". (Don't forget the second step;
hopefully we will have streamlined this in the near future!)

Thanks,
--Jacob



On Wed, 3 Aug 2022 at 07:04, Jacob Champion <jchampion@timescale.com> wrote:
> This entry has been waiting on author input for a while (our current
> threshold is roughly two weeks), so I've marked it Returned with
> Feedback.

Thanks for taking care of this. You dealt with this correctly based on
the fact that I'd failed to rebase before or during the entire July
CF.

I'm still interested in having the LockReleaseAll slowness fixed, so
here's a rebased patch.

David

Attachment
Hi David,

This is review of speed up releasing of locks patch.

Contents & Purpose:
Subject is missing in patch. It would have been easier to understand purpose had it been included.
Included in the patch are change in README, but no new tests are included..

Initial Run:
The patch applies cleanly to HEAD. The regression tests all pass
successfully against the new patch.

Nitpicking & conclusion:
I don't see any performance improvement in tests. Lots of comments
were removed which were not fully replaced. Change of log level for ReleaseLockIfHeld: failed
from warning to panic is mystery. 
Change in readme doesn't look right.
`Any subsequent lockers are share lockers wait 
waiting for the VXID to terminate via some other method) is for deadlock`. This sentence could be rewritten.
Also more comments could be added to explain new methods added.

Thanks,
Ankit
Thank you for looking at the patch.

On Fri, 4 Nov 2022 at 04:43, Ankit Kumar Pandey <itsankitkp@gmail.com> wrote:
> I don't see any performance improvement in tests.

Are you able to share what your test was?

In order to see a performance improvement you're likely going to have
to obtain a large number of locks in the session so that the local
lock table becomes bloated, then continue to run some fast query and
observe that LockReleaseAll has become slower as a result of the hash
table becoming bloated.  Running pgbench running a SELECT on a hash
partitioned table with a good number of partitions to look up a single
row with -M prepared.  The reason this becomes slow is that the
planner will try a generic plan on the 6th execution which will lock
every partition and bloat the local lock table. From then on it will
use a custom plan which only locks a single leaf partition.

I just tried the following:

$ pgbench -i --partition-method=hash --partitions=1000 postgres

Master:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 21286.172326 (without initial connection time)

Patched:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 23034.063261 (without initial connection time)

If I try again with 10,000 partitions, I get:

Master:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 13044.290903 (without initial connection time)

Patched:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 22683.545686 (without initial connection time)

David



On Wed, 3 Aug 2022 at 09:04, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Wed, 3 Aug 2022 at 07:04, Jacob Champion <jchampion@timescale.com> wrote:
> > This entry has been waiting on author input for a while (our current
> > threshold is roughly two weeks), so I've marked it Returned with
> > Feedback.
>
> Thanks for taking care of this. You dealt with this correctly based on
> the fact that I'd failed to rebase before or during the entire July
> CF.
>
> I'm still interested in having the LockReleaseAll slowness fixed, so
> here's a rebased patch.

CFBot shows some compilation errors as in [1], please post an updated
version for the same:
[15:40:00.287] [1239/1809] Linking target src/backend/postgres
[15:40:00.287] FAILED: src/backend/postgres
[15:40:00.287] cc @src/backend/postgres.rsp
[15:40:00.287] /usr/bin/ld:
src/backend/postgres_lib.a.p/replication_logical_launcher.c.o: in
function `logicalrep_worker_onexit':
[15:40:00.287] /tmp/cirrus-ci-build/build/../src/backend/replication/logical/launcher.c:773:
undefined reference to `LockReleaseAll'
[15:40:00.287] collect2: error: ld returned 1 exit status

[1] - https://cirrus-ci.com/task/4562493863886848

Regards,
Vignesh



On Fri, 20 Jan 2023 at 00:26, vignesh C <vignesh21@gmail.com> wrote:
> CFBot shows some compilation errors as in [1], please post an updated
> version for the same:

I've attached a rebased patch.

While reading over this again, I wondered if instead of allocating the
memory for the LOCALLOCKOWNER in TopMemoryContext, maybe we should
create a Slab context as a child of TopMemoryContext and perform the
allocations there.  I feel like slab might be a better option here as
it'll use slightly less memory due to it not rounding up allocations
to the next power of 2. sizeof(LOCALLOCKOWNER) == 56, so it's not a
great deal of memory, but more than nothing. The primary reason that I
think this might be a good idea is mostly around better handling of
chunk on block fragmentation in slab.c than aset.c. If we have
transactions which create a large number of locks then we may end up
growing the TopMemoryContext and never releasing the AllocBlocks and
just having a high number of 64-byte chunks left on the freelist
that'll maybe never be used again. I'm thinking slab.c might handle
that better as it'll only keep around 10 completely empty SlabBlocks
before it'll start free'ing them. The slab allocator is quite a bit
faster now as a result of d21ded75f.

I would like to get this LockReleaseAll problem finally fixed in PG16,
but I'd feel much better about this patch if it had some review from
someone who has more in-depth knowledge of the locking code.

I've also gone and adjusted all the places that upgraded the
elog(WARNING)s of local table corruption to PANIC and put them back to
use WARNING again. While I think it might be a good idea to do that,
it seems to be adding a bit more resistance to this patch which I
don't think it really needs. Maybe we can consider that in a separate
effort.

David

Attachment
Hi David,

On Tue, Jan 24, 2023 at 12:58 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Fri, 20 Jan 2023 at 00:26, vignesh C <vignesh21@gmail.com> wrote:
> > CFBot shows some compilation errors as in [1], please post an updated
> > version for the same:
>
> I've attached a rebased patch.

Thanks for the new patch.

Maybe you're planning to do it once this patch is post the PoC phase
(isn't it?), but it would be helpful to have commentary on all the new
dlist fields.

Especially, I think the following warrants a bit more explanation than other:

-   /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
-   int         nlocks;         /* number of owned locks */
-   LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];  /* list of owned locks */
+   dlist_head  locks;          /* dlist of owned locks */

This seems to be replacing what is a cache with an upper limit on the
number of cacheds locks with something that has no limit on how many
per-owner locks are remembered.  I wonder whether we'd be doing
additional work in some cases with the new no-limit implementation
that wasn't being done before (where the owner's locks array is
overflowed) or maybe not much because the new implementation of
ResourceOwner{Remember|Forget}Lock() is simple push/delete of a dlist
node from the owner's dlist?

The following comment is now obsolete:

/*
 * LockReassignCurrentOwner
 *      Reassign all locks belonging to CurrentResourceOwner to belong
 *      to its parent resource owner.
 *
 * If the caller knows what those locks are, it can pass them as an array.
 * That speeds up the call significantly, when a lot of locks are held
 * (e.g pg_dump with a large schema).  Otherwise, pass NULL for locallocks,
 * and we'll traverse through our hash table to find them.
 */

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com



Hi,

On 2023-01-24 16:57:37 +1300, David Rowley wrote:
> I've attached a rebased patch.

Looks like there's some issue causing tests to fail probabilistically:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3501

Several failures are when testing a 32bit build.



> While reading over this again, I wondered if instead of allocating the
> memory for the LOCALLOCKOWNER in TopMemoryContext, maybe we should
> create a Slab context as a child of TopMemoryContext and perform the
> allocations there.

Yes, that does make sense.


> I would like to get this LockReleaseAll problem finally fixed in PG16,
> but I'd feel much better about this patch if it had some review from
> someone who has more in-depth knowledge of the locking code.

I feel my review wouldn't be independent, but I'll give it a shot if nobody
else does.

Greetings,

Andres Freund



Thanks for having a look at this.

On Wed, 1 Feb 2023 at 03:07, Amit Langote <amitlangote09@gmail.com> wrote:
> Maybe you're planning to do it once this patch is post the PoC phase
> (isn't it?), but it would be helpful to have commentary on all the new
> dlist fields.

I've added comments on the new fields.  Maybe we can say the patch is "wip".

> This seems to be replacing what is a cache with an upper limit on the
> number of cacheds locks with something that has no limit on how many
> per-owner locks are remembered.  I wonder whether we'd be doing
> additional work in some cases with the new no-limit implementation
> that wasn't being done before (where the owner's locks array is
> overflowed) or maybe not much because the new implementation of
> ResourceOwner{Remember|Forget}Lock() is simple push/delete of a dlist
> node from the owner's dlist?

It's a good question. The problem is I don't really have a good test
to find out.  The problem is we'd need to benchmark taking fewer than
16 locks. On trying that, I find that there's just too much
variability in the performance between runs to determine if there's
any slowdown.

$ cat 10_locks.sql
select count(pg_advisory_lock(x)) from generate_series(1,10) x;

$ pgbench -f 10_locks.sql@1000 -M prepared -T 10 -n postgres | grep -E "(tps)"
tps = 47809.306088 (without initial connection time)
tps = 66859.789072 (without initial connection time)
tps = 37885.924616 (without initial connection time)

On trying with more locks, I see there are good wins from the patched version.

$ cat 100_locks.sql
select count(pg_advisory_lock(x)) from generate_series(1,100) x;

$ cat 1k_locks.sql
select count(pg_advisory_lock(x)) from generate_series(1,1000) x;

$ cat 10k_locks.sql
select count(pg_advisory_lock(x)) from generate_series(1,10000) x;

Test 1:  Take 100 locks but periodically take 10k locks to bloat the
local lock table.

master:
$ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10
-n postgres | grep -E "(tps|script)"
transaction type: multiple scripts
tps = 2726.197037 (without initial connection time)
SQL script 1: 100_locks.sql
 - 27219 transactions (99.9% of total, tps = 2722.496227)
SQL script 2: 10k_locks.sql
 - 37 transactions (0.1% of total, tps = 3.700810)

patched:
$ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10
-n postgres | grep -E "(tps|script)"
transaction type: multiple scripts
tps = 34047.297822 (without initial connection time)
SQL script 1: 100_locks.sql
 - 340039 transactions (99.9% of total, tps = 34012.688879)
SQL script 2: 10k_locks.sql
 - 346 transactions (0.1% of total, tps = 34.608943)

patched without slab context:
$ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10
-n postgres | grep -E "(tps|script)"
transaction type: multiple scripts
tps = 34851.770846 (without initial connection time)
SQL script 1: 100_locks.sql
 - 348097 transactions (99.9% of total, tps = 34818.662324)
SQL script 2: 10k_locks.sql
 - 331 transactions (0.1% of total, tps = 33.108522)

Test 2: Always take just 100 locks and don't bloat the local lock table.

master:
$ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep
-E "(tps|script)"
tps = 32682.491548 (without initial connection time)

patched:
$ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep
-E "(tps|script)"
tps = 35637.241815 (without initial connection time)

patched without slab context:
$ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep
-E "(tps|script)"
tps = 36192.185181 (without initial connection time)

The attached 0003 patch is an experiment to see if using a slab memory
context has any advantages for storing the LOCALLOCKOWNER structs.
There seems to be a small performance hit from doing this.

> The following comment is now obsolete:
>
> /*
>  * LockReassignCurrentOwner
>  *      Reassign all locks belonging to CurrentResourceOwner to belong
>  *      to its parent resource owner.
>  *
>  * If the caller knows what those locks are, it can pass them as an array.
>  * That speeds up the call significantly, when a lot of locks are held
>  * (e.g pg_dump with a large schema).  Otherwise, pass NULL for locallocks,
>  * and we'll traverse through our hash table to find them.
>  */

I've removed the obsolete part.

I've attached another set of patches. I do need to spend longer
looking at this. I'm mainly attaching these as CI seems to be
highlighting a problem that I'm unable to recreate locally and I
wanted to see if the attached fixes it.

David

Attachment
On 10/02/2023 04:51, David Rowley wrote:
> I've attached another set of patches. I do need to spend longer
> looking at this. I'm mainly attaching these as CI seems to be
> highlighting a problem that I'm unable to recreate locally and I
> wanted to see if the attached fixes it.

I like this patch's approach.

> index 296dc82d2ee..edb8b6026e5 100644
> --- a/src/backend/commands/discard.c
> +++ b/src/backend/commands/discard.c
> @@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
>         Async_UnlistenAll();
> -       LockReleaseAll(USER_LOCKMETHOD, true);
> +       LockReleaseSession(USER_LOCKMETHOD);
>         ResetPlanCache();

This assumes that there are no transaction-level advisory locks. I think 
that's OK. It took me a while to convince myself of that, though. I 
think we need a high level comment somewhere that explains what 
assumptions we make on which locks can be held in session mode and which 
in transaction mode.

> @@ -3224,14 +3206,6 @@ PostPrepare_Locks(TransactionId xid)
>                         Assert(lock->nGranted <= lock->nRequested);
>                         Assert((proclock->holdMask & ~lock->grantMask) == 0);
>  
> -                       /* Ignore it if nothing to release (must be a session lock) */
> -                       if (proclock->releaseMask == 0)
> -                               continue;
> -
> -                       /* Else we should be releasing all locks */
> -                       if (proclock->releaseMask != proclock->holdMask)
> -                               elog(PANIC, "we seem to have dropped a bit somewhere");
> -
>                         /*
>                          * We cannot simply modify proclock->tag.myProc to reassign
>                          * ownership of the lock, because that's part of the hash key and

This looks wrong. If you prepare a transaction that is holding any 
session locks, we will now transfer them to the prepared transaction. 
And its locallock entry will be out of sync. To fix, I think we could 
keep around the hash table that CheckForSessionAndXactLocks() builds, 
and use that here.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Thank you for having a look at this. Apologies for not getting back to
you sooner.

On Wed, 5 Jul 2023 at 21:44, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 10/02/2023 04:51, David Rowley wrote:
> > I've attached another set of patches. I do need to spend longer
> > looking at this. I'm mainly attaching these as CI seems to be
> > highlighting a problem that I'm unable to recreate locally and I
> > wanted to see if the attached fixes it.
>
> I like this patch's approach.
>
> > index 296dc82d2ee..edb8b6026e5 100644
> > --- a/src/backend/commands/discard.c
> > +++ b/src/backend/commands/discard.c
> > @@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
> >         Async_UnlistenAll();
> > -       LockReleaseAll(USER_LOCKMETHOD, true);
> > +       LockReleaseSession(USER_LOCKMETHOD);
> >         ResetPlanCache();
>
> This assumes that there are no transaction-level advisory locks. I think
> that's OK. It took me a while to convince myself of that, though. I
> think we need a high level comment somewhere that explains what
> assumptions we make on which locks can be held in session mode and which
> in transaction mode.

Isn't it ok because DISCARD ALL cannot run inside a transaction block,
so there should be no locks taken apart from possibly session-level
locks?

I've added a call to LockAssertNoneHeld(false) in there.

> > @@ -3224,14 +3206,6 @@ PostPrepare_Locks(TransactionId xid)
> >                         Assert(lock->nGranted <= lock->nRequested);
> >                         Assert((proclock->holdMask & ~lock->grantMask) == 0);
> >
> > -                       /* Ignore it if nothing to release (must be a session lock) */
> > -                       if (proclock->releaseMask == 0)
> > -                               continue;
> > -
> > -                       /* Else we should be releasing all locks */
> > -                       if (proclock->releaseMask != proclock->holdMask)
> > -                               elog(PANIC, "we seem to have dropped a bit somewhere");
> > -
> >                         /*
> >                          * We cannot simply modify proclock->tag.myProc to reassign
> >                          * ownership of the lock, because that's part of the hash key and
>
> This looks wrong. If you prepare a transaction that is holding any
> session locks, we will now transfer them to the prepared transaction.
> And its locallock entry will be out of sync. To fix, I think we could
> keep around the hash table that CheckForSessionAndXactLocks() builds,
> and use that here.

Good catch.  I've modified the patch to keep the hash table built in
CheckForSessionAndXactLocks around for longer so that we can check for
session locks.

I've attached an updated patch mainly to get CI checking this. I
suspect something is wrong as subscription/015_stream is timing out.
I've not gotten to the bottom of that yet.

David

Attachment
On 11/09/2023 15:00, David Rowley wrote:
> On Wed, 5 Jul 2023 at 21:44, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>>> index 296dc82d2ee..edb8b6026e5 100644
>>> --- a/src/backend/commands/discard.c
>>> +++ b/src/backend/commands/discard.c
>>> @@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
>>>          Async_UnlistenAll();
>>> -       LockReleaseAll(USER_LOCKMETHOD, true);
>>> +       LockReleaseSession(USER_LOCKMETHOD);
>>>          ResetPlanCache();
>>
>> This assumes that there are no transaction-level advisory locks. I think
>> that's OK. It took me a while to convince myself of that, though. I
>> think we need a high level comment somewhere that explains what
>> assumptions we make on which locks can be held in session mode and which
>> in transaction mode.
> 
> Isn't it ok because DISCARD ALL cannot run inside a transaction block,
> so there should be no locks taken apart from possibly session-level
> locks?

Hmm, sounds valid. I think I convinced myself that it's OK through some 
other reasoning, but I don't remember it now.

> I've added a call to LockAssertNoneHeld(false) in there.

I don't see it in the patch?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > I've added a call to LockAssertNoneHeld(false) in there.
>
> I don't see it in the patch?

hmm. I must've git format-patch before committing that part.

I'll try that again... see attached.

David

Attachment
On 18/09/2023 07:08, David Rowley wrote:
> On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> I've added a call to LockAssertNoneHeld(false) in there.
>>
>> I don't see it in the patch?
> 
> hmm. I must've git format-patch before committing that part.
> 
> I'll try that again... see attached.

This needed a rebase after my ResourceOwner refactoring. Attached.

A few quick comments:

- It would be nice to add a test for the issue that you fixed in patch 
v7, i.e. if you prepare a transaction while holding session-level locks.

- GrantLockLocal() now calls MemoryContextAlloc(), which can fail if you 
are out of memory. Is that handled gracefully or is the lock leaked?

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment
On Thu, 9 Nov 2023 at 21:48, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 18/09/2023 07:08, David Rowley wrote:
> > On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>> I've added a call to LockAssertNoneHeld(false) in there.
> >>
> >> I don't see it in the patch?
> >
> > hmm. I must've git format-patch before committing that part.
> >
> > I'll try that again... see attached.
>
> This needed a rebase after my ResourceOwner refactoring. Attached.
>
> A few quick comments:
>
> - It would be nice to add a test for the issue that you fixed in patch
> v7, i.e. if you prepare a transaction while holding session-level locks.
>
> - GrantLockLocal() now calls MemoryContextAlloc(), which can fail if you
> are out of memory. Is that handled gracefully or is the lock leaked?

CFBot shows one of the test has aborted at [1] with:
[20:54:28.535] Core was generated by `postgres: subscriber: logical
replication apply worker for subscription 16397 '.
[20:54:28.535] Program terminated with signal SIGABRT, Aborted.
[20:54:28.535] #0  __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:50
[20:54:28.535] Download failed: Invalid argument.  Continuing without
source file ./signal/../sysdeps/unix/sysv/linux/raise.c.
[20:54:28.627]
[20:54:28.627] Thread 1 (Thread 0x7f0ea02d1a40 (LWP 50984)):
[20:54:28.627] #0  __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:50
...
...
[20:54:28.627] #2  0x00005618e989d62f in ExceptionalCondition
(conditionName=conditionName@entry=0x5618e9b40f70
"dlist_is_empty(&(MyProc->myProcLocks[i]))",
fileName=fileName@entry=0x5618e9b40ec0
"../src/backend/storage/lmgr/proc.c", lineNumber=lineNumber@entry=856)
at ../src/backend/utils/error/assert.c:66
[20:54:28.627] No locals.
[20:54:28.627] #3  0x00005618e95e6847 in ProcKill (code=<optimized
out>, arg=<optimized out>) at ../src/backend/storage/lmgr/proc.c:856
[20:54:28.627]         i = <optimized out>
[20:54:28.627]         proc = <optimized out>
[20:54:28.627]         procgloballist = <optimized out>
[20:54:28.627]         __func__ = "ProcKill"
[20:54:28.627] #4  0x00005618e959ebcc in shmem_exit
(code=code@entry=1) at ../src/backend/storage/ipc/ipc.c:276
[20:54:28.627]         __func__ = "shmem_exit"
[20:54:28.627] #5  0x00005618e959ecd0 in proc_exit_prepare
(code=code@entry=1) at ../src/backend/storage/ipc/ipc.c:198
[20:54:28.627]         __func__ = "proc_exit_prepare"
[20:54:28.627] #6  0x00005618e959ee8e in proc_exit (code=code@entry=1)
at ../src/backend/storage/ipc/ipc.c:111
[20:54:28.627]         __func__ = "proc_exit"
[20:54:28.627] #7  0x00005618e94aa54d in BackgroundWorkerMain () at
../src/backend/postmaster/bgworker.c:805
[20:54:28.627]         local_sigjmp_buf = {{__jmpbuf =
{94665009627112, -3865857745677845768, 0, 0, 140732736634980, 1,
3865354362587970296, 7379258256398875384}, __mask_was_saved = 1,
__saved_mask = {__val = {18446744066192964099, 94665025527920,
94665025527920, 94665025527920, 0, 94665025528120, 8192, 1,
94664997686410, 94665009627040, 94664997622076, 94665025527920, 1, 0,
0, 140732736634980}}}}
[20:54:28.627]         worker = 0x5618eb37c570
[20:54:28.627]         entrypt = <optimized out>
[20:54:28.627]         __func__ = "BackgroundWorkerMain"
[20:54:28.627] #8  0x00005618e94b495c in do_start_bgworker
(rw=rw@entry=0x5618eb3b73c8) at
../src/backend/postmaster/postmaster.c:5697
[20:54:28.627]         worker_pid = <optimized out>
[20:54:28.627]         __func__ = "do_start_bgworker"
[20:54:28.627] #9  0x00005618e94b4c32 in maybe_start_bgworkers () at
../src/backend/postmaster/postmaster.c:5921
[20:54:28.627]         rw = 0x5618eb3b73c8
[20:54:28.627]         num_launched = 0
[20:54:28.627]         now = 0
[20:54:28.627]         iter = {cur = 0x5618eb3b79a8, next =
0x5618eb382a20, prev = 0x5618ea44a980 <BackgroundWorkerList>}
[20:54:28.627] #10 0x00005618e94b574a in process_pm_pmsignal () at
../src/backend/postmaster/postmaster.c:5073
[20:54:28.627]         __func__ = "process_pm_pmsignal"
[20:54:28.627] #11 0x00005618e94b5f4a in ServerLoop () at
../src/backend/postmaster/postmaster.c:1760

[1] - https://cirrus-ci.com/task/5118173163290624?logs=cores#L51

Regards,
Vignesh