Thread: Postgres is not able to handle more than 4k tables!?

Postgres is not able to handle more than 4k tables!?

From

Konstantin Knizhnik

Date:

08 July 2020, 20:41:01

I want to explain one bad situation we have encountered with one of our 
customers.
There are ~5000 tables in their database. And what is worse - most of 
them are actively used.
Then several flaws of Postgres make their system almost stuck.

Autovacuum is periodically processing all this 5k relations (because 
them are actively updated).
And as far as most of this tables are small enough autovacuum complete 
processing of them almost in the same time.
As a result autovacuum workers produce ~5k invalidation messages in 
short period of time.

There are several thousand clients, most of which are executing complex 
queries.
So them are not able to process all this invalidation messages and their 
invalidation message buffer is overflown.
Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be 
changed without recompilation of Postgres.
This is problem N1.

As a result resetState is set to true, forcing backends to invalidate 
their caches.
So most of backends loose there cached metadata and have to access 
system catalog trying to reload it.
But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
It is also hardcoded and can't be changed without recompilation:

#define LOG2_NUM_LOCK_PARTITIONS  4
#define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)

Having just 16 LW-Locks greatly increase conflict probability (taken in 
account that there are 5k tables and totally about 25k relations).
It cause huge lw-lock acquisition time for heap_open and planning stage 
of some queries is increased from milliseconds to several minutes!
Koda!

This is problem number 2. But there is one more flaw we have faced with. 
We have increased LOG2_NUM_LOCK_PARTITIONS to 8
and ... clients start to report "too many LWLocks taken" error.
There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200
which relation with NUM_LOCK_PARTITIONS  was not mentioned anywhere.

But there are several places in Postgres where it tries to hold all 
partition locks (for example in deadlock detector).
Definitely if NUM_LOCK_PARTITIONS > MAX_SIMUL_LWLOCKS we get this error.

So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES  constants have to 
be replaced with GUCs.
To avoid division, we can specify log2 of this values, so shift can be 
used instead.
And MAX_SIMUL_LWLOCKS should be defined as NUM_LOCK_PARTITIONS + 
NUM_INDIVIDUAL_LWLOCKS + NAMED_LWLOCK_RESERVE.

Re: Postgres is not able to handle more than 4k tables!?

From

Tom Lane

Date:

08 July 2020, 21:35:55

Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
> There are several thousand clients, most of which are executing complex 
> queries.

So, that's really the core of your problem.  We don't promise that
you can run several thousand backends at once.  Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.

> So them are not able to process all this invalidation messages and their 
> invalidation message buffer is overflown.
> Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be 
> changed without recompilation of Postgres.
> This is problem N1.

No, this isn't a problem.  Or at least you haven't shown a reason to
think it is.  Sinval overruns are somewhat routine, and we certainly
test that code path (see CLOBBER_CACHE_ALWAYS buildfarm animals).

> But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
> It is also hardcoded and can't be changed without recompilation:

> #define LOG2_NUM_LOCK_PARTITIONS  4
> #define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)

> Having just 16 LW-Locks greatly increase conflict probability (taken in 
> account that there are 5k tables and totally about 25k relations).

> It cause huge lw-lock acquisition time for heap_open and planning stage 
> of some queries is increased from milliseconds to several minutes!

Really?

> This is problem number 2. But there is one more flaw we have faced with. 
> We have increased LOG2_NUM_LOCK_PARTITIONS to 8
> and ... clients start to report "too many LWLocks taken" error.
> There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200
> which relation with NUM_LOCK_PARTITIONS  was not mentioned anywhere.

Seems like self-inflicted damage.  I certainly don't recall anyplace
in the docs where we suggest that you can alter that constant without
worrying about consequences.

> So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES  constants have to 
> be replaced with GUCs.

I seriously doubt we'd do that.

            regards, tom lane

RE: Postgres is not able to handle more than 4k tables!?

From

"tsunakawa.takay@fujitsu.com"

Date:

09 July 2020, 00:49:21

From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
> Autovacuum is periodically processing all this 5k relations (because
> them are actively updated).
> And as far as most of this tables are small enough autovacuum complete
> processing of them almost in the same time.
> As a result autovacuum workers produce ~5k invalidation messages in
> short period of time.

How about trying CREATE/ALTER TABLE WITH (vacuum_truncate = off)?  It's available since PG 12.  It causes autovacuum to
nottruncate the relation.  It's the relation truncation what produces those shared invalidation messages.
 


> But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
> It is also hardcoded and can't be changed without recompilation:
> 
> #define LOG2_NUM_LOCK_PARTITIONS  4
> #define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)
> 
> Having just 16 LW-Locks greatly increase conflict probability (taken in
> account that there are 5k tables and totally about 25k relations).
> It cause huge lw-lock acquisition time for heap_open and planning stage
> of some queries is increased from milliseconds to several minutes!
> Koda!

The vacuum's relation truncation is also the culprit here, and it can be eliminated by the above storage parameter.  It
acquiresAccess Exclusive lock on the relation.  Without the strong Access Exclusive lock, just running DML statements
usethe fast path locking, which doesn't acquire the lock manager partition lock.
 

The long lwlock wait is a sad story.  The victim is probably exclusive lockers.  When someone holds a shared lock on a
lwlock,the exclusive locker has to wait.  That's OK.  However, if another share locker comes later, it acquires the
lwlockeven though there're waiting exclusive lockers.  That's unfair, but this is the community decision.
 


Regards
Takayuki Tsunakawa

Re: Postgres is not able to handle more than 4k tables!?

From

Konstantin Knizhnik

Date:

09 July 2020, 06:31:06


On 09.07.2020 03:49, tsunakawa.takay@fujitsu.com wrote:
> From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
>> Autovacuum is periodically processing all this 5k relations (because
>> them are actively updated).
>> And as far as most of this tables are small enough autovacuum complete
>> processing of them almost in the same time.
>> As a result autovacuum workers produce ~5k invalidation messages in
>> short period of time.
> How about trying CREATE/ALTER TABLE WITH (vacuum_truncate = off)?  It's available since PG 12.  It causes autovacuum
tonot truncate the relation.  It's the relation truncation what produces those shared invalidation messages.
 

Invalidation messages are also caused by statistic update:

#0  0x000055a85f4f5fd6 in RegisterCatcacheInvalidation (cacheId=49, 
hashValue=715727843, dbId=12443)
     at inval.c:483
#1  0x000055a85f4f4dc2 in PrepareToInvalidateCacheTuple 
(relation=0x7f45a34ce5a0, tuple=0x7ffc75bebc70,
     newtuple=0x7f4598e75ef8, function=0x55a85f4f5fc0 
<RegisterCatcacheInvalidation>) at catcache.c:1830
#2  0x000055a85f4f6b21 in CacheInvalidateHeapTuple 
(relation=0x7f45a34ce5a0, tuple=0x7ffc75bebc70,
     newtuple=0x7f4598e75ef8) at inval.c:1149
#3  0x000055a85f016372 in heap_update (relation=0x7f45a34ce5a0, 
otid=0x7f4598e75efc,
     newtup=0x7f4598e75ef8, cid=0, crosscheck=0x0, wait=1 '\001', 
hufd=0x7ffc75bebcf0,
     lockmode=0x7ffc75bebce8) at heapam.c:4245
#4  0x000055a85f016f98 in simple_heap_update (relation=0x7f45a34ce5a0, 
otid=0x7f4598e75efc,
     tup=0x7f4598e75ef8) at heapam.c:4490
#5  0x000055a85f153ec5 in update_attstats (relid=16384, inh=0 '\000', 
natts=1, vacattrstats=0x55a860f0fba0)
     at analyze.c:1619
#6  0x000055a85f151898 in do_analyze_rel (onerel=0x7f45a3480080, 
options=98, params=0x55a860f0f028,
     va_cols=0x0, acquirefunc=0x55a85f15264e <acquire_sample_rows>, 
relpages=26549, inh=0 '\000',
     in_outer_xact=0 '\000', elevel=13) at analyze.c:562
#7  0x000055a85f150be1 in analyze_rel (relid=16384, 
relation=0x7ffc75bec370, options=98,
     params=0x55a860f0f028, va_cols=0x0, in_outer_xact=0 '\000', 
bstrategy=0x55a860f0f0b8) at analyze.c:257
#8  0x000055a85f1e1589 in vacuum (options=98, relation=0x7ffc75bec370, 
relid=16384, params=0x55a860f0f028,
     va_cols=0x0, bstrategy=0x55a860f0f0b8, isTopLevel=1 '\001') at 
vacuum.c:320
#9  0x000055a85f2fd92a in autovacuum_do_vac_analyze (tab=0x55a860f0f020, 
bstrategy=0x55a860f0f0b8)
     at autovacuum.c:2874
#10 0x000055a85f2fcccb in do_autovacuum () at autovacuum.c:2374
>
>> But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
>> It is also hardcoded and can't be changed without recompilation:
>>
>> #define LOG2_NUM_LOCK_PARTITIONS  4
>> #define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)
>>
>> Having just 16 LW-Locks greatly increase conflict probability (taken in
>> account that there are 5k tables and totally about 25k relations).
>> It cause huge lw-lock acquisition time for heap_open and planning stage
>> of some queries is increased from milliseconds to several minutes!
>> Koda!
> The vacuum's relation truncation is also the culprit here, and it can be eliminated by the above storage parameter.
Itacquires Access Exclusive lock on the relation.  Without the strong Access Exclusive lock, just running DML
statementsuse the fast path locking, which doesn't acquire the lock manager partition lock.
 

Looks like it is not true (at lest for PG9.6):

#0  0x00007fa6d30da087 in semop () from /lib64/libc.so.6
#1  0x0000000000682241 in PGSemaphoreLock 
(sema=sema@entry=0x7fa66f5655d8) at pg_sema.c:387
#2  0x00000000006ec6eb in LWLockAcquire (lock=lock@entry=0x7f23b544f800, 
mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1338
#3  0x00000000006e5560 in LockAcquireExtended 
(locktag=locktag@entry=0x7ffd94883fa0, lockmode=lockmode@entry=1, 
sessionLock=sessionLock@entry=0 '\000', dontWait=dontWait@entry=0 
'\000', reportMemoryError=reportMemoryError@entry=1 '\001', 
locallockp=locallockp@entry=0x7ffd94883f98) at lock.c:962
#4  0x00000000006e29f6 in LockRelationOid (relid=87103837, lockmode=1) 
at lmgr.c:113
#5  0x00000000004a9f55 in relation_open (relationId=87103837, 
lockmode=lockmode@entry=1) at heapam.c:1131
#6  0x00000000004bdc66 in index_open (relationId=<optimized out>, 
lockmode=lockmode@entry=1) at indexam.c:151
#7  0x000000000067be58 in get_relation_info (root=root@entry=0x3a1a758, 
relationObjectId=72079078, inhparent=<optimized out>, 
rel=rel@entry=0x3a2d460) at plancat.c:183
#8  0x000000000067ef45 in build_simple_rel (root=root@entry=0x3a1a758, 
relid=2, reloptkind=reloptkind@entry=RELOPT_BASEREL) at relnode.c:148

Please notice  lockmode=1 (AccessShareLock)

>
> The long lwlock wait is a sad story.  The victim is probably exclusive lockers.  When someone holds a shared lock on
alwlock, the exclusive locker has to wait.  That's OK.  However, if another share locker comes later, it acquires the
lwlockeven though there're waiting exclusive lockers.  That's unfair, but this is the community decision.
 
Yes, I also think that it is the reason of the problem.
Alexandr Korotokov has implemented fair LW-Locks which eliminate such 
kind of problems in some scenarios.
May it also can help here.

Re: Postgres is not able to handle more than 4k tables!?

From

Konstantin Knizhnik

Date:

09 July 2020, 07:07:38


On 09.07.2020 00:35, Tom Lane wrote:
> Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
>> There are several thousand clients, most of which are executing complex
>> queries.
> So, that's really the core of your problem.  We don't promise that
> you can run several thousand backends at once.  Usually it's recommended
> that you stick a connection pooler in front of a server with (at most)
> a few hundred backends.
It is not my problem - it is customer's problem.
Certainly the advice to use connection pooler is the first thing we have 
proposed to the customer when we see such larger number of active backends.
Unfortunately it is not always possible (connection pooler is not 
preseving session semantic).
This is why I have proposed builtin connection pooler for Postgres.
But it is different story.


>> So them are not able to process all this invalidation messages and their
>> invalidation message buffer is overflown.
>> Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be
>> changed without recompilation of Postgres.
>> This is problem N1.
> No, this isn't a problem.  Or at least you haven't shown a reason to
> think it is.  Sinval overruns are somewhat routine, and we certainly
> test that code path (see CLOBBER_CACHE_ALWAYS buildfarm animals).
Certainly cache overrun is not fatal.
But if most of backends are blocked in heap_open pf pg_attribute 
relation then something is not ok with Postgres, isn't it?

> It cause huge lw-lock acquisition time for heap_open and planning stage
>> of some queries is increased from milliseconds to several minutes!
> Really?

Planning time: 75698.602 ms
Execution time: 0.861 ms

>> This is problem number 2. But there is one more flaw we have faced with.
>> We have increased LOG2_NUM_LOCK_PARTITIONS to 8
>> and ... clients start to report "too many LWLocks taken" error.
>> There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200
>> which relation with NUM_LOCK_PARTITIONS  was not mentioned anywhere.
> Seems like self-inflicted damage.  I certainly don't recall anyplace
> in the docs where we suggest that you can alter that constant without
> worrying about consequences.

Looks like you try to convince me that such practice of hardcoding 
constants in code and
not taken in account relation between them is good design pattern?
>> So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES  constants have to
>> be replaced with GUCs.
> I seriously doubt we'd do that.
It's a pity, because such attention is one of the reasons why Postgres 
is pgbench-oriented database showing good results at notebooks
but not at real systems running at power servers (NUMA, SSD, huge amount 
of memory, large number of cores,...).

RE: Postgres is not able to handle more than 4k tables!?

From

"tsunakawa.takay@fujitsu.com"

Date:

09 July 2020, 07:17:02

From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
> Looks like it is not true (at lest for PG9.6):
> 
> #0  0x00007fa6d30da087 in semop () from /lib64/libc.so.6
> #1  0x0000000000682241 in PGSemaphoreLock
> (sema=sema@entry=0x7fa66f5655d8) at pg_sema.c:387
> #2  0x00000000006ec6eb in LWLockAcquire
> (lock=lock@entry=0x7f23b544f800,
> mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1338
> #3  0x00000000006e5560 in LockAcquireExtended
> (locktag=locktag@entry=0x7ffd94883fa0, lockmode=lockmode@entry=1,
> sessionLock=sessionLock@entry=0 '\000', dontWait=dontWait@entry=0
> '\000', reportMemoryError=reportMemoryError@entry=1 '\001',
> locallockp=locallockp@entry=0x7ffd94883f98) at lock.c:962
> #4  0x00000000006e29f6 in LockRelationOid (relid=87103837, lockmode=1)
> at lmgr.c:113
> #5  0x00000000004a9f55 in relation_open (relationId=87103837,
> lockmode=lockmode@entry=1) at heapam.c:1131
> #6  0x00000000004bdc66 in index_open (relationId=<optimized out>,
> lockmode=lockmode@entry=1) at indexam.c:151
> #7  0x000000000067be58 in get_relation_info (root=root@entry=0x3a1a758,
> relationObjectId=72079078, inhparent=<optimized out>,
> rel=rel@entry=0x3a2d460) at plancat.c:183
> #8  0x000000000067ef45 in build_simple_rel (root=root@entry=0x3a1a758,
> relid=2, reloptkind=reloptkind@entry=RELOPT_BASEREL) at relnode.c:148
> 
> Please notice  lockmode=1 (AccessShareLock)

Ouch, there exists another sad hardcoded value: the number of maximum locks that can be acquired by the fast-path
mechanism.

[LockAcquireExtended]
    /*
     * Attempt to take lock via fast path, if eligible.  But if we remember
     * having filled up the fast path array, we don't attempt to make any
     * further use of it until we release some locks.  It's possible that some
     * other backend has transferred some of those locks to the shared hash
     * table, leaving space free, but it's not worth acquiring the LWLock just
     * to check.  It's also possible that we're acquiring a second or third
     * lock type on a relation we have already locked using the fast-path, but
     * for now we don't worry about that case either.
     */
    if (EligibleForRelationFastPath(locktag, lockmode) &&
        FastPathLocalUseCount < FP_LOCK_SLOTS_PER_BACKEND)
    {

/*
 * We allow a small number of "weak" relation locks (AccessShareLock,
 * RowShareLock, RowExclusiveLock) to be recorded in the PGPROC structure
 * rather than the main lock table.  This eases contention on the lock
 * manager LWLocks.  See storage/lmgr/README for additional details.
 */
#define     FP_LOCK_SLOTS_PER_BACKEND 16


16 looks easily exceeded even in a not-long OLTP transaction... especially the table is partitioned.  I wonder if we're
caughtin the hell of lock manager partition lock contention without knowing it.  I'm afraid other pitfalls are lurking
whenthere are many relations.
 


Regards
Takayuki Tsunakawa

Re: Postgres is not able to handle more than 4k tables!?

From

Stephen Frost

Date:

09 July 2020, 14:57:00

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
> > There are several thousand clients, most of which are executing complex
> > queries.
>
> So, that's really the core of your problem.  We don't promise that
> you can run several thousand backends at once.  Usually it's recommended
> that you stick a connection pooler in front of a server with (at most)
> a few hundred backends.

Sure, but that doesn't mean things should completely fall over when we
do get up to larger numbers of backends, which is definitely pretty
common in larger systems.  I'm pretty sure we all agree that using a
connection pooler is recommended, but if there's things we can do to
make the system work at least a bit better when folks do use lots of
connections, provided we don't materially damage other cases, that's
probably worthwhile.

> > So them are not able to process all this invalidation messages and their
> > invalidation message buffer is overflown.
> > Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be
> > changed without recompilation of Postgres.
> > This is problem N1.
>
> No, this isn't a problem.  Or at least you haven't shown a reason to
> think it is.  Sinval overruns are somewhat routine, and we certainly
> test that code path (see CLOBBER_CACHE_ALWAYS buildfarm animals).

Testing that it doesn't outright break and having it be decently
performant are two rather different things.  I think we're talking more
about performance and not so much about if the system is outright broken
in this case.

> > But then we come to the next show stopper: NUM_LOCK_PARTITIONS.
> > It is also hardcoded and can't be changed without recompilation:
>
> > #define LOG2_NUM_LOCK_PARTITIONS  4
> > #define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)
>
> > Having just 16 LW-Locks greatly increase conflict probability (taken in
> > account that there are 5k tables and totally about 25k relations).
>
> > It cause huge lw-lock acquisition time for heap_open and planning stage
> > of some queries is increased from milliseconds to several minutes!
>
> Really?

Apparently, given the response down-thread.

> > This is problem number 2. But there is one more flaw we have faced with.
> > We have increased LOG2_NUM_LOCK_PARTITIONS to 8
> > and ... clients start to report "too many LWLocks taken" error.
> > There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200
> > which relation with NUM_LOCK_PARTITIONS  was not mentioned anywhere.
>
> Seems like self-inflicted damage.  I certainly don't recall anyplace
> in the docs where we suggest that you can alter that constant without
> worrying about consequences.

Perhaps not in the docs, but would be good to make note of it somewhere,
as I don't think it's really appropriate to assume these constants won't
ever change and whomever contemplates changing them would appreciate
knowing about other related values..

> > So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES  constants have to
> > be replaced with GUCs.
>
> I seriously doubt we'd do that.

Making them GUCs does seem like it's a few steps too far... but it'd be
nice if we could arrange to have values that don't result in the system
falling over with large numbers of backends and large numbers of tables.
To get a lot of backends, you'd have to set max_connections up pretty
high to begin with- perhaps we should contemplate allowing these values
to vary based on what max_connections is set to?

Thanks,

Stephen

Attachment

signature.asc

Re: Postgres is not able to handle more than 4k tables!?

From

Tom Lane

Date:

09 July 2020, 15:14:29

Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> So, that's really the core of your problem.  We don't promise that
>> you can run several thousand backends at once.  Usually it's recommended
>> that you stick a connection pooler in front of a server with (at most)
>> a few hundred backends.

> Sure, but that doesn't mean things should completely fall over when we
> do get up to larger numbers of backends, which is definitely pretty
> common in larger systems.

As I understood the report, it was not "things completely fall over",
it was "performance gets bad".  But let's get real.  Unless the OP
has a machine with thousands of CPUs, trying to run this way is
counterproductive.

Perhaps in a decade or two such machines will be common enough that
it'll make sense to try to tune Postgres to run well on them.  Right
now I feel no hesitation about saying "if it hurts, don't do that".

            regards, tom lane

Re: Postgres is not able to handle more than 4k tables!?

From

Stephen Frost

Date:

09 July 2020, 15:29:24

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> >> So, that's really the core of your problem.  We don't promise that
> >> you can run several thousand backends at once.  Usually it's recommended
> >> that you stick a connection pooler in front of a server with (at most)
> >> a few hundred backends.
>
> > Sure, but that doesn't mean things should completely fall over when we
> > do get up to larger numbers of backends, which is definitely pretty
> > common in larger systems.
>
> As I understood the report, it was not "things completely fall over",
> it was "performance gets bad".  But let's get real.  Unless the OP
> has a machine with thousands of CPUs, trying to run this way is
> counterproductive.

Right, the issue is that performance gets bad (or, really, more like
terrible...), and regardless of if it's ideal or not, lots of folks
actually do run PG with thousands of connections, and we know that at
start-up time because they've set max_connections to a sufficiently high
value to support doing exactly that.

> Perhaps in a decade or two such machines will be common enough that
> it'll make sense to try to tune Postgres to run well on them.  Right
> now I feel no hesitation about saying "if it hurts, don't do that".

I disagree that we should completely ignore these use-cases.

Thanks,

Stephen

Attachment

signature.asc

Re: Postgres is not able to handle more than 4k tables!?

From

Konstantin Knizhnik

Date:

09 July 2020, 15:57:46

Hi Stephen,

Thank you for supporting an opinion that it is the problems not only of 
client system design (I agree it is not so good
idea to have thousands tables and thousands active backends) but also of 
Postgres.

We have made more investigation and found out one more problem in 
Postgres causing "invalidation storm".
There are some log living transactions which prevent autovacuum to do it 
work and  remove dead tuples.
So autovacuum is started once and once again and each time did no 
progress but updated statistics and so sent invalidation messages.
autovacuum_naptime was set to 30 seconds, so each 30 seconds autovacuum 
proceed huge number of tables and initiated large number of invalidation 
messages which quite soon cause overflow of validation message buffers 
for backends performing long OLAP queries.

It makes me think about two possible optimizations:

1. Provide separate invalidation messages for relation metadata and its 
statistic.
So update of statistic should not invalidate relation cache.
The main problem with this proposal is that pg_class contains relpages 
and reltuples columns which conceptually are \ part of relation statistic
but stored in relation cache. If relation statistic is updated, then 
most likely this fields are also changed. So we have to remove this relation
from relation cache in any case.

2. Remember in relation info XID of oldest active transaction at the 
moment of last autovacuum.
At next autovacuum iteration we first of all compare this stored XID 
with current oldest active transaction XID
and bypass vacuuming this relation if XID is not changed.

Thoughts?

>> So, that's really the core of your problem.  We don't promise that
>> you can run several thousand backends at once.  Usually it's recommended
>> that you stick a connection pooler in front of a server with (at most)
>> a few hundred backends.
> Sure, but that doesn't mean things should completely fall over when we
> do get up to larger numbers of backends, which is definitely pretty
> common in larger systems.  I'm pretty sure we all agree that using a
> connection pooler is recommended, but if there's things we can do to
> make the system work at least a bit better when folks do use lots of
> connections, provided we don't materially damage other cases, that's
> probably worthwhile.

I also think that Postgres performance should degrade gradually with 
increasing number
of active backends. Actually further investigations of this particular 
case shows that such large number of
database connections was caused by ... Postgres slowdown.
During normal workflow number of active backends is few hundreds.
But "invalidation storm" cause hangout of queries, so user application 
has to initiate more and more new connections to perform required actions.
Yes, this may be not the best behavior of application in this case. At 
least it should first terminate current connection using 
pg_terminate_backend. I just want to notice that large number of 
backends was not the core of the problem.

>
> Making them GUCs does seem like it's a few steps too far... but it'd be
> nice if we could arrange to have values that don't result in the system
> falling over with large numbers of backends and large numbers of tables.
> To get a lot of backends, you'd have to set max_connections up pretty
> high to begin with- perhaps we should contemplate allowing these values
> to vary based on what max_connections is set to?

I think that optimal value of number of lock partitions should depend 
not on number of connections
but on number of available CPU cores and so expected level on concurrency.
It is hard to propose some portable way to obtain this number.
This is why I think that GUCs is better solution.
Certainly I realize that it is very dangerous parameter which should be 
changed with special care.
Not only because of  MAX_SIMUL_LWLOCKS.

There are few places in Postgres when it tries to lock all partitions 
(deadlock detector, logical replication,...).
If there very thousands of partitions, then such lock will be too 
expensive and we get yet another
popular Postgres program: "deadlock detection storm" when due to high 
contention between backends lock can not be obtained
in deadlock timeout and so initiate deadlock detection. Simultaneous 
deadlock detection performed by all backends
(which tries to take ALL partitions locks) paralyze the system (TPS 
falls down to 0).
Proposed patch for this problem was also rejected (once again - problem 
can be reproduced only of powerful server with large number of cores).

Re: Postgres is not able to handle more than 4k tables!?

From

Konstantin Knizhnik

Date:

09 July 2020, 16:16:26

On 09.07.2020 18:14, Tom Lane wrote:
>
> As I understood the report, it was not "things completely fall over",
> it was "performance gets bad".  But let's get real.  Unless the OP
> has a machine with thousands of CPUs, trying to run this way is
> counterproductive.
Sorry, that I was not clear. It is actually case when "things completely 
fall over".
If query planning time takes several minutes and so user response time 
is increased from seconds to hours,
then system becomes unusable, doesn't it?

> Perhaps in a decade or two such machines will be common enough that
> it'll make sense to try to tune Postgres to run well on them.  Right
> now I feel no hesitation about saying "if it hurts, don't do that".

Unfortunately we have not to wait for decade or two.
Postgres is faced with multiple problems at existed multiprocessor 
systems (64, 96,.. cores).
And it is not even necessary to initiate thousands of connections: just 
enough to load all this cores and let them compete for some
resource (LW-lock, buffer,...). Even standard pgbench/YCSB benchmarks 
with zipfian distribution may illustrate this problems.

There were many proposed patches which help to improve this situation.
But as far as this patches increase performance only at huge servers 
with large number of cores and show almost no
improvement  (or even some degradation) at standard 4-cores desktops, 
almost none of them were committed.
Consequently our customers have a lot of troubles trying to replace 
Oracle with Postgres and provide the same performance at same
(quite good and expensive) hardware.

Re: Postgres is not able to handle more than 4k tables!?

From

Nikolay Samokhvalov

Date:

09 July 2020, 16:19:34

Hi Konstantin, a silly question: do you consider the workload you have as well-optimized? Can it be optimized further? Reading this thread I have a strong feeling that a very basic set of regular optimization actions is missing here (or not explained): query analysis and optimization based on pg_stat_statements (and, maybe pg_stat_kcache), some method to analyze the state of the server in general, resource consumption, etc.

Do you have some monitoring that covers pg_stat_statements?

Before looking under the hood, I would use multiple pg_stat_statements snapshots (can be analyzed using, say, postgres-checkup or pgCenter) to understand the workload and identify the heaviest queries -- first of all, in terms of total_time, calls, shared buffers reads/hits, temporary files generation. Which query groups are Top-N in each category, have you looked at it?

You mentioned some crazy numbers for the planning time, but why not to analyze the picture holistically and see the overall numbers? Those queries that have increased planning time, what their part of total_time, on the overall picture, in %? (Unfortunately, we cannot see Top-N by planning time in pg_stat_statements till PG13, but it doesn't mean that we cannot have some good understanding of overall picture today, it just requires more work).

If workload analysis & optimization was done holistically already, or not possible due to some reason — pardon me. But if not and if your primary goal is to improve this particular setup ASAP, then the topic could be started in the -performance mailing list first, discussing the workload and its aspects, and only after it's done, raised in -hackers. No?

On Thu, Jul 9, 2020 at 8:57 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

Hi Stephen,

Thank you for supporting an opinion that it is the problems not only of
client system design (I agree it is not so good
idea to have thousands tables and thousands active backends) but also of
Postgres.

We have made more investigation and found out one more problem in
Postgres causing "invalidation storm".
There are some log living transactions which prevent autovacuum to do it
work and remove dead tuples.
So autovacuum is started once and once again and each time did no
progress but updated statistics and so sent invalidation messages.
autovacuum_naptime was set to 30 seconds, so each 30 seconds autovacuum
proceed huge number of tables and initiated large number of invalidation
messages which quite soon cause overflow of validation message buffers
for backends performing long OLAP queries.

It makes me think about two possible optimizations:

1. Provide separate invalidation messages for relation metadata and its
statistic.
So update of statistic should not invalidate relation cache.
The main problem with this proposal is that pg_class contains relpages
and reltuples columns which conceptually are \ part of relation statistic
but stored in relation cache. If relation statistic is updated, then
most likely this fields are also changed. So we have to remove this relation
from relation cache in any case.

2. Remember in relation info XID of oldest active transaction at the
moment of last autovacuum.
At next autovacuum iteration we first of all compare this stored XID
with current oldest active transaction XID
and bypass vacuuming this relation if XID is not changed.

Thoughts?

>> So, that's really the core of your problem. We don't promise that
>> you can run several thousand backends at once. Usually it's recommended
>> that you stick a connection pooler in front of a server with (at most)
>> a few hundred backends.
> Sure, but that doesn't mean things should completely fall over when we
> do get up to larger numbers of backends, which is definitely pretty
> common in larger systems. I'm pretty sure we all agree that using a
> connection pooler is recommended, but if there's things we can do to
> make the system work at least a bit better when folks do use lots of
> connections, provided we don't materially damage other cases, that's
> probably worthwhile.

I also think that Postgres performance should degrade gradually with
increasing number
of active backends. Actually further investigations of this particular
case shows that such large number of
database connections was caused by ... Postgres slowdown.
During normal workflow number of active backends is few hundreds.
But "invalidation storm" cause hangout of queries, so user application
has to initiate more and more new connections to perform required actions.
Yes, this may be not the best behavior of application in this case. At
least it should first terminate current connection using
pg_terminate_backend. I just want to notice that large number of
backends was not the core of the problem.

>
> Making them GUCs does seem like it's a few steps too far... but it'd be
> nice if we could arrange to have values that don't result in the system
> falling over with large numbers of backends and large numbers of tables.
> To get a lot of backends, you'd have to set max_connections up pretty
> high to begin with- perhaps we should contemplate allowing these values
> to vary based on what max_connections is set to?

I think that optimal value of number of lock partitions should depend
not on number of connections
but on number of available CPU cores and so expected level on concurrency.
It is hard to propose some portable way to obtain this number.
This is why I think that GUCs is better solution.
Certainly I realize that it is very dangerous parameter which should be
changed with special care.
Not only because of MAX_SIMUL_LWLOCKS.

There are few places in Postgres when it tries to lock all partitions
(deadlock detector, logical replication,...).
If there very thousands of partitions, then such lock will be too
expensive and we get yet another
popular Postgres program: "deadlock detection storm" when due to high
contention between backends lock can not be obtained
in deadlock timeout and so initiate deadlock detection. Simultaneous
deadlock detection performed by all backends
(which tries to take ALL partitions locks) paralyze the system (TPS
falls down to 0).
Proposed patch for this problem was also rejected (once again - problem
can be reproduced only of powerful server with large number of cores).

Re: Postgres is not able to handle more than 4k tables!?

From

Alexander Korotkov

Date:

09 July 2020, 16:37:46

On Thu, Jul 9, 2020 at 6:57 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> 2. Remember in relation info XID of oldest active transaction at the
> moment of last autovacuum.
> At next autovacuum iteration we first of all compare this stored XID
> with current oldest active transaction XID
> and bypass vacuuming this relation if XID is not changed.

This option looks good for me independently of the use case under
consideration.  Long-running transactions are an old and well-known
problem.  If we can skip some useless work here, let's do this.

------
Regards,
Alexander Korotkov

Re: Postgres is not able to handle more than 4k tables!?

From

Stephen Frost

Date:

09 July 2020, 16:47:35

Greetings,

* Konstantin Knizhnik (k.knizhnik@postgrespro.ru) wrote:
> It makes me think about two possible optimizations:
>
> 1. Provide separate invalidation messages for relation metadata and its
> statistic.
> So update of statistic should not invalidate relation cache.
> The main problem with this proposal is that pg_class contains relpages and
> reltuples columns which conceptually are \ part of relation statistic
> but stored in relation cache. If relation statistic is updated, then most
> likely this fields are also changed. So we have to remove this relation
> from relation cache in any case.

I realize this is likely to go over like a lead balloon, but the churn
in pg_class from updating reltuples/relpages has never seemed all that
great to me when just about everything else is so rarely changed, and
only through some user DDL action- and I agree that it seems like those
particular columns are more 'statistics' type of info and less info
about the definition of the relation.  Other columns that do get changed
regularly are relfrozenxid and relminmxid.  I wonder if it's possible to
move all of those elsewhere- perhaps some to the statistics tables as
you seem to be alluding to, and the others to $somewhereelse that is
dedicated to tracking that information which VACUUM is primarily
concerned with.

> 2. Remember in relation info XID of oldest active transaction at the moment
> of last autovacuum.
> At next autovacuum iteration we first of all compare this stored XID with
> current oldest active transaction XID
> and bypass vacuuming this relation if XID is not changed.
>
> Thoughts?

That sounds like an interesting optimization and I agree it'd be nice to
avoid the re-run of autovacuum when we can tell that there's not going
to be anything more we can do.  As noted above, for my part, I think
it'd be nice to move that kind of ongoing maintenance/updates out of
pg_class, but just in general I agree with the idea to store that info
somewhere and wait until there's actually been progress in the global
xmin before re-running a vacuum on a table.  If we can do that somewhere
outside of pg_class, I think that'd be better, but if no one is up for
that kind of a shake-up, then maybe we just put it in pg_class and deal
with the churn there.

> >>So, that's really the core of your problem.  We don't promise that
> >>you can run several thousand backends at once.  Usually it's recommended
> >>that you stick a connection pooler in front of a server with (at most)
> >>a few hundred backends.
> >Sure, but that doesn't mean things should completely fall over when we
> >do get up to larger numbers of backends, which is definitely pretty
> >common in larger systems.  I'm pretty sure we all agree that using a
> >connection pooler is recommended, but if there's things we can do to
> >make the system work at least a bit better when folks do use lots of
> >connections, provided we don't materially damage other cases, that's
> >probably worthwhile.
>
> I also think that Postgres performance should degrade gradually with
> increasing number
> of active backends. Actually further investigations of this particular case
> shows that such large number of
> database connections was caused by ... Postgres slowdown.
> During normal workflow number of active backends is few hundreds.
> But "invalidation storm" cause hangout of queries, so user application has
> to initiate more and more new connections to perform required actions.
> Yes, this may be not the best behavior of application in this case. At least
> it should first terminate current connection using pg_terminate_backend. I
> just want to notice that large number of backends was not the core of the
> problem.

Yeah, this is all getting back to the fact that we don't have an
acceptance criteria or anything like that, where we'd actually hold off
on new connections/queries being allowed in while other things are
happening.  Of course, a connection pooler would address this (and you
could use one and have it still look exactly like PG, if you use, say,
pgbouncer in session-pooling mode, but then you need to have the
application drop/reconnect and not do its own connection pooling..), but
it'd be nice to have something in core for this.

> >Making them GUCs does seem like it's a few steps too far... but it'd be
> >nice if we could arrange to have values that don't result in the system
> >falling over with large numbers of backends and large numbers of tables.
> >To get a lot of backends, you'd have to set max_connections up pretty
> >high to begin with- perhaps we should contemplate allowing these values
> >to vary based on what max_connections is set to?
>
> I think that optimal value of number of lock partitions should depend not on
> number of connections
> but on number of available CPU cores and so expected level on concurrency.
> It is hard to propose some portable way to obtain this number.
> This is why I think that GUCs is better solution.

A GUC for 'number of CPUs' doesn't seem like a bad option to have.  How
to make that work well may be challenging though.

> Certainly I realize that it is very dangerous parameter which should be
> changed with special care.
> Not only because of  MAX_SIMUL_LWLOCKS.

Sure.

> There are few places in Postgres when it tries to lock all partitions
> (deadlock detector, logical replication,...).
> If there very thousands of partitions, then such lock will be too expensive
> and we get yet another
> popular Postgres program: "deadlock detection storm" when due to high
> contention between backends lock can not be obtained
> in deadlock timeout and so initiate deadlock detection. Simultaneous
> deadlock detection performed by all backends
> (which tries to take ALL partitions locks) paralyze the system (TPS falls
> down to 0).
> Proposed patch for this problem was also rejected (once again - problem can
> be reproduced only of powerful server with large number of cores).

That does sound like something that would be good to improve on, though
I haven't looked at the proposed patch or read the associated thread, so
I'm not sure I can really comment on its rejection specifically.

Thanks,

Stephen

Attachment

signature.asc

Re: Postgres is not able to handle more than 4k tables!?

From

Konstantin Knizhnik

Date:

09 July 2020, 16:49:26


On 09.07.2020 19:19, Nikolay Samokhvalov wrote:
> Hi Konstantin, a silly question: do you consider the workload you have 
> as well-optimized? Can it be optimized further? Reading this thread I 
> have a strong feeling that a very basic set of regular optimization 
> actions is missing here (or not explained): query analysis and 
> optimization based on pg_stat_statements (and, maybe pg_stat_kcache), 
> some method to analyze the state of the server in general, resource 
> consumption, etc.
>
> Do you have some monitoring that covers pg_stat_statements?
>
> Before looking under the hood, I would use multiple pg_stat_statements 
> snapshots (can be analyzed using, say, postgres-checkup or pgCenter) 
> to understand the workload and identify the heaviest queries -- first 
> of all, in terms of total_time, calls, shared buffers reads/hits, 
> temporary files generation. Which query groups are Top-N in each 
> category, have you looked at it?
>
> You mentioned some crazy numbers for the planning time, but why not to 
> analyze the picture holistically and see the overall numbers? Those 
> queries that have increased planning time, what their part of 
> total_time, on the overall picture, in %? (Unfortunately, we cannot 
> see Top-N by planning time in pg_stat_statements till PG13, but it 
> doesn't mean that we cannot have some good understanding of overall 
> picture today, it just requires more work).
>
> If workload analysis & optimization was done holistically already, or 
> not possible due to some reason — pardon me. But if not and if your 
> primary goal is to improve this particular setup ASAP, then the topic 
> could be started in the -performance mailing list first, discussing 
> the workload and its aspects, and only after it's done, raised in 
> -hackers. No?

Certainly, both we and customer has made workload analysis & optimization.
It is not a problem of particular queries, bad plans, resource 
exhaustion,...

Unfortunately there many scenarios when Postgres demonstrates not 
gradual degrade of performance with increasing workload,
but "snow avalanche" whennegative feedback cause very fastparalysis of 
the system.

This case is just one if this scenarios. It is hard to say for sure what 
triggers the avalanche... Long living transaction, huge number of tables,
aggressive autovacuum settings... But there is cascade of negative 
events which cause system which normally function for months to stop 
working at all.

In this particular case we have the following chain:

- long living transaction cause autovacuum to send a lot of invalidation 
message
- this messages cause overflow of invalidation message queues, forcing 
backens to invalidate their caches and reload from catalog.
- too small value of fastpath lock cache cause many concurrent accesses 
to shared lock hash
- contention for LW-lock caused by small number of lock partition cause 
starvation

Re: Postgres is not able to handle more than 4k tables!?

From

Nikolay Samokhvalov

Date:

09 July 2020, 19:00:02

Great idea.

In addition to this, it would be good to consider another optimization for the default transaction isolation level: making autovacuum to clean dead tuples in relations that are not currently used in any transaction and when there are no IN_PROGRESS transactions running at RR or S level (which is a very common case because RC is the default level and this is what is actively used in heavily loaded OLTP systems which often suffer from long-running transactions). I don't know the details of how easy it would be to implement, but it always wondered that autovacuum has the global XID "horizon".

With such an optimization, the "hot_standby_feedback=on" mode could be implemented also more gracefully, reporting "min(xid)" for ongoing transactions on standbys separately for RC and RR/S levels.

Without this, we cannot have good performance for HTAP use cases for Postgres – the presence of just a small number of long-running transactions, indeed, is known to kill the performance of OLTP workloads quickly. And leading to much faster bloat growth than it could be.

However, maybe I'm wrong in these considerations, or it's impossible / too difficult to implement.

On Thu, Jul 9, 2020 at 9:38 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Jul 9, 2020 at 6:57 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> 2. Remember in relation info XID of oldest active transaction at the
> moment of last autovacuum.
> At next autovacuum iteration we first of all compare this stored XID
> with current oldest active transaction XID
> and bypass vacuuming this relation if XID is not changed.

This option looks good for me independently of the use case under
consideration. Long-running transactions are an old and well-known
problem. If we can skip some useless work here, let's do this.

------
Regards,
Alexander Korotkov

Re: Postgres is not able to handle more than 4k tables!?

From

Stephen Frost

Date:

09 July 2020, 19:07:10

Greetings,

We generally prefer that you don't top-post on these lists.

* Nikolay Samokhvalov (samokhvalov@gmail.com) wrote:
> In addition to this, it would be good to consider another optimization for
> the default transaction isolation level: making autovacuum to clean dead
> tuples in relations that are not currently used in any transaction and when
> there are no IN_PROGRESS transactions running at RR or S level (which is a
> very common case because RC is the default level and this is what is
> actively used in heavily loaded OLTP systems which often suffer from
> long-running transactions). I don't know the details of how easy it would
> be to implement, but it always wondered that autovacuum has the global XID
> "horizon".

Yeah, I've had thoughts along the same lines, though I had some ideas
that we could actually manage to support it even with RR (at least...
not sure about serializable) by considering what tuples the transactions
in the system could actually see (eg: even with RR, a tuple created
after that transaction started and was then deleted wouldn't ever be
able to be seen and therefore could be cleaned up..).  A further thought
on that was to only spend that kind of effort once a tuple had aged a
certain amount, though it depends a great deal on exactly what would
need to be done for this.

Unfortunately, anything in this area is likely to carry a good bit of
risk associated with it as VACUUM doing the wrong thing would be quite
bad.

Thanks,

Stephen

Attachment

signature.asc

Re: Postgres is not able to handle more than 4k tables!?

From

Grigory Smolkin

Date:

09 July 2020, 19:16:37

On 7/8/20 11:41 PM, Konstantin Knizhnik wrote:
>
> So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES  constants have 
> to be replaced with GUCs.
> To avoid division, we can specify log2 of this values, so shift can be 
> used instead.
> And MAX_SIMUL_LWLOCKS should be defined as NUM_LOCK_PARTITIONS + 
> NUM_INDIVIDUAL_LWLOCKS + NAMED_LWLOCK_RESERVE.
>
Because I was involved in this particular case and don`t want it to 
became a habit, I`m volunteering to test whatever patch this discussion 
will produce.

-- 
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Postgres is not able to handle more than 4k tables!?

From

Alexander Korotkov

Date:

09 July 2020, 21:38:07

On Thu, Jul 9, 2020 at 10:00 PM Nikolay Samokhvalov
<samokhvalov@gmail.com> wrote:
> In addition to this, it would be good to consider another optimization for the default transaction isolation level:
makingautovacuum to clean dead tuples in relations that are not currently used in any transaction and when there are no
IN_PROGRESStransactions running at RR or S level (which is a very common case because RC is the default level and this
iswhat is actively used in heavily loaded OLTP systems which often suffer from long-running transactions). I don't know
thedetails of how easy it would be to implement, but it always wondered that autovacuum has the global XID "horizon". 
>
> With such an optimization, the "hot_standby_feedback=on" mode could be implemented also more gracefully, reporting
"min(xid)"for ongoing transactions on standbys separately for RC and RR/S levels. 

Yes, the current way of calculation of dead tuples is lossy, because
we only rely on the oldest xid.  However, if we would keep the oldest
snapshot instead of oldest xmin, long-running transactions wouldn't be
such a disaster.  I don't think this is feasible with the current
snapshot model, because keeping full snapshots instead of just xmins
would bloat shared-memory structs and complicate computations.  But
CSN can certainly support this optimization.

------
Regards,
Alexander Korotkov

RE: Postgres is not able to handle more than 4k tables!?

From

"tsunakawa.takay@fujitsu.com"

Date:

10 July 2020, 02:10:20

From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru>
> Unfortunately we have not to wait for decade or two.
> Postgres is faced with multiple problems at existed multiprocessor
> systems (64, 96,.. cores).
> And it is not even necessary to initiate thousands of connections: just
> enough to load all this cores and let them compete for some
> resource (LW-lock, buffer,...). Even standard pgbench/YCSB benchmarks
> with zipfian distribution may illustrate this problems.

I concur with you.  VMs and bare metal machines with 100~200 CPU cores and TBs of RAM are already available even on
publicclouds.  The users easily set max_connections to a high value like 10,000, create thousands or tens of thousands
ofrelations, and expect it to go smoothly.  Although it may be a horror for PG developers who know the internals well,
Postgreshas grown a great database to be relied upon.
 

Besides, I don't want people to think like "Postgres cannot scale up on one machine, so we need scale-out."  I
understandsome form of scale-out is necessary for large-scale analytics and web-scale multitenant OLTP, but it would be
desirableto be able to cover the OLTP workloads for one organization/region with the advances in hardware and Postgres
leveragingthose advances, without something like Oracle RAC.
 


> There were many proposed patches which help to improve this situation.
> But as far as this patches increase performance only at huge servers
> with large number of cores and show almost no
> improvement  (or even some degradation) at standard 4-cores desktops,
> almost none of them were committed.
> Consequently our customers have a lot of troubles trying to replace
> Oracle with Postgres and provide the same performance at same
> (quite good and expensive) hardware.

Yeah, it's a pity that the shiny-looking patches from Postgres Pro (mostly from Konstantin san?) -- autoprepare,
built-inconnection pooling, fair lwlock, and revolutionary multi-threaded backend -- haven't gained hot atension.
 


Regards
Takayuki Tsunakawa

Re: Postgres is not able to handle more than 4k tables!?

From

Konstantin Knizhnik

Date:

10 July 2020, 07:18:36


On 09.07.2020 22:16, Grigory Smolkin wrote:
>
> On 7/8/20 11:41 PM, Konstantin Knizhnik wrote:
>>
>> So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES  constants have 
>> to be replaced with GUCs.
>> To avoid division, we can specify log2 of this values, so shift can 
>> be used instead.
>> And MAX_SIMUL_LWLOCKS should be defined as NUM_LOCK_PARTITIONS + 
>> NUM_INDIVIDUAL_LWLOCKS + NAMED_LWLOCK_RESERVE.
>>
> Because I was involved in this particular case and don`t want it to 
> became a habit, I`m volunteering to test whatever patch this 
> discussion will produce.
>
You are welcome:)

Attachment

skip_autovacuum.patch

Re: Postgres is not able to handle more than 4k tables!?

From

Laurenz Albe

Date:

10 July 2020, 07:24:25

On Thu, 2020-07-09 at 12:47 -0400, Stephen Frost wrote:
> I realize this is likely to go over like a lead balloon, but the churn
> in pg_class from updating reltuples/relpages has never seemed all that
> great to me when just about everything else is so rarely changed, and
> only through some user DDL action- and I agree that it seems like those
> particular columns are more 'statistics' type of info and less info
> about the definition of the relation.  Other columns that do get changed
> regularly are relfrozenxid and relminmxid.  I wonder if it's possible to
> move all of those elsewhere- perhaps some to the statistics tables as
> you seem to be alluding to, and the others to $somewhereelse that is
> dedicated to tracking that information which VACUUM is primarily
> concerned with.

Perhaps we could create pg_class with a fillfactor less than 100
so we het HOT updates there.
That would be less invasive.

Yours,
Laurenz Albe

Re: Postgres is not able to handle more than 4k tables!?

From

Bruce Momjian

Date:

14 July 2020, 22:59:03

On Fri, Jul 10, 2020 at 02:10:20AM +0000, tsunakawa.takay@fujitsu.com
wrote:
> > There were many proposed patches which help to improve this
> > situation.  But as far as this patches increase performance only
> > at huge servers with large number of cores and show almost no
> > improvement (or even some degradation) at standard 4-cores desktops,
> > almost none of them were committed.  Consequently our customers have
> > a lot of troubles trying to replace Oracle with Postgres and provide
> > the same performance at same (quite good and expensive) hardware.
>
> Yeah, it's a pity that the shiny-looking patches from Postgres Pro
> (mostly from Konstantin san?) -- autoprepare, built-in connection
> pooling, fair lwlock, and revolutionary multi-threaded backend --
> haven't gained hot atension.

Yeah, it is probably time for us to get access to a current large-scale
machine again and really find the bottlenecks.  We seem to next this
every few years.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Postgres is not able to handle more than 4k tables!?

From

Alvaro Herrera

Date:

14 July 2020, 23:17:10

On 2020-Jul-10, Konstantin Knizhnik wrote:

> @@ -3076,6 +3080,10 @@ relation_needs_vacanalyze(Oid relid,
>          instuples = tabentry->inserts_since_vacuum;
>          anltuples = tabentry->changes_since_analyze;
>  
> +        rel = RelationIdGetRelation(relid);
> +        oldestXmin = TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
> +        RelationClose(rel);

*cough*

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Postgres is not able to handle more than 4k tables!?

From

Konstantin Knizhnik

Date:

15 July 2020, 14:28:07

On 15.07.2020 02:17, Alvaro Herrera wrote:
> On 2020-Jul-10, Konstantin Knizhnik wrote:
>
>> @@ -3076,6 +3080,10 @@ relation_needs_vacanalyze(Oid relid,
>>           instuples = tabentry->inserts_since_vacuum;
>>           anltuples = tabentry->changes_since_analyze;
>>   
>> +        rel = RelationIdGetRelation(relid);
>> +        oldestXmin = TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
>> +        RelationClose(rel);
> *cough*
>
Sorry, Alvaro.
Can you explain this *cough*
You didn't like that relation is opened  just to call GetOldestXmin?
But this functions requires Relation. Do you suggest to rewrite it so 
that it is possible to pass just Oid of relation?

Or you do you think that such calculation of oldestSmin is obscure and 
at lest requires some comment?
Actually, I have copied it from vacuum.c and there is a large comment 
explaining why it is calculated in this  way.
May be it is enough to add reference to vacuum.c?
Or may be create some special function for it?
I just need to oldestXmin in calculated in the same way in vacuum.c and 
autovacuum.c

Re: Postgres is not able to handle more than 4k tables!?

From

Alvaro Herrera

Date:

15 July 2020, 15:03:33

On 2020-Jul-15, Konstantin Knizhnik wrote:

> 
> 
> On 15.07.2020 02:17, Alvaro Herrera wrote:
> > On 2020-Jul-10, Konstantin Knizhnik wrote:
> > 
> > > @@ -3076,6 +3080,10 @@ relation_needs_vacanalyze(Oid relid,
> > >           instuples = tabentry->inserts_since_vacuum;
> > >           anltuples = tabentry->changes_since_analyze;
> > > +        rel = RelationIdGetRelation(relid);
> > > +        oldestXmin = TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
> > > +        RelationClose(rel);
> > *cough*
> > 
> Sorry, Alvaro.
> Can you explain this *cough*
> You didn't like that relation is opened  just to call GetOldestXmin?
> But this functions requires Relation. Do you suggest to rewrite it so that
> it is possible to pass just Oid of relation?

At that point of autovacuum, you don't have a lock on the relation; the
only thing you have is a pg_class tuple (and we do it that way on
purpose as I recall).  I think asking relcache for it is dangerous, and
moreover requesting relcache for it directly goes counter our normal
coding pattern.  At the very least you should have a comment explaining
why you do it and why it's okay to do it, and also handle the case when
RelationIdGetRelation returns null.

However, looking at the bigger picture I wonder if it would be better to
test the getoldestxmin much later in the process to avoid this whole
issue.  Just carry forward the relation until the point where vacuum is
called ... that may be cleaner?  And the extra cost is not that much.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Postgres is not able to handle more than 4k tables!?

From

Konstantin Knizhnik

Date:

15 July 2020, 17:10:06

On 15.07.2020 18:03, Alvaro Herrera wrote:
> On 2020-Jul-15, Konstantin Knizhnik wrote:
>
>>
>> On 15.07.2020 02:17, Alvaro Herrera wrote:
>>> On 2020-Jul-10, Konstantin Knizhnik wrote:
>>>
>>>> @@ -3076,6 +3080,10 @@ relation_needs_vacanalyze(Oid relid,
>>>>            instuples = tabentry->inserts_since_vacuum;
>>>>            anltuples = tabentry->changes_since_analyze;
>>>> +        rel = RelationIdGetRelation(relid);
>>>> +        oldestXmin = TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
>>>> +        RelationClose(rel);
>>> *cough*
>>>
>> Sorry, Alvaro.
>> Can you explain this *cough*
>> You didn't like that relation is opened  just to call GetOldestXmin?
>> But this functions requires Relation. Do you suggest to rewrite it so that
>> it is possible to pass just Oid of relation?
> At that point of autovacuum, you don't have a lock on the relation; the
> only thing you have is a pg_class tuple (and we do it that way on
> purpose as I recall).  I think asking relcache for it is dangerous, and
> moreover requesting relcache for it directly goes counter our normal
> coding pattern.  At the very least you should have a comment explaining
> why you do it and why it's okay to do it, and also handle the case when
> RelationIdGetRelation returns null.
>
> However, looking at the bigger picture I wonder if it would be better to
> test the getoldestxmin much later in the process to avoid this whole
> issue.  Just carry forward the relation until the point where vacuum is
> called ... that may be cleaner?  And the extra cost is not that much.
>
Thank you for explanation.
I have prepared new version of the patch which opens relation with care.
Concerning your suggestion to perform this check later (in vacuum_rel() 
I guess?)
I tried to implement it but faced with another problem.

Right now information about autovacuum_oldest_xmin for relationis stored 
in statistic (PgStat_StatTabEntry)
together with other atovacuum related fields like 
autovac_vacuum_timestamp, autovac_analyze_timestamp,...)
I am not sure that it is right place for storing autovacuum_oldest_xmin 
but I didn't want to organize in shared memory yet another hash table
just for keeping this field. May be you can suggest something better...
But right now it is stored here when vacuum is completed.

PgStat_StatTabEntry is obtained by get_pgstat_tabentry_relid() which 
also needs  pgstat_fetch_stat_dbentry(MyDatabaseId) and 
pgstat_fetch_stat_dbentry(InvalidOid).  I do not want to copy all this 
stuff to vacuum.c.
It seems to me to be easier to open relation in 
relation_needs_vacanalyze(), isn;t it?

Attachment

skip_autovacuum-2.patch