Thread: Postgres is not able to handle more than 4k tables!?
I want to explain one bad situation we have encountered with one of our customers. There are ~5000 tables in their database. And what is worse - most of them are actively used. Then several flaws of Postgres make their system almost stuck. Autovacuum is periodically processing all this 5k relations (because them are actively updated). And as far as most of this tables are small enough autovacuum complete processing of them almost in the same time. As a result autovacuum workers produce ~5k invalidation messages in short period of time. There are several thousand clients, most of which are executing complex queries. So them are not able to process all this invalidation messages and their invalidation message buffer is overflown. Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be changed without recompilation of Postgres. This is problem N1. As a result resetState is set to true, forcing backends to invalidate their caches. So most of backends loose there cached metadata and have to access system catalog trying to reload it. But then we come to the next show stopper: NUM_LOCK_PARTITIONS. It is also hardcoded and can't be changed without recompilation: #define LOG2_NUM_LOCK_PARTITIONS 4 #define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS) Having just 16 LW-Locks greatly increase conflict probability (taken in account that there are 5k tables and totally about 25k relations). It cause huge lw-lock acquisition time for heap_open and planning stage of some queries is increased from milliseconds to several minutes! Koda! This is problem number 2. But there is one more flaw we have faced with. We have increased LOG2_NUM_LOCK_PARTITIONS to 8 and ... clients start to report "too many LWLocks taken" error. There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200 which relation with NUM_LOCK_PARTITIONS was not mentioned anywhere. But there are several places in Postgres where it tries to hold all partition locks (for example in deadlock detector). Definitely if NUM_LOCK_PARTITIONS > MAX_SIMUL_LWLOCKS we get this error. So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have to be replaced with GUCs. To avoid division, we can specify log2 of this values, so shift can be used instead. And MAX_SIMUL_LWLOCKS should be defined as NUM_LOCK_PARTITIONS + NUM_INDIVIDUAL_LWLOCKS + NAMED_LWLOCK_RESERVE.
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes: > There are several thousand clients, most of which are executing complex > queries. So, that's really the core of your problem. We don't promise that you can run several thousand backends at once. Usually it's recommended that you stick a connection pooler in front of a server with (at most) a few hundred backends. > So them are not able to process all this invalidation messages and their > invalidation message buffer is overflown. > Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be > changed without recompilation of Postgres. > This is problem N1. No, this isn't a problem. Or at least you haven't shown a reason to think it is. Sinval overruns are somewhat routine, and we certainly test that code path (see CLOBBER_CACHE_ALWAYS buildfarm animals). > But then we come to the next show stopper: NUM_LOCK_PARTITIONS. > It is also hardcoded and can't be changed without recompilation: > #define LOG2_NUM_LOCK_PARTITIONS 4 > #define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS) > Having just 16 LW-Locks greatly increase conflict probability (taken in > account that there are 5k tables and totally about 25k relations). > It cause huge lw-lock acquisition time for heap_open and planning stage > of some queries is increased from milliseconds to several minutes! Really? > This is problem number 2. But there is one more flaw we have faced with. > We have increased LOG2_NUM_LOCK_PARTITIONS to 8 > and ... clients start to report "too many LWLocks taken" error. > There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200 > which relation with NUM_LOCK_PARTITIONS was not mentioned anywhere. Seems like self-inflicted damage. I certainly don't recall anyplace in the docs where we suggest that you can alter that constant without worrying about consequences. > So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have to > be replaced with GUCs. I seriously doubt we'd do that. regards, tom lane
From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru> > Autovacuum is periodically processing all this 5k relations (because > them are actively updated). > And as far as most of this tables are small enough autovacuum complete > processing of them almost in the same time. > As a result autovacuum workers produce ~5k invalidation messages in > short period of time. How about trying CREATE/ALTER TABLE WITH (vacuum_truncate = off)? It's available since PG 12. It causes autovacuum to nottruncate the relation. It's the relation truncation what produces those shared invalidation messages. > But then we come to the next show stopper: NUM_LOCK_PARTITIONS. > It is also hardcoded and can't be changed without recompilation: > > #define LOG2_NUM_LOCK_PARTITIONS 4 > #define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS) > > Having just 16 LW-Locks greatly increase conflict probability (taken in > account that there are 5k tables and totally about 25k relations). > It cause huge lw-lock acquisition time for heap_open and planning stage > of some queries is increased from milliseconds to several minutes! > Koda! The vacuum's relation truncation is also the culprit here, and it can be eliminated by the above storage parameter. It acquiresAccess Exclusive lock on the relation. Without the strong Access Exclusive lock, just running DML statements usethe fast path locking, which doesn't acquire the lock manager partition lock. The long lwlock wait is a sad story. The victim is probably exclusive lockers. When someone holds a shared lock on a lwlock,the exclusive locker has to wait. That's OK. However, if another share locker comes later, it acquires the lwlockeven though there're waiting exclusive lockers. That's unfair, but this is the community decision. Regards Takayuki Tsunakawa
On 09.07.2020 03:49, tsunakawa.takay@fujitsu.com wrote: > From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru> >> Autovacuum is periodically processing all this 5k relations (because >> them are actively updated). >> And as far as most of this tables are small enough autovacuum complete >> processing of them almost in the same time. >> As a result autovacuum workers produce ~5k invalidation messages in >> short period of time. > How about trying CREATE/ALTER TABLE WITH (vacuum_truncate = off)? It's available since PG 12. It causes autovacuum tonot truncate the relation. It's the relation truncation what produces those shared invalidation messages. Invalidation messages are also caused by statistic update: #0 0x000055a85f4f5fd6 in RegisterCatcacheInvalidation (cacheId=49, hashValue=715727843, dbId=12443) at inval.c:483 #1 0x000055a85f4f4dc2 in PrepareToInvalidateCacheTuple (relation=0x7f45a34ce5a0, tuple=0x7ffc75bebc70, newtuple=0x7f4598e75ef8, function=0x55a85f4f5fc0 <RegisterCatcacheInvalidation>) at catcache.c:1830 #2 0x000055a85f4f6b21 in CacheInvalidateHeapTuple (relation=0x7f45a34ce5a0, tuple=0x7ffc75bebc70, newtuple=0x7f4598e75ef8) at inval.c:1149 #3 0x000055a85f016372 in heap_update (relation=0x7f45a34ce5a0, otid=0x7f4598e75efc, newtup=0x7f4598e75ef8, cid=0, crosscheck=0x0, wait=1 '\001', hufd=0x7ffc75bebcf0, lockmode=0x7ffc75bebce8) at heapam.c:4245 #4 0x000055a85f016f98 in simple_heap_update (relation=0x7f45a34ce5a0, otid=0x7f4598e75efc, tup=0x7f4598e75ef8) at heapam.c:4490 #5 0x000055a85f153ec5 in update_attstats (relid=16384, inh=0 '\000', natts=1, vacattrstats=0x55a860f0fba0) at analyze.c:1619 #6 0x000055a85f151898 in do_analyze_rel (onerel=0x7f45a3480080, options=98, params=0x55a860f0f028, va_cols=0x0, acquirefunc=0x55a85f15264e <acquire_sample_rows>, relpages=26549, inh=0 '\000', in_outer_xact=0 '\000', elevel=13) at analyze.c:562 #7 0x000055a85f150be1 in analyze_rel (relid=16384, relation=0x7ffc75bec370, options=98, params=0x55a860f0f028, va_cols=0x0, in_outer_xact=0 '\000', bstrategy=0x55a860f0f0b8) at analyze.c:257 #8 0x000055a85f1e1589 in vacuum (options=98, relation=0x7ffc75bec370, relid=16384, params=0x55a860f0f028, va_cols=0x0, bstrategy=0x55a860f0f0b8, isTopLevel=1 '\001') at vacuum.c:320 #9 0x000055a85f2fd92a in autovacuum_do_vac_analyze (tab=0x55a860f0f020, bstrategy=0x55a860f0f0b8) at autovacuum.c:2874 #10 0x000055a85f2fcccb in do_autovacuum () at autovacuum.c:2374 > >> But then we come to the next show stopper: NUM_LOCK_PARTITIONS. >> It is also hardcoded and can't be changed without recompilation: >> >> #define LOG2_NUM_LOCK_PARTITIONS 4 >> #define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS) >> >> Having just 16 LW-Locks greatly increase conflict probability (taken in >> account that there are 5k tables and totally about 25k relations). >> It cause huge lw-lock acquisition time for heap_open and planning stage >> of some queries is increased from milliseconds to several minutes! >> Koda! > The vacuum's relation truncation is also the culprit here, and it can be eliminated by the above storage parameter. Itacquires Access Exclusive lock on the relation. Without the strong Access Exclusive lock, just running DML statementsuse the fast path locking, which doesn't acquire the lock manager partition lock. Looks like it is not true (at lest for PG9.6): #0 0x00007fa6d30da087 in semop () from /lib64/libc.so.6 #1 0x0000000000682241 in PGSemaphoreLock (sema=sema@entry=0x7fa66f5655d8) at pg_sema.c:387 #2 0x00000000006ec6eb in LWLockAcquire (lock=lock@entry=0x7f23b544f800, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1338 #3 0x00000000006e5560 in LockAcquireExtended (locktag=locktag@entry=0x7ffd94883fa0, lockmode=lockmode@entry=1, sessionLock=sessionLock@entry=0 '\000', dontWait=dontWait@entry=0 '\000', reportMemoryError=reportMemoryError@entry=1 '\001', locallockp=locallockp@entry=0x7ffd94883f98) at lock.c:962 #4 0x00000000006e29f6 in LockRelationOid (relid=87103837, lockmode=1) at lmgr.c:113 #5 0x00000000004a9f55 in relation_open (relationId=87103837, lockmode=lockmode@entry=1) at heapam.c:1131 #6 0x00000000004bdc66 in index_open (relationId=<optimized out>, lockmode=lockmode@entry=1) at indexam.c:151 #7 0x000000000067be58 in get_relation_info (root=root@entry=0x3a1a758, relationObjectId=72079078, inhparent=<optimized out>, rel=rel@entry=0x3a2d460) at plancat.c:183 #8 0x000000000067ef45 in build_simple_rel (root=root@entry=0x3a1a758, relid=2, reloptkind=reloptkind@entry=RELOPT_BASEREL) at relnode.c:148 Please notice lockmode=1 (AccessShareLock) > > The long lwlock wait is a sad story. The victim is probably exclusive lockers. When someone holds a shared lock on alwlock, the exclusive locker has to wait. That's OK. However, if another share locker comes later, it acquires the lwlockeven though there're waiting exclusive lockers. That's unfair, but this is the community decision. Yes, I also think that it is the reason of the problem. Alexandr Korotokov has implemented fair LW-Locks which eliminate such kind of problems in some scenarios. May it also can help here.
On 09.07.2020 00:35, Tom Lane wrote: > Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes: >> There are several thousand clients, most of which are executing complex >> queries. > So, that's really the core of your problem. We don't promise that > you can run several thousand backends at once. Usually it's recommended > that you stick a connection pooler in front of a server with (at most) > a few hundred backends. It is not my problem - it is customer's problem. Certainly the advice to use connection pooler is the first thing we have proposed to the customer when we see such larger number of active backends. Unfortunately it is not always possible (connection pooler is not preseving session semantic). This is why I have proposed builtin connection pooler for Postgres. But it is different story. >> So them are not able to process all this invalidation messages and their >> invalidation message buffer is overflown. >> Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be >> changed without recompilation of Postgres. >> This is problem N1. > No, this isn't a problem. Or at least you haven't shown a reason to > think it is. Sinval overruns are somewhat routine, and we certainly > test that code path (see CLOBBER_CACHE_ALWAYS buildfarm animals). Certainly cache overrun is not fatal. But if most of backends are blocked in heap_open pf pg_attribute relation then something is not ok with Postgres, isn't it? > It cause huge lw-lock acquisition time for heap_open and planning stage >> of some queries is increased from milliseconds to several minutes! > Really? Planning time: 75698.602 ms Execution time: 0.861 ms >> This is problem number 2. But there is one more flaw we have faced with. >> We have increased LOG2_NUM_LOCK_PARTITIONS to 8 >> and ... clients start to report "too many LWLocks taken" error. >> There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200 >> which relation with NUM_LOCK_PARTITIONS was not mentioned anywhere. > Seems like self-inflicted damage. I certainly don't recall anyplace > in the docs where we suggest that you can alter that constant without > worrying about consequences. Looks like you try to convince me that such practice of hardcoding constants in code and not taken in account relation between them is good design pattern? >> So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have to >> be replaced with GUCs. > I seriously doubt we'd do that. It's a pity, because such attention is one of the reasons why Postgres is pgbench-oriented database showing good results at notebooks but not at real systems running at power servers (NUMA, SSD, huge amount of memory, large number of cores,...).
From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru> > Looks like it is not true (at lest for PG9.6): > > #0 0x00007fa6d30da087 in semop () from /lib64/libc.so.6 > #1 0x0000000000682241 in PGSemaphoreLock > (sema=sema@entry=0x7fa66f5655d8) at pg_sema.c:387 > #2 0x00000000006ec6eb in LWLockAcquire > (lock=lock@entry=0x7f23b544f800, > mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1338 > #3 0x00000000006e5560 in LockAcquireExtended > (locktag=locktag@entry=0x7ffd94883fa0, lockmode=lockmode@entry=1, > sessionLock=sessionLock@entry=0 '\000', dontWait=dontWait@entry=0 > '\000', reportMemoryError=reportMemoryError@entry=1 '\001', > locallockp=locallockp@entry=0x7ffd94883f98) at lock.c:962 > #4 0x00000000006e29f6 in LockRelationOid (relid=87103837, lockmode=1) > at lmgr.c:113 > #5 0x00000000004a9f55 in relation_open (relationId=87103837, > lockmode=lockmode@entry=1) at heapam.c:1131 > #6 0x00000000004bdc66 in index_open (relationId=<optimized out>, > lockmode=lockmode@entry=1) at indexam.c:151 > #7 0x000000000067be58 in get_relation_info (root=root@entry=0x3a1a758, > relationObjectId=72079078, inhparent=<optimized out>, > rel=rel@entry=0x3a2d460) at plancat.c:183 > #8 0x000000000067ef45 in build_simple_rel (root=root@entry=0x3a1a758, > relid=2, reloptkind=reloptkind@entry=RELOPT_BASEREL) at relnode.c:148 > > Please notice lockmode=1 (AccessShareLock) Ouch, there exists another sad hardcoded value: the number of maximum locks that can be acquired by the fast-path mechanism. [LockAcquireExtended] /* * Attempt to take lock via fast path, if eligible. But if we remember * having filled up the fast path array, we don't attempt to make any * further use of it until we release some locks. It's possible that some * other backend has transferred some of those locks to the shared hash * table, leaving space free, but it's not worth acquiring the LWLock just * to check. It's also possible that we're acquiring a second or third * lock type on a relation we have already locked using the fast-path, but * for now we don't worry about that case either. */ if (EligibleForRelationFastPath(locktag, lockmode) && FastPathLocalUseCount < FP_LOCK_SLOTS_PER_BACKEND) { /* * We allow a small number of "weak" relation locks (AccessShareLock, * RowShareLock, RowExclusiveLock) to be recorded in the PGPROC structure * rather than the main lock table. This eases contention on the lock * manager LWLocks. See storage/lmgr/README for additional details. */ #define FP_LOCK_SLOTS_PER_BACKEND 16 16 looks easily exceeded even in a not-long OLTP transaction... especially the table is partitioned. I wonder if we're caughtin the hell of lock manager partition lock contention without knowing it. I'm afraid other pitfalls are lurking whenthere are many relations. Regards Takayuki Tsunakawa
Greetings, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes: > > There are several thousand clients, most of which are executing complex > > queries. > > So, that's really the core of your problem. We don't promise that > you can run several thousand backends at once. Usually it's recommended > that you stick a connection pooler in front of a server with (at most) > a few hundred backends. Sure, but that doesn't mean things should completely fall over when we do get up to larger numbers of backends, which is definitely pretty common in larger systems. I'm pretty sure we all agree that using a connection pooler is recommended, but if there's things we can do to make the system work at least a bit better when folks do use lots of connections, provided we don't materially damage other cases, that's probably worthwhile. > > So them are not able to process all this invalidation messages and their > > invalidation message buffer is overflown. > > Size of this buffer is hardcoded (MAXNUMMESSAGES = 4096) and can not be > > changed without recompilation of Postgres. > > This is problem N1. > > No, this isn't a problem. Or at least you haven't shown a reason to > think it is. Sinval overruns are somewhat routine, and we certainly > test that code path (see CLOBBER_CACHE_ALWAYS buildfarm animals). Testing that it doesn't outright break and having it be decently performant are two rather different things. I think we're talking more about performance and not so much about if the system is outright broken in this case. > > But then we come to the next show stopper: NUM_LOCK_PARTITIONS. > > It is also hardcoded and can't be changed without recompilation: > > > #define LOG2_NUM_LOCK_PARTITIONS 4 > > #define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS) > > > Having just 16 LW-Locks greatly increase conflict probability (taken in > > account that there are 5k tables and totally about 25k relations). > > > It cause huge lw-lock acquisition time for heap_open and planning stage > > of some queries is increased from milliseconds to several minutes! > > Really? Apparently, given the response down-thread. > > This is problem number 2. But there is one more flaw we have faced with. > > We have increased LOG2_NUM_LOCK_PARTITIONS to 8 > > and ... clients start to report "too many LWLocks taken" error. > > There is yet another hardcoded constant MAX_SIMUL_LWLOCKS = 200 > > which relation with NUM_LOCK_PARTITIONS was not mentioned anywhere. > > Seems like self-inflicted damage. I certainly don't recall anyplace > in the docs where we suggest that you can alter that constant without > worrying about consequences. Perhaps not in the docs, but would be good to make note of it somewhere, as I don't think it's really appropriate to assume these constants won't ever change and whomever contemplates changing them would appreciate knowing about other related values.. > > So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have to > > be replaced with GUCs. > > I seriously doubt we'd do that. Making them GUCs does seem like it's a few steps too far... but it'd be nice if we could arrange to have values that don't result in the system falling over with large numbers of backends and large numbers of tables. To get a lot of backends, you'd have to set max_connections up pretty high to begin with- perhaps we should contemplate allowing these values to vary based on what max_connections is set to? Thanks, Stephen
Attachment
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> So, that's really the core of your problem. We don't promise that >> you can run several thousand backends at once. Usually it's recommended >> that you stick a connection pooler in front of a server with (at most) >> a few hundred backends. > Sure, but that doesn't mean things should completely fall over when we > do get up to larger numbers of backends, which is definitely pretty > common in larger systems. As I understood the report, it was not "things completely fall over", it was "performance gets bad". But let's get real. Unless the OP has a machine with thousands of CPUs, trying to run this way is counterproductive. Perhaps in a decade or two such machines will be common enough that it'll make sense to try to tune Postgres to run well on them. Right now I feel no hesitation about saying "if it hurts, don't do that". regards, tom lane
Greetings, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > >> So, that's really the core of your problem. We don't promise that > >> you can run several thousand backends at once. Usually it's recommended > >> that you stick a connection pooler in front of a server with (at most) > >> a few hundred backends. > > > Sure, but that doesn't mean things should completely fall over when we > > do get up to larger numbers of backends, which is definitely pretty > > common in larger systems. > > As I understood the report, it was not "things completely fall over", > it was "performance gets bad". But let's get real. Unless the OP > has a machine with thousands of CPUs, trying to run this way is > counterproductive. Right, the issue is that performance gets bad (or, really, more like terrible...), and regardless of if it's ideal or not, lots of folks actually do run PG with thousands of connections, and we know that at start-up time because they've set max_connections to a sufficiently high value to support doing exactly that. > Perhaps in a decade or two such machines will be common enough that > it'll make sense to try to tune Postgres to run well on them. Right > now I feel no hesitation about saying "if it hurts, don't do that". I disagree that we should completely ignore these use-cases. Thanks, Stephen
Attachment
Hi Stephen, Thank you for supporting an opinion that it is the problems not only of client system design (I agree it is not so good idea to have thousands tables and thousands active backends) but also of Postgres. We have made more investigation and found out one more problem in Postgres causing "invalidation storm". There are some log living transactions which prevent autovacuum to do it work and remove dead tuples. So autovacuum is started once and once again and each time did no progress but updated statistics and so sent invalidation messages. autovacuum_naptime was set to 30 seconds, so each 30 seconds autovacuum proceed huge number of tables and initiated large number of invalidation messages which quite soon cause overflow of validation message buffers for backends performing long OLAP queries. It makes me think about two possible optimizations: 1. Provide separate invalidation messages for relation metadata and its statistic. So update of statistic should not invalidate relation cache. The main problem with this proposal is that pg_class contains relpages and reltuples columns which conceptually are \ part of relation statistic but stored in relation cache. If relation statistic is updated, then most likely this fields are also changed. So we have to remove this relation from relation cache in any case. 2. Remember in relation info XID of oldest active transaction at the moment of last autovacuum. At next autovacuum iteration we first of all compare this stored XID with current oldest active transaction XID and bypass vacuuming this relation if XID is not changed. Thoughts? >> So, that's really the core of your problem. We don't promise that >> you can run several thousand backends at once. Usually it's recommended >> that you stick a connection pooler in front of a server with (at most) >> a few hundred backends. > Sure, but that doesn't mean things should completely fall over when we > do get up to larger numbers of backends, which is definitely pretty > common in larger systems. I'm pretty sure we all agree that using a > connection pooler is recommended, but if there's things we can do to > make the system work at least a bit better when folks do use lots of > connections, provided we don't materially damage other cases, that's > probably worthwhile. I also think that Postgres performance should degrade gradually with increasing number of active backends. Actually further investigations of this particular case shows that such large number of database connections was caused by ... Postgres slowdown. During normal workflow number of active backends is few hundreds. But "invalidation storm" cause hangout of queries, so user application has to initiate more and more new connections to perform required actions. Yes, this may be not the best behavior of application in this case. At least it should first terminate current connection using pg_terminate_backend. I just want to notice that large number of backends was not the core of the problem. > > Making them GUCs does seem like it's a few steps too far... but it'd be > nice if we could arrange to have values that don't result in the system > falling over with large numbers of backends and large numbers of tables. > To get a lot of backends, you'd have to set max_connections up pretty > high to begin with- perhaps we should contemplate allowing these values > to vary based on what max_connections is set to? I think that optimal value of number of lock partitions should depend not on number of connections but on number of available CPU cores and so expected level on concurrency. It is hard to propose some portable way to obtain this number. This is why I think that GUCs is better solution. Certainly I realize that it is very dangerous parameter which should be changed with special care. Not only because of MAX_SIMUL_LWLOCKS. There are few places in Postgres when it tries to lock all partitions (deadlock detector, logical replication,...). If there very thousands of partitions, then such lock will be too expensive and we get yet another popular Postgres program: "deadlock detection storm" when due to high contention between backends lock can not be obtained in deadlock timeout and so initiate deadlock detection. Simultaneous deadlock detection performed by all backends (which tries to take ALL partitions locks) paralyze the system (TPS falls down to 0). Proposed patch for this problem was also rejected (once again - problem can be reproduced only of powerful server with large number of cores).
On 09.07.2020 18:14, Tom Lane wrote: > > As I understood the report, it was not "things completely fall over", > it was "performance gets bad". But let's get real. Unless the OP > has a machine with thousands of CPUs, trying to run this way is > counterproductive. Sorry, that I was not clear. It is actually case when "things completely fall over". If query planning time takes several minutes and so user response time is increased from seconds to hours, then system becomes unusable, doesn't it? > Perhaps in a decade or two such machines will be common enough that > it'll make sense to try to tune Postgres to run well on them. Right > now I feel no hesitation about saying "if it hurts, don't do that". Unfortunately we have not to wait for decade or two. Postgres is faced with multiple problems at existed multiprocessor systems (64, 96,.. cores). And it is not even necessary to initiate thousands of connections: just enough to load all this cores and let them compete for some resource (LW-lock, buffer,...). Even standard pgbench/YCSB benchmarks with zipfian distribution may illustrate this problems. There were many proposed patches which help to improve this situation. But as far as this patches increase performance only at huge servers with large number of cores and show almost no improvement (or even some degradation) at standard 4-cores desktops, almost none of them were committed. Consequently our customers have a lot of troubles trying to replace Oracle with Postgres and provide the same performance at same (quite good and expensive) hardware.
Hi Konstantin, a silly question: do you consider the workload you have as well-optimized? Can it be optimized further? Reading this thread I have a strong feeling that a very basic set of regular optimization actions is missing here (or not explained): query analysis and optimization based on pg_stat_statements (and, maybe pg_stat_kcache), some method to analyze the state of the server in general, resource consumption, etc.
Do you have some monitoring that covers pg_stat_statements?
Before looking under the hood, I would use multiple pg_stat_statements snapshots (can be analyzed using, say, postgres-checkup or pgCenter) to understand the workload and identify the heaviest queries -- first of all, in terms of total_time, calls, shared buffers reads/hits, temporary files generation. Which query groups are Top-N in each category, have you looked at it?
You mentioned some crazy numbers for the planning time, but why not to analyze the picture holistically and see the overall numbers? Those queries that have increased planning time, what their part of total_time, on the overall picture, in %? (Unfortunately, we cannot see Top-N by planning time in pg_stat_statements till PG13, but it doesn't mean that we cannot have some good understanding of overall picture today, it just requires more work).
If workload analysis & optimization was done holistically already, or not possible due to some reason — pardon me. But if not and if your primary goal is to improve this particular setup ASAP, then the topic could be started in the -performance mailing list first, discussing the workload and its aspects, and only after it's done, raised in -hackers. No?
On Thu, Jul 9, 2020 at 8:57 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
Hi Stephen,
Thank you for supporting an opinion that it is the problems not only of
client system design (I agree it is not so good
idea to have thousands tables and thousands active backends) but also of
Postgres.
We have made more investigation and found out one more problem in
Postgres causing "invalidation storm".
There are some log living transactions which prevent autovacuum to do it
work and remove dead tuples.
So autovacuum is started once and once again and each time did no
progress but updated statistics and so sent invalidation messages.
autovacuum_naptime was set to 30 seconds, so each 30 seconds autovacuum
proceed huge number of tables and initiated large number of invalidation
messages which quite soon cause overflow of validation message buffers
for backends performing long OLAP queries.
It makes me think about two possible optimizations:
1. Provide separate invalidation messages for relation metadata and its
statistic.
So update of statistic should not invalidate relation cache.
The main problem with this proposal is that pg_class contains relpages
and reltuples columns which conceptually are \ part of relation statistic
but stored in relation cache. If relation statistic is updated, then
most likely this fields are also changed. So we have to remove this relation
from relation cache in any case.
2. Remember in relation info XID of oldest active transaction at the
moment of last autovacuum.
At next autovacuum iteration we first of all compare this stored XID
with current oldest active transaction XID
and bypass vacuuming this relation if XID is not changed.
Thoughts?
>> So, that's really the core of your problem. We don't promise that
>> you can run several thousand backends at once. Usually it's recommended
>> that you stick a connection pooler in front of a server with (at most)
>> a few hundred backends.
> Sure, but that doesn't mean things should completely fall over when we
> do get up to larger numbers of backends, which is definitely pretty
> common in larger systems. I'm pretty sure we all agree that using a
> connection pooler is recommended, but if there's things we can do to
> make the system work at least a bit better when folks do use lots of
> connections, provided we don't materially damage other cases, that's
> probably worthwhile.
I also think that Postgres performance should degrade gradually with
increasing number
of active backends. Actually further investigations of this particular
case shows that such large number of
database connections was caused by ... Postgres slowdown.
During normal workflow number of active backends is few hundreds.
But "invalidation storm" cause hangout of queries, so user application
has to initiate more and more new connections to perform required actions.
Yes, this may be not the best behavior of application in this case. At
least it should first terminate current connection using
pg_terminate_backend. I just want to notice that large number of
backends was not the core of the problem.
>
> Making them GUCs does seem like it's a few steps too far... but it'd be
> nice if we could arrange to have values that don't result in the system
> falling over with large numbers of backends and large numbers of tables.
> To get a lot of backends, you'd have to set max_connections up pretty
> high to begin with- perhaps we should contemplate allowing these values
> to vary based on what max_connections is set to?
I think that optimal value of number of lock partitions should depend
not on number of connections
but on number of available CPU cores and so expected level on concurrency.
It is hard to propose some portable way to obtain this number.
This is why I think that GUCs is better solution.
Certainly I realize that it is very dangerous parameter which should be
changed with special care.
Not only because of MAX_SIMUL_LWLOCKS.
There are few places in Postgres when it tries to lock all partitions
(deadlock detector, logical replication,...).
If there very thousands of partitions, then such lock will be too
expensive and we get yet another
popular Postgres program: "deadlock detection storm" when due to high
contention between backends lock can not be obtained
in deadlock timeout and so initiate deadlock detection. Simultaneous
deadlock detection performed by all backends
(which tries to take ALL partitions locks) paralyze the system (TPS
falls down to 0).
Proposed patch for this problem was also rejected (once again - problem
can be reproduced only of powerful server with large number of cores).
On Thu, Jul 9, 2020 at 6:57 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > 2. Remember in relation info XID of oldest active transaction at the > moment of last autovacuum. > At next autovacuum iteration we first of all compare this stored XID > with current oldest active transaction XID > and bypass vacuuming this relation if XID is not changed. This option looks good for me independently of the use case under consideration. Long-running transactions are an old and well-known problem. If we can skip some useless work here, let's do this. ------ Regards, Alexander Korotkov
Greetings, * Konstantin Knizhnik (k.knizhnik@postgrespro.ru) wrote: > It makes me think about two possible optimizations: > > 1. Provide separate invalidation messages for relation metadata and its > statistic. > So update of statistic should not invalidate relation cache. > The main problem with this proposal is that pg_class contains relpages and > reltuples columns which conceptually are \ part of relation statistic > but stored in relation cache. If relation statistic is updated, then most > likely this fields are also changed. So we have to remove this relation > from relation cache in any case. I realize this is likely to go over like a lead balloon, but the churn in pg_class from updating reltuples/relpages has never seemed all that great to me when just about everything else is so rarely changed, and only through some user DDL action- and I agree that it seems like those particular columns are more 'statistics' type of info and less info about the definition of the relation. Other columns that do get changed regularly are relfrozenxid and relminmxid. I wonder if it's possible to move all of those elsewhere- perhaps some to the statistics tables as you seem to be alluding to, and the others to $somewhereelse that is dedicated to tracking that information which VACUUM is primarily concerned with. > 2. Remember in relation info XID of oldest active transaction at the moment > of last autovacuum. > At next autovacuum iteration we first of all compare this stored XID with > current oldest active transaction XID > and bypass vacuuming this relation if XID is not changed. > > Thoughts? That sounds like an interesting optimization and I agree it'd be nice to avoid the re-run of autovacuum when we can tell that there's not going to be anything more we can do. As noted above, for my part, I think it'd be nice to move that kind of ongoing maintenance/updates out of pg_class, but just in general I agree with the idea to store that info somewhere and wait until there's actually been progress in the global xmin before re-running a vacuum on a table. If we can do that somewhere outside of pg_class, I think that'd be better, but if no one is up for that kind of a shake-up, then maybe we just put it in pg_class and deal with the churn there. > >>So, that's really the core of your problem. We don't promise that > >>you can run several thousand backends at once. Usually it's recommended > >>that you stick a connection pooler in front of a server with (at most) > >>a few hundred backends. > >Sure, but that doesn't mean things should completely fall over when we > >do get up to larger numbers of backends, which is definitely pretty > >common in larger systems. I'm pretty sure we all agree that using a > >connection pooler is recommended, but if there's things we can do to > >make the system work at least a bit better when folks do use lots of > >connections, provided we don't materially damage other cases, that's > >probably worthwhile. > > I also think that Postgres performance should degrade gradually with > increasing number > of active backends. Actually further investigations of this particular case > shows that such large number of > database connections was caused by ... Postgres slowdown. > During normal workflow number of active backends is few hundreds. > But "invalidation storm" cause hangout of queries, so user application has > to initiate more and more new connections to perform required actions. > Yes, this may be not the best behavior of application in this case. At least > it should first terminate current connection using pg_terminate_backend. I > just want to notice that large number of backends was not the core of the > problem. Yeah, this is all getting back to the fact that we don't have an acceptance criteria or anything like that, where we'd actually hold off on new connections/queries being allowed in while other things are happening. Of course, a connection pooler would address this (and you could use one and have it still look exactly like PG, if you use, say, pgbouncer in session-pooling mode, but then you need to have the application drop/reconnect and not do its own connection pooling..), but it'd be nice to have something in core for this. > >Making them GUCs does seem like it's a few steps too far... but it'd be > >nice if we could arrange to have values that don't result in the system > >falling over with large numbers of backends and large numbers of tables. > >To get a lot of backends, you'd have to set max_connections up pretty > >high to begin with- perhaps we should contemplate allowing these values > >to vary based on what max_connections is set to? > > I think that optimal value of number of lock partitions should depend not on > number of connections > but on number of available CPU cores and so expected level on concurrency. > It is hard to propose some portable way to obtain this number. > This is why I think that GUCs is better solution. A GUC for 'number of CPUs' doesn't seem like a bad option to have. How to make that work well may be challenging though. > Certainly I realize that it is very dangerous parameter which should be > changed with special care. > Not only because of MAX_SIMUL_LWLOCKS. Sure. > There are few places in Postgres when it tries to lock all partitions > (deadlock detector, logical replication,...). > If there very thousands of partitions, then such lock will be too expensive > and we get yet another > popular Postgres program: "deadlock detection storm" when due to high > contention between backends lock can not be obtained > in deadlock timeout and so initiate deadlock detection. Simultaneous > deadlock detection performed by all backends > (which tries to take ALL partitions locks) paralyze the system (TPS falls > down to 0). > Proposed patch for this problem was also rejected (once again - problem can > be reproduced only of powerful server with large number of cores). That does sound like something that would be good to improve on, though I haven't looked at the proposed patch or read the associated thread, so I'm not sure I can really comment on its rejection specifically. Thanks, Stephen
Attachment
On 09.07.2020 19:19, Nikolay Samokhvalov wrote: > Hi Konstantin, a silly question: do you consider the workload you have > as well-optimized? Can it be optimized further? Reading this thread I > have a strong feeling that a very basic set of regular optimization > actions is missing here (or not explained): query analysis and > optimization based on pg_stat_statements (and, maybe pg_stat_kcache), > some method to analyze the state of the server in general, resource > consumption, etc. > > Do you have some monitoring that covers pg_stat_statements? > > Before looking under the hood, I would use multiple pg_stat_statements > snapshots (can be analyzed using, say, postgres-checkup or pgCenter) > to understand the workload and identify the heaviest queries -- first > of all, in terms of total_time, calls, shared buffers reads/hits, > temporary files generation. Which query groups are Top-N in each > category, have you looked at it? > > You mentioned some crazy numbers for the planning time, but why not to > analyze the picture holistically and see the overall numbers? Those > queries that have increased planning time, what their part of > total_time, on the overall picture, in %? (Unfortunately, we cannot > see Top-N by planning time in pg_stat_statements till PG13, but it > doesn't mean that we cannot have some good understanding of overall > picture today, it just requires more work). > > If workload analysis & optimization was done holistically already, or > not possible due to some reason — pardon me. But if not and if your > primary goal is to improve this particular setup ASAP, then the topic > could be started in the -performance mailing list first, discussing > the workload and its aspects, and only after it's done, raised in > -hackers. No? Certainly, both we and customer has made workload analysis & optimization. It is not a problem of particular queries, bad plans, resource exhaustion,... Unfortunately there many scenarios when Postgres demonstrates not gradual degrade of performance with increasing workload, but "snow avalanche" whennegative feedback cause very fastparalysis of the system. This case is just one if this scenarios. It is hard to say for sure what triggers the avalanche... Long living transaction, huge number of tables, aggressive autovacuum settings... But there is cascade of negative events which cause system which normally function for months to stop working at all. In this particular case we have the following chain: - long living transaction cause autovacuum to send a lot of invalidation message - this messages cause overflow of invalidation message queues, forcing backens to invalidate their caches and reload from catalog. - too small value of fastpath lock cache cause many concurrent accesses to shared lock hash - contention for LW-lock caused by small number of lock partition cause starvation
Great idea.
In addition to this, it would be good to consider another optimization for the default transaction isolation level: making autovacuum to clean dead tuples in relations that are not currently used in any transaction and when there are no IN_PROGRESS transactions running at RR or S level (which is a very common case because RC is the default level and this is what is actively used in heavily loaded OLTP systems which often suffer from long-running transactions). I don't know the details of how easy it would be to implement, but it always wondered that autovacuum has the global XID "horizon".
With such an optimization, the "hot_standby_feedback=on" mode could be implemented also more gracefully, reporting "min(xid)" for ongoing transactions on standbys separately for RC and RR/S levels.
Without this, we cannot have good performance for HTAP use cases for Postgres – the presence of just a small number of long-running transactions, indeed, is known to kill the performance of OLTP workloads quickly. And leading to much faster bloat growth than it could be.
However, maybe I'm wrong in these considerations, or it's impossible / too difficult to implement.
On Thu, Jul 9, 2020 at 9:38 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Thu, Jul 9, 2020 at 6:57 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> 2. Remember in relation info XID of oldest active transaction at the
> moment of last autovacuum.
> At next autovacuum iteration we first of all compare this stored XID
> with current oldest active transaction XID
> and bypass vacuuming this relation if XID is not changed.
This option looks good for me independently of the use case under
consideration. Long-running transactions are an old and well-known
problem. If we can skip some useless work here, let's do this.
------
Regards,
Alexander Korotkov
Greetings, We generally prefer that you don't top-post on these lists. * Nikolay Samokhvalov (samokhvalov@gmail.com) wrote: > In addition to this, it would be good to consider another optimization for > the default transaction isolation level: making autovacuum to clean dead > tuples in relations that are not currently used in any transaction and when > there are no IN_PROGRESS transactions running at RR or S level (which is a > very common case because RC is the default level and this is what is > actively used in heavily loaded OLTP systems which often suffer from > long-running transactions). I don't know the details of how easy it would > be to implement, but it always wondered that autovacuum has the global XID > "horizon". Yeah, I've had thoughts along the same lines, though I had some ideas that we could actually manage to support it even with RR (at least... not sure about serializable) by considering what tuples the transactions in the system could actually see (eg: even with RR, a tuple created after that transaction started and was then deleted wouldn't ever be able to be seen and therefore could be cleaned up..). A further thought on that was to only spend that kind of effort once a tuple had aged a certain amount, though it depends a great deal on exactly what would need to be done for this. Unfortunately, anything in this area is likely to carry a good bit of risk associated with it as VACUUM doing the wrong thing would be quite bad. Thanks, Stephen
Attachment
On 7/8/20 11:41 PM, Konstantin Knizhnik wrote: > > So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have > to be replaced with GUCs. > To avoid division, we can specify log2 of this values, so shift can be > used instead. > And MAX_SIMUL_LWLOCKS should be defined as NUM_LOCK_PARTITIONS + > NUM_INDIVIDUAL_LWLOCKS + NAMED_LWLOCK_RESERVE. > Because I was involved in this particular case and don`t want it to became a habit, I`m volunteering to test whatever patch this discussion will produce. -- Grigory Smolkin Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Jul 9, 2020 at 10:00 PM Nikolay Samokhvalov <samokhvalov@gmail.com> wrote: > In addition to this, it would be good to consider another optimization for the default transaction isolation level: makingautovacuum to clean dead tuples in relations that are not currently used in any transaction and when there are no IN_PROGRESStransactions running at RR or S level (which is a very common case because RC is the default level and this iswhat is actively used in heavily loaded OLTP systems which often suffer from long-running transactions). I don't know thedetails of how easy it would be to implement, but it always wondered that autovacuum has the global XID "horizon". > > With such an optimization, the "hot_standby_feedback=on" mode could be implemented also more gracefully, reporting "min(xid)"for ongoing transactions on standbys separately for RC and RR/S levels. Yes, the current way of calculation of dead tuples is lossy, because we only rely on the oldest xid. However, if we would keep the oldest snapshot instead of oldest xmin, long-running transactions wouldn't be such a disaster. I don't think this is feasible with the current snapshot model, because keeping full snapshots instead of just xmins would bloat shared-memory structs and complicate computations. But CSN can certainly support this optimization. ------ Regards, Alexander Korotkov
From: Konstantin Knizhnik <k.knizhnik@postgrespro.ru> > Unfortunately we have not to wait for decade or two. > Postgres is faced with multiple problems at existed multiprocessor > systems (64, 96,.. cores). > And it is not even necessary to initiate thousands of connections: just > enough to load all this cores and let them compete for some > resource (LW-lock, buffer,...). Even standard pgbench/YCSB benchmarks > with zipfian distribution may illustrate this problems. I concur with you. VMs and bare metal machines with 100~200 CPU cores and TBs of RAM are already available even on publicclouds. The users easily set max_connections to a high value like 10,000, create thousands or tens of thousands ofrelations, and expect it to go smoothly. Although it may be a horror for PG developers who know the internals well, Postgreshas grown a great database to be relied upon. Besides, I don't want people to think like "Postgres cannot scale up on one machine, so we need scale-out." I understandsome form of scale-out is necessary for large-scale analytics and web-scale multitenant OLTP, but it would be desirableto be able to cover the OLTP workloads for one organization/region with the advances in hardware and Postgres leveragingthose advances, without something like Oracle RAC. > There were many proposed patches which help to improve this situation. > But as far as this patches increase performance only at huge servers > with large number of cores and show almost no > improvement (or even some degradation) at standard 4-cores desktops, > almost none of them were committed. > Consequently our customers have a lot of troubles trying to replace > Oracle with Postgres and provide the same performance at same > (quite good and expensive) hardware. Yeah, it's a pity that the shiny-looking patches from Postgres Pro (mostly from Konstantin san?) -- autoprepare, built-inconnection pooling, fair lwlock, and revolutionary multi-threaded backend -- haven't gained hot atension. Regards Takayuki Tsunakawa
On 09.07.2020 22:16, Grigory Smolkin wrote: > > On 7/8/20 11:41 PM, Konstantin Knizhnik wrote: >> >> So looks like NUM_LOCK_PARTITIONS and MAXNUMMESSAGES constants have >> to be replaced with GUCs. >> To avoid division, we can specify log2 of this values, so shift can >> be used instead. >> And MAX_SIMUL_LWLOCKS should be defined as NUM_LOCK_PARTITIONS + >> NUM_INDIVIDUAL_LWLOCKS + NAMED_LWLOCK_RESERVE. >> > Because I was involved in this particular case and don`t want it to > became a habit, I`m volunteering to test whatever patch this > discussion will produce. > You are welcome:)
Attachment
On Thu, 2020-07-09 at 12:47 -0400, Stephen Frost wrote: > I realize this is likely to go over like a lead balloon, but the churn > in pg_class from updating reltuples/relpages has never seemed all that > great to me when just about everything else is so rarely changed, and > only through some user DDL action- and I agree that it seems like those > particular columns are more 'statistics' type of info and less info > about the definition of the relation. Other columns that do get changed > regularly are relfrozenxid and relminmxid. I wonder if it's possible to > move all of those elsewhere- perhaps some to the statistics tables as > you seem to be alluding to, and the others to $somewhereelse that is > dedicated to tracking that information which VACUUM is primarily > concerned with. Perhaps we could create pg_class with a fillfactor less than 100 so we het HOT updates there. That would be less invasive. Yours, Laurenz Albe
On Fri, Jul 10, 2020 at 02:10:20AM +0000, tsunakawa.takay@fujitsu.com wrote: > > There were many proposed patches which help to improve this > > situation. But as far as this patches increase performance only > > at huge servers with large number of cores and show almost no > > improvement (or even some degradation) at standard 4-cores desktops, > > almost none of them were committed. Consequently our customers have > > a lot of troubles trying to replace Oracle with Postgres and provide > > the same performance at same (quite good and expensive) hardware. > > Yeah, it's a pity that the shiny-looking patches from Postgres Pro > (mostly from Konstantin san?) -- autoprepare, built-in connection > pooling, fair lwlock, and revolutionary multi-threaded backend -- > haven't gained hot atension. Yeah, it is probably time for us to get access to a current large-scale machine again and really find the bottlenecks. We seem to next this every few years. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
On 2020-Jul-10, Konstantin Knizhnik wrote: > @@ -3076,6 +3080,10 @@ relation_needs_vacanalyze(Oid relid, > instuples = tabentry->inserts_since_vacuum; > anltuples = tabentry->changes_since_analyze; > > + rel = RelationIdGetRelation(relid); > + oldestXmin = TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel); > + RelationClose(rel); *cough* -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 15.07.2020 02:17, Alvaro Herrera wrote: > On 2020-Jul-10, Konstantin Knizhnik wrote: > >> @@ -3076,6 +3080,10 @@ relation_needs_vacanalyze(Oid relid, >> instuples = tabentry->inserts_since_vacuum; >> anltuples = tabentry->changes_since_analyze; >> >> + rel = RelationIdGetRelation(relid); >> + oldestXmin = TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel); >> + RelationClose(rel); > *cough* > Sorry, Alvaro. Can you explain this *cough* You didn't like that relation is opened just to call GetOldestXmin? But this functions requires Relation. Do you suggest to rewrite it so that it is possible to pass just Oid of relation? Or you do you think that such calculation of oldestSmin is obscure and at lest requires some comment? Actually, I have copied it from vacuum.c and there is a large comment explaining why it is calculated in this way. May be it is enough to add reference to vacuum.c? Or may be create some special function for it? I just need to oldestXmin in calculated in the same way in vacuum.c and autovacuum.c
On 2020-Jul-15, Konstantin Knizhnik wrote: > > > On 15.07.2020 02:17, Alvaro Herrera wrote: > > On 2020-Jul-10, Konstantin Knizhnik wrote: > > > > > @@ -3076,6 +3080,10 @@ relation_needs_vacanalyze(Oid relid, > > > instuples = tabentry->inserts_since_vacuum; > > > anltuples = tabentry->changes_since_analyze; > > > + rel = RelationIdGetRelation(relid); > > > + oldestXmin = TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel); > > > + RelationClose(rel); > > *cough* > > > Sorry, Alvaro. > Can you explain this *cough* > You didn't like that relation is opened just to call GetOldestXmin? > But this functions requires Relation. Do you suggest to rewrite it so that > it is possible to pass just Oid of relation? At that point of autovacuum, you don't have a lock on the relation; the only thing you have is a pg_class tuple (and we do it that way on purpose as I recall). I think asking relcache for it is dangerous, and moreover requesting relcache for it directly goes counter our normal coding pattern. At the very least you should have a comment explaining why you do it and why it's okay to do it, and also handle the case when RelationIdGetRelation returns null. However, looking at the bigger picture I wonder if it would be better to test the getoldestxmin much later in the process to avoid this whole issue. Just carry forward the relation until the point where vacuum is called ... that may be cleaner? And the extra cost is not that much. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 15.07.2020 18:03, Alvaro Herrera wrote: > On 2020-Jul-15, Konstantin Knizhnik wrote: > >> >> On 15.07.2020 02:17, Alvaro Herrera wrote: >>> On 2020-Jul-10, Konstantin Knizhnik wrote: >>> >>>> @@ -3076,6 +3080,10 @@ relation_needs_vacanalyze(Oid relid, >>>> instuples = tabentry->inserts_since_vacuum; >>>> anltuples = tabentry->changes_since_analyze; >>>> + rel = RelationIdGetRelation(relid); >>>> + oldestXmin = TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel); >>>> + RelationClose(rel); >>> *cough* >>> >> Sorry, Alvaro. >> Can you explain this *cough* >> You didn't like that relation is opened just to call GetOldestXmin? >> But this functions requires Relation. Do you suggest to rewrite it so that >> it is possible to pass just Oid of relation? > At that point of autovacuum, you don't have a lock on the relation; the > only thing you have is a pg_class tuple (and we do it that way on > purpose as I recall). I think asking relcache for it is dangerous, and > moreover requesting relcache for it directly goes counter our normal > coding pattern. At the very least you should have a comment explaining > why you do it and why it's okay to do it, and also handle the case when > RelationIdGetRelation returns null. > > However, looking at the bigger picture I wonder if it would be better to > test the getoldestxmin much later in the process to avoid this whole > issue. Just carry forward the relation until the point where vacuum is > called ... that may be cleaner? And the extra cost is not that much. > Thank you for explanation. I have prepared new version of the patch which opens relation with care. Concerning your suggestion to perform this check later (in vacuum_rel() I guess?) I tried to implement it but faced with another problem. Right now information about autovacuum_oldest_xmin for relationis stored in statistic (PgStat_StatTabEntry) together with other atovacuum related fields like autovac_vacuum_timestamp, autovac_analyze_timestamp,...) I am not sure that it is right place for storing autovacuum_oldest_xmin but I didn't want to organize in shared memory yet another hash table just for keeping this field. May be you can suggest something better... But right now it is stored here when vacuum is completed. PgStat_StatTabEntry is obtained by get_pgstat_tabentry_relid() which also needs pgstat_fetch_stat_dbentry(MyDatabaseId) and pgstat_fetch_stat_dbentry(InvalidOid). I do not want to copy all this stuff to vacuum.c. It seems to me to be easier to open relation in relation_needs_vacanalyze(), isn;t it?