Thread: poor performance with Context Switch Storm at TPC-W.
Hi,All. The problem has occurred in my customer. poor performance with Context Switch Storm occurred with the following composition. Usually, CS is about 5000, WIPS=360. when CSStorm occurrence, CS is about 100000, WIPS=60 or less. (WIPS = number of web interactions per second) It is under investigation using the patch which collects a LWLock. I suspected conflict of BufMappingLock. but, collected results are seen, occurrence of CSStorm and the increase of BufMappingLock counts seem not to correspond. Instead, SubtransControlLock and SubTrans were increasing. I do not understand what in the cause of CSStorm. [DB server]*1 Intel Xeon 3.0GHz*4(2CPU * H/T ON) 4GB Memory Red Hat Enterprise Linux ES release 4(Nahant Update 3) Linux version 2.6.9-34.ELsmp PostgreSQL8.1.3 (The version 8.2(head-6/15) was also occurred) shared_buffers=131072 temp_buffers=1000 max_connections=300 [AP server]*2 200 connection pooling. TPC-W model workload [Clinet]*4 TPC-W model workload (1) The following discussion were read. http://archives.postgresql.org/pgsql-hackers/2006-05/msg01003.php From: Tom Lane <tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot ) us> To: josh ( at ) agliodbs ( dot ) com Subject: Re: Further reduction of bufmgr lock contention Date: Wed, 24 May 2006 15:25:26 -0400 If there is a patch for investigation or a technique, would someone show it to me? (2) It seems that much sequential scan has occurred at CSStorm. When reading a tuple, do the visible satisfy check. it seems to generate the subtransaction for every transaction. How much is a possibility that the LWLock to a subtransaction cause CSStorm? best regards. -------- Katsuhiko Okano okano katsuhiko _at_ oss ntt co jp
"Katsuhiko Okano" <okano.katsuhiko@oss.ntt.co.jp> wrote > > The problem has occurred in my customer. > poor performance with Context Switch Storm occurred > with the following composition. > Usually, CS is about 5000, WIPS=360. > when CSStorm occurrence, CS is about 100000, WIPS=60 or less. > > Intel Xeon 3.0GHz*4(2CPU * H/T ON) > 4GB Memory Do you have bgwriter on and what's the parameters? I read a theory somewhere that bgwriter scan a large portion of memory and cause L1/L2 thrushing, so with HT on, the other backends sharing the physical processor with it also get thrashed ... So try to turn bgwriter off or turn HT off see what's the difference. Regards, Qingqing
Katsuhiko Okano wrote: > I suspected conflict of BufMappingLock. > but, collected results are seen, > occurrence of CSStorm and the increase of BufMappingLock counts > seem not to correspond. > Instead, SubtransControlLock and SubTrans were increasing. > I do not understand what in the cause of CSStorm. Please see this thread: http://archives.postgresql.org/pgsql-hackers/2005-11/msg01547.php (actually it's a single message AFAICT) This was applied on the 8.2dev code, so I'm surprised that 8.2dev behaves the same as 8.1. Does your problem have any relationship to what's described there? I also wondered whether the problem may be that the number of SLRU buffers we use for subtrans is too low. But the number was increased from the default 8 to 32 in 8.2dev as well. Maybe you could try increasing that even further; say 128 and see if the problem is still there. (src/include/access/subtrans.h, NUM_SUBTRANS_BUFFERS). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Katsuhiko, Have you tried turning HT off? HT is not generally considered (even by Intel) a good idea for database appplications. --Josh
hello. > Do you have bgwriter on and what's the parameters? I read a theory somewhere > that bgwriter scan a large portion of memory and cause L1/L2 thrushing, so > with HT on, the other backends sharing the physical processor with it also > get thrashed ... So try to turn bgwriter off or turn HT off see what's the > difference. bgwriter is ON. at postgresql.conf: > # - Background writer - > > bgwriter_delay = 200 # 10-10000 milliseconds between rounds > bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round > bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round > bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round > bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round I tried turn H/T OFF, but CSStorm occurred. Usually, CS is about 5000. when CSStrom occurrence, CS is about 70000. (CS is a value smaller than the case where H/T is ON. I think that it is because the performance of CPU fell.) Regards -------- Katsuhiko Okano okano katsuhiko _at_ oss ntt co jp
Hi. Alvaro Herrera wrote: > Katsuhiko Okano wrote: > > > I suspected conflict of BufMappingLock. > > but, collected results are seen, > > occurrence of CSStorm and the increase of BufMappingLock counts > > seem not to correspond. > > Instead, SubtransControlLock and SubTrans were increasing. > > I do not understand what in the cause of CSStorm. > > Please see this thread: > http://archives.postgresql.org/pgsql-hackers/2005-11/msg01547.php > (actually it's a single message AFAICT) > > This was applied on the 8.2dev code, so I'm surprised that 8.2dev > behaves the same as 8.1. > > Does your problem have any relationship to what's described there? > Probably it is related. There is no telling are a thing with the bad method of a lock and whether it is bad that the number of LRU buffers is simply small. > I also wondered whether the problem may be that the number of SLRU > buffers we use for subtrans is too low. But the number was increased > from the default 8 to 32 in 8.2dev as well. Maybe you could try > increasing that even further; say 128 and see if the problem is still > there. (src/include/access/subtrans.h, NUM_SUBTRANS_BUFFERS). By PostgreSQL8.2, NUM_SUBTRANS_BUFFERS was changed into 128 and recompile and measured again. NOT occurrence of CSStorm. The value of WIPS was about 400. (but the value of WIPS fell about to 320 at intervals of 4 to 6 minutes.) If the number of SLRU buffers is too low, also in PostgreSQL8.1.4, if the number of buffers is increased I think that the same result is brought. (Although the buffer of CLOG or a multi-transaction also increases, I think that effect is small) Now, NUM_SLRU_BUFFERS is changed into 128 in PostgreSQL8.1.4 and is under measurement. regards, -------- Katsuhiko Okano okano katsuhiko _at_ oss ntt co jp
CSStorm occurred again by postgreSQL8.2. (Re: poor performance with Context Switch Storm at TPC-W.)
From
Katsuhiko Okano
Date:
Katsuhiko Okano wrote: > By PostgreSQL8.2, NUM_SUBTRANS_BUFFERS was changed into 128 > and recompile and measured again. > NOT occurrence of CSStorm. The value of WIPS was about 400. measured again. not occurrence when measured for 30 minutes. but occurrence when measured for 3 hours, and 1 hour and 10 minutes passed. It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS. The problem was only postponed. > If the number of SLRU buffers is too low, > also in PostgreSQL8.1.4, if the number of buffers is increased > I think that the same result is brought. > (Although the buffer of CLOG or a multi-transaction also increases, > I think that effect is small) > > Now, NUM_SLRU_BUFFERS is changed into 128 in PostgreSQL8.1.4 > and is under measurement. Occurrence CSStorm when the version 8.1.4 passed similarly for 1 hour and 10 minutes. A strange point, The number of times of a LWLock lock for LRU buffers is 0 times until CSStorm occurs. After CSStorm occurs, the share lock and the exclusion lock are required and most locks are kept waiting. (exclusion lock for SubtransControlLock is increased rapidly after CSStorm start.) Is different processing done by whether CSStrom has occurred or not occurred? regards, -------- Katsuhiko Okano okano katsuhiko _at_ oss ntt co jp
Re: CSStorm occurred again by postgreSQL8.2. (Re: poor performance with Context Switch Storm at TPC-W.)
From
Tom Lane
Date:
Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes: > It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS. > The problem was only postponed. Can you provide a reproducible test case for this? regards, tom lane
Re: CSStorm occurred again by postgreSQL8.2. (Re: poor performance with Context Switch Storm at TPC-W.)
From
Katsuhiko Okano
Date:
"Tom Lane <tgl@sss.pgh.pa.us>" wrote: > Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes: > > It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS. > > The problem was only postponed. > > Can you provide a reproducible test case for this? Seven machines are required in order to perform measurement. (DB*1,AP*2,CLient*4) Enough work load was not able to be given in two machines. (DB*1,{AP+CL}*1) It was not able to reappear to a multiplex run of pgbench or a simple SELECT query. TPC-W of a work load tool used this time is a full scratch. Regrettably it cannot open to the public. If there is a work load tool of a free license, I would like to try. I will show if there is information required for others. The patch which outputs the number of times of LWLock was used this time. The following is old example output. FYI. # SELECT * FROM pg_stat_lwlocks;kind | pg_stat_get_lwlock_name | sh_call | sh_wait | ex_call | ex_wait | sleep ------+----------------------------+------------+-----------+-----------+-----------+------- 0 | BufMappingLock | 559375542 | 33542 | 320092 | 24025 | 0 1 | BufFreelistLock | 0 | 0 | 370709 | 47 | 0 2 | LockMgrLock | 0 | 0 | 41718885 | 734502 | 0 3 | OidGenLock | 33 | 0 | 0 | 0 | 0 4 | XidGenLock | 12572279 | 10095 | 11299469 | 20089 | 0 5 | ProcArrayLock | 8371330 | 72052 | 16965667 | 603294 | 0 6 | SInvalLock | 38822428 | 435 | 25917 | 128 | 0 7 | FreeSpaceLock | 0 | 0 | 16787 | 4 | 0 8 | WALInsertLock | 0 | 0 | 1239911 | 885 | 0 9 | WALWriteLock | 0 | 0 | 69907 | 5589 | 0 10 | ControlFileLock | 0 | 0 | 16686 | 1 | 0 11 | CheckpointLock | 0 | 0 | 34 | 0 | 0 12 | CheckpointStartLock | 69509 | 0 | 34 | 1 | 0 13 | CLogControlLock | 0 | 0 | 236763 | 183 | 0 14 | SubtransControlLock | 0 | 0 | 753773945 | 205273395 | 0 15 | MultiXactGenLock | 66 | 0 | 0 | 0 | 0 16 | MultiXactOffsetControlLock | 0 | 0 | 35 | 0 | 0 17 | MultiXactMemberControlLock | 0 | 0 | 34 | 0 | 0 18 | RelCacheInitLock | 0 | 0 | 0 | 0 | 0 19 | BgWriterCommLock | 0 | 0 | 61457 | 1 | 0 20 | TwoPhaseStateLock | 33 | 0 | 0 | 0 | 0 21 | TablespaceCreateLock | 0 | 0 | 0 | 0 | 0 22 | BufferIO | 0 | 0 | 695627 | 16 | 0 23 | BufferContent | 3568231805 | 1897 | 1361394 | 829 | 0 24 | CLog | 0 | 0 | 0 | 0 | 0 25 | SubTrans | 138571621 | 143208883 | 8122181 | 8132646 | 0 26 | MultiXactOffset | 0 | 0 | 0 | 0 | 0 27 | MultiXactMember | 0 | 0 | 0 | 0 | 0 (28 rows) I am pleased if interested. regards, -------- Katsuhiko Okano okano katsuhiko _at_ oss ntt co jp
Katsuhiko Okano wrote: > "Tom Lane <tgl@sss.pgh.pa.us>" wrote: >> Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes: >>> It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS. >>> The problem was only postponed. >> Can you provide a reproducible test case for this? > > Seven machines are required in order to perform measurement. > (DB*1,AP*2,CLient*4) > Enough work load was not able to be given in two machines. > (DB*1,{AP+CL}*1) > > > It was not able to reappear to a multiplex run of pgbench > or a simple SELECT query. > TPC-W of a work load tool used this time is a full scratch. > Regrettably it cannot open to the public. > If there is a work load tool of a free license, I would like to try. FYI: there is a free tpc-w implementation done by Jan available at: http://pgfoundry.org/projects/tpc-w-php/ Stefan
Hi folks, From: Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> Subject: Re: CSStorm occurred again by postgreSQL8.2. (Re: [HACKERS] poor Date: Wed, 19 Jul 2006 12:53:53 +0200 > Katsuhiko Okano wrote: > > "Tom Lane <tgl@sss.pgh.pa.us>" wrote: > >> Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes: > >>> It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS. > >>> The problem was only postponed. > >> Can you provide a reproducible test case for this? > > > > Seven machines are required in order to perform measurement. > > (DB*1,AP*2,CLient*4) > > Enough work load was not able to be given in two machines. > > (DB*1,{AP+CL}*1) > > > > > > It was not able to reappear to a multiplex run of pgbench > > or a simple SELECT query. > > TPC-W of a work load tool used this time is a full scratch. > > Regrettably it cannot open to the public. > > If there is a work load tool of a free license, I would like to try. > > > FYI: there is a free tpc-w implementation done by Jan available at: > http://pgfoundry.org/projects/tpc-w-php/ FYI(2): There is one more (pseudo) TPC-W implementation by OSDL. http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/osdl_dbt-1/ One more comment is that Katsuhiko't team is using their own version of TPC-W like benchmark suite, and he cannot makeit public. Also, his point is that he tried to reproduce the CSS phenomena using pgbench and a proguram issuing heavily multiple SELECTqueries on a single table but they didn't work well reproducing CSS. Regards, Masanori > Stefan --- Masanori ITOH NTT OSS Center, Nippon Telegraph and Telephone Corporation e-mail: ito.masanori@oss.ntt.co.jp phone : +81-3-5860-5015
On Fri, Jul 14, 2006 at 02:58:36PM +0900, Katsuhiko Okano wrote: > NOT occurrence of CSStorm. The value of WIPS was about 400. > (but the value of WIPS fell about to 320 at intervals of 4 to 6 minutes.) If you haven't changed checkpoint timeout, this drop-off every 4-6 minutes indicates that you need to make the bgwriter more aggressive. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
> > > If there is a work load tool of a free license, I would like to try. > > > > > > FYI: there is a free tpc-w implementation done by Jan available at: > > http://pgfoundry.org/projects/tpc-w-php/ > > FYI(2): > > There is one more (pseudo) TPC-W implementation by OSDL. > > http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/osdl_dbt-1/ Thank you for the information. I'll try it. Regards, -------- Katsuhiko Okano okano katsuhiko _at_ oss ntt co jp
"Jim C. Nasby" wrote: > If you haven't changed checkpoint timeout, this drop-off every 4-6 > minutes indicates that you need to make the bgwriter more aggressive. I'll say to a customer when proposing and explaining. Thank you for the information. Regards, -------- Katsuhiko Okano okano katsuhiko _at_ oss ntt co jp
LWLock statistics collector (was: CSStorm occurred again by postgreSQL8.2)
From
ITAGAKI Takahiro
Date:
Hi hackers, I tackled the performance problem on SUBTRANS module with Okano. He and I reach a conclusion that SubTrans log is heavily read on a specific access pattern in my TPC-W implementation. There seems to be awful traffic on SUBTRANS to check visivility of tuples in HeapTupleSatisfiesSnapshot(). I'll report more details later. BTW, I wrote a patch to collect statistics of Light-weight locks for analysis. We have already had Trace_lwlocks option, but it can collect statistics with less impact. The following is an output of the patch (on 8.1). Are you interested in the feature? and I'll port it to HEAD and post it. > # SELECT * FROM pg_stat_lwlocks; > kind | pg_stat_get_lwlock_name | sh_call | sh_wait | ex_call | ex_wait | > ------+----------------------------+------------+-----------+-----------+-----------+- > 0 | BufMappingLock | 559375542 | 33542 | 320092 | 24025 | > 1 | BufFreelistLock | 0 | 0 | 370709 | 47 | > 2 | LockMgrLock | 0 | 0 | 41718885 | 734502 | > 3 | OidGenLock | 33 | 0 | 0 | 0 | > 4 | XidGenLock | 12572279 | 10095 | 11299469 | 20089 | > 5 | ProcArrayLock | 8371330 | 72052 | 16965667 | 603294 | > 6 | SInvalLock | 38822428 | 435 | 25917 | 128 | > 7 | FreeSpaceLock | 0 | 0 | 16787 | 4 | > 8 | WALInsertLock | 0 | 0 | 1239911 | 885 | > 9 | WALWriteLock | 0 | 0 | 69907 | 5589 | > 10 | ControlFileLock | 0 | 0 | 16686 | 1 | > 11 | CheckpointLock | 0 | 0 | 34 | 0 | > 12 | CheckpointStartLock | 69509 | 0 | 34 | 1 | > 13 | CLogControlLock | 0 | 0 | 236763 | 183 | > 14 | SubtransControlLock | 0 | 0 | 753773945 | 205273395 | > 15 | MultiXactGenLock | 66 | 0 | 0 | 0 | > 16 | MultiXactOffsetControlLock | 0 | 0 | 35 | 0 | > 17 | MultiXactMemberControlLock | 0 | 0 | 34 | 0 | > 18 | RelCacheInitLock | 0 | 0 | 0 | 0 | > 19 | BgWriterCommLock | 0 | 0 | 61457 | 1 | > 20 | TwoPhaseStateLock | 33 | 0 | 0 | 0 | > 21 | TablespaceCreateLock | 0 | 0 | 0 | 0 | > 22 | BufferIO | 0 | 0 | 695627 | 16 | > 23 | BufferContent | 3568231805 | 1897 | 1361394 | 829 | > 24 | CLog | 0 | 0 | 0 | 0 | > 25 | SubTrans | 138571621 | 143208883 | 8122181 | 8132646 | > 26 | MultiXactOffset | 0 | 0 | 0 | 0 | > 27 | MultiXactMember | 0 | 0 | 0 | 0 | > (28 rows) Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Re: LWLock statistics collector (was: CSStorm occurred again by postgreSQL8.2)
From
Katsuhiko Okano
Date:
Hi,All. Since the cause was found and the provisional patch was made and solved about the CSStorm problem in previous mails, it reports. > Subject: [HACKERS] poor performance with Context Switch Storm at TPC-W. > Date: Tue, 11 Jul 2006 20:09:24 +0900 > From: Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> > > poor performance with Context Switch Storm occurred > with the following composition. Premise knowledge : PostgreSQL8.0 to SAVEPOINT was supported. All the transactions have one or more subtransactions in an inside. When judging VISIBILITY of a tupple, XID which inserted the tuppleneeds to judge a top transaction or a subtransaction. (if it's XMIN committed) In order to judge, it is necessary to access SubTrans. (data structure which manages the parents of transaction ID) SubTrans is accessed via a LRU buffer. Occurrence conditions of this phenomenon : The occurrence conditions of this phenomenon are the following. - There is transaction which refers to the tupple in quantity frequency (typically seq scan). - (Appropriate frequency) There is updating transaction. - (Appropriate length) There is long live transaction. Point of view : (A) The algorithm which replaces a buffer is bad. A time stamp does not become new until swapout completes the swapout page. If access is during swap at other pages, the swapout page will be in the state where it is not used most, It is again chosen as the page for swapout. (When work load is high) (B) Accessing at every judgment of VISIBILITY of a tupple is frequent. If many processes wait LWLock using semop, CSStorm will occur. Result : As opposed to (A), I created a patch which the page of read/write IN PROGRESS does not make an exchange candidate. (It has "betterslot" supposing the case where all the pages are set to IN PROGRESS.) The patch was applied. However, it recurred. it did not become fundamental solution. As opposed to (B), A patch which is changed so that it may consider that all the transactions are top transactions was created. (Thank you, ITAGAKI) The patch was applied. 8 hours was measured. CSStorm problem was stopped. Argument : (1)Since neither SAVEPOINT nor the error trap using PL/pgSQL is done, the subtransaction is unnecessary. Is it better to implement the mode not using a subtransaction? (2)It is the better if a cache can be carried out by structure like CLOG that it seems that it is not necessary to check a LRU buffer at every occasion. Are there a problem and other ideas? -------- Katsuhiko Okano okano katsuhiko _at_ oss ntt co jp
Re: LWLock statistics collector (was: CSStorm occurred again by postgreSQL8.2)
From
Katsuhiko Okano
Date:
Katsuhiko Okano wrote: > Since the cause was found and the provisional patch was made > and solved about the CSStorm problem in previous mails, it reports. (snip) > (A) The algorithm which replaces a buffer is bad. > A time stamp does not become new until swapout completes > the swapout page. > If access is during swap at other pages, the swapout page will be > in the state where it is not used most, > It is again chosen as the page for swapout. > (When work load is high) The following is the patch. diff -cpr postgresql-8.1.4-orig/src/backend/access/transam/slru.c postgresql-8.1.4-SlruSelectLRUPage-fix/src/backend/access/transam/slru.c *** postgresql-8.1.4-orig/src/backend/access/transam/slru.c 2006-01-21 13:38:27.000000000 +0900 --- postgresql-8.1.4-SlruSelectLRUPage-fix/src/backend/access/transam/slru.c 2006-07-25 18:02:49.000000000 +0900 *************** SlruSelectLRUPage(SlruCtl ctl, int pagen *** 703,710 **** for (;;) { int slotno; ! int bestslot = 0; unsigned int bestcount = 0; /* See if page already has a buffer assigned */ for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++) --- 703,712 ---- for (;;) { int slotno; ! int bestslot = -1; ! int betterslot = -1; unsigned int bestcount = 0; + unsigned int bettercount = 0; /* See if page already has a buffer assigned */ for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++) *************** SlruSelectLRUPage(SlruCtl ctl, int pagen *** 720,732 **** */ for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++) { ! if (shared->page_status[slotno] == SLRU_PAGE_EMPTY) ! return slotno; ! if (shared->page_lru_count[slotno] > bestcount && ! shared->page_number[slotno] != shared->latest_page_number) ! { ! bestslot = slotno; ! bestcount = shared->page_lru_count[slotno]; } } --- 722,746 ---- */ for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++) { ! switch (shared->page_status[slotno]) ! { ! case SLRU_PAGE_EMPTY: ! return slotno; ! case SLRU_PAGE_READ_IN_PROGRESS: ! case SLRU_PAGE_WRITE_IN_PROGRESS: ! if (shared->page_lru_count[slotno] > bettercount && ! shared->page_number[slotno] != shared->latest_page_number) ! { ! betterslot = slotno; ! bettercount = shared->page_lru_count[slotno]; ! } ! default: /* SLRU_PAGE_CLEAN,SLRU_PAGE_DIRTY */ ! if (shared->page_lru_count[slotno] > bestcount && ! shared->page_number[slotno] != shared->latest_page_number) ! { ! bestslot = slotno; ! bestcount = shared->page_lru_count[slotno]; ! } } } *************** SlruSelectLRUPage(SlruCtl ctl, int pagen *** 736,741 **** --- 750,758 ---- if (shared->page_status[bestslot] == SLRU_PAGE_CLEAN) return bestslot; + if (bestslot == -1) + bestslot = betterslot; + /* * We need to do I/O. Normal case is that we have to write it out, * but it's possible in the worst case to have selected a read-busy Regards, -------- Katsuhiko Okano okano katsuhiko _at_ oss ntt co jp
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > Here is a patch to collect statistics of LWLocks. This seems fairly invasive, as well as confused about whether it's an #ifdef'able thing or not. You can't have system views and pg_proc entries conditional on a compile-time #ifdef, so in a default build we would have a lot of nonfunctional cruft exposed to users. Do we really need this compared to the simplistic dump-to-stderr counting support that's in there now? That stuff doesn't leave any cruft behind when not enabled, and it has at least one significant advantage over your proposal, which is that it's possible to get per-process statistics when needed. If I thought that average users would have a need for LWLock statistics, I'd be more sympathetic to expending effort on a nice frontend for viewing the statistics, but this is and always will be just a concern for hardcore hackers ... regards, tom lane
Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes: >> (A) The algorithm which replaces a buffer is bad. >> A time stamp does not become new until swapout completes >> the swapout page. >> If access is during swap at other pages, the swapout page will be >> in the state where it is not used most, >> It is again chosen as the page for swapout. >> (When work load is high) > The following is the patch. I'm confused ... is this patch being proposed for inclusion? I understood your previous message to say that it didn't help much. The patch is buggy as posted, because it will try to do this: if (shared->page_status[bestslot] == SLRU_PAGE_CLEAN) return bestslot; while bestslot could still be -1. I see your concern about multiple processes selecting the same buffer for replacement, but what will actually happen is that all but the first will block for the first one's I/O to complete using SimpleLruWaitIO, and then all of them will repeat the outer loop and recheck what to do. If they were all trying to swap in the same page this is actually optimal. If they were trying to swap in different pages then the losing processes will again try to initiate I/O on a different buffer. (They will pick a different buffer, because the guy who got the buffer will have done SlruRecentlyUsed on it before releasing the control lock --- so I don't believe the worry that we get a buffer thrash scenario here. Look at the callers of SlruSelectLRUPage not just the function itself.) It's possible that letting different processes initiate I/O on different buffers would be a win, but it might just result in excess writes, depending on the relative probability of requests for the same page vs. requests for different pages. Also, I think the patch as posted would still cause processes to gang up on the same buffer, it would just be a different one from before. The right thing would be to locate the overall-oldest buffer and return it if clean; otherwise to initiate I/O on the oldest buffer that isn't either clean or write-busy, if there is one; otherwise just do WaitIO on the oldest buffer. This would ensure that different processes try to push different buffers to disk. They'd still go back and make their decisions from the top after doing their I/O. Whether this is a win or not is not clear to me, but at least it would attack the guessed-at problem correctly. regards, tom lane
Tom Lane wrote: > If I thought that average users would have a need for LWLock statistics, > I'd be more sympathetic to expending effort on a nice frontend for > viewing the statistics, but this is and always will be just a concern > for hardcore hackers ... That may be true of the output, but that's not a very strong argument against making it much easier to gather and display the LWLock statistics. I can easily imagine the patch be a useful performance troubleshooting tool in a high load environment. Depends on how easy/intrusive it is to enable/use the stderr method on a production system, though, as well as how much of a performance impact the measurements have on overall operation... -- Kevin Brown kevin@sysexperts.com
Tom Lane <tgl@sss.pgh.pa.us> wrote: > This seems fairly invasive, as well as confused about whether it's an > #ifdef'able thing or not. You can't have system views and pg_proc > entries conditional on a compile-time #ifdef, so in a default build > we would have a lot of nonfunctional cruft exposed to users. Is it acceptable if pg_stat_lwlocks view and other functions are not installed and invisible when LWLOCK_STAT is not defined? We don't have such a feature now, but we can. > Do we really need this compared to the simplistic dump-to-stderr > counting support that's in there now? That stuff doesn't leave any > cruft behind when not enabled, and it has at least one significant > advantage over your proposal, which is that it's possible to get > per-process statistics when needed. There is one advantage in my proposal. We can watch the time-varying activities of LWLocks easily. Is per-process statistics actually needed? If we use connection pooling, we lose the advantage. I think connection pooling is commonly used in such a high-load case where we encounter LWLock contentions. > If I thought that average users would have a need for LWLock statistics, > I'd be more sympathetic to expending effort on a nice frontend for > viewing the statistics, but this is and always will be just a concern > for hardcore hackers ... I assume the ones who analyze the output of the patch are hardcore hacker, but the ones who actually use it and collect information are average users. The dump-to-stderr method is hard to use because it will increase syslogs and requires re-parsing efforts. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Re: LWLock statistics collector (was: CSStorm occurred again by postgreSQL8.2)
From
Katsuhiko Okano
Date:
Tom Lane wrote: > I'm confused ... is this patch being proposed for inclusion? I > understood your previous message to say that it didn't help much. This is only the patch for carving where there is any problem. > The patch is buggy as posted, because it will try to do this: > if (shared->page_status[bestslot] == SLRU_PAGE_CLEAN) > return bestslot; > while bestslot could still be -1. A check is required. understood. > (They > will pick a different buffer, because the guy who got the buffer will > have done SlruRecentlyUsed on it before releasing the control lock --- > so I don't believe the worry that we get a buffer thrash scenario here. > Look at the callers of SlruSelectLRUPage not just the function itself.) umm,I read a code again. > otherwise to initiate I/O on the oldest buffer that isn't > either clean or write-busy, if there is one; Understanding is a difficult point although it is important. -------- Katsuhiko Okano okano katsuhiko _at_ oss ntt co jp
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> This seems fairly invasive, as well as confused about whether it's an >> #ifdef'able thing or not. You can't have system views and pg_proc >> entries conditional on a compile-time #ifdef, so in a default build >> we would have a lot of nonfunctional cruft exposed to users. > Is it acceptable if pg_stat_lwlocks view and other functions are not > installed and invisible when LWLOCK_STAT is not defined? We don't have > such a feature now, but we can. Then you'd need an initdb to go between having stats and not, which carries its own downsides. If I thought there were a wide market for this then I'd say sure, let's just have it there ... but I don't. I think the actual wave of the future for analyzing behavior at the LWLock level is going to be DTrace. It seems way more flexible than an aggregate-statistics view can be. regards, tom lane
Tom Lane wrote: >I think the actual wave of the future for analyzing behavior at the >LWLock level is going to be DTrace. It seems way more flexible than >an aggregate-statistics view can be. > > CVS head now has the following LWLock probes, and more can easily be added. These probes can be enabled using the sample DTrace scripts at http://pgfoundry.org/projects/dtrace/ lwlock-acquire lwlock-release lwlock-startwait lwlock-endwait lwlock-condacquire lwlock-condacquire-fail If anyone wants to have access to a Solaris machine to play with DTrace, let me know. Regards, -Robert
Robert Lor <Robert.Lor@Sun.COM> wrote: > CVS head now has the following LWLock probes, and more can easily be > added. These probes can be enabled using the sample DTrace scripts at > http://pgfoundry.org/projects/dtrace/ > > lwlock-acquire > lwlock-release > lwlock-startwait > lwlock-endwait > lwlock-condacquire > lwlock-condacquire-fail > > If anyone wants to have access to a Solaris machine to play with DTrace, > let me know. I assure you that DTrace is a powerful tool, but as for LWLock statistics, can we gather them well with it? There is a "color coding" problem in the persent dump-to-stderr and DTrace methods. I assume we want to gather the statistics per resource (represented by LWLockKind in my patch), not per LWLockId. Even if we use DTrace, do we need some supports for coloring of lwlocks? Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > I assume we want to gather the statistics per resource (represented by > LWLockKind in my patch), not per LWLockId. Why? I haven't yet seen any problem where that looked useful. The named LWLocks are certainly sui generis, and as for things like per-buffer locks I haven't seen any need for aggregate statistics --- I'd rather know about "hot spots" if there are any buffers that are not like the rest. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > > It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS. > > The problem was only postponed. > > Can you provide a reproducible test case for this? This is the reproducible test case: - Occurs on both 8.1.4 and HEAD. - On smp machine. I used dual opterons. CSStrom becomes worse on dual xeon with hyper-threading. - Tuning parameters are default. Whole data are cached in shared buffers. (shared_buffers=32MB, data of pgbench (scale=1)are less than 15MB.) - Using custom pgbench. One client doing UPDATE with indexscan and multiple clients doing SELECT with seqscan/indexscan. $ pgbench -i $ pgbench -n -c 1 -t 100000 -f cs_update.sql & $ pgbench -n -c 50 -t 100000 -f cs_indexscan.sql & $ pgbench -n -c 35 -t 100000 -f cs_seqscan.sql & (The scripts are attached at end of this message.) In above workload, context switches are 2000-10000/sec and cpu usage is user=100%. Then, start a long open transaction on another connection. $ psql # begin; -- Long open transaction After a lapse of 30-60 seconds, context switches become 50000/sec over (120000 over on xeons) and cpu usage is user=66% / sys=21% / idle=13%. If we increase the frequency of UPDATE, the duration becomes shorter. This is a human-induced workload, but I can see the same condition in TPC-W -- even though it is a benchmark. TPC-W requires full-text search and it is implementd using "LIKE %foo%" in my implementation (DBT-1, too). Also, it requires periodical aggregations. They might behave as long transactions. The cause seems to be a lock contention. The number of locks on SubtransControlLock and SubTransBuffer are significantly increased by comparison with BufMappingLocks. # Before starting a long transaction.kind | lwlock | sh_call | sh_wait | ex_call | ex_wait ------+---------------------+----------+---------+---------+--------- 13 | SubtransControlLock | 28716 | 2 | 54 | 0 22 | BufMappingLock | 11637884 | 0 | 2492 | 0 27 | SubTransBuffer | 0 | 0 | 11 | 0 # Afterkind | lwlock | sh_call | sh_wait | ex_call | ex_wait ------+---------------------+----------+---------+---------+--------- 13 | SubtransControlLock | 4139111 | 65059 | 3926691| 390838 22 | BufMappingLock | 32348073 | 0 | 2509 | 0 27 | SubTransBuffer | 939646| 960341 | 1419152 | 61 The invokers of SubTrans module are two SubTransGetTopmostTransaction() in HeapTupleSatisfiesSnapshot(). When I disabled the calls, CSStorm did not occur. SubTransGetTopmostTransaction returns the argument without change when we don't use SAVEPOINTs. If we optimize for non-subtransactions, we can avoid to lock SubTrans for check visiblities of tuples inserted by top transactions. If we want to resolve the probmen fundamentally, we might have to improve SubTrans using a better buffer management algorithm or so. Do you have any idea to avoid such a problem? -- cs_update.sql \set naccounts 100000 * :tps \setrandom aid 1 :naccounts \setrandom delta -5000 5000 UPDATE accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT pg_sleep(0.1); -- cs_seqscan.sql \set naccounts 100000 * :tps \setrandom aid 1 :naccounts SELECT abalance FROM accounts WHERE aid::int8 = :aid; -- cast to force seqscan -- cs_indexscan.sql \set naccounts 100000 * :tps \setrandom aid 1 :naccounts SELECT abalance FROM accounts WHERE aid = :aid; Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > The invokers of SubTrans module are two SubTransGetTopmostTransaction() > in HeapTupleSatisfiesSnapshot(). When I disabled the calls, CSStorm did > not occur. SubTransGetTopmostTransaction returns the argument without > change when we don't use SAVEPOINTs. > If we optimize for non-subtransactions, we can avoid to lock SubTrans > for check visiblities of tuples inserted by top transactions. Only for top transactions still in progress, so I doubt that would help much. I'm wondering about doing something similar to what TransactionIdIsInProgress does, ie, make use of the PGPROC lists of live subtransactions. Suppose that GetSnapshotData copies not only top xids but live subxids into the snapshot, and adds a flag indicating whether the subxids are complete (ie, none of the subxid lists have overflowed). Then if the flag is set, tqual.c doesn't need to do SubTransGetTopmostTransaction() before searching the list. regards, tom lane
Tom Lane wrote: > I'm wondering about doing something similar to what > TransactionIdIsInProgress does, ie, make use of the PGPROC lists > of live subtransactions. Suppose that GetSnapshotData copies not > only top xids but live subxids into the snapshot, and adds a flag > indicating whether the subxids are complete (ie, none of the subxid > lists have overflowed). Then if the flag is set, tqual.c doesn't > need to do SubTransGetTopmostTransaction() before searching the > list. Well, that sounds awfully more expensive than setting local-to-my-database Xmins as well as global (all databases) Xmins :-) On the other hand, ISTM as soon as one cache overflows, you have to go check pg_subtrans which means the entire optimization buys nothing in that case. It would be nice if the optimization degraded more gracefully. I don't have any concrete suggestion though. The changes proposed in the other CS-storm thread by the NTT person may help. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> I'm wondering about doing something similar to what >> TransactionIdIsInProgress does, ie, make use of the PGPROC lists >> of live subtransactions. > Well, that sounds awfully more expensive than setting > local-to-my-database Xmins as well as global (all databases) Xmins :-) Only when you've got a lot of subtransactions. The point of this discussion is to optimize for the few-or-none case. In any case, the problem with the local/global bit was that you'd be expending foreground-process cycles without any foreground-process return. Being able to use a snapshot without consulting pg_subtrans will certainly buy back some cycles... regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Tom Lane wrote: > >> I'm wondering about doing something similar to what > >> TransactionIdIsInProgress does, ie, make use of the PGPROC lists > >> of live subtransactions. > > > Well, that sounds awfully more expensive than setting > > local-to-my-database Xmins as well as global (all databases) Xmins :-) > > Only when you've got a lot of subtransactions. The point of this > discussion is to optimize for the few-or-none case. In any case, > the problem with the local/global bit was that you'd be expending > foreground-process cycles without any foreground-process return. > Being able to use a snapshot without consulting pg_subtrans will > certainly buy back some cycles... I can buy that. Some idle thoughts: I was thinking at what time was the most appropiate to insert or remove an Xid from the cache. We can do without any excl-locking because 1) we already assume the storing of an Xid to be atomic, and 2) no one can be interested in querying for an Xid before the originating transaction has had the chance to write a tuple with that Xid anyway. OTOH I think we only need to store live Xids and those finished that wrote a WAL record; we can drop subaborted and subcommitted if they didn't. On the third hand, are we going to sh-acquire the ProcArray lock while a GetSnapshotData copies all subxact Xids of all running transactions? ProcArrayLock will become more of a contention point than it already is. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > I was thinking at what time was the most appropiate to insert or remove > an Xid from the cache. We can do without any excl-locking because 1) we > already assume the storing of an Xid to be atomic, and 2) no one can be > interested in querying for an Xid before the originating transaction has > had the chance to write a tuple with that Xid anyway. Actually ... that fails if GetSnapshotData is going to copy subtrans XIDs. So this area needs more thought. > On the third hand, are we going to sh-acquire the ProcArray lock while a > GetSnapshotData copies all subxact Xids of all running transactions? > ProcArrayLock will become more of a contention point than it already is. Yeah, but sharelock is better than exclusive lock ... regards, tom lane
This is an additional information. I wrote: > If we want to resolve the probmen fundamentally, we might have to > improve SubTrans using a better buffer management algorithm or so. The above is maybe wrong. I checked each lwlock of pg_subtrans's buffers. All lwlocks are uniformly acquired and I could not see any differences among buffers. So the cause seems not to be a buffer management algorithm, but just a lack of SLRU buffer pages. NUM_SUBTRANS_BUFFERS is defined as 32 in HEAD. If we increase it, we can avoid the old transaction problem for a certain time. However, it doesn't help much on high-load -- for example, on a workload with 2000 tps, we will use up 1000 pg_subtrans pages in 15 minites. I suppose it is not enough for online and batch/maintenance mix. Also, the simple scanning way in SLRU will likely cause another performance issue when we highly increase the number of buffers. A sequential scanning is used in SLRU, so it will not work well against many buffers. I hope some cares in upper layer, snapshot, hitbits or something, being discussed in the recent thread. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Is there anything to do for 8.2 here? --------------------------------------------------------------------------- ITAGAKI Takahiro wrote: > This is an additional information. > > I wrote: > > If we want to resolve the probmen fundamentally, we might have to > > improve SubTrans using a better buffer management algorithm or so. > > The above is maybe wrong. I checked each lwlock of pg_subtrans's buffers. > All lwlocks are uniformly acquired and I could not see any differences > among buffers. So the cause seems not to be a buffer management algorithm, > but just a lack of SLRU buffer pages. > > NUM_SUBTRANS_BUFFERS is defined as 32 in HEAD. If we increase it, > we can avoid the old transaction problem for a certain time. > However, it doesn't help much on high-load -- for example, on a workload > with 2000 tps, we will use up 1000 pg_subtrans pages in 15 minites. > I suppose it is not enough for online and batch/maintenance mix. > > Also, the simple scanning way in SLRU will likely cause another performance > issue when we highly increase the number of buffers. A sequential scanning > is used in SLRU, so it will not work well against many buffers. > > > I hope some cares in upper layer, snapshot, hitbits or something, > being discussed in the recent thread. > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> wrote: > Is there anything to do for 8.2 here? I'm working on Tom's idea. It is not a feature and does not change the on-disk-structures, so I hope it meet the 8.2 deadline... Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm wondering about doing something similar to what > TransactionIdIsInProgress does, ie, make use of the PGPROC lists > of live subtransactions. Suppose that GetSnapshotData copies not > only top xids but live subxids into the snapshot, and adds a flag > indicating whether the subxids are complete (ie, none of the subxid > lists have overflowed). Then if the flag is set, tqual.c doesn't > need to do SubTransGetTopmostTransaction() before searching the > list. I added a subxid array to Snapshot and running subxids are gathered from PGPROC->subxids cache. There are two overflowed case; any of PGPROC->subxids are overflowed or the number of total subxids exceeds pre-allocated buffers. If overflowed, we cannot avoid to call SubTransGetTopmostTransaction. I tested the patch and did not see any context switch storm which comes from pg_subtrans, but there may be some bugs in the visibility checking. It would be very nice if you could review or test the patch. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > I added a subxid array to Snapshot and running subxids are gathered from > PGPROC->subxids cache. There are two overflowed case; any of PGPROC->subxids > are overflowed or the number of total subxids exceeds pre-allocated buffers. > If overflowed, we cannot avoid to call SubTransGetTopmostTransaction. Applied after some editorialization (you really need to pay more attention to keeping comments in sync with code ;-)) I cannot measure any consistent speed difference in plain pgbench scenarios with and without the patch, so at least as a rough approximation the extra cycles in GetSnapshotData aren't hurting. And I confirm that the test case you posted before no longer exhibits a context-swap storm. This change makes it even more obvious than before that we really want to stay out of the subxids-overflowed regime. I don't especially want to make those arrays even bigger, but I wonder if there isn't more we can do to use them efficiently. In particular, it seems like in the case where RecordSubTransactionCommit detects that the subxact has not stored its XID anywhere, we could immediately remove the XID from the PGPROC array, just as if it had aborted. This would avoid chewing subxid slots for cases such as exception blocks in plpgsql that are not modifying the database, but just catching computational errors. Comments? regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > OTOH I think we only need to store live Xids and those finished that > wrote a WAL record; we can drop subaborted and subcommitted if they > didn't. While reviewing this thread, I see Alvaro already had the idea I just came to... regards, tom lane
I wrote: > ... it seems like in the > case where RecordSubTransactionCommit detects that the subxact has not > stored its XID anywhere, we could immediately remove the XID from > the PGPROC array, just as if it had aborted. This would avoid chewing > subxid slots for cases such as exception blocks in plpgsql that are > not modifying the database, but just catching computational errors. (and later realized that Alvaro had had the same idea awhile back, but I don't have his message at hand). I looked into this a bit more; it seems like basically it should only take addition of else XidCacheRemoveRunningXids(xid, 0, NULL); to the bottom of RecordSubTransactionCommit(), plus suitable adjustment of the comments in both routines. However, there's a problem: if we delete a second-level subxact's XID from PGPROC, and later its upper subtransaction aborts, XidCacheRemoveRunningXids will emit scary warnings when it doesn't find the sub-subxact in PGPROC. This could doubtless be fixed with sufficient jiggery-pokery --- simply removing the debug warnings would be a brute-force answer, but I'd like to find something a bit less brute-force. Maybe drop the sub-subxact from its parent's list immediately, instead of carrying it forward? Anyway, given that there's this one nonobvious gotcha, there might be others. My recommendation is that we take this off the open-items list for 8.2 and revisit it in the 8.3 cycle when there's more time. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Anyway, given that there's this one nonobvious gotcha, there might be > others. My recommendation is that we take this off the open-items list > for 8.2 and revisit it in the 8.3 cycle when there's more time. I wonder if Theo's recent reported problem with 4.3M child xids changes the calculus on this. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> Anyway, given that there's this one nonobvious gotcha, there might be >> others. My recommendation is that we take this off the open-items list >> for 8.2 and revisit it in the 8.3 cycle when there's more time. > I wonder if Theo's recent reported problem with 4.3M child xids changes the > calculus on this. Yeah, I was just looking at that. Removing useless entries from the child-xid list would presumably help him. Considering we're not even formally in beta yet, I'm probably being too conservative to recommend we not touch it. regards, tom lane
I wrote: > Yeah, I was just looking at that. Removing useless entries from the > child-xid list would presumably help him. Considering we're not even > formally in beta yet, I'm probably being too conservative to recommend > we not touch it. Actually ... wait a minute. We do not assign an XID to a subtransaction at all unless it writes a tuple to disk (see GetCurrentTransactionId and its callers). So this whole "optimization" idea is redundant. I see a bug though, which is that RecordSubTransactionAbort() calls GetCurrentTransactionId() before having verified that it needs to do anything. This means that we'll generate and then discard an XID uselessly in a failed subxact that didn't touch disk. Worth fixing, but it doesn't look like this is Theo's problem. Unless I'm missing something, Theo's problem must involve having done tuple updates in 4.6M different subtransactions. regards, tom lane
I wrote: > I see a bug though, which is that RecordSubTransactionAbort() calls > GetCurrentTransactionId() before having verified that it needs to do > anything. This means that we'll generate and then discard an XID > uselessly in a failed subxact that didn't touch disk. Well, it would be a bug except that RecordSubTransactionAbort isn't called unless the current subxact has an XID. Perhaps a comment would be appropriate but there's nothing to fix here. I think Theo's problem is probably somewhere else, too --- apparently it's not so much that TransactionIdIsCurrentTransactionId takes a long time as that something is calling it lots of times with no check for interrupt. regards, tom lane
Tom Lane wrote: > I wrote: > > I see a bug though, which is that RecordSubTransactionAbort() calls > > GetCurrentTransactionId() before having verified that it needs to do > > anything. This means that we'll generate and then discard an XID > > uselessly in a failed subxact that didn't touch disk. > > Well, it would be a bug except that RecordSubTransactionAbort isn't > called unless the current subxact has an XID. Perhaps a comment would > be appropriate but there's nothing to fix here. > > I think Theo's problem is probably somewhere else, too --- apparently > it's not so much that TransactionIdIsCurrentTransactionId takes a long > time as that something is calling it lots of times with no check for > interrupt. Could it be something like heap_lock_tuple? It calls MultiXactIdWait, which calls GetMultXactIdMembers and TransactionIdIsCurrentTransactionId on each member. (heap_update and heap_delete do the same thing). I must admit I didn't read Theo's description on his scenario though. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> I think Theo's problem is probably somewhere else, too --- apparently >> it's not so much that TransactionIdIsCurrentTransactionId takes a long >> time as that something is calling it lots of times with no check for >> interrupt. > Could it be something like heap_lock_tuple? It calls MultiXactIdWait, > which calls GetMultXactIdMembers and TransactionIdIsCurrentTransactionId > on each member. (heap_update and heap_delete do the same thing). I > must admit I didn't read Theo's description on his scenario though. He shows HeapTupleSatisfiesSnapshot as the next thing down the call stack, so those scenarios don't seem quite right. I'm wondering about a CHECK_FOR_INTERRUPTS-free loop in either plperl or trigger handling, myself. Anyway, I was thinking some more about Theo's original suggestion that the linked-list representation of childXids was too inefficient. I'm disinclined to use a hash as he suggests, but it strikes me that we could very easily change the list into a dynamically extended array --- and because the entries are surely added in increasing XID order, such an array could be binary-searched. This wouldn't be a win for very small numbers of child XIDs, but for large numbers it would. OTOH, there are probably enough other inefficiencies in handling large numbers of subxact XIDs that speeding up TransactionIdIsCurrentTransactionId might be a useless exercise. It would be good to profile a test case before spending much effort here. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > --- and because the entries are surely added in increasing XID order, > such an array could be binary-searched. If they're only added if they write to disk then isn't it possible to add them out of order? Start a child transaction, start a child of that one and write to disk, then exit the grandchild and write to disk in the first child? I'm just going on your description, I'm not familiar with this part of the code at all. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> --- and because the entries are surely added in increasing XID order, >> such an array could be binary-searched. > If they're only added if they write to disk then isn't it possible to add them > out of order? Start a child transaction, start a child of that one and write > to disk, then exit the grandchild and write to disk in the first > child? No, because we enforce child XID > parent XID. In the case above, the child xact would be given an XID when the grandchild needs one --- see recursion in AssignSubTransactionId(). The actually slightly shaky assumption above is that children of the same parent xact must subcommit in numerical order ... but as long as we have strict nesting of subxacts I think this must be so. regards, tom lane
On Wed, 2006-09-13 at 21:45 -0400, Tom Lane wrote: > Anyway, given that there's this one nonobvious gotcha, there might be > others. My recommendation is that we take this off the open-items list > for 8.2 and revisit it in the 8.3 cycle when there's more time. Well, its still 8.3 just... As discussed in the other thread "Final Thoughts for 8.3 on LWLocking and Scalability", XidCacheRemoveRunningXids() is now the only holder of an X lock during normal processing, so I would like to remove it. Here's how: Currently, we take the lock, remove the subxact and then shuffle down all the other subxactIds so that the subxact cache is contiguous. I propose that we simply zero out the subxact entry without re-arranging the cache; this will be atomic, so we need not acquire an X lock. We then increment ndeletedxids. When we enter a new subxact into the cache, if ndeletedxids > 0 we scan the cache to find an InvalidTransactionId that we can use, then decrement ndeletedxids. So ndeletedxids is just a hint, not an absolute requirement. nxids then becomes the number of cache entries and never goes down until EOXact. The subxact cache is no longer in order, but then it doesn't need to be either. When we take a snapshot we will end up taking a copy of zeroed cache entries, so the snapshots will be slightly larger than previously. Though still no larger than the max. The size reduction was not so large as to make a significant difference across the whole array, so scalability is the main issue to resolve. The snapshots will be valid with no change, since InvalidTransactionId will never match against any recorded Xid. I would also like to make the size of the subxact cache configurable with a parameter such as subtransaction_cache_size = 64 (default), valid range 4-256. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs <simon@2ndquadrant.com> writes: > As discussed in the other thread "Final Thoughts for 8.3 on LWLocking > and Scalability", XidCacheRemoveRunningXids() is now the only holder of > an X lock during normal processing, Nonsense. Main transaction exit also takes an exclusive lock, and is far more likely to be exercised in typical workloads than a subtransaction abort. In any case: there has still not been any evidence presented by anyone that optimizing XidCacheRemoveRunningXids will help one bit. Given the difficulty of measuring any benefit from the last couple of optimizations in this general area, I'm thinking that such evidence will be hard to come by. And we have got way more than enough on our plates already. Can we let go of this for 8.3, please? regards, tom lane
On Tue, 2007-09-11 at 09:58 -0400, Tom Lane wrote: > Can we let go of this for 8.3, please? OK, we've moved forward, so its a good place to break. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Is this a TODO? Tom's reply was: > Nonsense. Main transaction exit also takes an exclusive lock, and is > far more likely to be exercised in typical workloads than a > subtransaction abort. > > In any case: there has still not been any evidence presented by anyone > that optimizing XidCacheRemoveRunningXids will help one bit. Given the > difficulty of measuring any benefit from the last couple of > optimizations in this general area, I'm thinking that such evidence > will be hard to come by. And we have got way more than enough on our > plates already. Can we let go of this for 8.3, please? --------------------------------------------------------------------------- Simon Riggs wrote: > On Wed, 2006-09-13 at 21:45 -0400, Tom Lane wrote: > > > Anyway, given that there's this one nonobvious gotcha, there might be > > others. My recommendation is that we take this off the open-items list > > for 8.2 and revisit it in the 8.3 cycle when there's more time. > > Well, its still 8.3 just... > > As discussed in the other thread "Final Thoughts for 8.3 on LWLocking > and Scalability", XidCacheRemoveRunningXids() is now the only holder of > an X lock during normal processing, so I would like to remove it. > Here's how: > > Currently, we take the lock, remove the subxact and then shuffle down > all the other subxactIds so that the subxact cache is contiguous. > > I propose that we simply zero out the subxact entry without re-arranging > the cache; this will be atomic, so we need not acquire an X lock. We > then increment ndeletedxids. When we enter a new subxact into the cache, > if ndeletedxids > 0 we scan the cache to find an InvalidTransactionId > that we can use, then decrement ndeletedxids. So ndeletedxids is just a > hint, not an absolute requirement. nxids then becomes the number of > cache entries and never goes down until EOXact. The subxact cache is no > longer in order, but then it doesn't need to be either. > > When we take a snapshot we will end up taking a copy of zeroed cache > entries, so the snapshots will be slightly larger than previously. > Though still no larger than the max. The size reduction was not so large > as to make a significant difference across the whole array, so > scalability is the main issue to resolve. > > The snapshots will be valid with no change, since InvalidTransactionId > will never match against any recorded Xid. > > I would also like to make the size of the subxact cache configurable > with a parameter such as subtransaction_cache_size = 64 (default), valid > range 4-256. > > -- > Simon Riggs > 2ndQuadrant http://www.2ndQuadrant.com > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Wed, 2008-03-12 at 20:13 -0400, Bruce Momjian wrote: > Is this a TODO? Tom's reply was: The general topic, yes. The caveats still apply. > > Nonsense. Main transaction exit also takes an exclusive lock, and is > > far more likely to be exercised in typical workloads than a > > subtransaction abort. > > > > In any case: there has still not been any evidence presented by anyone > > that optimizing XidCacheRemoveRunningXids will help one bit. Given the > > difficulty of measuring any benefit from the last couple of > > optimizations in this general area, I'm thinking that such evidence > > will be hard to come by. And we have got way more than enough on our > > plates already. Can we let go of this for 8.3, please? > > --------------------------------------------------------------------------- > > Simon Riggs wrote: > > On Wed, 2006-09-13 at 21:45 -0400, Tom Lane wrote: > > > > > Anyway, given that there's this one nonobvious gotcha, there might be > > > others. My recommendation is that we take this off the open-items list > > > for 8.2 and revisit it in the 8.3 cycle when there's more time. > > > > Well, its still 8.3 just... > > > > As discussed in the other thread "Final Thoughts for 8.3 on LWLocking > > and Scalability", XidCacheRemoveRunningXids() is now the only holder of > > an X lock during normal processing, so I would like to remove it. > > Here's how: > > > > Currently, we take the lock, remove the subxact and then shuffle down > > all the other subxactIds so that the subxact cache is contiguous. > > > > I propose that we simply zero out the subxact entry without re-arranging > > the cache; this will be atomic, so we need not acquire an X lock. We > > then increment ndeletedxids. When we enter a new subxact into the cache, > > if ndeletedxids > 0 we scan the cache to find an InvalidTransactionId > > that we can use, then decrement ndeletedxids. So ndeletedxids is just a > > hint, not an absolute requirement. nxids then becomes the number of > > cache entries and never goes down until EOXact. The subxact cache is no > > longer in order, but then it doesn't need to be either. > > > > When we take a snapshot we will end up taking a copy of zeroed cache > > entries, so the snapshots will be slightly larger than previously. > > Though still no larger than the max. The size reduction was not so large > > as to make a significant difference across the whole array, so > > scalability is the main issue to resolve. > > > > The snapshots will be valid with no change, since InvalidTransactionId > > will never match against any recorded Xid. > > > > I would also like to make the size of the subxact cache configurable > > with a parameter such as subtransaction_cache_size = 64 (default), valid > > range 4-256. > > > > -- > > Simon Riggs > > 2ndQuadrant http://www.2ndQuadrant.com > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 9: In versions below 8.0, the planner will ignore your desire to > > choose an index scan if your joining column's datatypes do not > > match > -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk