Thread: poor performance with Context Switch Storm at TPC-W.

poor performance with Context Switch Storm at TPC-W.

From

Katsuhiko Okano

Date:

11 July 2006, 08:33:17

Hi,All.

The problem has occurred in my customer.
poor performance with Context Switch Storm occurred
with the following composition.
Usually, CS is about 5000, WIPS=360.
when CSStorm occurrence, CS is about 100000, WIPS=60 or less.
(WIPS = number of web interactions per second)

It is under investigation using the patch
which collects a LWLock.

I suspected conflict of BufMappingLock.
but, collected results are seen,
occurrence of CSStorm and the increase of BufMappingLock counts
seem not to correspond.
Instead, SubtransControlLock and SubTrans were increasing.
I do not understand what in the cause of CSStorm.



[DB server]*1
Intel Xeon 3.0GHz*4(2CPU * H/T ON)
4GB Memory
Red Hat Enterprise Linux ES release 4(Nahant Update 3)
Linux version 2.6.9-34.ELsmp
PostgreSQL8.1.3 (The version 8.2(head-6/15) was also occurred)
shared_buffers=131072
temp_buffers=1000 
max_connections=300

[AP server]*2
200 connection pooling.
TPC-W model workload

[Clinet]*4
TPC-W model workload



(1)
The following discussion were read.
http://archives.postgresql.org/pgsql-hackers/2006-05/msg01003.php
From: Tom Lane <tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot ) us> 
To: josh ( at ) agliodbs ( dot ) com 
Subject: Re: Further reduction of bufmgr lock contention 
Date: Wed, 24 May 2006 15:25:26 -0400 

If there is a patch for investigation or a technique,
would someone show it to me?


(2)
It seems that much sequential scan has occurred at CSStorm.
When reading a tuple, do the visible satisfy check.
it seems to generate the subtransaction for every transaction.
How much is a possibility that 
the LWLock to a subtransaction cause CSStorm?


best regards.
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

Re: poor performance with Context Switch Storm at TPC-W.

From

"Qingqing Zhou"

Date:

11 July 2006, 10:06:53

"Katsuhiko Okano" <okano.katsuhiko@oss.ntt.co.jp> wrote
>
> The problem has occurred in my customer.
> poor performance with Context Switch Storm occurred
> with the following composition.
> Usually, CS is about 5000, WIPS=360.
> when CSStorm occurrence, CS is about 100000, WIPS=60 or less.
>
> Intel Xeon 3.0GHz*4(2CPU * H/T ON)
> 4GB Memory

Do you have bgwriter on and what's the parameters? I read a theory somewhere
that bgwriter scan a large portion of memory and cause L1/L2 thrushing, so
with HT on, the other backends sharing the physical processor with it also
get thrashed ... So try to turn bgwriter off or turn HT off see what's the
difference.

Regards,
Qingqing

Re: poor performance with Context Switch Storm at TPC-W.

From

Alvaro Herrera

Date:

11 July 2006, 11:00:06

Katsuhiko Okano wrote:

> I suspected conflict of BufMappingLock.
> but, collected results are seen,
> occurrence of CSStorm and the increase of BufMappingLock counts
> seem not to correspond.
> Instead, SubtransControlLock and SubTrans were increasing.
> I do not understand what in the cause of CSStorm.

Please see this thread:
http://archives.postgresql.org/pgsql-hackers/2005-11/msg01547.php
(actually it's a single message AFAICT)

This was applied on the 8.2dev code, so I'm surprised that 8.2dev
behaves the same as 8.1.

Does your problem have any relationship to what's described there?

I also wondered whether the problem may be that the number of SLRU
buffers we use for subtrans is too low.  But the number was increased
from the default 8 to 32 in 8.2dev as well.  Maybe you could try
increasing that even further; say 128 and see if the problem is still
there.  (src/include/access/subtrans.h, NUM_SUBTRANS_BUFFERS).

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: poor performance with Context Switch Storm at TPC-W.

From

Josh Berkus

Date:

11 July 2006, 11:25:25

Katsuhiko,

Have you tried turning HT off?   HT is not generally considered (even by 
Intel) a good idea for database appplications.

--Josh

Re: poor performance with Context Switch Storm at TPC-W.

From

Katsuhiko Okano

Date:

11 July 2006, 22:29:22

hello.


> Do you have bgwriter on and what's the parameters? I read a theory somewhere
> that bgwriter scan a large portion of memory and cause L1/L2 thrushing, so
> with HT on, the other backends sharing the physical processor with it also
> get thrashed ... So try to turn bgwriter off or turn HT off see what's the
> difference.

bgwriter is ON.
at postgresql.conf:
> # - Background writer -
> 
> bgwriter_delay = 200            # 10-10000 milliseconds between rounds
> bgwriter_lru_percent = 1.0        # 0-100% of LRU buffers scanned/round
> bgwriter_lru_maxpages = 5        # 0-1000 buffers max written/round
> bgwriter_all_percent = 0.333        # 0-100% of all buffers scanned/round
> bgwriter_all_maxpages = 5        # 0-1000 buffers max written/round


I tried turn H/T OFF, but CSStorm occurred.
Usually, CS is about 5000.
when CSStrom occurrence, CS is about 70000.
(CS is a value smaller than the case where H/T is ON.
I think that it is because the performance of CPU fell.)


Regards
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

Re: poor performance with Context Switch Storm at TPC-W.

From

Katsuhiko Okano

Date:

14 July 2006, 02:58:50

Hi.


Alvaro Herrera wrote:

> Katsuhiko Okano wrote:
> 
> > I suspected conflict of BufMappingLock.
> > but, collected results are seen,
> > occurrence of CSStorm and the increase of BufMappingLock counts
> > seem not to correspond.
> > Instead, SubtransControlLock and SubTrans were increasing.
> > I do not understand what in the cause of CSStorm.
> 
> Please see this thread:
> http://archives.postgresql.org/pgsql-hackers/2005-11/msg01547.php
> (actually it's a single message AFAICT)
> 
> This was applied on the 8.2dev code, so I'm surprised that 8.2dev
> behaves the same as 8.1.
> 
> Does your problem have any relationship to what's described there?
> 
Probably it is related.
There is no telling are a thing with the bad method of a lock and 
whether it is bad that the number of LRU buffers is simply small.


> I also wondered whether the problem may be that the number of SLRU
> buffers we use for subtrans is too low.  But the number was increased
> from the default 8 to 32 in 8.2dev as well.  Maybe you could try
> increasing that even further; say 128 and see if the problem is still
> there.  (src/include/access/subtrans.h, NUM_SUBTRANS_BUFFERS).
By PostgreSQL8.2, NUM_SUBTRANS_BUFFERS was changed into 128
and recompile and measured again.
NOT occurrence of CSStorm. The value of WIPS was about 400.
(but the value of WIPS fell about to 320 at intervals of 4 to 6 minutes.)

If the number of SLRU buffers is too low,
also in PostgreSQL8.1.4, if the number of buffers is increased
I think that the same result is brought.
(Although the buffer of CLOG or a multi-transaction also increases,
I think that effect is small)  

Now, NUM_SLRU_BUFFERS is changed into 128 in PostgreSQL8.1.4
and is under measurement.


regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

CSStorm occurred again by postgreSQL8.2. (Re: poor performance with Context Switch Storm at TPC-W.)

From

Katsuhiko Okano

Date:

18 July 2006, 08:24:42

Katsuhiko Okano wrote:
> By PostgreSQL8.2, NUM_SUBTRANS_BUFFERS was changed into 128
> and recompile and measured again.
> NOT occurrence of CSStorm. The value of WIPS was about 400.

measured again.
not occurrence when measured for 30 minutes.
but occurrence when measured for 3 hours, and 1 hour and 10 minutes passed.
It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
The problem was only postponed.

> If the number of SLRU buffers is too low,
> also in PostgreSQL8.1.4, if the number of buffers is increased
> I think that the same result is brought.
> (Although the buffer of CLOG or a multi-transaction also increases,
> I think that effect is small)  
> 
> Now, NUM_SLRU_BUFFERS is changed into 128 in PostgreSQL8.1.4
> and is under measurement.

Occurrence CSStorm when the version 8.1.4 passed similarly for 
1 hour and 10 minutes.

A strange point,
The number of times of a LWLock lock for LRU buffers is 0 times
until CSStorm occurs.
After CSStorm occurs, the share lock and the exclusion lock are required and 
most locks are kept waiting.
(exclusion lock for SubtransControlLock is increased rapidly after CSStorm start.)

Is different processing done by whether CSStrom has occurred or not occurred?

regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

Re: CSStorm occurred again by postgreSQL8.2. (Re: poor performance with Context Switch Storm at TPC-W.)

From

Tom Lane

Date:

18 July 2006, 14:06:29

Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:
> It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
> The problem was only postponed.

Can you provide a reproducible test case for this?
        regards, tom lane

Re: CSStorm occurred again by postgreSQL8.2. (Re: poor performance with Context Switch Storm at TPC-W.)

From

Katsuhiko Okano

Date:

19 July 2006, 07:37:29

"Tom Lane <tgl@sss.pgh.pa.us>" wrote:
> Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:
> > It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
> > The problem was only postponed.
> 
> Can you provide a reproducible test case for this?

Seven machines are required in order to perform measurement.
(DB*1,AP*2,CLient*4)
Enough work load was not able to be given in two machines.
(DB*1,{AP+CL}*1)


It was not able to reappear to a multiplex run of pgbench 
or a simple SELECT query.
TPC-W of a work load tool used this time is a full scratch.
Regrettably it cannot open to the public.
If there is a work load tool of a free license, I would like to try.


I will show if there is information required for others.


The patch which outputs the number of times of LWLock was used this time.
The following is old example output. FYI.

# SELECT * FROM pg_stat_lwlocks;kind |  pg_stat_get_lwlock_name   |  sh_call   |  sh_wait  |  ex_call  |  ex_wait  |
sleep
 

------+----------------------------+------------+-----------+-----------+-----------+-------
   0 | BufMappingLock             |  559375542 |     33542 |    320092 |     24025 |     0
   1 | BufFreelistLock            |          0 |         0 |    370709 |        47 |     0
   2 | LockMgrLock                |          0 |         0 |  41718885 |    734502 |     0
   3 | OidGenLock                 |         33 |         0 |         0 |         0 |     0
   4 | XidGenLock                 |   12572279 |     10095 |  11299469 |     20089 |     0
   5 | ProcArrayLock              |    8371330 |     72052 |  16965667 |    603294 |     0
   6 | SInvalLock                 |   38822428 |       435 |     25917 |       128 |     0
   7 | FreeSpaceLock              |          0 |         0 |     16787 |         4 |     0
   8 | WALInsertLock              |          0 |         0 |   1239911 |       885 |     0
   9 | WALWriteLock               |          0 |         0 |     69907 |      5589 |     0
  10 | ControlFileLock            |          0 |         0 |     16686 |         1 |     0
  11 | CheckpointLock             |          0 |         0 |        34 |         0 |     0
  12 | CheckpointStartLock        |      69509 |         0 |        34 |         1 |     0
  13 | CLogControlLock            |          0 |         0 |    236763 |       183 |     0
  14 | SubtransControlLock        |          0 |         0 | 753773945 | 205273395 |     0
  15 | MultiXactGenLock           |         66 |         0 |         0 |         0 |     0
  16 | MultiXactOffsetControlLock |          0 |         0 |        35 |         0 |     0
  17 | MultiXactMemberControlLock |          0 |         0 |        34 |         0 |     0
  18 | RelCacheInitLock           |          0 |         0 |         0 |         0 |     0
  19 | BgWriterCommLock           |          0 |         0 |     61457 |         1 |     0
  20 | TwoPhaseStateLock          |         33 |         0 |         0 |         0 |     0
  21 | TablespaceCreateLock       |          0 |         0 |         0 |         0 |     0
  22 | BufferIO                   |          0 |         0 |    695627 |        16 |     0
  23 | BufferContent              | 3568231805 |      1897 |   1361394 |       829 |     0
  24 | CLog                       |          0 |         0 |         0 |         0 |     0
  25 | SubTrans                   |  138571621 | 143208883 |   8122181 |   8132646 |     0
  26 | MultiXactOffset            |          0 |         0 |         0 |         0 |     0
  27 | MultiXactMember            |          0 |         0 |         0 |         0 |     0

(28 rows)


I am pleased if interested.



regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

Re: CSStorm occurred again by postgreSQL8.2. (Re: poor

From

Stefan Kaltenbrunner

Date:

19 July 2006, 07:54:07

Katsuhiko Okano wrote:
> "Tom Lane <tgl@sss.pgh.pa.us>" wrote:
>> Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:
>>> It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
>>> The problem was only postponed.
>> Can you provide a reproducible test case for this?
> 
> Seven machines are required in order to perform measurement.
> (DB*1,AP*2,CLient*4)
> Enough work load was not able to be given in two machines.
> (DB*1,{AP+CL}*1)
> 
> 
> It was not able to reappear to a multiplex run of pgbench 
> or a simple SELECT query.
> TPC-W of a work load tool used this time is a full scratch.
> Regrettably it cannot open to the public.
> If there is a work load tool of a free license, I would like to try.


FYI: there is a free tpc-w implementation done by Jan available at:
http://pgfoundry.org/projects/tpc-w-php/


Stefan

Re: CSStorm occurred again by postgreSQL8.2. (Re: poor

From

Masanori ITOH

Date:

19 July 2006, 08:29:44

Hi folks,

From: Stefan Kaltenbrunner <stefan@kaltenbrunner.cc>
Subject: Re: CSStorm occurred again by postgreSQL8.2. (Re: [HACKERS] poor
Date: Wed, 19 Jul 2006 12:53:53 +0200

> Katsuhiko Okano wrote:
> > "Tom Lane <tgl@sss.pgh.pa.us>" wrote:
> >> Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:
> >>> It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
> >>> The problem was only postponed.
> >> Can you provide a reproducible test case for this?
> > 
> > Seven machines are required in order to perform measurement.
> > (DB*1,AP*2,CLient*4)
> > Enough work load was not able to be given in two machines.
> > (DB*1,{AP+CL}*1)
> > 
> > 
> > It was not able to reappear to a multiplex run of pgbench 
> > or a simple SELECT query.
> > TPC-W of a work load tool used this time is a full scratch.
> > Regrettably it cannot open to the public.
> > If there is a work load tool of a free license, I would like to try.
> 
> 
> FYI: there is a free tpc-w implementation done by Jan available at:
> http://pgfoundry.org/projects/tpc-w-php/

FYI(2):
 There is one more (pseudo) TPC-W implementation by OSDL.
 http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/osdl_dbt-1/
 One more comment is that Katsuhiko't team is using their own version of  TPC-W like benchmark suite, and he cannot
makeit public.

 Also, his point is that he tried to reproduce the CSS phenomena using pgbench and a proguram issuing heavily multiple
SELECTqueries on a single table but they didn't work well reproducing CSS.

Regards,
Masanori

> Stefan

---
Masanori ITOH  NTT OSS Center, Nippon Telegraph and Telephone Corporation              e-mail:
ito.masanori@oss.ntt.co.jp             phone : +81-3-5860-5015

Re: poor performance with Context Switch Storm at TPC-W.

From

"Jim C. Nasby"

Date:

19 July 2006, 12:35:50

On Fri, Jul 14, 2006 at 02:58:36PM +0900, Katsuhiko Okano wrote:
> NOT occurrence of CSStorm. The value of WIPS was about 400.
> (but the value of WIPS fell about to 320 at intervals of 4 to 6 minutes.)

If you haven't changed checkpoint timeout, this drop-off every 4-6
minutes indicates that you need to make the bgwriter more aggressive.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: CSStorm occurred again by postgreSQL8.2. (Re: poor

From

Katsuhiko Okano

Date:

20 July 2006, 07:03:02

> > > If there is a work load tool of a free license, I would like to try.
> > 
> > 
> > FYI: there is a free tpc-w implementation done by Jan available at:
> > http://pgfoundry.org/projects/tpc-w-php/
> 
> FYI(2):
> 
>   There is one more (pseudo) TPC-W implementation by OSDL.
> 
>   http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/osdl_dbt-1/

Thank you for the information.
I'll try it.

Regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

Re: poor performance with Context Switch Storm at TPC-W.

From

Katsuhiko Okano

Date:

20 July 2006, 07:08:09

"Jim C. Nasby" wrote:
> If you haven't changed checkpoint timeout, this drop-off every 4-6
> minutes indicates that you need to make the bgwriter more aggressive.
I'll say to a customer when proposing and explaining.
Thank you for the information.

Regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

LWLock statistics collector (was: CSStorm occurred again by postgreSQL8.2)

From

ITAGAKI Takahiro

Date:

28 July 2006, 06:26:13

Hi hackers,

I tackled the performance problem on SUBTRANS module with Okano.
He and I reach a conclusion that SubTrans log is heavily read on a specific
access pattern in my TPC-W implementation. There seems to be awful traffic
on SUBTRANS to check visivility of tuples in HeapTupleSatisfiesSnapshot().
I'll report more details later.


BTW, I wrote a patch to collect statistics of Light-weight locks for analysis.
We have already had Trace_lwlocks option, but it can collect statistics with
less impact. The following is an output of the patch (on 8.1). 
Are you interested in the feature? and I'll port it to HEAD and post it.

> # SELECT * FROM pg_stat_lwlocks;
>  kind |  pg_stat_get_lwlock_name   |  sh_call   |  sh_wait  |  ex_call  |  ex_wait  | 
> ------+----------------------------+------------+-----------+-----------+-----------+-
>     0 | BufMappingLock             |  559375542 |     33542 |    320092 |     24025 | 
>     1 | BufFreelistLock            |          0 |         0 |    370709 |        47 | 
>     2 | LockMgrLock                |          0 |         0 |  41718885 |    734502 | 
>     3 | OidGenLock                 |         33 |         0 |         0 |         0 | 
>     4 | XidGenLock                 |   12572279 |     10095 |  11299469 |     20089 | 
>     5 | ProcArrayLock              |    8371330 |     72052 |  16965667 |    603294 | 
>     6 | SInvalLock                 |   38822428 |       435 |     25917 |       128 | 
>     7 | FreeSpaceLock              |          0 |         0 |     16787 |         4 | 
>     8 | WALInsertLock              |          0 |         0 |   1239911 |       885 | 
>     9 | WALWriteLock               |          0 |         0 |     69907 |      5589 | 
>    10 | ControlFileLock            |          0 |         0 |     16686 |         1 | 
>    11 | CheckpointLock             |          0 |         0 |        34 |         0 | 
>    12 | CheckpointStartLock        |      69509 |         0 |        34 |         1 | 
>    13 | CLogControlLock            |          0 |         0 |    236763 |       183 | 
>    14 | SubtransControlLock        |          0 |         0 | 753773945 | 205273395 | 
>    15 | MultiXactGenLock           |         66 |         0 |         0 |         0 | 
>    16 | MultiXactOffsetControlLock |          0 |         0 |        35 |         0 | 
>    17 | MultiXactMemberControlLock |          0 |         0 |        34 |         0 | 
>    18 | RelCacheInitLock           |          0 |         0 |         0 |         0 | 
>    19 | BgWriterCommLock           |          0 |         0 |     61457 |         1 | 
>    20 | TwoPhaseStateLock          |         33 |         0 |         0 |         0 | 
>    21 | TablespaceCreateLock       |          0 |         0 |         0 |         0 | 
>    22 | BufferIO                   |          0 |         0 |    695627 |        16 | 
>    23 | BufferContent              | 3568231805 |      1897 |   1361394 |       829 | 
>    24 | CLog                       |          0 |         0 |         0 |         0 | 
>    25 | SubTrans                   |  138571621 | 143208883 |   8122181 |   8132646 | 
>    26 | MultiXactOffset            |          0 |         0 |         0 |         0 | 
>    27 | MultiXactMember            |          0 |         0 |         0 |         0 | 
> (28 rows)

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: LWLock statistics collector (was: CSStorm occurred again by postgreSQL8.2)

From

Katsuhiko Okano

Date:

31 July 2006, 05:53:41

Hi,All.

Since the cause was found and the provisional patch was made
and solved about the CSStorm problem in previous mails, it reports.

> Subject: [HACKERS] poor performance with Context Switch Storm at TPC-W.
> Date: Tue, 11 Jul 2006 20:09:24 +0900
> From: Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp>
>
> poor performance with Context Switch Storm occurred
> with the following composition.

Premise knowledge :
PostgreSQL8.0 to SAVEPOINT was supported.
All the transactions have one or more subtransactions in an inside.
When judging VISIBILITY of a tupple, XID which inserted the tuppleneeds to judge a top transaction or a
subtransaction.
(if it's XMIN committed)
In order to judge, it is necessary to access SubTrans.
(data structure which manages the parents of transaction ID)
SubTrans is accessed via a LRU buffer.

Occurrence conditions of this phenomenon :
The occurrence conditions of this phenomenon are the following.
- There is transaction which refers to the tupple in quantity frequency (typically seq scan).
- (Appropriate frequency) There is updating transaction.
- (Appropriate length) There is long live transaction.

Point of view :
(A) The algorithm which replaces a buffer is bad.
A time stamp does not become new until swapout completes
the swapout page.
If access is during swap at other pages, the swapout page will be
in the state where it is not used most,
It is again chosen as the page for swapout.
(When work load is high)

(B) Accessing at every judgment of VISIBILITY of a tupple is frequent.
If many processes wait LWLock using semop, CSStorm will occur.

Result :
As opposed to (A),
I created a patch which the page of read/write IN PROGRESS does not
make an exchange candidate.
(It has "betterslot" supposing the case where all the pages are set
to IN PROGRESS.)
The patch was applied.
However, it recurred. it did not become fundamental solution.

As opposed to (B),
A patch which is changed so that it may consider that all the
transactions are top transactions was created.
(Thank you, ITAGAKI) The patch was applied. 8 hours was measured.
CSStorm problem was stopped.

Argument :
(1)Since neither SAVEPOINT nor the error trap using PL/pgSQL is done,
the subtransaction is unnecessary.
Is it better to implement the mode not using a subtransaction?

(2)It is the better if a cache can be carried out by structure
like CLOG that it seems that it is not necessary to check
a LRU buffer at every occasion.

Are there a problem and other ideas?
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

Re: LWLock statistics collector (was: CSStorm occurred again by postgreSQL8.2)

From

Katsuhiko Okano

Date:

31 July 2006, 06:01:33

Katsuhiko Okano wrote:
> Since the cause was found and the provisional patch was made 
> and solved about the CSStorm problem in previous mails, it reports.
(snip)
> (A) The algorithm which replaces a buffer is bad.
> A time stamp does not become new until swapout completes 
> the swapout page.
> If access is during swap at other pages, the swapout page will be 
> in the state where it is not used most,
> It is again chosen as the page for swapout.
> (When work load is high)

The following is the patch.


diff -cpr postgresql-8.1.4-orig/src/backend/access/transam/slru.c
postgresql-8.1.4-SlruSelectLRUPage-fix/src/backend/access/transam/slru.c

*** postgresql-8.1.4-orig/src/backend/access/transam/slru.c    2006-01-21 13:38:27.000000000 +0900

--- postgresql-8.1.4-SlruSelectLRUPage-fix/src/backend/access/transam/slru.c    2006-07-25 18:02:49.000000000 +0900

*************** SlruSelectLRUPage(SlruCtl ctl, int pagen

*** 703,710 ****
     for (;;)
     {
         int            slotno;

!         int            bestslot = 0;
         unsigned int bestcount = 0;
 
         /* See if page already has a buffer assigned */
         for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)

--- 703,712 ----
     for (;;)
     {
         int            slotno;

!         int            bestslot = -1;

!         int            betterslot = -1;
         unsigned int bestcount = 0;

+         unsigned int bettercount = 0;
 
         /* See if page already has a buffer assigned */
         for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)

*************** SlruSelectLRUPage(SlruCtl ctl, int pagen

*** 720,732 ****
          */
         for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
         {

!             if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)

!                 return slotno;

!             if (shared->page_lru_count[slotno] > bestcount &&

!                 shared->page_number[slotno] != shared->latest_page_number)

!             {

!                 bestslot = slotno;

!                 bestcount = shared->page_lru_count[slotno];
             }
         }
 

--- 722,746 ----
          */
         for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
         {

!              switch (shared->page_status[slotno])

!              {

!                  case SLRU_PAGE_EMPTY:

!                      return slotno;

!                  case SLRU_PAGE_READ_IN_PROGRESS:

!                  case SLRU_PAGE_WRITE_IN_PROGRESS:

!                     if (shared->page_lru_count[slotno] > bettercount &&

!                         shared->page_number[slotno] != shared->latest_page_number)

!                     {

!                         betterslot = slotno;

!                         bettercount = shared->page_lru_count[slotno];

!                     }

!                  default:    /* SLRU_PAGE_CLEAN,SLRU_PAGE_DIRTY */

!                     if (shared->page_lru_count[slotno] > bestcount &&

!                         shared->page_number[slotno] != shared->latest_page_number)

!                     {

!                         bestslot = slotno;

!                         bestcount = shared->page_lru_count[slotno];

!                     }
             }
         }
 

*************** SlruSelectLRUPage(SlruCtl ctl, int pagen

*** 736,741 ****

--- 750,758 ----
         if (shared->page_status[bestslot] == SLRU_PAGE_CLEAN)
             return bestslot;
 

+         if (bestslot == -1)

+             bestslot = betterslot;

+ 
         /*
          * We need to do I/O.  Normal case is that we have to write it out,
          * but it's possible in the worst case to have selected a read-busy



Regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

Re: [PATCHES] LWLock statistics collector

From

Tom Lane

Date:

03 August 2006, 14:33:11

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> Here is a patch to collect statistics of LWLocks.

This seems fairly invasive, as well as confused about whether it's an
#ifdef'able thing or not.  You can't have system views and pg_proc
entries conditional on a compile-time #ifdef, so in a default build
we would have a lot of nonfunctional cruft exposed to users.

Do we really need this compared to the simplistic dump-to-stderr
counting support that's in there now?  That stuff doesn't leave any
cruft behind when not enabled, and it has at least one significant
advantage over your proposal, which is that it's possible to get
per-process statistics when needed.

If I thought that average users would have a need for LWLock statistics,
I'd be more sympathetic to expending effort on a nice frontend for
viewing the statistics, but this is and always will be just a concern
for hardcore hackers ...

            regards, tom lane

Re: LWLock statistics collector (was: CSStorm occurred again by postgreSQL8.2)

From

Tom Lane

Date:

03 August 2006, 19:37:47

Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:
>> (A) The algorithm which replaces a buffer is bad.
>> A time stamp does not become new until swapout completes 
>> the swapout page.
>> If access is during swap at other pages, the swapout page will be 
>> in the state where it is not used most,
>> It is again chosen as the page for swapout.
>> (When work load is high)

> The following is the patch.

I'm confused ... is this patch being proposed for inclusion?  I
understood your previous message to say that it didn't help much.

The patch is buggy as posted, because it will try to do this:         if (shared->page_status[bestslot] ==
SLRU_PAGE_CLEAN)            return bestslot;

while bestslot could still be -1.

I see your concern about multiple processes selecting the same buffer
for replacement, but what will actually happen is that all but the first
will block for the first one's I/O to complete using SimpleLruWaitIO,
and then all of them will repeat the outer loop and recheck what to do.
If they were all trying to swap in the same page this is actually
optimal.  If they were trying to swap in different pages then the losing
processes will again try to initiate I/O on a different buffer.  (They
will pick a different buffer, because the guy who got the buffer will
have done SlruRecentlyUsed on it before releasing the control lock ---
so I don't believe the worry that we get a buffer thrash scenario here.
Look at the callers of SlruSelectLRUPage not just the function itself.)

It's possible that letting different processes initiate I/O on different
buffers would be a win, but it might just result in excess writes,
depending on the relative probability of requests for the same page
vs. requests for different pages.

Also, I think the patch as posted would still cause processes to gang up
on the same buffer, it would just be a different one from before.  The
right thing would be to locate the overall-oldest buffer and return it
if clean; otherwise to initiate I/O on the oldest buffer that isn't
either clean or write-busy, if there is one; otherwise just do WaitIO
on the oldest buffer.  This would ensure that different processes try
to push different buffers to disk.  They'd still go back and make their
decisions from the top after doing their I/O.  Whether this is a win or
not is not clear to me, but at least it would attack the guessed-at
problem correctly.
        regards, tom lane

Re: [PATCHES] LWLock statistics collector

From

Kevin Brown

Date:

03 August 2006, 23:42:02

Tom Lane wrote:
> If I thought that average users would have a need for LWLock statistics,
> I'd be more sympathetic to expending effort on a nice frontend for
> viewing the statistics, but this is and always will be just a concern
> for hardcore hackers ...

That may be true of the output, but that's not a very strong argument
against making it much easier to gather and display the LWLock
statistics.  I can easily imagine the patch be a useful performance
troubleshooting tool in a high load environment.  Depends on how
easy/intrusive it is to enable/use the stderr method on a production
system, though, as well as how much of a performance impact the
measurements have on overall operation...

-- 
Kevin Brown                          kevin@sysexperts.com

Re: LWLock statistics collector

From

ITAGAKI Takahiro

Date:

04 August 2006, 06:58:03

Tom Lane <tgl@sss.pgh.pa.us> wrote:

> This seems fairly invasive, as well as confused about whether it's an
> #ifdef'able thing or not.  You can't have system views and pg_proc
> entries conditional on a compile-time #ifdef, so in a default build
> we would have a lot of nonfunctional cruft exposed to users.

Is it acceptable if pg_stat_lwlocks view and other functions are not
installed and invisible when LWLOCK_STAT is not defined? We don't have
such a feature now, but we can.

> Do we really need this compared to the simplistic dump-to-stderr
> counting support that's in there now?  That stuff doesn't leave any
> cruft behind when not enabled, and it has at least one significant
> advantage over your proposal, which is that it's possible to get
> per-process statistics when needed.

There is one advantage in my proposal. We can watch the time-varying
activities of LWLocks easily.
Is per-process statistics actually needed? If we use connection
pooling, we lose the advantage. I think connection pooling is commonly
used in such a high-load case where we encounter LWLock contentions.

> If I thought that average users would have a need for LWLock statistics,
> I'd be more sympathetic to expending effort on a nice frontend for
> viewing the statistics, but this is and always will be just a concern
> for hardcore hackers ...

I assume the ones who analyze the output of the patch are hardcore hacker,
but the ones who actually use it and collect information are average users.
The dump-to-stderr method is hard to use because it will increase syslogs
and requires re-parsing efforts.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: LWLock statistics collector (was: CSStorm occurred again by postgreSQL8.2)

From

Katsuhiko Okano

Date:

04 August 2006, 08:05:58

Tom Lane wrote:
> I'm confused ... is this patch being proposed for inclusion?  I
> understood your previous message to say that it didn't help much.
This is only the patch for carving where there is any problem.

> The patch is buggy as posted, because it will try to do this:
>           if (shared->page_status[bestslot] == SLRU_PAGE_CLEAN)
>               return bestslot;
> while bestslot could still be -1.
A check is required. understood.

>  (They
> will pick a different buffer, because the guy who got the buffer will
> have done SlruRecentlyUsed on it before releasing the control lock ---
> so I don't believe the worry that we get a buffer thrash scenario here.
> Look at the callers of SlruSelectLRUPage not just the function itself.)
umm,I read a code again.

> otherwise to initiate I/O on the oldest buffer that isn't
> either clean or write-busy, if there is one; 
Understanding is a difficult point although it is important.
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

Re: LWLock statistics collector

From

Tom Lane

Date:

04 August 2006, 12:54:04

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This seems fairly invasive, as well as confused about whether it's an
>> #ifdef'able thing or not.  You can't have system views and pg_proc
>> entries conditional on a compile-time #ifdef, so in a default build
>> we would have a lot of nonfunctional cruft exposed to users.

> Is it acceptable if pg_stat_lwlocks view and other functions are not
> installed and invisible when LWLOCK_STAT is not defined? We don't have
> such a feature now, but we can.

Then you'd need an initdb to go between having stats and not, which
carries its own downsides.  If I thought there were a wide market for
this then I'd say sure, let's just have it there ... but I don't.

I think the actual wave of the future for analyzing behavior at the
LWLock level is going to be DTrace.  It seems way more flexible than
an aggregate-statistics view can be.
        regards, tom lane

Re: LWLock statistics collector

From

Robert Lor

Date:

04 August 2006, 13:42:52

Tom Lane wrote:

>I think the actual wave of the future for analyzing behavior at the
>LWLock level is going to be DTrace.  It seems way more flexible than
>an aggregate-statistics view can be.
>  
>
CVS head now has the following LWLock probes, and more can easily be 
added. These probes can be enabled using the sample DTrace scripts at 
http://pgfoundry.org/projects/dtrace/

lwlock-acquire
lwlock-release
lwlock-startwait
lwlock-endwait
lwlock-condacquire
lwlock-condacquire-fail

If anyone wants to have access to a Solaris machine to play with DTrace, 
let me know.

Regards,
-Robert

Re: LWLock statistics collector

From

ITAGAKI Takahiro

Date:

06 August 2006, 22:27:10

Robert Lor <Robert.Lor@Sun.COM> wrote:

> CVS head now has the following LWLock probes, and more can easily be 
> added. These probes can be enabled using the sample DTrace scripts at 
> http://pgfoundry.org/projects/dtrace/
> 
> lwlock-acquire
> lwlock-release
> lwlock-startwait
> lwlock-endwait
> lwlock-condacquire
> lwlock-condacquire-fail
> 
> If anyone wants to have access to a Solaris machine to play with DTrace, 
> let me know.

I assure you that DTrace is a powerful tool, but as for LWLock statistics,
can we gather them well with it? There is a "color coding" problem in the
persent dump-to-stderr and DTrace methods. I assume we want to gather the
statistics per resource (represented by LWLockKind in my patch), not per
LWLockId.

Even if we use DTrace, do we need some supports for coloring of lwlocks?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: LWLock statistics collector

From

Tom Lane

Date:

07 August 2006, 00:50:23

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> I assume we want to gather the statistics per resource (represented by
> LWLockKind in my patch), not per LWLockId.

Why?  I haven't yet seen any problem where that looked useful.
The named LWLocks are certainly sui generis, and as for things like
per-buffer locks I haven't seen any need for aggregate statistics ---
I'd rather know about "hot spots" if there are any buffers that are
not like the rest.
        regards, tom lane

Re: CSStorm occurred again by postgreSQL8.2

From

ITAGAKI Takahiro

Date:

07 August 2006, 01:37:09

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
> > The problem was only postponed.
> 
> Can you provide a reproducible test case for this?

This is the reproducible test case:
- Occurs on both 8.1.4 and HEAD.
- On smp machine. I used dual opterons. CSStrom becomes worse on dual xeon with hyper-threading.
- Tuning parameters are default. Whole data are cached in shared buffers. (shared_buffers=32MB, data of pgbench
(scale=1)are less than 15MB.)

- Using custom pgbench. One client doing UPDATE with indexscan and multiple clients doing SELECT with
seqscan/indexscan.

$ pgbench -i
$ pgbench -n -c  1 -t 100000 -f cs_update.sql    &
$ pgbench -n -c 50 -t 100000 -f cs_indexscan.sql &
$ pgbench -n -c 35 -t 100000 -f cs_seqscan.sql   &
(The scripts are attached at end of this message.)

In above workload, context switches are 2000-10000/sec and cpu usage is
user=100%. Then, start a long open transaction on another connection.

$ psql
# begin; -- Long open transaction

After a lapse of 30-60 seconds, context switches become 50000/sec over
(120000 over on xeons) and cpu usage is user=66% / sys=21% / idle=13%.
If we increase the frequency of UPDATE, the duration becomes shorter.

This is a human-induced workload, but I can see the same condition in
TPC-W -- even though it is a benchmark. TPC-W requires full-text search
and it is implementd using "LIKE %foo%" in my implementation (DBT-1, too).
Also, it requires periodical aggregations. They might behave as long
transactions.

The cause seems to be a lock contention. The number of locks on
SubtransControlLock and SubTransBuffer are significantly increased
by comparison with BufMappingLocks.

# Before starting a long transaction.kind |        lwlock       | sh_call  | sh_wait | ex_call | ex_wait
------+---------------------+----------+---------+---------+---------  13 | SubtransControlLock |    28716 |       2 |
   54 |       0  22 | BufMappingLock      | 11637884 |       0 |    2492 |       0  27 | SubTransBuffer      |        0
|      0 |      11 |       0

# Afterkind |        lwlock       | sh_call  | sh_wait | ex_call | ex_wait
------+---------------------+----------+---------+---------+---------  13 | SubtransControlLock |  4139111 |   65059 |
3926691|  390838  22 | BufMappingLock      | 32348073 |       0 |    2509 |       0  27 | SubTransBuffer      |
939646|  960341 | 1419152 |      61

The invokers of SubTrans module are two SubTransGetTopmostTransaction()
in HeapTupleSatisfiesSnapshot(). When I disabled the calls, CSStorm did
not occur. SubTransGetTopmostTransaction returns the argument without
change when we don't use SAVEPOINTs.

If we optimize for non-subtransactions, we can avoid to lock SubTrans
for check visiblities of tuples inserted by top transactions.
If we want to resolve the probmen fundamentally, we might have to
improve SubTrans using a better buffer management algorithm or so.

Do you have any idea to avoid such a problem?

-- cs_update.sql
\set naccounts 100000 * :tps
\setrandom aid 1 :naccounts
\setrandom delta -5000 5000
UPDATE accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT pg_sleep(0.1);

-- cs_seqscan.sql
\set naccounts 100000 * :tps
\setrandom aid 1 :naccounts
SELECT abalance FROM accounts WHERE aid::int8 = :aid; -- cast to force seqscan

-- cs_indexscan.sql
\set naccounts 100000 * :tps
\setrandom aid 1 :naccounts
SELECT abalance FROM accounts WHERE aid = :aid;

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: CSStorm occurred again by postgreSQL8.2

From

Tom Lane

Date:

07 August 2006, 10:09:52

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> The invokers of SubTrans module are two SubTransGetTopmostTransaction()
> in HeapTupleSatisfiesSnapshot(). When I disabled the calls, CSStorm did
> not occur. SubTransGetTopmostTransaction returns the argument without
> change when we don't use SAVEPOINTs.

> If we optimize for non-subtransactions, we can avoid to lock SubTrans
> for check visiblities of tuples inserted by top transactions.

Only for top transactions still in progress, so I doubt that would
help much.

I'm wondering about doing something similar to what
TransactionIdIsInProgress does, ie, make use of the PGPROC lists
of live subtransactions.  Suppose that GetSnapshotData copies not
only top xids but live subxids into the snapshot, and adds a flag
indicating whether the subxids are complete (ie, none of the subxid
lists have overflowed).  Then if the flag is set, tqual.c doesn't
need to do SubTransGetTopmostTransaction() before searching the
list.
        regards, tom lane

Re: CSStorm occurred again by postgreSQL8.2

From

Alvaro Herrera

Date:

07 August 2006, 15:53:44

Tom Lane wrote:

> I'm wondering about doing something similar to what
> TransactionIdIsInProgress does, ie, make use of the PGPROC lists
> of live subtransactions.  Suppose that GetSnapshotData copies not
> only top xids but live subxids into the snapshot, and adds a flag
> indicating whether the subxids are complete (ie, none of the subxid
> lists have overflowed).  Then if the flag is set, tqual.c doesn't
> need to do SubTransGetTopmostTransaction() before searching the
> list.

Well, that sounds awfully more expensive than setting
local-to-my-database Xmins as well as global (all databases) Xmins :-)

On the other hand, ISTM as soon as one cache overflows, you have to go
check pg_subtrans which means the entire optimization buys nothing in
that case.  It would be nice if the optimization degraded more
gracefully.  I don't have any concrete suggestion though.  The changes
proposed in the other CS-storm thread by the NTT person may help.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: CSStorm occurred again by postgreSQL8.2

From

Tom Lane

Date:

07 August 2006, 16:42:50

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> I'm wondering about doing something similar to what
>> TransactionIdIsInProgress does, ie, make use of the PGPROC lists
>> of live subtransactions.

> Well, that sounds awfully more expensive than setting
> local-to-my-database Xmins as well as global (all databases) Xmins :-)

Only when you've got a lot of subtransactions.  The point of this
discussion is to optimize for the few-or-none case.  In any case,
the problem with the local/global bit was that you'd be expending
foreground-process cycles without any foreground-process return.
Being able to use a snapshot without consulting pg_subtrans will
certainly buy back some cycles...
        regards, tom lane

Re: CSStorm occurred again by postgreSQL8.2

From

Alvaro Herrera

Date:

07 August 2006, 17:10:38

Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Tom Lane wrote:
> >> I'm wondering about doing something similar to what
> >> TransactionIdIsInProgress does, ie, make use of the PGPROC lists
> >> of live subtransactions.
> 
> > Well, that sounds awfully more expensive than setting
> > local-to-my-database Xmins as well as global (all databases) Xmins :-)
> 
> Only when you've got a lot of subtransactions.  The point of this
> discussion is to optimize for the few-or-none case.  In any case,
> the problem with the local/global bit was that you'd be expending
> foreground-process cycles without any foreground-process return.
> Being able to use a snapshot without consulting pg_subtrans will
> certainly buy back some cycles...

I can buy that.  Some idle thoughts:

I was thinking at what time was the most appropiate to insert or remove
an Xid from the cache.  We can do without any excl-locking because 1) we
already assume the storing of an Xid to be atomic, and 2) no one can be
interested in querying for an Xid before the originating transaction has
had the chance to write a tuple with that Xid anyway.

OTOH I think we only need to store live Xids and those finished that
wrote a WAL record; we can drop subaborted and subcommitted if they
didn't.

On the third hand, are we going to sh-acquire the ProcArray lock while a
GetSnapshotData copies all subxact Xids of all running transactions?
ProcArrayLock will become more of a contention point than it already is.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: CSStorm occurred again by postgreSQL8.2

From

Tom Lane

Date:

07 August 2006, 17:27:51

Alvaro Herrera <alvherre@commandprompt.com> writes:
> I was thinking at what time was the most appropiate to insert or remove
> an Xid from the cache.  We can do without any excl-locking because 1) we
> already assume the storing of an Xid to be atomic, and 2) no one can be
> interested in querying for an Xid before the originating transaction has
> had the chance to write a tuple with that Xid anyway.

Actually ... that fails if GetSnapshotData is going to copy subtrans
XIDs.  So this area needs more thought.

> On the third hand, are we going to sh-acquire the ProcArray lock while a
> GetSnapshotData copies all subxact Xids of all running transactions?
> ProcArrayLock will become more of a contention point than it already is.

Yeah, but sharelock is better than exclusive lock ...
        regards, tom lane

Re: CSStorm occurred again by postgreSQL8.2

From

ITAGAKI Takahiro

Date:

08 August 2006, 00:17:05

This is an additional information.

I wrote:
> If we want to resolve the probmen fundamentally, we might have to
> improve SubTrans using a better buffer management algorithm or so.

The above is maybe wrong. I checked each lwlock of pg_subtrans's buffers.
All lwlocks are uniformly acquired and I could not see any differences
among buffers. So the cause seems not to be a buffer management algorithm,
but just a lack of SLRU buffer pages.

NUM_SUBTRANS_BUFFERS is defined as 32 in HEAD. If we increase it,
we can avoid the old transaction problem for a certain time.
However, it doesn't help much on high-load -- for example, on a workload
with 2000 tps, we will use up 1000 pg_subtrans pages in 15 minites.
I suppose it is not enough for online and batch/maintenance mix.

Also, the simple scanning way in SLRU will likely cause another performance
issue when we highly increase the number of buffers. A sequential scanning
is used in SLRU, so it will not work well against many buffers.

I hope some cares in upper layer, snapshot, hitbits or something,
being discussed in the recent thread.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: CSStorm occurred again by postgreSQL8.2

From

Bruce Momjian

Date:

26 August 2006, 00:47:00

Is there anything to do for 8.2 here?

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> This is an additional information.
> 
> I wrote:
> > If we want to resolve the probmen fundamentally, we might have to
> > improve SubTrans using a better buffer management algorithm or so.
> 
> The above is maybe wrong. I checked each lwlock of pg_subtrans's buffers.
> All lwlocks are uniformly acquired and I could not see any differences
> among buffers. So the cause seems not to be a buffer management algorithm,
> but just a lack of SLRU buffer pages.
> 
> NUM_SUBTRANS_BUFFERS is defined as 32 in HEAD. If we increase it,
> we can avoid the old transaction problem for a certain time.
> However, it doesn't help much on high-load -- for example, on a workload
> with 2000 tps, we will use up 1000 pg_subtrans pages in 15 minites.
> I suppose it is not enough for online and batch/maintenance mix.
> 
> Also, the simple scanning way in SLRU will likely cause another performance
> issue when we highly increase the number of buffers. A sequential scanning
> is used in SLRU, so it will not work well against many buffers.
> 
> 
> I hope some cares in upper layer, snapshot, hitbits or something,
> being discussed in the recent thread.
> 
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: CSStorm occurred again by postgreSQL8.2

From

ITAGAKI Takahiro

Date:

28 August 2006, 07:18:00

Bruce Momjian <bruce@momjian.us> wrote:
> Is there anything to do for 8.2 here?

I'm working on Tom's idea. It is not a feature and does not change
the on-disk-structures, so I hope it meet the 8.2 deadline...

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm wondering about doing something similar to what
> TransactionIdIsInProgress does, ie, make use of the PGPROC lists
> of live subtransactions.  Suppose that GetSnapshotData copies not
> only top xids but live subxids into the snapshot, and adds a flag
> indicating whether the subxids are complete (ie, none of the subxid
> lists have overflowed).  Then if the flag is set, tqual.c doesn't
> need to do SubTransGetTopmostTransaction() before searching the
> list.

I added a subxid array to Snapshot and running subxids are gathered from
PGPROC->subxids cache. There are two overflowed case; any of PGPROC->subxids
are overflowed or the number of total subxids exceeds pre-allocated buffers.
If overflowed, we cannot avoid to call SubTransGetTopmostTransaction.

I tested the patch and did not see any context switch storm which comes
from pg_subtrans, but there may be some bugs in the visibility checking.
It would be very nice if you could review or test the patch.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment

snapshot_subtrans.patch

Re: CSStorm occurred again by postgreSQL8.2

From

Tom Lane

Date:

03 September 2006, 13:09:16

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> I added a subxid array to Snapshot and running subxids are gathered from
> PGPROC->subxids cache. There are two overflowed case; any of PGPROC->subxids
> are overflowed or the number of total subxids exceeds pre-allocated buffers.
> If overflowed, we cannot avoid to call SubTransGetTopmostTransaction.

Applied after some editorialization (you really need to pay more
attention to keeping comments in sync with code ;-))

I cannot measure any consistent speed difference in plain pgbench
scenarios with and without the patch, so at least as a rough
approximation the extra cycles in GetSnapshotData aren't hurting.
And I confirm that the test case you posted before no longer exhibits
a context-swap storm.

This change makes it even more obvious than before that we really want
to stay out of the subxids-overflowed regime.  I don't especially want
to make those arrays even bigger, but I wonder if there isn't more we
can do to use them efficiently.  In particular, it seems like in the
case where RecordSubTransactionCommit detects that the subxact has not
stored its XID anywhere, we could immediately remove the XID from
the PGPROC array, just as if it had aborted.  This would avoid chewing
subxid slots for cases such as exception blocks in plpgsql that are
not modifying the database, but just catching computational errors.
Comments?
        regards, tom lane

Re: CSStorm occurred again by postgreSQL8.2

From

Tom Lane

Date:

03 September 2006, 13:15:38

Alvaro Herrera <alvherre@commandprompt.com> writes:
> OTOH I think we only need to store live Xids and those finished that
> wrote a WAL record; we can drop subaborted and subcommitted if they
> didn't.

While reviewing this thread, I see Alvaro already had the idea I just
came to...
        regards, tom lane

Re: CSStorm occurred again by postgreSQL8.2

From

Tom Lane

Date:

13 September 2006, 22:45:52

I wrote:
> ... it seems like in the
> case where RecordSubTransactionCommit detects that the subxact has not
> stored its XID anywhere, we could immediately remove the XID from
> the PGPROC array, just as if it had aborted.  This would avoid chewing
> subxid slots for cases such as exception blocks in plpgsql that are
> not modifying the database, but just catching computational errors.

(and later realized that Alvaro had had the same idea awhile back, but
I don't have his message at hand).

I looked into this a bit more; it seems like basically it should only
take addition of
else    XidCacheRemoveRunningXids(xid, 0, NULL);

to the bottom of RecordSubTransactionCommit(), plus suitable adjustment
of the comments in both routines.  However, there's a problem: if we
delete a second-level subxact's XID from PGPROC, and later its upper
subtransaction aborts, XidCacheRemoveRunningXids will emit scary
warnings when it doesn't find the sub-subxact in PGPROC.  This could
doubtless be fixed with sufficient jiggery-pokery --- simply removing
the debug warnings would be a brute-force answer, but I'd like to find
something a bit less brute-force.  Maybe drop the sub-subxact from its
parent's list immediately, instead of carrying it forward?

Anyway, given that there's this one nonobvious gotcha, there might be
others.  My recommendation is that we take this off the open-items list
for 8.2 and revisit it in the 8.3 cycle when there's more time.
        regards, tom lane

Re: CSStorm occurred again by postgreSQL8.2

From

Gregory Stark

Date:

14 September 2006, 08:06:15

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Anyway, given that there's this one nonobvious gotcha, there might be
> others.  My recommendation is that we take this off the open-items list
> for 8.2 and revisit it in the 8.3 cycle when there's more time.

I wonder if Theo's recent reported problem with 4.3M child xids changes the
calculus on this. 

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: CSStorm occurred again by postgreSQL8.2

From

Tom Lane

Date:

14 September 2006, 09:42:03

Gregory Stark <stark@enterprisedb.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Anyway, given that there's this one nonobvious gotcha, there might be
>> others.  My recommendation is that we take this off the open-items list
>> for 8.2 and revisit it in the 8.3 cycle when there's more time.

> I wonder if Theo's recent reported problem with 4.3M child xids changes the
> calculus on this. 

Yeah, I was just looking at that.  Removing useless entries from the
child-xid list would presumably help him.  Considering we're not even
formally in beta yet, I'm probably being too conservative to recommend
we not touch it.
        regards, tom lane

Re: CSStorm occurred again by postgreSQL8.2

From

Tom Lane

Date:

14 September 2006, 10:41:06

I wrote:
> Yeah, I was just looking at that.  Removing useless entries from the
> child-xid list would presumably help him.  Considering we're not even
> formally in beta yet, I'm probably being too conservative to recommend
> we not touch it.

Actually ... wait a minute.  We do not assign an XID to a subtransaction
at all unless it writes a tuple to disk (see GetCurrentTransactionId
and its callers).  So this whole "optimization" idea is redundant.

I see a bug though, which is that RecordSubTransactionAbort() calls
GetCurrentTransactionId() before having verified that it needs to do
anything.  This means that we'll generate and then discard an XID
uselessly in a failed subxact that didn't touch disk.  Worth fixing,
but it doesn't look like this is Theo's problem.

Unless I'm missing something, Theo's problem must involve having done
tuple updates in 4.6M different subtransactions.
        regards, tom lane

Re: CSStorm occurred again by postgreSQL8.2

From

Tom Lane

Date:

14 September 2006, 10:58:08

I wrote:
> I see a bug though, which is that RecordSubTransactionAbort() calls
> GetCurrentTransactionId() before having verified that it needs to do
> anything.  This means that we'll generate and then discard an XID
> uselessly in a failed subxact that didn't touch disk.

Well, it would be a bug except that RecordSubTransactionAbort isn't
called unless the current subxact has an XID.  Perhaps a comment would
be appropriate but there's nothing to fix here.

I think Theo's problem is probably somewhere else, too --- apparently
it's not so much that TransactionIdIsCurrentTransactionId takes a long
time as that something is calling it lots of times with no check for
interrupt.
        regards, tom lane

Re: CSStorm occurred again by postgreSQL8.2

From

Alvaro Herrera

Date:

14 September 2006, 11:22:57

Tom Lane wrote:
> I wrote:
> > I see a bug though, which is that RecordSubTransactionAbort() calls
> > GetCurrentTransactionId() before having verified that it needs to do
> > anything.  This means that we'll generate and then discard an XID
> > uselessly in a failed subxact that didn't touch disk.
> 
> Well, it would be a bug except that RecordSubTransactionAbort isn't
> called unless the current subxact has an XID.  Perhaps a comment would
> be appropriate but there's nothing to fix here.
> 
> I think Theo's problem is probably somewhere else, too --- apparently
> it's not so much that TransactionIdIsCurrentTransactionId takes a long
> time as that something is calling it lots of times with no check for
> interrupt.

Could it be something like heap_lock_tuple?  It calls MultiXactIdWait,
which calls GetMultXactIdMembers and TransactionIdIsCurrentTransactionId
on each member.  (heap_update and heap_delete do the same thing).  I
must admit I didn't read Theo's description on his scenario though.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: CSStorm occurred again by postgreSQL8.2

From

Tom Lane

Date:

14 September 2006, 11:44:21

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> I think Theo's problem is probably somewhere else, too --- apparently
>> it's not so much that TransactionIdIsCurrentTransactionId takes a long
>> time as that something is calling it lots of times with no check for
>> interrupt.

> Could it be something like heap_lock_tuple?  It calls MultiXactIdWait,
> which calls GetMultXactIdMembers and TransactionIdIsCurrentTransactionId
> on each member.  (heap_update and heap_delete do the same thing).  I
> must admit I didn't read Theo's description on his scenario though.

He shows HeapTupleSatisfiesSnapshot as the next thing down the call
stack, so those scenarios don't seem quite right.  I'm wondering about a
CHECK_FOR_INTERRUPTS-free loop in either plperl or trigger handling,
myself.

Anyway, I was thinking some more about Theo's original suggestion that
the linked-list representation of childXids was too inefficient.  I'm
disinclined to use a hash as he suggests, but it strikes me that we
could very easily change the list into a dynamically extended array
--- and because the entries are surely added in increasing XID order,
such an array could be binary-searched.  This wouldn't be a win for
very small numbers of child XIDs, but for large numbers it would.

OTOH, there are probably enough other inefficiencies in handling large
numbers of subxact XIDs that speeding up TransactionIdIsCurrentTransactionId
might be a useless exercise.  It would be good to profile a test case
before spending much effort here.
        regards, tom lane

Re: CSStorm occurred again by postgreSQL8.2

From

Gregory Stark

Date:

14 September 2006, 12:15:20

Tom Lane <tgl@sss.pgh.pa.us> writes:

> --- and because the entries are surely added in increasing XID order,
> such an array could be binary-searched.  

If they're only added if they write to disk then isn't it possible to add them
out of order? Start a child transaction, start a child of that one and write
to disk, then exit the grandchild and write to disk in the first child? I'm
just going on your description, I'm not familiar with this part of the code at
all.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: CSStorm occurred again by postgreSQL8.2

From

Tom Lane

Date:

14 September 2006, 12:31:05

Gregory Stark <stark@enterprisedb.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> --- and because the entries are surely added in increasing XID order,
>> such an array could be binary-searched.  

> If they're only added if they write to disk then isn't it possible to add them
> out of order? Start a child transaction, start a child of that one and write
> to disk, then exit the grandchild and write to disk in the first
> child?

No, because we enforce child XID > parent XID.  In the case above, the
child xact would be given an XID when the grandchild needs one --- see
recursion in AssignSubTransactionId().  The actually slightly shaky
assumption above is that children of the same parent xact must subcommit
in numerical order ... but as long as we have strict nesting of subxacts
I think this must be so.
        regards, tom lane

Re: CSStorm occurred again by postgreSQL8.2

From

Simon Riggs

Date:

11 September 2007, 06:48:54

On Wed, 2006-09-13 at 21:45 -0400, Tom Lane wrote:

> Anyway, given that there's this one nonobvious gotcha, there might be
> others.  My recommendation is that we take this off the open-items list
> for 8.2 and revisit it in the 8.3 cycle when there's more time.

Well, its still 8.3 just...

As discussed in the other thread "Final Thoughts for 8.3 on LWLocking
and Scalability", XidCacheRemoveRunningXids() is now the only holder of
an X lock during normal processing, so I would like to remove it. 
Here's how:

Currently, we take the lock, remove the subxact and then shuffle down
all the other subxactIds so that the subxact cache is contiguous.

I propose that we simply zero out the subxact entry without re-arranging
the cache; this will be atomic, so we need not acquire an X lock. We
then increment ndeletedxids. When we enter a new subxact into the cache,
if ndeletedxids > 0 we scan the cache to find an InvalidTransactionId
that we can use, then decrement ndeletedxids. So ndeletedxids is just a
hint, not an absolute requirement. nxids then becomes the number of
cache entries and never goes down until EOXact. The subxact cache is no
longer in order, but then it doesn't need to be either.

When we take a snapshot we will end up taking a copy of zeroed cache
entries, so the snapshots will be slightly larger than previously.
Though still no larger than the max. The size reduction was not so large
as to make a significant difference across the whole array, so
scalability is the main issue to resolve.

The snapshots will be valid with no change, since InvalidTransactionId
will never match against any recorded Xid.

I would also like to make the size of the subxact cache configurable
with a parameter such as subtransaction_cache_size = 64 (default), valid
range 4-256.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: CSStorm occurred again by postgreSQL8.2

From

Tom Lane

Date:

11 September 2007, 10:58:24

Simon Riggs <simon@2ndquadrant.com> writes:
> As discussed in the other thread "Final Thoughts for 8.3 on LWLocking
> and Scalability", XidCacheRemoveRunningXids() is now the only holder of
> an X lock during normal processing,

Nonsense.  Main transaction exit also takes an exclusive lock, and is
far more likely to be exercised in typical workloads than a
subtransaction abort.

In any case: there has still not been any evidence presented by anyone
that optimizing XidCacheRemoveRunningXids will help one bit.  Given the
difficulty of measuring any benefit from the last couple of
optimizations in this general area, I'm thinking that such evidence
will be hard to come by.  And we have got way more than enough on our
plates already.  Can we let go of this for 8.3, please?
        regards, tom lane

Re: CSStorm occurred again by postgreSQL8.2

From

Simon Riggs

Date:

11 September 2007, 12:24:31

On Tue, 2007-09-11 at 09:58 -0400, Tom Lane wrote:

>  Can we let go of this for 8.3, please?

OK, we've moved forward, so its a good place to break.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: CSStorm occurred again by postgreSQL8.2

From

Bruce Momjian

Date:

12 March 2008, 21:13:45

Is this a TODO?  Tom's reply was:

> Nonsense.  Main transaction exit also takes an exclusive lock, and is
> far more likely to be exercised in typical workloads than a
> subtransaction abort.
> 
> In any case: there has still not been any evidence presented by anyone
> that optimizing XidCacheRemoveRunningXids will help one bit.  Given the
> difficulty of measuring any benefit from the last couple of
> optimizations in this general area, I'm thinking that such evidence
> will be hard to come by.  And we have got way more than enough on our
> plates already.  Can we let go of this for 8.3, please?

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Wed, 2006-09-13 at 21:45 -0400, Tom Lane wrote:
> 
> > Anyway, given that there's this one nonobvious gotcha, there might be
> > others.  My recommendation is that we take this off the open-items list
> > for 8.2 and revisit it in the 8.3 cycle when there's more time.
> 
> Well, its still 8.3 just...
> 
> As discussed in the other thread "Final Thoughts for 8.3 on LWLocking
> and Scalability", XidCacheRemoveRunningXids() is now the only holder of
> an X lock during normal processing, so I would like to remove it. 
> Here's how:
> 
> Currently, we take the lock, remove the subxact and then shuffle down
> all the other subxactIds so that the subxact cache is contiguous.
> 
> I propose that we simply zero out the subxact entry without re-arranging
> the cache; this will be atomic, so we need not acquire an X lock. We
> then increment ndeletedxids. When we enter a new subxact into the cache,
> if ndeletedxids > 0 we scan the cache to find an InvalidTransactionId
> that we can use, then decrement ndeletedxids. So ndeletedxids is just a
> hint, not an absolute requirement. nxids then becomes the number of
> cache entries and never goes down until EOXact. The subxact cache is no
> longer in order, but then it doesn't need to be either.
> 
> When we take a snapshot we will end up taking a copy of zeroed cache
> entries, so the snapshots will be slightly larger than previously.
> Though still no larger than the max. The size reduction was not so large
> as to make a significant difference across the whole array, so
> scalability is the main issue to resolve.
> 
> The snapshots will be valid with no change, since InvalidTransactionId
> will never match against any recorded Xid.
> 
> I would also like to make the size of the subxact cache configurable
> with a parameter such as subtransaction_cache_size = 64 (default), valid
> range 4-256.
> 
> -- 
>   Simon Riggs
>   2ndQuadrant  http://www.2ndQuadrant.com
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: CSStorm occurred again by postgreSQL8.2

From

Simon Riggs

Date:

13 March 2008, 05:24:08

On Wed, 2008-03-12 at 20:13 -0400, Bruce Momjian wrote:
> Is this a TODO?  Tom's reply was:

The general topic, yes. The caveats still apply.

> > Nonsense.  Main transaction exit also takes an exclusive lock, and is
> > far more likely to be exercised in typical workloads than a
> > subtransaction abort.
> > 
> > In any case: there has still not been any evidence presented by anyone
> > that optimizing XidCacheRemoveRunningXids will help one bit.  Given the
> > difficulty of measuring any benefit from the last couple of
> > optimizations in this general area, I'm thinking that such evidence
> > will be hard to come by.  And we have got way more than enough on our
> > plates already.  Can we let go of this for 8.3, please?
> 
> ---------------------------------------------------------------------------
> 
> Simon Riggs wrote:
> > On Wed, 2006-09-13 at 21:45 -0400, Tom Lane wrote:
> > 
> > > Anyway, given that there's this one nonobvious gotcha, there might be
> > > others.  My recommendation is that we take this off the open-items list
> > > for 8.2 and revisit it in the 8.3 cycle when there's more time.
> > 
> > Well, its still 8.3 just...
> > 
> > As discussed in the other thread "Final Thoughts for 8.3 on LWLocking
> > and Scalability", XidCacheRemoveRunningXids() is now the only holder of
> > an X lock during normal processing, so I would like to remove it. 
> > Here's how:
> > 
> > Currently, we take the lock, remove the subxact and then shuffle down
> > all the other subxactIds so that the subxact cache is contiguous.
> > 
> > I propose that we simply zero out the subxact entry without re-arranging
> > the cache; this will be atomic, so we need not acquire an X lock. We
> > then increment ndeletedxids. When we enter a new subxact into the cache,
> > if ndeletedxids > 0 we scan the cache to find an InvalidTransactionId
> > that we can use, then decrement ndeletedxids. So ndeletedxids is just a
> > hint, not an absolute requirement. nxids then becomes the number of
> > cache entries and never goes down until EOXact. The subxact cache is no
> > longer in order, but then it doesn't need to be either.
> > 
> > When we take a snapshot we will end up taking a copy of zeroed cache
> > entries, so the snapshots will be slightly larger than previously.
> > Though still no larger than the max. The size reduction was not so large
> > as to make a significant difference across the whole array, so
> > scalability is the main issue to resolve.
> > 
> > The snapshots will be valid with no change, since InvalidTransactionId
> > will never match against any recorded Xid.
> > 
> > I would also like to make the size of the subxact cache configurable
> > with a parameter such as subtransaction_cache_size = 64 (default), valid
> > range 4-256.
> > 
> > -- 
> >   Simon Riggs
> >   2ndQuadrant  http://www.2ndQuadrant.com
> > 
> > 
> > ---------------------------(end of broadcast)---------------------------
> > TIP 9: In versions below 8.0, the planner will ignore your desire to
> >        choose an index scan if your joining column's datatypes do not
> >        match
> 
--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com 
 PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk