Thread: pg locking problem
When running a test with multiple (>=2) users pg 'seizes up' after a few transactions. Cpu goes 100% and disk goes to 0%. This lasts 'forever' (overnight)On the same test all other tested databases don't have this problem. The error occurs with higher tx rate, when transactions bump into each other more frequently. Some 'deadlock detected' messages appear around the hang up time, but _not_ always. Occasionally - but rarely - the seizure looks differently. CPU goes to 0%, disk goes to 0% and, after about one minute, the processing is resumed. The current suspicion is that it is due to difference in lock and deadlock handling between pg and most other dbs. In general I found that I can't set transaction isolation level to READ_UNCOMMITTED (as for other databases), the lowest level is READ_COMMITEED. Any ideas how to reduce this problem? I really want to prove pg perf with 20-100 users and have a problem running 2... It wouldn't be fair (to other db's) to rewrite the test with some pg-only LOCK command etc. I suspect I'm missing something simple, like one of .conf parameters. P.S. Using WinNT/Win2K system, pg 7.1.3 (current cygwin), jdbc driver is jdbc7.1-1.3, cygipc is 1.10-1, java is 1.3.1_01a (current jdk). Default pg installation, except for bumped up memory and 8 wal files.
czl@iname.com (charles) writes: > When running a test with multiple (>=2) users pg 'seizes up' after a > few transactions. Since you haven't told us a thing about what this test is, it's hard to see how you expect to get any useful help ... regards, tom lane
> When running a test with multiple (>=2) users pg 'seizes up' after a > few transactions. Cpu goes 100% and disk goes to 0%. This lasts > 'forever' (overnight)On the same test all other tested databases don't > have this problem. > > The error occurs with higher tx rate, when transactions bump into each > other more frequently. Some 'deadlock detected' messages appear around > the hang up time, but _not_ always. > > Occasionally - but rarely - the seizure looks differently. CPU goes to > 0%, disk goes to 0% and, after about one minute, the processing is > resumed. > > The current suspicion is that it is due to difference in lock and > deadlock handling between pg and most other dbs. In general I found > that I can't set transaction isolation level to READ_UNCOMMITTED (as > for other databases), the lowest level is READ_COMMITEED. > > Any ideas how to reduce this problem? I really want to prove pg perf > with 20-100 users and have a problem running 2... Have you ever tried the test on UNIX (or UNIX like systems)? I have never seen such a problem with PostgreSQL running on UNIX. Also I think PostgreSQL on Win is not ready for practical use... > It wouldn't be fair (to other db's) to rewrite the test with some > pg-only LOCK command etc. I suspect I'm missing something simple, like > one of .conf parameters. > > > P.S. Using WinNT/Win2K system, pg 7.1.3 (current cygwin), jdbc driver > is jdbc7.1-1.3, cygipc is 1.10-1, java is 1.3.1_01a (current jdk). > Default pg installation, except for bumped up memory and 8 wal files. > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
For various reasons, not wholly dependent on me, the test should show good perf on Windows. Otherwise Sorry about that. Can't change the platform for this one. I did scan the archives, without finding anything similar - though maybe my search was not thorough enough. I managed to isolate the bug further. 1. Running with read-only transactions the bug does not occur. This means that the bug is not directly related to the number of users (as long as there's more than one). 2. Running with read-only transactions _and_ just _one_ type of a read-write transaction the bug occurs. This means that the bug is not caused by a deadlock - single transaction type always requests the tables in the same order. (Am I right here? i'm sleepy so my thinking is not up to scratch). Anyway, regardless which one of read-write tx types is chosen, the problem occurs. 3. Overall this suggests that, in crude terms, the problem is triggered when reading updated but uncommitted records. Possibly even by one user reading updated uncommitted records of another (since this happens with only two users) 4. The seizure problem manifests itself in high (100%) cpu utilization. Also, about 80% of that cpu utilization is system state. All pg processes (for all users) use about the same amount of cpu time - that is the situation is not caused by one process/user getting out of whack. t-ishii@sra.co.jp (Tatsuo Ishii) wrote in message news:<20011114100112O.t-ishii@sra.co.jp>... > > When running a test with multiple (>=2) users pg 'seizes up' after a > > few transactions. Cpu goes 100% and disk goes to 0%. This lasts > > 'forever' (overnight)On the same test all other tested databases don't > > have this problem. > > > > The error occurs with higher tx rate, when transactions bump into each > > other more frequently. Some 'deadlock detected' messages appear around > > the hang up time, but _not_ always. > > > > Occasionally - but rarely - the seizure looks differently. CPU goes to > > 0%, disk goes to 0% and, after about one minute, the processing is > > resumed. > > > > The current suspicion is that it is due to difference in lock and > > deadlock handling between pg and most other dbs. In general I found > > that I can't set transaction isolation level to READ_UNCOMMITTED (as > > for other databases), the lowest level is READ_COMMITEED. > > > > Any ideas how to reduce this problem? I really want to prove pg perf > > with 20-100 users and have a problem running 2... > > Have you ever tried the test on UNIX (or UNIX like systems)? I have > never seen such a problem with PostgreSQL running on UNIX. Also I > think PostgreSQL on Win is not ready for practical use... > > > It wouldn't be fair (to other db's) to rewrite the test with some > > pg-only LOCK command etc. I suspect I'm missing something simple, like > > one of .conf parameters. > > > > > > P.S. Using WinNT/Win2K system, pg 7.1.3 (current cygwin), jdbc driver > > is jdbc7.1-1.3, cygipc is 1.10-1, java is 1.3.1_01a (current jdk). > > Default pg installation, except for bumped up memory and 8 wal files. > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 6: Have you searched our list archives? > > > > http://archives.postgresql.org > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
On Wed, Nov 14, 2001 at 02:14:33PM -0800, charles wrote: > For various reasons, not wholly dependent on me, the test should show > good perf on Windows. Otherwise Sorry about that. Can't change the > platform for this one. I did scan the archives, without finding > anything similar - though maybe my search was not thorough enough. > > I managed to isolate the bug further. <minor rant mode> Based on your descriptions, I don't think this can really qualify as a bug. Asking PostgreSQL to adapt to the locking scheme optimized for a different RDBMS, when it's underlying locking mechanisms are not only different, but fundamentally _better_ in many contexts, is just silly, not to mention not "fair" to PostgreSQL. I think you would find, (I am, of course, guessing, since you've given few actual details of the tests) that the PG specific rewriting would be to _remove_ LOCK calls. </minor rant mode> Given all that, it is still probably bad behavior for PG to use so much CPU. We're interested in fixing that, but not just in solving your problem. > 1. Running with read-only transactions the bug does not occur. This > means that the bug is not directly related to the number of users (as > long as there's more than one). > > 2. Running with read-only transactions _and_ just _one_ type of a > read-write transaction the bug occurs. This means that the bug is not > caused by a deadlock - single transaction type always requests the > tables in the same order. (Am I right here? i'm sleepy so my thinking > is not up to scratch). Anyway, regardless which one of read-write tx > types is chosen, the problem occurs. > > 3. Overall this suggests that, in crude terms, the problem is > triggered when reading updated but uncommitted records. Possibly even > by one user reading updated uncommitted records of another (since this > happens with only two users) > > 4. The seizure problem manifests itself in high (100%) cpu > utilization. Also, about 80% of that cpu utilization is system state. > All pg processes (for all users) use about the same amount of cpu time > - that is the situation is not caused by one process/user getting out > of whack. > Several people have suggested tests you could run to help isolate the problem: running the identical code against an identical PG database hosted on Unix would tell us if it's a problem with the NT compatibility layer: i.e. cygwin. As Tatsuo Ishii pointed out, this is a likely cause, since _many_ people use PG for heavy duty service under Unix, and NT isn't a primary platform for any of the core developers. But, if your application can trigger the same behavior on a Unix hosted server, you will get a _lot_ of attention, trust me. > > > P.S. Using WinNT/Win2K system, pg 7.1.3 (current cygwin), jdbc driver > > > is jdbc7.1-1.3, cygipc is 1.10-1, java is 1.3.1_01a (current jdk). > > > Default pg installation, except for bumped up memory and 8 wal files. Even if your application _does_ have the same behavior under Unix, the next thing you'd be asked to do is try the latest version, 7.2b2, which would be a good idea anyway, though I don't know if whoever builds the NT binaries has built one yet (hint hint): there's been a lot of bug fixes and code rework since 7.1. Ross
All, Thanks for all your comments. After looking more at the problem and reading your mail I suspect that it may be somewhere at the boundary between cyg-ipc and postgres. Firstly, as you point out, pg is too heavily used on UNIX, hence a bug like that couldn't have gone unnoticed. Secondly, nothing in the testing done so far contradicts that hypothesis. Unfortunately, end of the year is near so I can't put much more into this pg bug - hopefully for the time being. And testing it on UNIX is beyond my time, money and logistics budget on this one, unfortunately. with many thanks and hope to revisit it soon (got some promising numbers on single user test) charles reedstrm@rice.edu ("Ross J. Reedstrom") wrote in message news:<20011115134442.A3811@rice.edu>... > On Wed, Nov 14, 2001 at 02:14:33PM -0800, charles wrote: > > For various reasons, not wholly dependent on me, the test should show > > good perf on Windows. Otherwise Sorry about that. Can't change the > > platform for this one. I did scan the archives, without finding > > anything similar - though maybe my search was not thorough enough. > > > > I managed to isolate the bug further. > > <minor rant mode> > Based on your descriptions, I don't think this can really qualify as a > bug. Asking PostgreSQL to adapt to the locking scheme optimized for a > different RDBMS, when it's underlying locking mechanisms are not only > different, but fundamentally _better_ in many contexts, is just silly, > not to mention not "fair" to PostgreSQL. I think you would find, (I am, > of course, guessing, since you've given few actual details of the tests) > that the PG specific rewriting would be to _remove_ LOCK calls. > </minor rant mode> > > Given all that, it is still probably bad behavior for PG to use so > much CPU. We're interested in fixing that, but not just in solving your > problem. > > > 1. Running with read-only transactions the bug does not occur. This > > means that the bug is not directly related to the number of users (as > > long as there's more than one). > > > > 2. Running with read-only transactions _and_ just _one_ type of a > > read-write transaction the bug occurs. This means that the bug is not > > caused by a deadlock - single transaction type always requests the > > tables in the same order. (Am I right here? i'm sleepy so my thinking > > is not up to scratch). Anyway, regardless which one of read-write tx > > types is chosen, the problem occurs. > > > > 3. Overall this suggests that, in crude terms, the problem is > > triggered when reading updated but uncommitted records. Possibly even > > by one user reading updated uncommitted records of another (since this > > happens with only two users) > > > > 4. The seizure problem manifests itself in high (100%) cpu > > utilization. Also, about 80% of that cpu utilization is system state. > > All pg processes (for all users) use about the same amount of cpu time > > - that is the situation is not caused by one process/user getting out > > of whack. > > > > Several people have suggested tests you could run to help isolate the > problem: running the identical code against an identical PG database > hosted on Unix would tell us if it's a problem with the NT compatibility > layer: i.e. cygwin. As Tatsuo Ishii pointed out, this is a likely cause, > since _many_ people use PG for heavy duty service under Unix, and NT > isn't a primary platform for any of the core developers. But, if your > application can trigger the same behavior on a Unix hosted server, > you will get a _lot_ of attention, trust me. > > > > > P.S. Using WinNT/Win2K system, pg 7.1.3 (current cygwin), jdbc driver > > > > is jdbc7.1-1.3, cygipc is 1.10-1, java is 1.3.1_01a (current jdk). > > > > Default pg installation, except for bumped up memory and 8 wal files. > > Even if your application _does_ have the same behavior under Unix, the > next thing you'd be asked to do is try the latest version, 7.2b2, which > would be a good idea anyway, though I don't know if whoever builds the > NT binaries has built one yet (hint hint): there's been a lot of bug > fixes and code rework since 7.1. > > > Ross > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html