Thread: Seeing context switch storm with 10/13 snapshot of 8.1beta3
I've been having this problem since trying to upgrade from 7.4.1 to 8.03, and now 8.1. It's a dual Xenon machine: Linux annette.stortek.com 2.4.22-26mdkenterprise #1 SMP Wed Jan 7 07:10:39 MST 2004 i686 unknown unknown GNU/Linux PostgreSQL version is: ------------------------------------------------------------------------------- ----------------------------PostgreSQL 8.1beta3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.1 (Mandrake Linux 9.2 3.3.1-2mdk) [creagrs@annette creagrs]$ /usr/local/pgsql810/bin/pg_config BINDIR = /usr/local/pgsql810/bin DOCDIR = /usr/local/pgsql810/doc INCLUDEDIR = /usr/local/pgsql810/include PKGINCLUDEDIR = /usr/local/pgsql810/include INCLUDEDIR-SERVER = /usr/local/pgsql810/include/server LIBDIR = /usr/local/pgsql810/lib PKGLIBDIR = /usr/local/pgsql810/lib LOCALEDIR = MANDIR = /usr/local/pgsql810/man SHAREDIR = /usr/local/pgsql810/share SYSCONFDIR = /usr/local/pgsql810/etc PGXS = /usr/local/pgsql810/lib/pgxs/src/makefiles/pgxs.mk CONFIGURE = '--enable-syslog' '--prefix=/usr/local/pgsql810' '--enable-debug' CC = gcc CPPFLAGS = -D_GNU_SOURCE CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Winline -Wendif-labels -fno-strict-aliasing -g CFLAGS_SL = -fpic LDFLAGS = -Wl,-rpath,/usr/local/pgsql810/lib LDFLAGS_SL = LIBS = -lpgport -lz -lreadline -lncurses -lcrypt -lresolv -lnsl -ldl -lm -lbsd VERSION = PostgreSQL 8.1beta3 I can reproduce this problem quite quickly with 20 concurrent queries (< 20 minutes with 5 sets of my app at 4 queries a piece, probably quicker). The queries nominally take < .1 second, but when all 5 sets run, the queries are up around 5-6 seconds. CS hit over 100k. Below is a vmstat over a few seconds which show the problem, then it subsides, then it hits again. Note that when my application gets up to 5 sets of 8, it never comes down again, and the queries are over 40 seconds to complete. When run against 7.4.1, the CS problem does not occur (stays < 10k). [creagrs@annette creagrs]$ vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----r b swpd free buff cache si so bi bo in cs us sy id wa 10 0 1032 63120 48 1608472 0 0 5 18 3 6 92 8 0 08 0 1032 62584 48 1608480 0 0 0 24 173 97793 39 61 0 09 0 1032 62332 48 1608500 0 0 0 32 197 83089 43 57 0 04 0 1032 61216 48 1608508 0 0 0 16 158 79967 43 57 0 06 0 1032 65208 48 1608556 0 0 0 88 262 91266 52 48 0 05 0 1032 67368 48 1608556 0 0 0 292 235 14234 87 13 0 04 0 1032 65780 48 1608580 0 0 0 40 173 325 95 5 0 04 0 1032 64764 48 1608580 0 0 0 0 144 86 98 2 0 06 0 1032 65800 48 1608580 0 0 0 24 197 397 96 4 0 06 0 1032 66284 48 1608604 0 0 0 64 234 9284 83 17 0 06 0 1032 66104 48 1608604 0 0 0 116 253 8765856 44 0 05 0 1032 66276 48 1608612 0 0 0 56 245 20648 81 19 0 06 0 1032 65736 48 1608644 0 0 0 88 250 19699 82 18 0 05 0 1032 65316 48 1608652 0 0 0 40 215 73709 7030 0 05 0 1032 67664 48 1608700 0 0 0 96 272 22222 81 19 0 04 0 1032 66700 48 1608700 0 0 0 84 188 351 93 7 0 04 0 1032 67696 48 1608732 0 0 0 64 210 505 91 9 0 04 0 1032 67696 48 1608732 0 0 0 0 139 86 99 1 0 04 0 1032 64612 48 1608732 0 0 0 0 134 74 99 1 0 05 0 1032 65344 48 1608756 0 0 0 48 206 421 94 6 0 04 0 1032 66636 48 1608764 0 0 0 16 171 232 96 4 0 04 0 1032 67404 48 1608764 0 0 0 0 136 71 100 0 0 07 0 1032 63544 48 1608780 0 0 0 56 239 67250 6535 0 06 0 1032 65168 48 1608796 0 0 0 56 230 87782 59 41 0 07 0 1032 66588 48 1608820 0 0 0 72 236 106591 42 58 0 08 0 1032 62544 48 1608836 0 0 0 184 244 129696 4852 0 04 0 1032 63504 48 1608860 0 0 0 72 245 110887 48 52 0 08 0 1032 63400 48 1608868 0 0 0 16 156 101638 35 65 0 08 0 1032 62744 48 1608868 0 0 0 24 193 91159 4060 0 08 0 1032 63700 48 1608868 0 0 0 16 172 85165 42 58 0 07 0 1032 62888 48 1608876 0 0 0 48 228 94902 42 58 0 04 0 1032 64184 48 1608908 0 0 0 72 234 117585 4654 0 07 0 1032 61832 48 1608948 0 0 0 72 209 104287 46 54 0 08 0 1032 64468 48 1608956 0 0 0 56 237 112181 44 56 0 07 0 1032 63032 48 1608972 0 0 0 56 236 114977 4753 0 08 0 1032 63204 48 1608980 0 0 0 40 205 107433 45 55 0 07 0 1032 63832 48 1608996 0 0 0 56 225 113536 40 60 0 0 Help? Thanks, Rob -- Robert Creager Advisory Software Engineer Data Management Group Sun Microsystems Robert.Creager@Sun.com 303.673.2365 Office 888.912.4458 Pager
Robert Creager <Robert.Creager@Sun.com> writes: > I've been having this problem since trying to upgrade from 7.4.1 to 8.03, and > now 8.1. Can you put together a test case that other people could use to reproduce it? regards, tom lane
When grilled further on (Thu, 13 Oct 2005 22:44:54 -0400), Tom Lane <tgl@sss.pgh.pa.us> confessed: > Robert Creager <Robert.Creager@Sun.com> writes: > > I've been having this problem since trying to upgrade from 7.4.1 to 8.03, and > > now 8.1. > > Can you put together a test case that other people could use to > reproduce it? > I can try. The data size isn't big, but the scripts that run against it are, and are dependent on our development enviornment. What I'll have to do is pull out the db portion of the app and see if I can simplify it - might work. The app is a test system that runs against our big storage libraries. Is there anything I might be able to do (without the test case) that would help figure out what's happening? Cheers, Rob -- 21:09:29 up 11 days, 12:44, 7 users, load average: 3.87, 4.27, 5.04 Linux 2.6.5-02 #8 SMP Mon Jul 12 21:34:44 MDT 2004
Robert Creager <Robert_Creager@LogicalChaos.org> writes: > Is there anything I might be able to do (without the test case) that > would help figure out what's happening? oprofile stats would be enlightening, perhaps. I'm particularly interested in why 7.4 is behaving better than newer versions --- that does not compute ;-). Have you checked that the same query plan is being used by 7.4 and the newer versions? Could you get oprofile results for both versions? regards, tom lane
On Thu, 2005-10-13 at 21:20 -0600, Robert Creager wrote: > When grilled further on (Thu, 13 Oct 2005 22:44:54 -0400), > Tom Lane <tgl@sss.pgh.pa.us> confessed: > > > Robert Creager <Robert.Creager@Sun.com> writes: > > > I've been having this problem since trying to upgrade from 7.4.1 to 8.03, > and > > > now 8.1. > > > > Can you put together a test case that other people could use to > > reproduce it? > > > > I can try. The data size isn't big, but the scripts that run against it are, > and are dependent on our development enviornment. What I'll have to do is pull > out the db portion of the app and see if I can simplify it - might work. The > app is a test system that runs against our big storage libraries. > > Is there anything I might be able to do (without the test case) that would help > figure out what's happening? Please try this patch and see if it reduces the CS storm: http://archives.postgresql.org/pgsql-patches/2005-10/msg00091.php Do you have access to another similar machine to do comparative testing? Do you have access to another machine with different CPU arch? It would be good to firmly isolate this to a CPU architecture interaction issue. Are the Xeons particularly old models? How new is the server? Thanks, Best Regards, Simon Riggs
On Tue, 18 Oct 2005 00:25:25 +0100 Simon Riggs <simon@2ndquadrant.com> wrote: > > Please try this patch and see if it reduces the CS storm: > > http://archives.postgresql.org/pgsql-patches/2005-10/msg00091.php Yes, I will. I'd been trying to figure out what triggered it, as I was unable to reproduce it for a while, but it gets there eventually. I had just re-started with auto vacuum off, as that might be the trigger? I'll apply the patch, re-build and re-start with auto vacuum back on, they way I'm running when I know I'll see it. > > Do you have access to another similar machine to do comparative testing? > Do you have access to another machine with different CPU arch? It would > be good to firmly isolate this to a CPU architecture interaction issue. I do have access to a windows version of the DL-380 but it's only a single CPU. I've a dual AMD server at home, but I cannot run against the db there (reverse VPN?). I'd need that reproducibility Tom wanted that I have not gotten to yet. > > Are the Xeons particularly old models? How new is the server? It is a HP Proliant (post Compaq merger) DL-380. A couple years old. Thanks, Rob
On Tue, 18 Oct 2005 10:29:43 -0600 Robert Creager <Robert.Creager@sun.com> wrote: > On Tue, 18 Oct 2005 00:25:25 +0100 > Simon Riggs <simon@2ndquadrant.com> wrote: > > > > Please try this patch and see if it reduces the CS storm: > > Sorry, didn't work. Took about an hour, and now it's now at the CS storm (averaging 94k). I've now disabled auto vacuum, just to see if that is a trigger, and am re-running. Thanks, Rob
On Tue, 18 Oct 2005 12:24:03 -0600 Robert Creager <Robert.Creager@sun.com> wrote: > On Tue, 18 Oct 2005 10:29:43 -0600 > Robert Creager <Robert.Creager@sun.com> wrote: > > > On Tue, 18 Oct 2005 00:25:25 +0100 > > Simon Riggs <simon@2ndquadrant.com> wrote: > > > > > > Please try this patch and see if it reduces the CS storm: > > > > > Sorry, didn't work. Took about an hour, and now it's now at the CS storm > (averaging 94k). I've now disabled auto vacuum, just to see if that is a > trigger, and am re-running. > Vacuum (cron or auto) has no impact on the problem for 8.1beta3. Tom had mentioned running oprofile, but I'm unable to get either 0.9.1 or 0.9.0 versions of oprofile working, so looks like I'll have to get a reproducible test case... Anyone know of a script that can replay a PostgreSQL log file? Then I could log all queries, wait till the problem hits, and then replay to see if that reproduces it... Cheers, Rob
Robert, > Anyone know of a script that can replay a PostgreSQL log file? Then I > could log all queries, wait till the problem hits, and then replay to > see if that reproduces it... log_statement=true in your postgresql.conf file. The trick is weeding out all the other non-query stuff. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On Tue, 2005-10-18 at 12:24 -0600, Robert Creager wrote: > Robert Creager <Robert.Creager@sun.com> wrote: > > Simon Riggs <simon@2ndquadrant.com> wrote: > > > Please try this patch and see if it reduces the CS storm: > Sorry, didn't work. Took about an hour, and now it's now at the CS storm > (averaging 94k). I've now disabled auto vacuum, just to see if that is a > trigger, and am re-running. OK, thanks. There was a possibility that there was some issue with the spinlock code inadvertently grabbing a cache line with another lock in it. Try this to recreate the problem: http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php Best Regards, Simon Riggs
On Thu, 20 Oct 2005 21:19:18 +0100 Simon Riggs <simon@2ndquadrant.com> wrote: > Try this to recreate the problem: > http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php > Yup, that does it. Three hits the level I see with my application ~100k. Two hits about 50k, one does nothing (< 1k). Interesting. 7.4.1 is worse for this test, as two jump up to 130k. But, my app runs fine against 7.4.1... Would it still be helpful to try and pull together a test case from my app against 8.1beta3? I have not worked on it, and don't know when I'd get time to (since I have a work around). Cheers, Rob
Robert Creager <Robert.Creager@Sun.com> writes: > Interesting. 7.4.1 is worse for this test, as two jump up to 130k. But, my app > runs fine against 7.4.1... > Would it still be helpful to try and pull together a test case from my app > against 8.1beta3? Yes, if you can show a case where 8.1 is much worse than 7.4 then we need to know about it yesterday. regards, tom lane
On Thu, 2005-10-20 at 14:59 -0600, Robert Creager wrote: > On Thu, 20 Oct 2005 21:19:18 +0100 > Simon Riggs <simon@2ndquadrant.com> wrote: > > > Try this to recreate the problem: > > http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php > > > > Yup, that does it. Three hits the level I see with my application ~100k. Two > hits about 50k, one does nothing (< 1k). OK, good. IYKWIM Can you try a slight modification? Run 3 threads, but against 3 different otherwise identical test tables created using a name-only mod of the test script. e.g. test_data1, 2 and 3. This will hit a different pattern of lwlocks. If the CS is the same, then it will tell us that the issue is not data dependent. If the CS drops, it tells us that it is an activity performed on the precise data blocks rather than the shared data structures which is the issue. That would then account for why the effect appears to come and go in your own application, because the effect is actually dependant on the data distribution (which presumably varies in your tables). Just trying to more tightly bracket the failure-case behaviour.... Best Regards, Simon Riggs
On Thu, 20 Oct 2005 23:28:21 +0100 Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2005-10-20 at 14:59 -0600, Robert Creager wrote: > > On Thu, 20 Oct 2005 21:19:18 +0100 > > Simon Riggs <simon@2ndquadrant.com> wrote: > > > > > Try this to recreate the problem: > > > http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php > > > > > > > Yup, that does it. Three hits the level I see with my application ~100k. > > Two hits about 50k, one does nothing (< 1k). > > OK, good. IYKWIM > > Can you try a slight modification? > > Run 3 threads, but against 3 different otherwise identical test tables > created using a name-only mod of the test script. e.g. test_data1, 2 and > 3. I modified the setup to create the first table, then selected it into the second two tables, all in the same database. Created three unique indexes (when I didn't, there was no CS issue). > > This will hit a different pattern of lwlocks. > > If the CS is the same, then it will tell us that the issue is not data > dependent. If the CS drops, it tells us that it is an activity performed > on the precise data blocks rather than the shared data structures which > is the issue. That would then account for why the effect appears to come > and go in your own application, because the effect is actually dependant > on the data distribution (which presumably varies in your tables). The CS is the same on both 7.4.1 and 8.1b3 (with 8.1b3 showing 100k and 7.4.1 showing 170k using three clients). I double checked the scripts to make sure... Cheers, Rob
On Thu, 2005-10-20 at 16:54 -0600, Robert Creager wrote: > On Thu, 20 Oct 2005 23:28:21 +0100 > Simon Riggs <simon@2ndquadrant.com> wrote: > > > If the CS is the same, then it will tell us that the issue is not data > > dependent. If the CS drops, it tells us that it is an activity performed > > on the precise data blocks rather than the shared data structures which > > is the issue. That would then account for why the effect appears to come > > and go in your own application, because the effect is actually dependant > > on the data distribution (which presumably varies in your tables). > > The CS is the same on both 7.4.1 and 8.1b3 (with 8.1b3 showing 100k and 7.4.1 > showing 170k using three clients). I double checked the scripts to make sure... So that means it is either a generic issue or likely to be a lwlock issue on the main named locks. The drop in CS between releases is consistent with the overall drop in contention as a result of the redesign of the buffer manager. Does definitely look like spinlock issues on that platform. It would be good to test this on a much newer system. Tuning for older hardware doesn't seem that worthwhile (no shouting, please!). It would be good right now to have a multi-process test harness that would allow us to test out different spin lock code without the rest of PostgreSQL getting in the way of testing. If we can isolate the issue outside of PostgreSQL it will be much easier to fix. Any hackers out there looking to set up a spinlock test harness? Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > It would be good right now to have a multi-process test harness that > would allow us to test out different spin lock code without the rest of > PostgreSQL getting in the way of testing. If we can isolate the issue > outside of PostgreSQL it will be much easier to fix. Actually I disagree with that, on three grounds: 1. Optimizing for an artificial environment may result in the wrong optimization. 2. I don't think you'll be able to prove anything except that all SMP designs ultimately suck. There is no hardware on the planet that can trade cache lines back and forth at instruction-dispatch rates. 3. The problem right now is not lack of ability to reproduce the problem, it is lack of ideas how to fix it. Building an artificial testbed is just directing effort into make-work rather than towards solving the problem. regards, tom lane
On Fri, 2005-10-21 at 09:52 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > It would be good right now to have a multi-process test harness that > > would allow us to test out different spin lock code without the rest of > > PostgreSQL getting in the way of testing. If we can isolate the issue > > outside of PostgreSQL it will be much easier to fix. > > Actually I disagree with that, on three grounds: > > 1. Optimizing for an artificial environment may result in the wrong > optimization. > > 2. I don't think you'll be able to prove anything except that all SMP > designs ultimately suck. There is no hardware on the planet that can > trade cache lines back and forth at instruction-dispatch rates. > > 3. The problem right now is not lack of ability to reproduce the > problem, it is lack of ideas how to fix it. Building an artificial > testbed is just directing effort into make-work rather than towards > solving the problem. If we think spinlocks are the problem, building a spinlock test harness will prove that and also simplify the testing of a solution. Isolating the spinlocks in that way is not artificial, but actually a very pure test, so although I agree with (1) as a general statement, this does not apply for the test harness proposal. I was seeing the problem as likely to rear its head again over time. We are entering a stage of increased importance of SMP code, since within a few short years all CPUs will be dual core/HT or something similar. We may fix it for one platform, but other similar problems may re/emerge. The easiest way to test spinlock code on any platform is to get an isolated test case that is runnable outside of the context of PostgreSQL, yet using the pg spinlock code. That would allow us to bring in other people and their many eyeballs to look at the issue; we on this list are not experts at everything. Certainly there is a lack of ideas as to how to fix it, as you mention in (3). This shows to me that the solution lies in one of two areas: a) the solution has not yet been considered or b) the solution has already been thought of and for whatever reason disregarded. You may be certain that the solution lies in a), though I am not. Rejecting ideas quickly may simply increase the chances of finding the solution in a b) case. Forgive me but (2) seems spurious. Nobody said anything about trading cache lines at instruction-dispatch rates. The objective of the test harness would be to check whether negative effects such as CS storms exist on that platform. Actual optimization of the spinlock mechanisms in the context of the PostgreSQL server should certainly be done within the code. In a more general sense: what is the best next action to make progress on the CS issue that exists for certain CPUs? Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Certainly there is a lack of ideas as to how to fix it, as you mention > in (3). This shows to me that the solution lies in one of two areas: a) > the solution has not yet been considered or b) the solution has already > been thought of and for whatever reason disregarded. You may be certain > that the solution lies in a), though I am not. Rejecting ideas quickly > may simply increase the chances of finding the solution in a b) case. However, building a spinlock test harness presupposes that the solution lies in the spinlock code itself, and not in (say) changing our usage patterns of it. So I'd throw the "rejecting ideas too quickly" challenge right back at you. What we need to optimize is the behavior in the real context of Postgres, not in a test harness. regards, tom lane
Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Certainly there is a lack of ideas as to how to fix it, as you mention > > in (3). This shows to me that the solution lies in one of two areas: a) > > the solution has not yet been considered or b) the solution has already > > been thought of and for whatever reason disregarded. You may be certain > > that the solution lies in a), though I am not. Rejecting ideas quickly > > may simply increase the chances of finding the solution in a b) case. > > However, building a spinlock test harness presupposes that the solution > lies in the spinlock code itself, and not in (say) changing our usage > patterns of it. So I'd throw the "rejecting ideas too quickly" > challenge right back at you. What we need to optimize is the behavior > in the real context of Postgres, not in a test harness. How do other databases deal with this? I can't imagine we are the only ones. Are we doing something different than them? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, 2005-10-21 at 17:17 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Certainly there is a lack of ideas as to how to fix it, as you mention > > in (3). This shows to me that the solution lies in one of two areas: a) > > the solution has not yet been considered or b) the solution has already > > been thought of and for whatever reason disregarded. You may be certain > > that the solution lies in a), though I am not. Rejecting ideas quickly > > may simply increase the chances of finding the solution in a b) case. > > However, building a spinlock test harness presupposes that the solution > lies in the spinlock code itself (I thought I had said) that was actually the main test, rather than a presupposition. The solution may lie there or not and when it is built we would be able to determine where it is. If the issue is with the spinlock code itself, it will then help us find the solution. If it is not, then we will know to look elsewhere. It is important that we isolate the problem in such a way that others can help fix it. If we do it only within the database, then the special context will force many to say they cannot help us. > patterns of it. So I'd throw the "rejecting ideas too quickly" > challenge right back at you. My words were inspired by a desire to leave the Doldrums only, not to cause other issues. I apologise for any offence given. > What we need to optimize is the behavior > in the real context of Postgres, not in a test harness. That is the end goal, yes. My sub-goal on that path is problem isolation (only). Best Regards, Simon Riggs
Robert Creager <Robert.Creager@Sun.com> writes: > I have a Perl script, a Perl module and a 1Mb database (from pg_dump > -F c). Are you interested at this time in receiving this? Sure. Please send it off-list, of course. > An upgrade to RC2 might occur when RC2 comes out, unless there would > be great benefit on running my tests against 8.1RC1. There isn't going to be an RC2, barring catastrophic problems being identified *very* shortly ... regards, tom lane
On Thu, 20 Oct 2005 17:35:31 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Creager <Robert.Creager@Sun.com> writes: > > Interesting. 7.4.1 is worse for this test, as two jump up to 130k. But, my > > app runs fine against 7.4.1... > > > Would it still be helpful to try and pull together a test case from my app > > against 8.1beta3? > > Yes, if you can show a case where 8.1 is much worse than 7.4 then we > need to know about it yesterday. > Ok. I finally have a stand alone test case that does not show that 8.1 is worse than 7.4. It shows the opposite quite nicely ;-). In tests I'm running, 7.4.1 bogs down immediately and never recovers. 8.1 behaves itself for about 20 minutes before it goes out to lunch. So, in those terms, it is worse as it looks fine at the start, but then gets progressively worse, but overall, it's much better. I have a Perl script, a Perl module and a 1Mb database (from pg_dump -F c). Are you interested at this time in receiving this? I plan on taking it home and running against my AMD system to see what it does, but that will be against 8.0 if I remember correctly, maybe 8.03. An upgrade to RC2 might occur when RC2 comes out, unless there would be great benefit on running my tests against 8.1RC1. Cheers, Rob