Thread: data on devel code perf dip
I have an example of runs that illustrate a performance problem that occurred between installing the 7/18 and 8/1 development release codes. I'm running on a PPC64 8-way system, with 16GB of memory (on a 15GB virtual machine), with a DBT2 workload configured as a 16 warehouse DBT2 with 16 db connections, no think time. Using Linux 2.6.11/gentoo. Overview: Before picture: run 42 code updated from cvs July 18 12:18pm pacific metric: 8374.71 NOTPM After picture (same config params, db size, etc): run 59 code downloaded from cvs (all of it) Aug1 9:58AM pacific metric: 4787.61 NOTPM Oprofile data in 59 (top part copied at the end of this message) but not in 42 (although the binary is still around and I could in theory rerun and get it). I'd like to first get back to the throughput of run 42, and then remove what's keeping me from getting higher than 45% user with the original July 18 drop. ************************************************************ Here are more details: (RUN 42) the previous performance 8374.71 new-order transactions per minute (NOTPM) and %user around 45% on 8 processors. Even this level of user utilization is not good, as there is still some bandwidth left in the log device (not much, but some). http://developer.osdl.org/maryedie/DBT2_PGSQL/42/ This development code we got on July 18 12:18pm pacific. Tuned it and got to run 42 on July 26. (RUN 59) This is an example of a run with code downloaded (whole) on Aug1 9:58am Pacific. 4787.61 new-order transactions per minute (NOTPM) and ~22% user over 8 cpus. This has the same config parameters as 42, with autovacuum off. However, I turned on oprofile for this run. (runs without oprofile are equivalent). http://developer.osdl.org/maryedie/DBT2_PGSQL/59/ Top things for postgreSQL in the oprofile are copied below. CPU: ppc64 POWER5, speed 1656.38 MHz (estimated) Counted CYCLES events (Processor cycles) with a unit mask of 0x00 (No unit mask) count 100000 samples % app name symbol name 164623113 70.5372 kernel-2.6.11.3 .shared_idle 6641569 2.8458 libc-2.3.4.so (no symbols) 4011874 1.7190 postgres .AllocSetAlloc 2798355 1.1990 postgres .CreateLWLocks 1731687 0.7420 postgres .SearchCatCache 1701936 0.7292 postgres .MemoryContextAllocZeroAligned 1509903 0.6470 postgres .FunctionCall2 1320467 0.5658 postgres .hash_search 1275278 0.5464 oprofiled (no symbols) 1271477 0.5448 postgres .yyparse 1099396 0.4711 postgres .OpernameGetCandidates 794802 0.3406 postgres .expression_tree_walker 731969 0.3136 postgres .MemoryContextAlloc 730266 0.3129 postgres ._bt_compare 709793 0.3041 postgres .DirectFunctionCall1 689405 0.2954 postgres .base_yylex 683776 0.2930 postgres .hash_any 635680 0.2724 postgres .DLMoveToFront 632098 0.2708 kernel-2.6.11.3 .___dst_free 562111 0.2409 kernel-2.6.11.3 .sys_setrlimit 508608 0.2179 postgres .AtEOXact_CatCache 499318 0.2139 postgres .heap_release_fetch 498136 0.2134 kernel-2.6.11.3 .tty_write 494020 0.2117 postgres .XLogInsert -- Mary Edie Meredith Initiative Manager Open Source Development Labs maryedie@hotmail.com 503-906-1942
Mary Edie Meredith <maryedie@osdl.org> writes: > I have an example of runs that illustrate a performance > problem that occurred between installing the 7/18 and 8/1 > development release codes. I dug through the CVS logs to see what had changed, and I'm afraid there is just one plausible-looking candidate: 2005-07-28 23:22 momjian * src/backend/access/transam/xlog.c: Use O_DIRECT if available when using O_SYNC for wal_sync_method.Also, write multipleWAL buffers out in one write() operation.ITAGAKI Takahiro Most of the CVS activity in that time period had to with stuff like roles and the interval datatype. It's conceivable that these things had some marginal performance cost, but if so I'd have expected it to show up as extra CPU effort (more time checking permissions, say). This figure: > samples % app name symbol name > 164623113 70.5372 kernel-2.6.11.3 .shared_idle says pretty clearly that your problem is all I/O wait, and there are no other commits that might have increased our tendency to wait for I/O. I am sure I will get some pushback if I propose reverting the O_DIRECT patch, so could you try to get some more-specific evidence? Like pull the CVS tree from just before and just after this patch and compare performance? BTW I did check that both runs are using wal_sync_method = fdatasync and wal_buffers = 1000, so it's not a problem of those parameters having been changed by the patch. regards, tom lane
On Thu, Aug 11, 2005 at 09:02:03PM -0400, Tom Lane wrote: > I am sure I will get some pushback if I propose reverting the O_DIRECT > patch, so could you try to get some more-specific evidence? Like pull > the CVS tree from just before and just after this patch and compare > performance? Quoth the open(2) manpage: O_DIRECT Try to minimize cache effects of the I/O to and from this file. In general thiswill degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The I/O is synchronous, i.e.,at the comple- tion of the read(2) or write(2) system call, data is guaran- teed to have beentransferred. In light of this, may I ask whether it makes sense to compare the performance of two runs with similar shared_buffer settings? With O_DIRECT, I understand from this manpage that the OS is going to do little or no page caching, so shared_buffers should be increased to account for this fact. Am I missing something? -- Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) "Hay que recordar que la existencia en el cosmos, y particularmente la elaboración de civilizaciones dentre de él no son, por desgracia, nada idílicas" (Ijon Tichy)
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > In light of this, may I ask whether it makes sense to compare the > performance of two runs with similar shared_buffer settings? With > O_DIRECT, I understand from this manpage that the OS is going to do > little or no page caching, so shared_buffers should be increased to > account for this fact. > Am I missing something? O_DIRECT is only being used for WAL page writes (or I sure hope so anyway), so shared_buffers should be irrelevant. regards, tom lane
Tom Lane wrote: > Most of the CVS activity in that time period had to with stuff like > roles and the interval datatype. It's conceivable that these things > had some marginal performance cost, but if so I'd have expected it to > show up as extra CPU effort (more time checking permissions, say). > This figure: > > > samples % app name symbol name > > 164623113 70.5372 kernel-2.6.11.3 .shared_idle > > says pretty clearly that your problem is all I/O wait, and there are > no other commits that might have increased our tendency to wait for I/O. > > I am sure I will get some pushback if I propose reverting the O_DIRECT > patch, so could you try to get some more-specific evidence? Like pull > the CVS tree from just before and just after this patch and compare > performance? > > BTW I did check that both runs are using wal_sync_method = fdatasync > and wal_buffers = 1000, so it's not a problem of those parameters having > been changed by the patch. We can supply a patch with just the O_DIRECT for you to test. The O_DIRECT patch also had grouped WAL writes, so that might be an issue too. Also, O_DIRECT is only used for open_* wal sync methods, so I don't see how it would affect this, but the grouped WAL writes might. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On 2005-08-12, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > In light of this, may I ask whether it makes sense to compare the > performance of two runs with similar shared_buffer settings? With > O_DIRECT, I understand from this manpage that the OS is going to do > little or no page caching, so shared_buffers should be increased to > account for this fact. > > Am I missing something? O_DIRECT should only be being used for the WAL, not for buffer I/O. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > In light of this, may I ask whether it makes sense to compare the > > performance of two runs with similar shared_buffer settings? With > > O_DIRECT, I understand from this manpage that the OS is going to do > > little or no page caching, so shared_buffers should be increased to > > account for this fact. > > > Am I missing something? > > O_DIRECT is only being used for WAL page writes (or I sure hope so > anyway), so shared_buffers should be irrelevant. Uh, O_DIRECT really just enables when open_sync is used, and I assume that is not used for writing dirty buffers during a checkpoint. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Andrew - Supernews wrote: > On 2005-08-12, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > In light of this, may I ask whether it makes sense to compare the > > performance of two runs with similar shared_buffer settings? With > > O_DIRECT, I understand from this manpage that the OS is going to do > > little or no page caching, so shared_buffers should be increased to > > account for this fact. > > > > Am I missing something? > > O_DIRECT should only be being used for the WAL, not for buffer I/O. And only when open_sync or open_datasync are used. But the group write patch is used in all cases, but again only WAL. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> O_DIRECT is only being used for WAL page writes (or I sure hope so >> anyway), so shared_buffers should be irrelevant. > Uh, O_DIRECT really just enables when open_sync is used, and I assume > that is not used for writing dirty buffers during a checkpoint. I double-checked that O_DIRECT is really just used for WAL, and only when the sync mode is open_sync or open_datasync. So it seems impossible that it affected a run with mode fdatasync. What seems the best theory at the moment is that the grouped-WAL-write part of the patch doesn't work so well as we thought. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> O_DIRECT is only being used for WAL page writes (or I sure hope so > >> anyway), so shared_buffers should be irrelevant. > > > Uh, O_DIRECT really just enables when open_sync is used, and I assume > > that is not used for writing dirty buffers during a checkpoint. > > I double-checked that O_DIRECT is really just used for WAL, and only > when the sync mode is open_sync or open_datasync. So it seems > impossible that it affected a run with mode fdatasync. What seems the > best theory at the moment is that the grouped-WAL-write part of the > patch doesn't work so well as we thought. Yes, that's my only guess. Let us know if you want the patch to test, rather than pulling CVS before and after the patch was applied. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Thu, 11 Aug 2005 22:11:42 -0400 (EDT) Bruce Momjian <pgman@candle.pha.pa.us> wrote: > Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > >> O_DIRECT is only being used for WAL page writes (or I sure hope so > > >> anyway), so shared_buffers should be irrelevant. > > > > > Uh, O_DIRECT really just enables when open_sync is used, and I assume > > > that is not used for writing dirty buffers during a checkpoint. > > > > I double-checked that O_DIRECT is really just used for WAL, and only > > when the sync mode is open_sync or open_datasync. So it seems > > impossible that it affected a run with mode fdatasync. What seems the > > best theory at the moment is that the grouped-WAL-write part of the > > patch doesn't work so well as we thought. > > Yes, that's my only guess. Let us know if you want the patch to test, > rather than pulling CVS before and after the patch was applied. Yeah, a patch would be a little easier. :) Thanks, Mark
Mary Edie Meredith <maryedie@osdl.org> writes: > I'm still very concerned about what I'm seeing in the oprofile: > namely: .CreateLWLocks is the second highest entry for postgres. > http://developer.osdl.org/maryedie/DBT2_PGSQL/59/oprofile.txt This says there's something wrong with your oprofile setup. CreateLWLocks is run *once* at postmaster startup, and it doesn't take all that long. Check to see if maybe your executables are out of sync with what oprofile is looking at? regards, tom lane
On Thu, 2005-08-11 at 22:11 -0400, Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > >> O_DIRECT is only being used for WAL page writes (or I sure hope so > > >> anyway), so shared_buffers should be irrelevant. > > > > > Uh, O_DIRECT really just enables when open_sync is used, and I assume > > > that is not used for writing dirty buffers during a checkpoint. > > > > I double-checked that O_DIRECT is really just used for WAL, and only > > when the sync mode is open_sync or open_datasync. So it seems > > impossible that it affected a run with mode fdatasync. What seems the > > best theory at the moment is that the grouped-WAL-write part of the > > patch doesn't work so well as we thought. > > Yes, that's my only guess. Let us know if you want the patch to test, > rather than pulling CVS before and after the patch was applied. I tried the pull we have from 8/28/05, which should not have the Wal O-DIRECT patch or the WAL grouping patch (I believe that markw check this for me). Unfortunately the performance is just as bad as run 59. So it looks like those patches are not my immediate problem. While you think about that, I'm going to pull the development code from July 18, and run it again with oprofile. I'm still very concerned about what I'm seeing in the oprofile: namely: .CreateLWLocks is the second highest entry for postgres. http://developer.osdl.org/maryedie/DBT2_PGSQL/59/oprofile.txt It is still the second highest entry in the run I just did. Why are there so many "Creates"? Shouldn't there just be mostly calls that gab a LW lock and release it? Mark and I looked through the code at what the CreateLWLocks subroutine does and found the note inserted below. Since the connections are established at the very beginning, and this run went on for 30 more minutes, I dont understand why so many creates happened. Can you explain this? * This is called by the postmaster or by a standalone backend.* It is also called by a backend forked from the postmasterin the* EXEC_BACKEND case. In the latter case, the shared memory segment* already exists and has been physicallyattached to, but we have to* initialize pointers in local memory that reference the shared structures,* because we didn't inherit the correct pointer values from the postmaster* as we do in the fork() scenario. The easiest way to do that is to run* through the same code as before. (Note that the called routines mostly* check IsUnderPostmaster, rather than EXEC_BACKEND, to detect this case.* This is a bit code-wasteful and could be cleaned up.)** If "makePrivate" is true then we only need private memory,not shared* memory. This is true for a standalone backend, false for a postmaster. TIA. > -- Mary Edie Meredith Initiative Manager Open Source Development Labs maryedie@hotmail.com 503-906-1942
On Tue, 2005-08-16 at 18:53 -0400, Tom Lane wrote: > Mary Edie Meredith <maryedie@osdl.org> writes: > > I'm still very concerned about what I'm seeing in the oprofile: > > namely: .CreateLWLocks is the second highest entry for postgres. > > http://developer.osdl.org/maryedie/DBT2_PGSQL/59/oprofile.txt > > This says there's something wrong with your oprofile setup. > CreateLWLocks is run *once* at postmaster startup, and it doesn't > take all that long. Check to see if maybe your executables are > out of sync with what oprofile is looking at? Ok, Tom, sounds correct. thanks. > > regards, tom lane
On Tue, 16 Aug 2005 18:53:55 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Mary Edie Meredith <maryedie@osdl.org> writes: > > I'm still very concerned about what I'm seeing in the oprofile: > > namely: .CreateLWLocks is the second highest entry for postgres. > > http://developer.osdl.org/maryedie/DBT2_PGSQL/59/oprofile.txt > > This says there's something wrong with your oprofile setup. > CreateLWLocks is run *once* at postmaster startup, and it doesn't > take all that long. Check to see if maybe your executables are > out of sync with what oprofile is looking at? It is a POWER5 platform, if that has anything to do with it. It certainly doesn't look sane, but for user apps oprofile is supposed to be getting the symbols directly from the binaries its using. I'm inclined to blame the platform support. ;) Mark
On Tue, 2005-08-16 at 16:07 -0700, Mark Wong wrote: > On Tue, 16 Aug 2005 18:53:55 -0400 > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Mary Edie Meredith <maryedie@osdl.org> writes: > > > I'm still very concerned about what I'm seeing in the oprofile: > > > namely: .CreateLWLocks is the second highest entry for postgres. > > > http://developer.osdl.org/maryedie/DBT2_PGSQL/59/oprofile.txt > > > > This says there's something wrong with your oprofile setup. > > CreateLWLocks is run *once* at postmaster startup, and it doesn't > > take all that long. Check to see if maybe your executables are > > out of sync with what oprofile is looking at? > > It is a POWER5 platform, if that has anything to do with it. It > certainly doesn't look sane, but for user apps oprofile is supposed to > be getting the symbols directly from the binaries its using. I'm > inclined to blame the platform support. ;) Could it have to do with the fact that I'm running in a Virtual Machine? Maybe they forgot to change the symbols to reflect the VM's "virtual physical address" if you get my drift. Who do you know on the IBM side, Mark, who could help us sort this out? > > Mark
> On Tue, 2005-08-16 at 16:07 -0700, Mark Wong wrote: >> On Tue, 16 Aug 2005 18:53:55 -0400 >> Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> > Mary Edie Meredith <maryedie@osdl.org> writes: >> > > I'm still very concerned about what I'm seeing in the oprofile: >> > > namely: .CreateLWLocks is the second highest entry for postgres. >> > > http://developer.osdl.org/maryedie/DBT2_PGSQL/59/oprofile.txt >> > >> > This says there's something wrong with your oprofile setup. >> > CreateLWLocks is run *once* at postmaster startup, and it doesn't >> > take all that long. Check to see if maybe your executables are >> > out of sync with what oprofile is looking at? >> >> It is a POWER5 platform, if that has anything to do with it. It >> certainly doesn't look sane, but for user apps oprofile is supposed to >> be getting the symbols directly from the binaries its using. I'm >> inclined to blame the platform support. ;) > > Could it have to do with the fact that I'm running in a Virtual Machine? > Maybe they forgot to change the symbols to reflect the VM's "virtual > physical address" if you get my drift. Who do you know on the IBM side, > Mark, who could help us sort this out? Not likely, that would imply the system knew it was running in a virtual machine. You're not supposed to know you're in a virtual machine. I don't know who at IBM to ask about the hypervisor. As for oprofile, the oprofile list would be a start. Mark
I wrote: > Mary Edie Meredith <maryedie@osdl.org> writes: >> I have an example of runs that illustrate a performance >> problem that occurred between installing the 7/18 and 8/1 >> development release codes. > I dug through the CVS logs to see what had changed, and I'm afraid there > is just one plausible-looking candidate: > * src/backend/access/transam/xlog.c: > Use O_DIRECT if available when using O_SYNC for wal_sync_method. > Also, write multiple WAL buffers out in one write() operation. I've been sniffing around that patch and not really finding any smoking gun about why it would make things slower when you're not using O_DIRECT. I noticed that it forces 8K alignment of the WAL buffers on any machine that has O_DIRECT defined, whether you use O_DIRECT or not --- but it's pretty hard to see why that would make things slower. (Indeed, the older code only guaranteed MAXALIGN alignment of the WAL buffers, and we *know* that's wrong --- on Intel machines you want cache-line alignment to make kernel-to-userspace transfers fast. BTW, does anyone know if the current definition of BUFFERALIGN == 32 bytes needs to be increased for newer Intel machines?) I've got a bunch of minor gripes about the coding style of the gather-write part of the patch, but I really can't see that it's causing any performance issue. AFAICS it just replaces several closely spaced write() syscalls with one larger one. It's not possible that the kernel you're using is less efficient for larger transfers than smaller ones, is it? The whole thing's pretty bizarre. regards, tom lane
On Sun, 2005-08-21 at 20:37 -0400, Tom Lane wrote: > The whole thing's pretty bizarre. I hate to sound obvious, but does the missing performance return if you back the patch out? It seemed to have been decided on Tue, 16 Aug 2005 15:45:30 -0700 that the performance was the same before and after. However, there also seems to be version confusion, as maryedie@osdl.org also claimed to be testing a tree from the future. Basically I think there's more confusion here than evidence. On an admittedly smaller x86 configuration, source trees with this patch are faster than without, not slower. And finally, not to start a flamewar, but the initial report was on a machine running gentoo, the distribution that should be renamed Irreproduceable Benchmarks Linux. -jwb
I wrote: > I've been sniffing around that patch and not really finding any smoking > gun about why it would make things slower when you're not using O_DIRECT. While rewriting the patch to fit more naturally into xlog.c, I found a bug that might possibly be related to your performance issue. The if-test in this fragment is wrong: /* * If we just wrote the whole last page of a logfile segment, * fsync the segment immediately. Thisavoids having to go back * and re-open prior segments when an fsync request comes along * later. Doingit here ensures that one and only one backend will * perform this fsync. * * This is also the rightplace to notify the Archiver that the * segment is ready to copy to archival storage. */ if (openLogOff+ pages.size >= XLogSegSize && !ispartialpage) { XLogPageFlush(&pages, currentIndex); issue_xlog_fsync(); ... Because of the patch's logic changes, openLogOff is not the right thing to be examining here --- we have not yet done a seek to ensure that it matches the write start point. (pages.offset + pages.size would have worked instead.) The implication of this is that the code might either fail to do a write and fsync when it needs to, or do extra fsyncs that it doesn't need to do. I am wondering if the latter could explain your issue. Right offhand it doesn't seem like this would lead to enough extra fsyncs to account for the amount of slowdown you report --- but there just doesn't seem to be anything else in there that could account for it. I've committed a patch that fixes this and makes some other minor improvements (which you probably won't notice given that you're using such a large wal_buffers setting). You might like to update to CVS tip (make sure you get xlog.c 1.218 or later) and see if things are any better or not. regards, tom lane
Thanks, Tom for keeping on this. I was on vacation last week. Will try the latest ASAP. On Mon, 2005-08-22 at 20:20 -0400, Tom Lane wrote: > I wrote: > > I've been sniffing around that patch and not really finding any smoking > > gun about why it would make things slower when you're not using O_DIRECT. > > While rewriting the patch to fit more naturally into xlog.c, I found a > bug that might possibly be related to your performance issue. The > if-test in this fragment is wrong: > > /* > * If we just wrote the whole last page of a logfile segment, > * fsync the segment immediately. This avoids having to go back > * and re-open prior segments when an fsync request comes along > * later. Doing it here ensures that one and only one backend will > * perform this fsync. > * > * This is also the right place to notify the Archiver that the > * segment is ready to copy to archival storage. > */ > if (openLogOff + pages.size >= XLogSegSize && !ispartialpage) > { > XLogPageFlush(&pages, currentIndex); > issue_xlog_fsync(); > ... > > Because of the patch's logic changes, openLogOff is not the right thing > to be examining here --- we have not yet done a seek to ensure that it > matches the write start point. (pages.offset + pages.size would have > worked instead.) > > The implication of this is that the code might either fail to do a write > and fsync when it needs to, or do extra fsyncs that it doesn't need to > do. I am wondering if the latter could explain your issue. Right > offhand it doesn't seem like this would lead to enough extra fsyncs > to account for the amount of slowdown you report --- but there just > doesn't seem to be anything else in there that could account for it. > > I've committed a patch that fixes this and makes some other minor > improvements (which you probably won't notice given that you're using > such a large wal_buffers setting). You might like to update to CVS > tip (make sure you get xlog.c 1.218 or later) and see if things are > any better or not. > > regards, tom lane