Thread: some longer, larger pgbench tests with various performance-related patches
Early yesterday morning, I was able to use Nate Boley's test machine do a single 30-minute pgbench run at scale factor 300 using a variety of trees built with various patches, and with the -l option added to track latency on a per-transaction basis. All tests were done using 32 clients and permanent tables. The configuration was otherwise identical to that described here: http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6dw@mail.gmail.com By doing this, I hoped to get a better understanding of (1) the effects of a scale factor too large to fit in shared_buffers, (2) what happens on a longer test run, and (3) how response time varies throughout the test. First, here are the raw tps numbers: background-clean-slru-v2: tps = 2027.282539 (including connections establishing) buffreelistlock-reduction-v1: tps = 2625.155348 (including connections establishing) buffreelistlock-reduction-v1-freelist-ok-v2: tps = 2468.638149 (including connections establishing) freelist-ok-v2: tps = 2467.065010 (including connections establishing) group-commit-2012-01-21: tps = 2205.128609 (including connections establishing) master: tps = 2200.848350 (including connections establishing) removebufmgrfreelist-v1: tps = 2679.453056 (including connections establishing) xloginsert-scale-6: tps = 3675.312202 (including connections establishing) Obviously these numbers are fairly noisy, especially since this is just one run, so the increases and decreases might not be all that meaningful. Time permitting, I'll try to run some more tests to get my hands around that situation a little better, Graphs are here: http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results There are two graphs for each branch. The first is a scatter plot of latency vs. transaction time. I found that graph hard to understand, though; I couldn't really tell what I was looking at. So I made a second set of graphs which graph number of completed transactions in a given second of the test against time. The results are also included on the previous page, below the latency graphs, and I find them much more informative. A couple of things stand out at me from these graphs. First, some of these transactions had really long latency. Second, there are a remarkable number of seconds all of the test during which no transactions at all manage to complete, sometimes several seconds in a row. I'm not sure why. Third, all of the tests initially start of processing transactions very quickly, and get slammed down very hard, probably because the very high rate of transaction processing early on causes a checkpoint to occur around 200 s. I didn't actually log when the checkpoints were occuring, but it seems like a good guess. It's also interesting to wonder whether the checkpoint I/O itself causes the drop-off, or the ensuing full page writes. Fourth, xloginsert-scale-6 helps quite a bit; in fact, it's the only patch that actually changes the whole shape of the tps graph. I'm speculating here, but that may be because it blunts the impact of full page writes by allowing backends to copy their full page images into the write-ahead log in parallel. One thing I also noticed while running the tests is that the system was really not using much CPU time. It was mostly idle. That could be because waiting for I/O leads to waiting for locks, or it could be fundamental lock contention. I don't know which. A couple of obvious further tests suggest themselves: (1) rerun some of the tests with full_page_writes=off, and (2) repeat this test with the remaining performance-related patches. It would be especially interesting, I think, to see what effect the checkpoint-related patches have on these graphs. But I plan to drop buffreelistlock-reduction-v1 and freelist-ok-v2 from future test runs based on Simon's comments elsewhere. I'm including the results here just because these tests were already running when he made those comments. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: some longer, larger pgbench tests with various performance-related patches
From
Simon Riggs
Date:
On Tue, Jan 24, 2012 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > do a single 30-minute pgbench run at scale factor 300 using a variety Nice A minor but necessary point: Repeated testing of the Group commit patch when you have synch commit off is clearly pointless, so publishing numbers for that without saying very clearly that's what's happening doesn't really help. > Graphs are here: > > http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results Nicer > There are two graphs for each branch. The first is a scatter plot of > latency vs. transaction time. I found that graph hard to understand, > though; I couldn't really tell what I was looking at. So I made a > second set of graphs which graph number of completed transactions in a > given second of the test against time. The results are also included > on the previous page, below the latency graphs, and I find them much > more informative. > > A couple of things stand out at me from these graphs. First, some of > these transactions had really long latency. Second, there are a > remarkable number of seconds all of the test during which no > transactions at all manage to complete, sometimes several seconds in a > row. I'm not sure why. Third, all of the tests initially start of > processing transactions very quickly, and get slammed down very hard, > probably because the very high rate of transaction processing early on > causes a checkpoint to occur around 200 s. Check I'm happy that this exposes characteristics I've seen and have been discussing for a while now. It would be useful to calculate the slow-down contribution of the longer txns. What % of total time is given over to slowest 10% of transactions. Looking at that, I think you'll see why we should care about sorting out what happens in the worst cases. > I didn't actually log when > the checkpoints were occuring, but it seems like a good guess. It's > also interesting to wonder whether the checkpoint I/O itself causes > the drop-off, or the ensuing full page writes. Fourth, > xloginsert-scale-6 helps quite a bit; in fact, it's the only patch > that actually changes the whole shape of the tps graph. I'm > speculating here, but that may be because it blunts the impact of full > page writes by allowing backends to copy their full page images into > the write-ahead log in parallel. I think we should be working to commit XLogInsert and then Group Commit, then come back to the discussion. There's no way we're committing other patches but not those, so everything needs to be viewed with those two in. i.e. commit and then re-baseline. So I'd say no more tests just yet, but then lots of testing next week++. > But I plan to drop > buffreelistlock-reduction-v1 and freelist-ok-v2 from future test runs > based on Simon's comments elsewhere. I'm including the results here > just because these tests were already running when he made those > comments. Yep One you aren't testing is clog_history, which is designed to work in conjunction with background_clean_slru. That last one clearly needs a little tuning though... -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: some longer, larger pgbench tests with various performance-related patches
From
Pavan Deolasee
Date:
On Wed, Jan 25, 2012 at 2:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Early yesterday morning, I was able to use Nate Boley's test machine > do a single 30-minute pgbench run at scale factor 300 using a variety > of trees built with various patches, and with the -l option added to > track latency on a per-transaction basis. All tests were done using > 32 clients and permanent tables. The configuration was otherwise > identical to that described here: > > http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6dw@mail.gmail.com > > By doing this, I hoped to get a better understanding of (1) the > effects of a scale factor too large to fit in shared_buffers, (2) what > happens on a longer test run, and (3) how response time varies > throughout the test. First, here are the raw tps numbers: > > background-clean-slru-v2: tps = 2027.282539 (including connections establishing) > buffreelistlock-reduction-v1: tps = 2625.155348 (including connections > establishing) > buffreelistlock-reduction-v1-freelist-ok-v2: tps = 2468.638149 > (including connections establishing) > freelist-ok-v2: tps = 2467.065010 (including connections establishing) > group-commit-2012-01-21: tps = 2205.128609 (including connections establishing) > master: tps = 2200.848350 (including connections establishing) > removebufmgrfreelist-v1: tps = 2679.453056 (including connections establishing) > xloginsert-scale-6: tps = 3675.312202 (including connections establishing) > > Obviously these numbers are fairly noisy, especially since this is > just one run, so the increases and decreases might not be all that > meaningful. Time permitting, I'll try to run some more tests to get > my hands around that situation a little better, > This is nice. I am sure long running tests will point out many more issues. If we are doing these tests, it might be more effective if we run even longer runs, such as to get at least 3-4 checkpoints/vacuum/analyze (and other such events which can impact final numbers either way) per test. Otherwise, one patch may stood out if it avoids, say one checkpoint. It would definitely help to log the checkpoint, auto-vacuum/auto-analyze details and plot them on the graph to see if the drop in performance has anything to do with these activities. It might also be a good idea to collect pg statistics such as relation sizes at the end of the run. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: some longer, larger pgbench tests with various performance-related patches
From
Robert Haas
Date:
On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > I think we should be working to commit XLogInsert and then Group > Commit, then come back to the discussion. I definitely agree that those two have way more promise than anything else on the table. However, now that I understand how badly we're getting screwed by checkpoints, I'm curious to do some more investigation of what's going on there. I can't help thinking that in these particular cases the full page writes may be a bigger issue than the checkpoint itself. If that turns out to be the case it's not likely to be something we're able to address for 9.2, but I'd like to at least characterize it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: some longer, larger pgbench tests with various performance-related patches
From
Jeff Janes
Date:
On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Early yesterday morning, I was able to use Nate Boley's test machine > do a single 30-minute pgbench run at scale factor 300 using a variety > of trees built with various patches, and with the -l option added to > track latency on a per-transaction basis. All tests were done using > 32 clients and permanent tables. The configuration was otherwise > identical to that described here: > > http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6dw@mail.gmail.com Previously we mostly used this machine for CPU benchmarking. Have you previously described the IO subsystem? That is becoming relevant for these types of benchmarks. For example, is WAL separated from data? > > By doing this, I hoped to get a better understanding of (1) the > effects of a scale factor too large to fit in shared_buffers, In my hands, the active part of data at scale of 300 fits very comfortably into 8GB shared_buffers. Indeed, until you have run a very long time so that pgbench_history gets large, everything fits in 8GB. Cheers, Jeff
Re: some longer, larger pgbench tests with various performance-related patches
From
Robert Haas
Date:
On Wed, Jan 25, 2012 at 12:00 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Early yesterday morning, I was able to use Nate Boley's test machine >> do a single 30-minute pgbench run at scale factor 300 using a variety >> of trees built with various patches, and with the -l option added to >> track latency on a per-transaction basis. All tests were done using >> 32 clients and permanent tables. The configuration was otherwise >> identical to that described here: >> >> http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6dw@mail.gmail.com > > Previously we mostly used this machine for CPU benchmarking. Have you > previously described the IO subsystem? That is becoming relevant for > these types of benchmarks. For example, is WAL separated from data? I actually don't know much about the I/O subsystem, but, no, WAL is not separated from data. I believe $PGDATA is on a SAN, but I don't know anything about its characteristics. >> By doing this, I hoped to get a better understanding of (1) the >> effects of a scale factor too large to fit in shared_buffers, > > In my hands, the active part of data at scale of 300 fits very > comfortably into 8GB shared_buffers. > > Indeed, until you have run a very long time so that pgbench_history > gets large, everything fits in 8GB. Hmm, my mistake. Maybe I need to crank up the scale factor some more.This is why good benchmarking is hard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: some longer, larger pgbench tests with various performance-related patches
From
Jeff Janes
Date:
On Wed, Jan 25, 2012 at 9:09 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 25, 2012 at 12:00 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Early yesterday morning, I was able to use Nate Boley's test machine >>> do a single 30-minute pgbench run at scale factor 300 using a variety >>> of trees built with various patches, and with the -l option added to >>> track latency on a per-transaction basis. All tests were done using >>> 32 clients and permanent tables. The configuration was otherwise >>> identical to that described here: >>> >>> http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6dw@mail.gmail.com >> >> Previously we mostly used this machine for CPU benchmarking. Have you >> previously described the IO subsystem? That is becoming relevant for >> these types of benchmarks. For example, is WAL separated from data? > > I actually don't know much about the I/O subsystem, but, no, WAL is > not separated from data. I believe $PGDATA is on a SAN, but I don't > know anything about its characteristics. I think the checkpointing issues that Greg is exploring are important, and I'm pretty sure that that is what is limiting your TPS in this case. However, I don't think we can make much progress on that front using a machine whose IO subsystem is largely unknown, and not tweakable. So I think the best use of this machine would be to explore the purely CPU based scaling issues, like freelist, CLOG, and XLogInsert. To do that, I'd set the scale factor small enough so that entire data set is willing to be cached dirty by the kernel, based on the kernel parameters: egrep . /proc/sys/vm/dirty_* Then set shared_buffers to be less than the needs for that scale factor, so freelist gets exercised. And neutralize checkpoints, by setting fsync=off, so they don't generate physical IO that we can't control for given the current constraints on the machine set up. Cheers, Jeff
Re: some longer, larger pgbench tests with various performance-related patches
From
Nathan Boley
Date:
> I actually don't know much about the I/O subsystem, but, no, WAL is > not separated from data. I believe $PGDATA is on a SAN, but I don't > know anything about its characteristics. The computer is here: http://www.supermicro.nl/Aplus/system/2U/2042/AS-2042G-6RF.cfm $PGDATA is on a 5 disk SATA software raid 5. Is there anything else that would be useful to know? ( Sorry, this isn't a subject that I'm very familiar with ) -Nathan
Re: some longer, larger pgbench tests with various performance-related patches
From
Robert Haas
Date:
On Wed, Jan 25, 2012 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> I think we should be working to commit XLogInsert and then Group >> Commit, then come back to the discussion. > > I definitely agree that those two have way more promise than anything > else on the table. However, now that I understand how badly we're > getting screwed by checkpoints, I'm curious to do some more > investigation of what's going on there. I can't help thinking that in > these particular cases the full page writes may be a bigger issue than > the checkpoint itself. If that turns out to be the case it's not > likely to be something we're able to address for 9.2, but I'd like to > at least characterize it. I did another run with full_page_writes=off: http://wiki.postgresql.org/wiki/File:Tps-master-nofpw.png This is run with the master branch as of commit 4f42b546fd87a80be30c53a0f2c897acb826ad52, same as the previous tests, so that the results are comparable. The graph is pretty dramatically different from the earlier one: http://wiki.postgresql.org/wiki/File:Tps-master.png So are the TPS numbers: full page writes on: tps = 2200.848350 (including connections establishing) full page writes off: tps = 9494.757916 (including connections establishing) It seems pretty clear that, even with full_page_writes=off, the checkpoint is affecting things negatively: the first 700 seconds or so show much better and more consistent performance than the remaining portion of the test. I'm not sure why that is, but I'm guessing there was a checkpoint around that time. But the effect is much worse with full_page_writes=on: the distinctive parabolic shape of those graphs is apparently caused by the gradually decreasing frequency of full page writes as the number of transactions processed since the last checkpoint grows. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: some longer, larger pgbench tests with various performance-related patches
From
Simon Riggs
Date:
On Wed, Feb 1, 2012 at 5:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 25, 2012 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> I think we should be working to commit XLogInsert and then Group >>> Commit, then come back to the discussion. >> >> I definitely agree that those two have way more promise than anything >> else on the table. However, now that I understand how badly we're >> getting screwed by checkpoints, I'm curious to do some more >> investigation of what's going on there. I can't help thinking that in >> these particular cases the full page writes may be a bigger issue than >> the checkpoint itself. If that turns out to be the case it's not >> likely to be something we're able to address for 9.2, but I'd like to >> at least characterize it. > > I did another run with full_page_writes=off: > > http://wiki.postgresql.org/wiki/File:Tps-master-nofpw.png > > This is run with the master branch as of commit > 4f42b546fd87a80be30c53a0f2c897acb826ad52, same as the previous tests, > so that the results are comparable. > > The graph is pretty dramatically different from the earlier one: > > http://wiki.postgresql.org/wiki/File:Tps-master.png > > So are the TPS numbers: > > full page writes on: tps = 2200.848350 (including connections establishing) > full page writes off: tps = 9494.757916 (including connections establishing) > > It seems pretty clear that, even with full_page_writes=off, the > checkpoint is affecting things negatively: the first 700 seconds or so > show much better and more consistent performance than the remaining > portion of the test. I'm not sure why that is, but I'm guessing there > was a checkpoint around that time. But the effect is much worse with > full_page_writes=on: the distinctive parabolic shape of those graphs > is apparently caused by the gradually decreasing frequency of full > page writes as the number of transactions processed since the last > checkpoint grows. Sounds like time to test the checkpoint smoothing patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: some longer, larger pgbench tests with various performance-related patches
From
Jeff Janes
Date:
On Wed, Feb 1, 2012 at 9:47 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 25, 2012 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> I think we should be working to commit XLogInsert and then Group >>> Commit, then come back to the discussion. >> >> I definitely agree that those two have way more promise than anything >> else on the table. However, now that I understand how badly we're >> getting screwed by checkpoints, I'm curious to do some more >> investigation of what's going on there. I can't help thinking that in >> these particular cases the full page writes may be a bigger issue than >> the checkpoint itself. If that turns out to be the case it's not >> likely to be something we're able to address for 9.2, but I'd like to >> at least characterize it. > > I did another run with full_page_writes=off: > > http://wiki.postgresql.org/wiki/File:Tps-master-nofpw.png > > This is run with the master branch as of commit > 4f42b546fd87a80be30c53a0f2c897acb826ad52, same as the previous tests, > so that the results are comparable. > > The graph is pretty dramatically different from the earlier one: > > http://wiki.postgresql.org/wiki/File:Tps-master.png > > So are the TPS numbers: > > full page writes on: tps = 2200.848350 (including connections establishing) > full page writes off: tps = 9494.757916 (including connections establishing) > > It seems pretty clear that, even with full_page_writes=off, the > checkpoint is affecting things negatively: the first 700 seconds or so > show much better and more consistent performance than the remaining > portion of the test. I'm not sure why that is, but I'm guessing there > was a checkpoint around that time. We really need to nail that down. Could you post the scripts (on the wiki) you use for running the benchmark and making the graph? I'd like to see how much work it would be for me to change it to detect checkpoints and do something like color the markers blue during checkpoints and red elsewhen. Also, I'm not sure how bad that graph really is. The overall throughput is more variable, and there are a few latency spikes but they are few. The dominant feature is simply that the long-term average is less than the initial burst. Of course the goal is to have a high level of throughput with a smooth latency under sustained conditions. But to expect that that long-sustained smooth level of throughput be identical to the "initial burst throughput" sounds like more of a fantasy than a goal. If we want to accept the lowered throughput and work on the what variability/spikes are there, I think a good approach would be to take the long term TPS average, and dial the number of clients back until the initial burst TPS matches that long term average. Then see if the spikes still exist over the long term using that dialed back number of clients. > But the effect is much worse with > full_page_writes=on: the distinctive parabolic shape of those graphs > is apparently caused by the gradually decreasing frequency of full > page writes as the number of transactions processed since the last > checkpoint grows. The dilution out of not-yet-full-page-written buffers as the checkpoint ages is probably the ultimate cause of that pattern, but I suspect there are intermediate causes that we could tackle. I don't think the full-page-writes are leading to WALInsert contention, for example, because that would probably lead to smooth throughput decline, but not those latency spikes in which those entire seconds go by without transactions. I doubt it is leading to general IO compaction, as the IO at that point should be pretty much sequential (the checkpoint has not yet reached the sync stage, the WAL is sequential). So I bet that that is caused by fsyncs occurring at xlog segment switches, and the locking that that entails. If I recall, we can have a segment which is completely written to OS and in the process of being fsynced, and we can have another segment which is in some state of partially in wal_buffers and partly written out to OS cache, but that we can't start reusing the wal_buffers that were already written to OS for that segment (and therefore are theoretically available for reuse by the upcoming 3rd segment) until the previous segments fsync has completed. So all WALInsert's freeze.Or something like that. This code has changed a bitsince last time I studied it. Cheers, Jeff
Re: some longer, larger pgbench tests with various performance-related patches
From
Robert Haas
Date:
On Sat, Feb 4, 2012 at 2:13 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > We really need to nail that down. Could you post the scripts (on the > wiki) you use for running the benchmark and making the graph? I'd > like to see how much work it would be for me to change it to detect > checkpoints and do something like color the markers blue during > checkpoints and red elsewhen. They're pretty crude - I've attached them here. > Also, I'm not sure how bad that graph really is. The overall > throughput is more variable, and there are a few latency spikes but > they are few. The dominant feature is simply that the long-term > average is less than the initial burst.Of course the goal is to have > a high level of throughput with a smooth latency under sustained > conditions. But to expect that that long-sustained smooth level of > throughput be identical to the "initial burst throughput" sounds like > more of a fantasy than a goal. That's probably true, but the drop-off is currently quite extreme. The fact that disabling full_page_writes causes throughput to increase by >4x is dismaying, at least to me. > If we want to accept the lowered > throughput and work on the what variability/spikes are there, I think > a good approach would be to take the long term TPS average, and dial > the number of clients back until the initial burst TPS matches that > long term average. Then see if the spikes still exist over the long > term using that dialed back number of clients. Hmm, I might be able to do that. > I don't think the full-page-writes are leading to WALInsert > contention, for example, because that would probably lead to smooth > throughput decline, but not those latency spikes in which those entire > seconds go by without transactions. Right. > I doubt it is leading to general > IO compaction, as the IO at that point should be pretty much > sequential (the checkpoint has not yet reached the sync stage, the WAL > is sequential). So I bet that that is caused by fsyncs occurring at > xlog segment switches, and the locking that that entails. That's definitely possible. > If I > recall, we can have a segment which is completely written to OS and in > the process of being fsynced, and we can have another segment which is > in some state of partially in wal_buffers and partly written out to OS > cache, but that we can't start reusing the wal_buffers that were > already written to OS for that segment (and therefore are > theoretically available for reuse by the upcoming 3rd segment) until > the previous segments fsync has completed. So all WALInsert's freeze. > Or something like that. This code has changed a bit since last time > I studied it. Yeah, I need to better-characterize where the pauses are coming from, but I'm reluctant to invest too much effort in until Heikki's xlog scaling patch goes in, because I think that's going to change things enough that any work done now will mostly be wasted. It might be worth trying a run with wal_buffers=32MB or something like that, just to see whether that mitigates any of the locking pile-ups. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Re: some longer, larger pgbench tests with various performance-related patches
From
Greg Smith
Date:
On 01/24/2012 03:53 PM, Robert Haas wrote: > There are two graphs for each branch. The first is a scatter plot of > latency vs. transaction time. I found that graph hard to understand, > though; I couldn't really tell what I was looking at. So I made a > second set of graphs which graph number of completed transactions in a > given second of the test against time. Note that you're now reinventing parts of pgbench-tools; the main two graphs it gives are the latency scatter plot and TPS per second. The things you're likely to find interesting next are maximum latency, 90th percentile latency, and a delta for what changed in pg_stat_bgwriter during the test; those are the other things I track in that program. I'm working toward publishing my own tests of the performance patches still considered useful by the end of the week. Murphy's Law has active on that project since it started though--server crashed the day I left on a week long trip, and I've been sick ever since getting back. > First, some of > these transactions had really long latency. Second, there are a > remarkable number of seconds all of the test during which no > transactions at all manage to complete, sometimes several seconds in a > row. These periods have in my tests always been associated with Linux turning aggressive about cleaning out its write cache, either due to fsync request or simply crossing one of its thresholds for doing so. My current record is an 80 second pause with no transactions completing. One of things I expect to add to pgbench_tools within the next week is tracking how much dirty memory is accumulating during each test. Seeing that graph overlaid on top of the rest makes a lot of what's happening at any time more obvious. Noting when the checkpoints happen is a bit less interesting, because once the first one happens, they happen almost continuously. You really need to track when the write and sync phases are happening for that to be really useful. This circles back to why I proposed exposing those timing bits in pg_stat_bgwriter. pgbench-tools already grabs data from it, which avoids all the mess around log file parsing. If I could do that more often and extract checkpoint timing from that data, it would make labelling graphs like these much easier to do, from the client that's running the benchmark even. > Third, all of the tests initially start of > processing transactions very quickly, and get slammed down very hard, > probably because the very high rate of transaction processing early on > causes a checkpoint to occur around 200 s. At the beginning of a write-heavy pgbench run, rate is high until one of these two things happen: 1) A checkpoint begins 2) Linux's write cache threshold (typically /proc/sys/vm/dirty_background_ratio) worth of dirty memory accumulates. Note that (1) on its own isn't necessarily the problem, it's something the case that it just makes (2) happen much faster then. Basically, the first 30 to 150 seconds of any write-heavy test always have an inflated speed. You're writing into the OS cache at maximum speed, and none of those writes are making it to physical disk--except perhaps for the WAL, which is all fast and sequential. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Re: some longer, larger pgbench tests with various performance-related patches
From
Jeff Janes
Date:
On Mon, Feb 6, 2012 at 6:38 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Feb 4, 2012 at 2:13 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> We really need to nail that down. Could you post the scripts (on the >> wiki) you use for running the benchmark and making the graph? I'd >> like to see how much work it would be for me to change it to detect >> checkpoints and do something like color the markers blue during >> checkpoints and red elsewhen. > > They're pretty crude - I've attached them here. Thanks. But given Greg's comments about pgbench-tools, I'll probably work on learning that instead. > >> Also, I'm not sure how bad that graph really is. The overall >> throughput is more variable, and there are a few latency spikes but >> they are few. The dominant feature is simply that the long-term >> average is less than the initial burst.Of course the goal is to have >> a high level of throughput with a smooth latency under sustained >> conditions. But to expect that that long-sustained smooth level of >> throughput be identical to the "initial burst throughput" sounds like >> more of a fantasy than a goal. > > That's probably true, but the drop-off is currently quite extreme. > The fact that disabling full_page_writes causes throughput to increase > by >4x is dismaying, at least to me. I just meant that the latest graph, with full page writes off, is not that bad. The ones with fpw on are definitely bad, more due to the latency spikes than the throughput. > >> If we want to accept the lowered >> throughput and work on the what variability/spikes are there, I think >> a good approach would be to take the long term TPS average, and dial >> the number of clients back until the initial burst TPS matches that >> long term average. Then see if the spikes still exist over the long >> term using that dialed back number of clients. > > Hmm, I might be able to do that. > >> I don't think the full-page-writes are leading to WALInsert >> contention, for example, because that would probably lead to smooth >> throughput decline, but not those latency spikes in which those entire >> seconds go by without transactions. > > Right. > >> I doubt it is leading to general >> IO compaction, as the IO at that point should be pretty much >> sequential (the checkpoint has not yet reached the sync stage, the WAL >> is sequential). So I bet that that is caused by fsyncs occurring at >> xlog segment switches, and the locking that that entails. > > That's definitely possible. > >> If I >> recall, we can have a segment which is completely written to OS and in >> the process of being fsynced, and we can have another segment which is >> in some state of partially in wal_buffers and partly written out to OS >> cache, but that we can't start reusing the wal_buffers that were >> already written to OS for that segment (and therefore are >> theoretically available for reuse by the upcoming 3rd segment) until >> the previous segments fsync has completed. So all WALInsert's freeze. >> Or something like that. This code has changed a bit since last time >> I studied it. > > Yeah, I need to better-characterize where the pauses are coming from, > but I'm reluctant to invest too much effort in until Heikki's xlog > scaling patch goes in, because I think that's going to change things > enough that any work done now will mostly be wasted. OK, You've scared me off from digging into the locking at wal switch for now. So instead I've spent some time today trying to unbreak the xlog scaling patch and haven't had any luck. Does anyone know if any combination of that patches history + git master history had been tested and verified to produce a recoverable WAL stream? It is a shame that "make check" doesn't test for that, but I don't know how to make it do so. Cheers, Jeff