Thread: Review: Revise parallel pg_restore's scheduling heuristic
Rebased to correct for pg_indent changes. Applies cleanly. Compiles cleanly. Passes regression tests. Comments and format look good. No documentation changes needed. No regression test changes needed. Performance tests to follow in a day or two. -Kevin
Attachment
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > Performance tests to follow in a day or two. I'm looking to beg another week or so on this to run more tests. What I can have by the end of today is pretty limited, mostly because I decided it made the most sense to test this with big complex databases, and it just takes a fair amount of time to throw around that much data. (This patch didn't seem likely to make a significant difference on smaller databases.) My current plan is to test this on a web server class machine and a distributed application class machine. Both database types have over 300 tables with tables with widely ranging row counts, widths, and index counts. It would be hard to schedule the requisite time on our biggest web machines, but I assume an 8 core 64GB machine would give meaningful results. Any sense what numbers of parallel jobs I should use for tests? I would be tempted to try 1 (with the -1 switch), 8, 12, and 16 -- maybe keep going if 16 beats 12. My plan here would be to have the dump on one machine, and run pg_restore there, and push it to a database on another machine through the LAN on a 1Gb connection. (This seems most likely to be what we'd be doing in real life.) I would run each test with the CVS trunk tip with and without the patch applied. The database is currently 1.1TB. The application machine would have 2 cores and about 4GB RAM. I'm tempted to use Milwaukee County's database there, as it has the most rows per table, even though some of the counties doing a lot of document scanning now have bigger databases in terms of disk space. It's 89GB. I'd probably try job counts starting at one and going up by one until performance starts to drop off. (At one I would use the -1 switch.) In all cases I was planning on using a "conversion" postgresql.conf file, turning off fsync, archiving, statistics, etc. Does this sound like a sane approach to testing whether this patch actually improves performance? Any suggestions before I start this, to ensure most meaningful results? -Kevin
On Sat, Jul 18, 2009 at 4:41 PM, Kevin Grittner<Kevin.Grittner@wicourts.gov> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > >> Performance tests to follow in a day or two. > > I'm looking to beg another week or so on this to run more tests. What > I can have by the end of today is pretty limited, mostly because I > decided it made the most sense to test this with big complex > databases, and it just takes a fair amount of time to throw around > that much data. (This patch didn't seem likely to make a significant > difference on smaller databases.) No worries. We have a limited number of people who can test these kinds of things and who have volunteered to serve as reviewers (two that I'm aware of, and one of those I haven't heard from lately...). So we'll be patient. :-) > My current plan is to test this on a web server class machine and a > distributed application class machine. Both database types have over > 300 tables with tables with widely ranging row counts, widths, and > index counts. > > It would be hard to schedule the requisite time on our biggest web > machines, but I assume an 8 core 64GB machine would give meaningful > results. Any sense what numbers of parallel jobs I should use for > tests? I would be tempted to try 1 (with the -1 switch), 8, 12, and > 16 -- maybe keep going if 16 beats 12. My plan here would be to have > the dump on one machine, and run pg_restore there, and push it to a > database on another machine through the LAN on a 1Gb connection. > (This seems most likely to be what we'd be doing in real life.) I > would run each test with the CVS trunk tip with and without the patch > applied. The database is currently 1.1TB. > > The application machine would have 2 cores and about 4GB RAM. I'm > tempted to use Milwaukee County's database there, as it has the most > rows per table, even though some of the counties doing a lot of > document scanning now have bigger databases in terms of disk space. > It's 89GB. I'd probably try job counts starting at one and going up by > one until performance starts to drop off. (At one I would use the -1 > switch.) > > In all cases I was planning on using a "conversion" postgresql.conf > file, turning off fsync, archiving, statistics, etc. > > Does this sound like a sane approach to testing whether this patch > actually improves performance? Any suggestions before I start this, > to ensure most meaningful results? This all sounds reasonable to me, although it might be worth testing with default settings too. ...Robert
Kevin Grittner wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > >> Performance tests to follow in a day or two. > > I'm looking to beg another week or so on this to run more tests. What > I can have by the end of today is pretty limited, mostly because I > decided it made the most sense to test this with big complex > databases, and it just takes a fair amount of time to throw around > that much data. (This patch didn't seem likely to make a significant > difference on smaller databases.) > > My current plan is to test this on a web server class machine and a > distributed application class machine. Both database types have over > 300 tables with tables with widely ranging row counts, widths, and > index counts. > > It would be hard to schedule the requisite time on our biggest web > machines, but I assume an 8 core 64GB machine would give meaningful > results. Any sense what numbers of parallel jobs I should use for > tests? I would be tempted to try 1 (with the -1 switch), 8, 12, and > 16 -- maybe keep going if 16 beats 12. My plan here would be to have > the dump on one machine, and run pg_restore there, and push it to a > database on another machine through the LAN on a 1Gb connection. > (This seems most likely to be what we'd be doing in real life.) I > would run each test with the CVS trunk tip with and without the patch > applied. The database is currently 1.1TB. you need to be careful here - in my latest round of benchmarking I had actually test with the workload generator on the same box because on fast boxes we can easily achive >100MB/s total load rate these days. At these load rates you are very close or over the pratical limits of GigE... Stefan
Robert Haas wrote: > On Sat, Jul 18, 2009 at 4:41 PM, Kevin > Grittner<Kevin.Grittner@wicourts.gov> wrote: > >> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: >> >> >>> Performance tests to follow in a day or two. >>> >> I'm looking to beg another week or so on this to run more tests. What >> I can have by the end of today is pretty limited, mostly because I >> decided it made the most sense to test this with big complex >> databases, and it just takes a fair amount of time to throw around >> that much data. (This patch didn't seem likely to make a significant >> difference on smaller databases.) >> > > No worries. We have a limited number of people who can test these > kinds of things and who have volunteered to serve as reviewers (two > that I'm aware of, and one of those I haven't heard from lately...). > So we'll be patient. :-) > > >> My current plan is to test this on a web server class machine and a >> distributed application class machine. Both database types have over >> 300 tables with tables with widely ranging row counts, widths, and >> index counts. >> >> It would be hard to schedule the requisite time on our biggest web >> machines, but I assume an 8 core 64GB machine would give meaningful >> results. Any sense what numbers of parallel jobs I should use for >> tests? I would be tempted to try 1 (with the -1 switch), 8, 12, and >> 16 -- maybe keep going if 16 beats 12. My plan here would be to have >> the dump on one machine, and run pg_restore there, and push it to a >> database on another machine through the LAN on a 1Gb connection. >> (This seems most likely to be what we'd be doing in real life.) I >> would run each test with the CVS trunk tip with and without the patch >> applied. The database is currently 1.1TB. >> >> The application machine would have 2 cores and about 4GB RAM. I'm >> tempted to use Milwaukee County's database there, as it has the most >> rows per table, even though some of the counties doing a lot of >> document scanning now have bigger databases in terms of disk space. >> It's 89GB. I'd probably try job counts starting at one and going up by >> one until performance starts to drop off. (At one I would use the -1 >> switch.) >> >> In all cases I was planning on using a "conversion" postgresql.conf >> file, turning off fsync, archiving, statistics, etc. >> >> Does this sound like a sane approach to testing whether this patch >> actually improves performance? Any suggestions before I start this, >> to ensure most meaningful results? >> > > This all sounds reasonable to me, although it might be worth testing > with default settings too. > > > > To performance test this properly you might need to devise a test that will use a sufficiently different order of queueing items to show the difference. One thing I am particularly interested in is to see if queuing FK items for a table as soon as they become available, which is most likely to be when the referred to index is created, rather than possibly doing them all together (assuming they are named with the table name as a prefix) as TOC order would do, has a better performance or not. cheers andrew
Kevin, > It would be hard to schedule the requisite time on our biggest web > machines, but I assume an 8 core 64GB machine would give meaningful > results. Any sense what numbers of parallel jobs I should use for > tests? I would be tempted to try 1 (with the -1 switch), 8, 12, and > 16 -- maybe keep going if 16 beats 12. Personally, I wouldn't go over the number of cores. But if you do find some gain that way, I'd be very interested to know it. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Andrew Dunstan <andrew@dunslane.net> wrote: > To performance test this properly you might need to devise a test > that will use a sufficiently different order of queueing items to > show the difference. > > One thing I am particularly interested in is to see if queuing FK > items for a table as soon as they become available, which is most > likely to be when the referred to index is created, rather than > possibly doing them all together (assuming they are named with the > table name as a prefix) as TOC order would do, has a better > performance or not. Hmmm.... I'm reevaluating my database choice. The 1.1TB database does not have foreign key constraints as a matter of policy. It is a replica of the county databases, and we want to replicate whatever we can of the county data -- failure for some reason of part of the data to replicate should not block replication of something else, to minimize differences. Is there still value in using such a database at all, or should I focus on databases in the 50GB to 90GB range with FK constraints defined? When you suggest devising a test to show a difference, in what way would it be likely that I would need to modify the real-life database to get such a test? Our FKs do start with "<TableName>_". We don't have underscores in our table names, although we use similar naming for our indexes. -Kevin
Robert Haas <robertmhaas@gmail.com> wrote: > it might be worth testing with default settings too. OK. I'll do that too, if time allows. -Kevin
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: >> My plan here would be to have >> the dump on one machine, and run pg_restore there, and push it to a >> database on another machine through the LAN on a 1Gb connection. >> (This seems most likely to be what we'd be doing in real life.) > you need to be careful here - in my latest round of benchmarking I > had actually test with the workload generator on the same box > because on fast boxes we can easily achive >100MB/s total load rate > these days. At these load rates you are very close or over the > pratical limits of GigE... Yeah, I was concerned about that. Problem was, with the 1.1TB database it is hard to fit a dump and an extra copy of the database onto the drives along with the production data for which they exist. We would likely face the same constraints with real data if using the parallel restore, since it requires an interim backup file (i.e., you can't stream directly from the source database). There's also the issue of reading from the same RAID you're targeting with the restore, which is sometimes not optimal. If I'm dropping down an order of magnitude or more in the databases I will use, I could put the backup file on a separate RAID on the same machine. This leaves a lot of options. I'm not sure which combinations of configuration, file placement, and job count yield the most useful results. -Kevin
Andrew Dunstan <andrew@dunslane.net> wrote: > To performance test this properly you might need to devise a test > that will use a sufficiently different order of queueing items to > show the difference. It would appear that I need help with devising a proper test. So far, all tests have shown no difference in performance based on the patch; I get almost twice the speed as a single job running in one database transaction either way. Can someone explain what I should try to set up to get a "best case" and a "worst case" for the patch? Our production databases don't expose any difference, but I'm willing to try to use them to "seed" an artificial case which will. -Kevin
Kevin Grittner wrote: > Andrew Dunstan <andrew@dunslane.net> wrote: > > >> To performance test this properly you might need to devise a test >> that will use a sufficiently different order of queueing items to >> show the difference. >> > > It would appear that I need help with devising a proper test. So far, > all tests have shown no difference in performance based on the patch; > I get almost twice the speed as a single job running in one database > transaction either way. Can someone explain what I should try to set > up to get a "best case" and a "worst case" for the patch? Our > production databases don't expose any difference, but I'm willing to > try to use them to "seed" an artificial case which will. > > Does your test case have lots of foreign keys? cheers andrew
Andrew Dunstan <andrew@dunslane.net> wrote: > Does your test case have lots of foreign keys? 488 of them. There is some variation on individual tests, but the results look to be "in the noise." When I add them all up, the patch comes out 0.0036% slower -- but that is so far into the noise as to be considered "no difference" in my book. -Kevin
I wrote: > So far, all tests have shown no difference in performance based on > the patch; My testing to that point had been on a "big" machine with 16 CPUs and 128 GB RAM and dozens of spindles. Last night I tried with a dual core machine with 4 GB RAM and 5 spindles in RAID 5. Still no difference with the patch. Any suggestions besides the foreign keys? Should 488 FKs be enough to matter here? (Barring better suggestions, I'll try the small machine again tonight with the default configuration, rather than the optimized one.) -Kevin
On Tue, Jul 28, 2009 at 10:28 AM, Kevin Grittner<Kevin.Grittner@wicourts.gov> wrote: > I wrote: > >> So far, all tests have shown no difference in performance based on >> the patch; > > My testing to that point had been on a "big" machine with 16 CPUs and > 128 GB RAM and dozens of spindles. Last night I tried with a dual > core machine with 4 GB RAM and 5 spindles in RAID 5. Still no > difference with the patch. > > Any suggestions besides the foreign keys? Should 488 FKs be enough to > matter here? (Barring better suggestions, I'll try the small machine > again tonight with the default configuration, rather than the > optimized one.) The other possibility here is that this just doesn't work. :-) ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > The other possibility here is that this just doesn't work. :-) That's why we wanted to test it ;-). I don't have time to look right now, but ISTM the original discussion that led to making that patch had ideas about scenarios where it would be faster. It'd be worth digging that up and seeing if the current tests covered the case or not. regards, tom lane
On Tue, Jul 28, 2009 at 9:52 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> The other possibility here is that this just doesn't work. :-) > > That's why we wanted to test it ;-). > > I don't have time to look right now, but ISTM the original discussion > that led to making that patch had ideas about scenarios where it would > be faster. It'd be worth digging that up and seeing if the current > tests covered the case or not. This is what I've been able to find on a quick look: http://archives.postgresql.org/pgsql-hackers/2009-05/msg00678.php Sounds like Kevin may want to try renaming some of his indices to produce intermingling... ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jul 28, 2009 at 9:52 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> I don't have time to look right now, but ISTM the original discussion >> that led to making that patch had ideas about scenarios where it would >> be faster. > This is what I've been able to find on a quick look: > http://archives.postgresql.org/pgsql-hackers/2009-05/msg00678.php > Sounds like Kevin may want to try renaming some of his indices to > produce intermingling... Also, the followup to that message points out that the 8.4.0 code has a potential O(N^2) dependency on the total number of TOC items in the dump. So it might be interesting to check the behavior with very large numbers of tables/indexes. regards, tom lane
Robert Haas <robertmhaas@gmail.com> wrote: > This is what I've been able to find on a quick look: > > http://archives.postgresql.org/pgsql-hackers/2009-05/msg00678.php > > Sounds like Kevin may want to try renaming some of his indices to > produce intermingling... Thanks, I'll give that a try. Renaming them is one thing, getting a new dump is another, though. I probably won't be able to test that theory until tomorrow night. Last night's test yielded a couple interesting results. For one thing, while the "optimized" postgresql.conf was 2.5% faster than the default file for a single job in one database transaction, it was 10% *slower* than the default for multi-job restores. I'll check on that more later, to see what might be helping and what is hurting. For another thing, with the default settings, the patched version ran an additional 1% faster than the unpatched; although I don't have enough samples to have a high degree of confidence it wasn't noise. I'll run another slew of tests tonight with the existing dump file to confirm to debunk that result, while I create a new dump file to test with name intermingling on later nights. For the record, the "default" postgresql.conf: port = 5678 datestyle = 'iso, mdy' lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' default_text_search_config = 'pg_catalog.english' escape_string_warning = off standard_conforming_strings = on The "optimized" file adds these: max_connections = 100 shared_buffers = 256MB work_mem = 50MB maintenance_work_mem = 500MB bgwriter_lru_maxpages = 600 bgwriter_lru_multiplier = 10.0 fsync = off full_page_writes = off wal_buffers = 4MB random_page_cost = 2.0 effective_cache_size = 3GB logging_collector = on log_line_prefix = '[%m] %p %q<%u %d %r> ' track_counts = off autovacuum = off sql_inheritance = off I'm sure that there is a wealth of opinion on which of these are slowing things down, but I'm going to withhold any guesses in favor of testing them. (They all proved themselves neutral or beneficial in objective testing for the single-job restores under 8.3.) -Kevin
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Also, the followup to that message points out that the 8.4.0 code > has a potential O(N^2) dependency on the total number of TOC items > in the dump. So it might be interesting to check the behavior with > very large numbers of tables/indexes. I've got 431 user tables with 578 indexes. How high should I push this? Can I just create a bunch of randomly named empty tables with primary keys to provoke this effect? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Also, the followup to that message points out that the 8.4.0 code >> has a potential O(N^2) dependency on the total number of TOC items >> in the dump. So it might be interesting to check the behavior with >> very large numbers of tables/indexes. > I've got 431 user tables with 578 indexes. How high should I push > this? Can I just create a bunch of randomly named empty tables with > primary keys to provoke this effect? Yeah, just add a bunch of empty tables. Ten thousand or so, perhaps. regards, tom lane
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > with the default settings, the patched version ran an additional 1% > faster than the unpatched; although I don't have enough samples to > have a high degree of confidence it wasn't noise. I'll run another > slew of tests tonight with the existing dump file to confirm to > debunk that result The timings vary by up to 2.5% between runs, so that's the noise level. Five runs of each (alternating between the two) last night give an average performance of 1.89% faster for the patched version. Combining that with yesterday's results starts to give me pretty good confidence that the patch is beneficial for this database with this configuration. I haven't found any database or configuration where it hurts. (For most tests, adding up the results gave a net difference measured in thousandths of a percent.) Is that good enough, or is it still worth the effort of constructing the artificial case where it might *really* shine? Or should I keep running with the "real" database a few more nights to get a big enough sample to further increase the confidence level with this test? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > The timings vary by up to 2.5% between runs, so that's the noise > level. Five runs of each (alternating between the two) last night > give an average performance of 1.89% faster for the patched version. > Combining that with yesterday's results starts to give me pretty good > confidence that the patch is beneficial for this database with this > configuration. I haven't found any database or configuration where it > hurts. (For most tests, adding up the results gave a net difference > measured in thousandths of a percent.) > Is that good enough, or is it still worth the effort of constructing > the artificial case where it might *really* shine? Or should I keep > running with the "real" database a few more nights to get a big enough > sample to further increase the confidence level with this test? I think we've pretty much established that it doesn't make things *worse*, so I'm sort of inclined to go ahead and apply it. The theoretical advantage of eliminating O(N^2) search behavior seems like reason enough, even if it takes a ridiculous number of tables for that to become significant. regards, tom lane
On Thu, Jul 30, 2009 at 1:24 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> The timings vary by up to 2.5% between runs, so that's the noise >> level. Five runs of each (alternating between the two) last night >> give an average performance of 1.89% faster for the patched version. >> Combining that with yesterday's results starts to give me pretty good >> confidence that the patch is beneficial for this database with this >> configuration. I haven't found any database or configuration where it >> hurts. (For most tests, adding up the results gave a net difference >> measured in thousandths of a percent.) > >> Is that good enough, or is it still worth the effort of constructing >> the artificial case where it might *really* shine? Or should I keep >> running with the "real" database a few more nights to get a big enough >> sample to further increase the confidence level with this test? > > I think we've pretty much established that it doesn't make things > *worse*, so I'm sort of inclined to go ahead and apply it. The > theoretical advantage of eliminating O(N^2) search behavior seems > like reason enough, even if it takes a ridiculous number of tables > for that to become significant. That makes sense to me, but OTOH if Kevin's willing to be more testing on some artificial cases, particularly the interleaved-index-names case, I think those results would be interesting too. We already know that the slowness of dump + restore is a big issue, so any data we can gather to understand it better (and perhaps eventually design further improvements) seems like it would be time well spent. ...Robert
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think we've pretty much established that it doesn't make things > *worse*, so I'm sort of inclined to go ahead and apply it. The > theoretical advantage of eliminating O(N^2) search behavior seems > like reason enough, even if it takes a ridiculous number of tables > for that to become significant. Agreed, although I'm having some concerns about whether this should proceed based exclusively on my benchmarks. On a thread on the performance list, people are talking about restores which go several times faster with parallel restore (compared to a single job). On my hardware, I haven't even gotten it to run twice as fast. This means that parallel restore is not a good fit for servers like we have, at least with databases like we have, which means it's probably a poor environment to get benchmarks for this patch. :-( Can we get someone who has benchmarks showing parallel restore to be eight times the speed of a single job to benchmark with this patch, just for confirmation? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Rebased to correct for pg_indent changes. Thanks for doing that. Attached is a further small improvement that gets rid of the find_ready_items() scans. After re-reading the patch I realized that it wasn't *really* avoiding O(N^2) behavior ... but this version does. regards, tom lane
Attachment
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Rebased to correct for pg_indent changes. > > Thanks for doing that. No problem. I think I still owe you a few. :-) > Attached is a further small improvement that gets rid of the > find_ready_items() scans. After re-reading the patch I realized > that it wasn't *really* avoiding O(N^2) behavior ... but this > version does. I'll run a fresh set of benchmarks. By the way, I muffed the setup of last night's benchmarks, so no new information there, except that in the process of reviewing the attempt I discovered I was guilty of making a false assertion yesterday, based on remembering incorrectly. The logs show that the six hour dump to custom format was over the LAN. :-( I hope I didn't waste too much of people's time by saying otherwise. I'll try to get some numbers on the same-box dump soon. (Note to self: never, ever trust your memory; always confirm.) -Kevin
On Thu, Jul 30, 2009 at 12:29:34PM -0500, Kevin Grittner wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > I think we've pretty much established that it doesn't make things > > *worse*, so I'm sort of inclined to go ahead and apply it. The > > theoretical advantage of eliminating O(N^2) search behavior seems > > like reason enough, even if it takes a ridiculous number of tables > > for that to become significant. > > Agreed, although I'm having some concerns about whether this should > proceed based exclusively on my benchmarks. On a thread on the > performance list, people are talking about restores which go several > times faster with parallel restore (compared to a single job). On my > hardware, I haven't even gotten it to run twice as fast. This means > that parallel restore is not a good fit for servers like we have, at > least with databases like we have, which means it's probably a poor > environment to get benchmarks for this patch. :-( > > Can we get someone who has benchmarks showing parallel restore to be > eight times the speed of a single job to benchmark with this patch, > just for confirmation? I have a couple "spare" 32GB 4 core and 64GB 8 core servers with 15 scsi/sas drives and dumps of production dbs in the 100GB to 500 GB range. These have several hundred tables most with an index or few and an fkey or too many. It will take a couple days to run a variety of tests I suppose, and I will be away starting mid next week, but maybe I could get some done before I go. Will the patch apply to a vanilla 8.4.0? -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
daveg <daveg@sonic.net> writes: > Will the patch apply to a vanilla 8.4.0? Yeah, it should. The line numbers in the version I just posted might be off a little bit for 8.4.0, but patch should cope. Be sure to "make clean" and recompile all of src/bin/pg_dump, else you might have some issues. regards, tom lane
I wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Attached is a further small improvement that gets rid of the >> find_ready_items() scans. After re-reading the patch I realized >> that it wasn't *really* avoiding O(N^2) behavior ... but this >> version does. > > I'll run a fresh set of benchmarks. Over the weekend I ran 40 restores of Milwaukee County's production data using Friday's snapshot with and without the patch. I alternated between patched and unpatched. It appears that this latest version is slightly slower for our production database on the same machine and configuration where the previous patch appeared to be 1% to 2% faster than unpatched (although I had fewer samples of that). Although the actual runs were interleaved, I've separated them for this email, because it seems easier to look them over this way: 8.5devel pg_restore -j2 real 77m13.737s real 78m21.701s real 78m21.536s real 77m37.541s real 79m10.068s real 81m37.111s real 77m52.252s real 80m49.110s real 76m50.093s real 78m0.701s real 77m30.674s real 77m22.875s real 80m43.914s real 78m51.525s real 80m46.329s real 76m56.703s real 78m44.960s real 82m37.757s real 84m12.938s real 82m27.591s 8.5devel-alt pg_restore -j2 real 78m10.846s real 78m5.584s real 78m13.673s real 79m43.939s real 84m39.593s real 80m25.240s real 78m10.045s real 82m34.320s real 78m43.528s real 77m12.211s real 77m39.171s real 79m58.421s real 84m50.816s real 85m56.248s real 79m17.361s real 79m0.778s real 77m3.525s real 77m22.750s real 78m22.882s real 78m51.617s That's about 0.52% slower with the patch. Because there was over 10% variation in the numbers with the patch, I tried leaving out the four highest outliers on both, in case it was the result of some other activity on the system (even though this machine should have been pretty quiet over the weekend) and the difference fell to 0.09%. I don't know if this difference is enough to worry about; it might depend on whether we're comparing to the unpatched version or the previous patch. If it comes to choosing between a 1% to 2% performance gain for a "normal" case versus elimination of O(N^2) behavior for a worst-case scenario, I'm not sure which should win -- especially in the absence of benchmarks showing the pessimal case. What would be the most productive focus for benchmarks at this point? The artificial pessimal case? -Kevin
> That's about 0.52% slower with the patch. Because there was over 10% > variation in the numbers with the patch, I tried leaving out the four > highest outliers on both, in case it was the result of some other > activity on the system (even though this machine should have been > pretty quiet over the weekend) and the difference fell to 0.09%. > > I don't know if this difference is enough to worry about; it might > depend on whether we're comparing to the unpatched version or the > previous patch. If it comes to choosing between a 1% to 2% > performance gain for a "normal" case versus elimination of O(N^2) > behavior for a worst-case scenario, I'm not sure which should win -- > especially in the absence of benchmarks showing the pessimal case. > What would be the most productive focus for benchmarks at this point? > The artificial pessimal case? > > > My instinct says that the variation is probably just noise, of no great significance. I'm personally not terribly worried about the O(n^2) case, but I think the patch is probably useful anyway, as a basis for other people to try to improve the item selection algorithm further. cheers andrew
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Over the weekend I ran 40 restores of Milwaukee County's production > data using Friday's snapshot with and without the patch. I alternated > between patched and unpatched. It appears that this latest version is > slightly slower for our production database on the same machine and > configuration where the previous patch appeared to be 1% to 2% faster > than unpatched (although I had fewer samples of that). I think we can conclude that for this particular test case, the effects of the patch are pretty much masked by noise. I definitely see no way that the latest version of the patch could really be slower than the original; it has the same job-scheduling behavior and strictly less list-munging overhead. Now the patch could be slower than unpatched as a result of different job-scheduling behavior ... but there's no evidence here of a consistently measurable benefit or loss from that. IIRC daveg was volunteering to do some tests with his own data; maybe we should wait for those results. regards, tom lane
> IIRC daveg was volunteering to do some tests with his own data; maybe > we should wait for those results. Unfortunately, I've lost access to the client's data which was showing bad behaviour under the first heuristic. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
On Mon, Aug 03, 2009 at 11:21:43AM -0400, Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > > Over the weekend I ran 40 restores of Milwaukee County's production > > data using Friday's snapshot with and without the patch. I alternated > > between patched and unpatched. It appears that this latest version is > > slightly slower for our production database on the same machine and > > configuration where the previous patch appeared to be 1% to 2% faster > > than unpatched (although I had fewer samples of that). > > I think we can conclude that for this particular test case, the effects > of the patch are pretty much masked by noise. I definitely see no way > that the latest version of the patch could really be slower than the > original; it has the same job-scheduling behavior and strictly less > list-munging overhead. Now the patch could be slower than unpatched > as a result of different job-scheduling behavior ... but there's no > evidence here of a consistently measurable benefit or loss from that. > > IIRC daveg was volunteering to do some tests with his own data; maybe > we should wait for those results. I have run extensive tests with three trials of each configuration on two hosts with a variety of db sizes from 3GB to 142GB. These just finished, and I will send a more detailed summary later, but at the moment I don't see any significant difference between the patched and vanilla pg_restore. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
On Mon, Aug 03, 2009 at 10:03:47AM -0500, Kevin Grittner wrote: > That's about 0.52% slower with the patch. Because there was over 10% > variation in the numbers with the patch, I tried leaving out the four > highest outliers on both, in case it was the result of some other > activity on the system (even though this machine should have been > pretty quiet over the weekend) and the difference fell to 0.09%. What do people do when testing this? I think I'd look to something like Student's t-test to check for statistical significance. My working would go something like: I assume the variance is the same because it's being tested on the same machine. samples = 20 stddev = 144.26 avg1 = 4783.13 avg2 = 4758.46 t = 0.54 ((avg1 - avg2) / (stddev * sqrt(2/samples))) We then have to choose how certain we want to be that they're actually different, 90% is a reasonably easy level to hit (i.e. one part in ten, with 95% being more commonly quoted). For 20 samples we have 19 degrees of freedom--giving us a cut-off[1] of 1.328. 0.54 is obviously well below this allowing us to say that there's no "statistical significance" between the two samples at a 90% level. -- Sam http://samason.me.uk/ [1] http://en.wikipedia.org/wiki/Student's_t-distribution#Table_of_selected_values
Sam Mason <sam@samason.me.uk> writes: > t = 0.54 ((avg1 - avg2) / (stddev * sqrt(2/samples))) > We then have to choose how certain we want to be that they're actually > different, 90% is a reasonably easy level to hit (i.e. one part in ten, > with 95% being more commonly quoted). For 20 samples we have 19 degrees > of freedom--giving us a cut-off[1] of 1.328. 0.54 is obviously well > below this allowing us to say that there's no "statistical significance" > between the two samples at a 90% level. Hmm, so what about 95% or 99% confidence? regards, tom lane
On Tue, Aug 04, 2009 at 10:45:52AM -0400, Tom Lane wrote: > Sam Mason <sam@samason.me.uk> writes: > > t = 0.54 ((avg1 - avg2) / (stddev * sqrt(2/samples))) > > > We then have to choose how certain we want to be that they're actually > > different, 90% is a reasonably easy level to hit (i.e. one part in ten, > > with 95% being more commonly quoted). For 20 samples we have 19 degrees > > of freedom--giving us a cut-off[1] of 1.328. 0.54 is obviously well > > below this allowing us to say that there's no "statistical significance" > > between the two samples at a 90% level. > > Hmm, so what about 95% or 99% confidence? The cut-off goes up to 1.729 for 95% and to 2.539 for 99%. These values are only really for a 20 samples with the above calculation, the link I gave above gives a nice table for different values. I've also realized that I did the standard deviation all wrong. I should have calculated them independently and then got the mean: stddev1 = 159.9699 stddev2 = 129.6466 stddev = 144.8083 ((stddev1+stddev2) / 2) Here it makes absolutely no difference, but when they were really different distributions it would. -- Sam http://samason.me.uk/
Sam Mason <sam@samason.me.uk> wrote: > What do people do when testing this? I think I'd look to something > like Student's t-test to check for statistical significance. My > working would go something like: > > I assume the variance is the same because it's being tested on the > same machine. > > samples = 20 > stddev = 144.26 > avg1 = 4783.13 > avg2 = 4758.46 > t = 0.54 ((avg1 - avg2) / (stddev * sqrt(2/samples))) > > We then have to choose how certain we want to be that they're > actually different, 90% is a reasonably easy level to hit (i.e. one > part in ten, with 95% being more commonly quoted). For 20 samples > we have 19 degrees of freedom--giving us a cut-off[1] of 1.328. > 0.54 is obviously well below this allowing us to say that there's no > "statistical significance" between the two samples at a 90% level. Thanks for the link; that looks useful. To confirm that I understand what this has established (or get a bit of help putting in in perspective), what this says to me, in the least technical jargon I can muster, is "With this many samples and this degree of standard deviation, the average difference is not large enough to have a 90% confidence level that the difference is significant." In fact, looking at the chart, it isn't enough to reach a 75% confidence level that the difference is significant. Significance here would seem to mean that at least the given percentage of the time, picking this many samples from an infinite set with an average difference that really was this big or bigger would generate a value for t this big or bigger. Am I close? I like to be clear, because it's easy to get confused and take the above to mean that there's a 90% confidence that there is no actual significant difference in performance based on that sampling. (Given Tom's assurance that this version of the patch should have similar performance to the last, and the samples from the prior patch went the other direction, I'm convinced there is not a significant difference, but if I'm going to use the referenced calculations, I want to be clear how to interpret the results.) -Kevin
On Fri, Aug 07, 2009 at 10:19:20AM -0500, Kevin Grittner wrote: > Sam Mason <sam@samason.me.uk> wrote: > > > What do people do when testing this? I think I'd look to something > > like Student's t-test to check for statistical significance. My > > working would go something like: > > > > I assume the variance is the same because it's being tested on the > > same machine. > > > > samples = 20 > > stddev = 144.26 > > avg1 = 4783.13 > > avg2 = 4758.46 > > t = 0.54 ((avg1 - avg2) / (stddev * sqrt(2/samples))) > > > > We then have to choose how certain we want to be that they're > > actually different, 90% is a reasonably easy level to hit (i.e. one > > part in ten, with 95% being more commonly quoted). For 20 samples > > we have 19 degrees of freedom--giving us a cut-off[1] of 1.328. > > 0.54 is obviously well below this allowing us to say that there's no > > "statistical significance" between the two samples at a 90% level. > > Thanks for the link; that looks useful. To confirm that I understand > what this has established (or get a bit of help putting in in > perspective), what this says to me, in the least technical jargon I > can muster, is "With this many samples and this degree of standard > deviation, the average difference is not large enough to have a 90% > confidence level that the difference is significant." In fact, > looking at the chart, it isn't enough to reach a 75% confidence level > that the difference is significant. Significance here would seem to > mean that at least the given percentage of the time, picking this many > samples from an infinite set with an average difference that really > was this big or bigger would generate a value for t this big or > bigger. > > Am I close? Yes, all that sounds as though you've got it. Note that running the test more times will tend to reduce the standard deviation a bit as well, so it may well become significant. In this case it's unlikely to affect it much though. > I like to be clear, because it's easy to get confused and take the > above to mean that there's a 90% confidence that there is no actual > significant difference in performance based on that sampling. (Given > Tom's assurance that this version of the patch should have similar > performance to the last, and the samples from the prior patch went the > other direction, I'm convinced there is not a significant difference, > but if I'm going to use the referenced calculations, I want to be > clear how to interpret the results.) All we're saying is that we're less than 90% confident that there's something "significant" going on. All the fiddling with standard deviations and sample sizes is just easiest way (that I know of) that statistics currently gives us of determining this more formally than a hand-wavy "it looks OK to me". Science tells us that humans are liable to say things are OK when they're not, as well as vice versa; statistics gives us a way to work past these limitations in some common and useful situations. -- Sam http://samason.me.uk/
Sam Mason <sam@samason.me.uk> wrote: > Yes, all that sounds as though you've got it. Thanks. I read through it carefully a few times, but I was still only 80% confident that I had it more-or-less right. ;-) That does seem like a good test, with the advantage of being relatively easy to calculate. Thanks again for suggesting it. -Kevin
On Fri, Aug 07, 2009 at 10:39:19AM -0500, Kevin Grittner wrote: > Sam Mason <sam@samason.me.uk> wrote: > > Yes, all that sounds as though you've got it. > > Thanks. I read through it carefully a few times, but I was still only > 80% confident that I had it more-or-less right. ;-) And which method did you use to determine that you're 80% confident? :) -- Sam http://samason.me.uk/
Sam Mason <sam@samason.me.uk> wrote: > All we're saying is that we're less than 90% confident that there's > something "significant" going on. All the fiddling with standard > deviations and sample sizes is just easiest way (that I know of) > that statistics currently gives us of determining this more formally > than a hand-wavy "it looks OK to me". Science tells us that humans > are liable to say things are OK when they're not, as well as vice > versa; statistics gives us a way to work past these limitations in > some common and useful situations. Following up, I took the advice offered in the referenced article, and used a spreadsheet with a TDIST function for more accurate results than available through the table included in the article. That allows what I think is a more meaningful number: the probability that taking a sample that big would have resulted in a t-statistic larger than was actually achieved if there was no real difference. With the 20 samples from that last round of tests, the answer (rounded to the nearest percent) is 60%, so "probably noise" is a good summary. Combined with the 12 samples from earlier comparable runs with the prior version of the patch, it goes to a 90% probability that noise would generate a difference at least that large, so I think we've gotten to "almost certainly noise". :-) To me, that seems more valuable for this situation than saying "we haven't reached 90% confidence that it's a real difference." I used the same calculations up through the t-statistic. The one question I have left for this technique is why you went with ((avg1 - avg2) / (stddev * sqrt(2/samples))) instead of ((avg1 - avg2) / (stddev / sqrt(samples))) I assume that it's because the baseline was a set of samples rather than a fixed mark, but I couldn't pick out a specific justification for this in the literature (although I might have just missed it), so I'd feel more comfy if you could clarify. Given the convenience of capturing benchmarking data in a database, has anyone tackled implementation of something like the spreadsheet TDIST function within PostgreSQL? -Kevin
On Fri, Aug 7, 2009 at 3:08 PM, Kevin Grittner<Kevin.Grittner@wicourts.gov> wrote: > Sam Mason <sam@samason.me.uk> wrote: > >> All we're saying is that we're less than 90% confident that there's >> something "significant" going on. All the fiddling with standard >> deviations and sample sizes is just easiest way (that I know of) >> that statistics currently gives us of determining this more formally >> than a hand-wavy "it looks OK to me". Science tells us that humans >> are liable to say things are OK when they're not, as well as vice >> versa; statistics gives us a way to work past these limitations in >> some common and useful situations. > > Following up, I took the advice offered in the referenced article, and > used a spreadsheet with a TDIST function for more accurate results > than available through the table included in the article. That allows > what I think is a more meaningful number: the probability that taking > a sample that big would have resulted in a t-statistic larger than was > actually achieved if there was no real difference. > > With the 20 samples from that last round of tests, the answer (rounded > to the nearest percent) is 60%, so "probably noise" is a good summary. > Combined with the 12 samples from earlier comparable runs with the > prior version of the patch, it goes to a 90% probability that noise > would generate a difference at least that large, so I think we've > gotten to "almost certainly noise". :-) So should we give up on this patch? ...Robert
On Fri, Aug 07, 2009 at 02:08:21PM -0500, Kevin Grittner wrote: > With the 20 samples from that last round of tests, the answer (rounded > to the nearest percent) is 60%, so "probably noise" is a good summary. > Combined with the 12 samples from earlier comparable runs with the > prior version of the patch, it goes to a 90% probability that noise > would generate a difference at least that large, so I think we've > gotten to "almost certainly noise". :-) > > To me, that seems more valuable for this situation than saying "we > haven't reached 90% confidence that it's a real difference." I used > the same calculations up through the t-statistic. The stats people in our group just tend to say that things are significant or not at a specific level; never bothered to find out why, I'll ask someone when I get a chance. > The one question I have left for this technique is why you went with > > ((avg1 - avg2) / (stddev * sqrt(2/samples))) > instead of > ((avg1 - avg2) / (stddev / sqrt(samples))) I was just doing a literal translation of what was on the Wikipedia page: http://en.wikipedia.org/wiki/Student's_t-test#Independent_two-sample_t-test If you really want to find out, there should be much better implementations in the pl/r language already in PG. I'd trust R much more than Wikipedia, but for things like this Wikipedia is reasonable. > I assume that it's because the baseline was a set of samples rather > than a fixed mark, but I couldn't pick out a specific justification > for this in the literature (although I might have just missed it), so > I'd feel more comfy if you could clarify. Sorry, that's about my limit! I've never studied stats, I'm a computer science person who just happens to be around people who use stats on a day-to-day basis and think it needs more use in the software world. I think you're right and you're aggregating the errors from two (assumed independent) datasets hence you want to keep a bit more of the error in there. As to the formal justification (and probably proof) I've no real idea. > Given the convenience of capturing benchmarking data in a database, > has anyone tackled implementation of something like the spreadsheet > TDIST function within PostgreSQL? Again, pl/r is what you want! -- Sam http://samason.me.uk/
On Fri, Aug 07, 2009 at 03:18:54PM -0400, Robert Haas wrote: > On Fri, Aug 7, 2009 at 3:08 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > > With the 20 samples from that last round of tests, the answer (rounded > > to the nearest percent) is 60%, so "probably noise" is a good summary. > > So should we give up on this patch? That's the joy of stats, it only tells you *very* precisely about the *exact* thing you've chosen to test. Interpreting the result is still awkward, but it does remove one problem! If you think the tests that've been done cover the use cases that the new code was been designed to help with and you're not showing any benefit I'd probably give up and put it down to a learning experience. Sorry, but I've not been following enough to comment on this much more. -- Sam http://samason.me.uk/
Robert Haas <robertmhaas@gmail.com> wrote: > So should we give up on this patch? No, this is not news; just confirmation of the earlier gut feelings and less convincing statistics that there is no problem. Tom's argument that if there's no slowdown for common cases, preventing O(N^2) behavior for extreme cases is compelling for me, and we've beaten up on it enough for me to feel comfortable that it doesn't break anything. I held off on investigating the artificial extreme cases when I thought we might possibly have a small performance problem here; but this statistics exercise has just gotten me from "gut feel" that it was noise and "not having 90% confident that we made things worse" to "90% of the time noise would produce a bigger difference". It's one thing to require 90% confidence that an improvement is real before accepting something, it's another to accept a change on the basis of not having 90% confidence that there is degradation -- so I wanted to see a more compelling statistic. Personally, I'm happy with it being in "Ready for Committer" status. I remember someone else on the thread saying that besides the elimination of O(N^2) behavior, it provided better structure for future enhancements. I'll run a few more benchmarks over the next few weeks to try to characterize the improvements in extreme cases, "just for the record", but I don't think we want to wait for that; we've got justification enough as is. -Kevin
On Fri, Aug 7, 2009 at 3:36 PM, Sam Mason<sam@samason.me.uk> wrote: > On Fri, Aug 07, 2009 at 03:18:54PM -0400, Robert Haas wrote: >> On Fri, Aug 7, 2009 at 3:08 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> > With the 20 samples from that last round of tests, the answer (rounded >> > to the nearest percent) is 60%, so "probably noise" is a good summary. >> >> So should we give up on this patch? > > That's the joy of stats, it only tells you *very* precisely about the > *exact* thing you've chosen to test. Interpreting the result is still > awkward, but it does remove one problem! > > If you think the tests that've been done cover the use cases that the > new code was been designed to help with and you're not showing any > benefit I'd probably give up and put it down to a learning experience. > Sorry, but I've not been following enough to comment on this much more. Yeah, I more wanted to here from Tom or Kevin or anyone else who had a technical thought about this. I haven't looked at the patch, but there may be reasons other than performance to commit it - or there may not. Tom posted a note on the commitfest suggesting that maybe we should give up on this, and I don't care one way or the other, except that in my role as CommitFest manager (or Mom) I'm trying to push the remaining patches to some sort of conclusion: either committed, or rejected, or needs more work please resubmit for the next CommitFest. ...Robert
I wrote: > I remember someone else on the thread saying [...] > it provided better structure for future enhancements. Found the reference: http://archives.postgresql.org/pgsql-hackers/2009-08/msg00078.php This was the email which I thought confirmed that the changes were worth it, even in the absence of benchmarks showing performance improvement. I guess the counter-argument is that so far this has been framed as a performance patch, and I've seen posts before which say that we don't accept those without benchmarks to show an actual performance improvement. Accepting it on the basis of evidence and comments so far would, I suppose, put it more in the category of a refactoring for cleaner code. Should we leave it to the code committer to make the final call? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Robert Haas <robertmhaas@gmail.com> wrote: >> So should we give up on this patch? > No, this is not news; just confirmation of the earlier gut feelings > and less convincing statistics that there is no problem. Tom's > argument that if there's no slowdown for common cases, preventing > O(N^2) behavior for extreme cases is compelling for me, and we've > beaten up on it enough for me to feel comfortable that it doesn't > break anything. Yeah. I had hoped to see some evidence of actual improvement for common cases, but we haven't got that. What we do have is evidence that it's not making common cases worse, so avoiding the possible O(N^2) behavior for extreme cases seems like a reasonable argument for committing it anyway. I'll go ahead and commit it just to get it out of the queue ... we can always revert the commit if new info surfaces. regards, tom lane