Thread: 9.4 regression
Hi all, I recently tried a simple benchmark to see how far 9.4 had come since 8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for performance. After checking 9.2 and 9.3 (as per Kevin Grittner's suggestion), I found that those were fine, so the issue must be in 9.4devel. I used identical configurations for each test, and used 9.1's pgbench to ensure changes in pgbench didn't affect the outcome.The common config changes were: max_connections = 500 shared_buffers = 4GB effective_cache_size = 12GB random_page_cost = 2.0 cpu_tuple_cost = 0.03 wal_buffers = 32MB work_mem = 100MB maintenance_work_mem = 512MB checkpoint_segments = 32 checkpoint_timeout = 15min checkpoint_completion_target = 0.8 commit_delay = 50 commit_siblings = 15 System info: 8GB DDR3 RAM (yes, I know the config isn't optimal here) 64-bit Linux Mint 15 (3.8.0 kernel) ext4 Only build option used was --enable-depend. I did have --enable-cassert for the shorter 5 min benchmarks, but was removed for the 30 min tests. Here are the results: pgbench -j 80 -c 80 -T 300: 8.4 - 535.990042 9.2 - 820.798141 9.3 - 828.395498 9.4 - 197.851720 pgbench -j 20 -c 20 -T 300: 8.4 - 496.529343 9.2 - 569.626729 9.3 - 575.831264 9.4 - 385.658893 pgbench -j 20 -c 400 -T 300: 8.4 - 580.186868 9.2 - 824.590252 9.3 - 784.638848 9.4 - 524.493308 These were only run for 5 minutes each, but subsequent ones were run for 30 mins. All tests were run with -s 20. Given a few key commits Robert Haas directed me to, I narrowed down the regression to a time period, so benchmarked a few select commits. pgbench -j 80 -c 80 -T 1800: 8.4: 812.482108 9.4 HEAD: 355.397658 9.4 e5592c (9th July): 356.485625 9.4 537227 (7th July): 365.992518 9.4 9b2543 (7th July): 362.587339 9.4 269e78 (5th July): 359.439143 9.4 8800d8 (5th July): 821.933082 9.4 568d41 (2nd July): 822.991160 269e78 was the commit immediately after 8800d8, so it appears that introduced the regression. "Use posix_fallocate() for new WAL files, where available." Ironically, that was intended to be a performance improvement. Regards Thom
On 2013-08-07 17:21:01 +0100, Thom Brown wrote: > Only build option used was --enable-depend. I did have > --enable-cassert for the shorter 5 min benchmarks, but was removed for > the 30 min tests. > pgbench -j 80 -c 80 -T 300: > > 8.4 - 535.990042 > 9.2 - 820.798141 > 9.3 - 828.395498 > 9.4 - 197.851720 e> > pgbench -j 80 -c 80 -T 1800: > > 8.4: 812.482108 > 9.4 HEAD: 355.397658 > 9.4 e5592c (9th July): 356.485625 > 9.4 537227 (7th July): 365.992518 > 9.4 9b2543 (7th July): 362.587339 > 9.4 269e78 (5th July): 359.439143 > 9.4 8800d8 (5th July): 821.933082 > 9.4 568d41 (2nd July): 822.991160 The differences between those runs look to small for enable/disable cassert to me. Are you you properly rebuilt for that? > 269e78 was the commit immediately after 8800d8, so it appears that > introduced the regression. > > "Use posix_fallocate() for new WAL files, where available." This is curious. Could you either run a longer test before/after the commit or reduce checkpoint_timeout to something like 3min? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Aug 7, 2013 at 11:21 AM, Thom Brown <thom@linux.com> wrote: > Hi all, > > I recently tried a simple benchmark to see how far 9.4 had come since > 8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for > performance. After checking 9.2 and 9.3 (as per Kevin Grittner's > suggestion), I found that those were fine, so the issue must be in > 9.4devel. I used identical configurations for each test, and used > 9.1's pgbench to ensure changes in pgbench didn't affect the outcome. > The common config changes were: ... > 8.4: 812.482108 > 9.4 HEAD: 355.397658 > 9.4 e5592c (9th July): 356.485625 > 9.4 537227 (7th July): 365.992518 > 9.4 9b2543 (7th July): 362.587339 > 9.4 269e78 (5th July): 359.439143 > 9.4 8800d8 (5th July): 821.933082 > 9.4 568d41 (2nd July): 822.991160 > > 269e78 was the commit immediately after 8800d8, so it appears that > introduced the regression. > > "Use posix_fallocate() for new WAL files, where available." > > Ironically, that was intended to be a performance improvement. Would it be possible for you to download, compile, and run the test program as described and located in this email: http://www.postgresql.org/message-id/CAKuK5J1AcML-1cGBhnRzED-vh4oG+8HkmFhy2ECa-8JBJ-6qbQ@mail.gmail.com I also wonder if there is a problem with the 3.8.0 kernel specifically. -- Jon
On 7 August 2013 17:49, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-08-07 17:21:01 +0100, Thom Brown wrote: >> Only build option used was --enable-depend. I did have >> --enable-cassert for the shorter 5 min benchmarks, but was removed for >> the 30 min tests. > >> pgbench -j 80 -c 80 -T 300: >> >> 8.4 - 535.990042 >> 9.2 - 820.798141 >> 9.3 - 828.395498 >> 9.4 - 197.851720 > e> >> pgbench -j 80 -c 80 -T 1800: >> >> 8.4: 812.482108 >> 9.4 HEAD: 355.397658 >> 9.4 e5592c (9th July): 356.485625 >> 9.4 537227 (7th July): 365.992518 >> 9.4 9b2543 (7th July): 362.587339 >> 9.4 269e78 (5th July): 359.439143 >> 9.4 8800d8 (5th July): 821.933082 >> 9.4 568d41 (2nd July): 822.991160 > > The differences between those runs look to small for enable/disable > cassert to me. Are you you properly rebuilt for that? > >> 269e78 was the commit immediately after 8800d8, so it appears that >> introduced the regression. >> >> "Use posix_fallocate() for new WAL files, where available." > > This is curious. Could you either run a longer test before/after the > commit or reduce checkpoint_timeout to something like 3min? Okay, I'll rerun the test for both those commits at 1 hour each with checkpoint_timeout set at 3mins, but with all other configuration settings the same -- Thom
On 7 August 2013 17:54, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > On Wed, Aug 7, 2013 at 11:21 AM, Thom Brown <thom@linux.com> wrote: >> Hi all, >> >> I recently tried a simple benchmark to see how far 9.4 had come since >> 8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for >> performance. After checking 9.2 and 9.3 (as per Kevin Grittner's >> suggestion), I found that those were fine, so the issue must be in >> 9.4devel. I used identical configurations for each test, and used >> 9.1's pgbench to ensure changes in pgbench didn't affect the outcome. >> The common config changes were: > > ... > >> 8.4: 812.482108 >> 9.4 HEAD: 355.397658 >> 9.4 e5592c (9th July): 356.485625 >> 9.4 537227 (7th July): 365.992518 >> 9.4 9b2543 (7th July): 362.587339 >> 9.4 269e78 (5th July): 359.439143 >> 9.4 8800d8 (5th July): 821.933082 >> 9.4 568d41 (2nd July): 822.991160 >> >> 269e78 was the commit immediately after 8800d8, so it appears that >> introduced the regression. >> >> "Use posix_fallocate() for new WAL files, where available." >> >> Ironically, that was intended to be a performance improvement. > > Would it be possible for you to download, compile, and run the test > program as described and located in this email: > > http://www.postgresql.org/message-id/CAKuK5J1AcML-1cGBhnRzED-vh4oG+8HkmFhy2ECa-8JBJ-6qbQ@mail.gmail.com I shall do after the 2 x 1 hour benchmarks are complete. > I also wonder if there is a problem with the 3.8.0 kernel specifically. Well my laptop has the same kernel (and also 64-bit Linux Mint), so took 3 quick sample benchmarks on those two commits, and I get the following (all without --enable-cassert): 269e78: 1162.593665 / 1158.644302 / 1153.955611 8800d8: 2446.087618 / 2443.797252 / 2321.876431 And running your test program gives the following (again, just on my laptop): for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done method: classic. 1 open/close iterations, 1 rewrite in 0.6380s method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3204s method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6274s method: classic. 2 open/close iterations, 1 rewrite in 1.2908s method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6596s method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.2666s method: classic. 5 open/close iterations, 1 rewrite in 3.1419s method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.5930s method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1516s method: classic. 10 open/close iterations, 1 rewrite in 6.2912s method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2626s method: glibc emulation. 10 open/close iterations, 1 rewrite in 6.3667s method: classic. 100 open/close iterations, 1 rewrite in 67.4174s method: posix_fallocate. 100 open/close iterations, 1 rewrite in 37.8788s method: glibc emulation. 100 open/close iterations, 1 rewrite in 55.0714s -- Thom
On Wed, Aug 7, 2013 at 12:42 PM, Thom Brown <thom@linux.com> wrote: > On 7 August 2013 17:54, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: >> On Wed, Aug 7, 2013 at 11:21 AM, Thom Brown <thom@linux.com> wrote: >>> Hi all, >>> >>> I recently tried a simple benchmark to see how far 9.4 had come since >>> 8.4, but I discovered that I couldn't get 9.4 to even touch 8.4 for >>> performance. After checking 9.2 and 9.3 (as per Kevin Grittner's >>> suggestion), I found that those were fine, so the issue must be in >>> 9.4devel. I used identical configurations for each test, and used >>> 9.1's pgbench to ensure changes in pgbench didn't affect the outcome. >>> The common config changes were: >> >> ... >> >>> 8.4: 812.482108 >>> 9.4 HEAD: 355.397658 >>> 9.4 e5592c (9th July): 356.485625 >>> 9.4 537227 (7th July): 365.992518 >>> 9.4 9b2543 (7th July): 362.587339 >>> 9.4 269e78 (5th July): 359.439143 >>> 9.4 8800d8 (5th July): 821.933082 >>> 9.4 568d41 (2nd July): 822.991160 >>> >>> 269e78 was the commit immediately after 8800d8, so it appears that >>> introduced the regression. >>> >>> "Use posix_fallocate() for new WAL files, where available." >>> >>> Ironically, that was intended to be a performance improvement. >> >> Would it be possible for you to download, compile, and run the test >> program as described and located in this email: >> >> http://www.postgresql.org/message-id/CAKuK5J1AcML-1cGBhnRzED-vh4oG+8HkmFhy2ECa-8JBJ-6qbQ@mail.gmail.com > > I shall do after the 2 x 1 hour benchmarks are complete. > >> I also wonder if there is a problem with the 3.8.0 kernel specifically. > > Well my laptop has the same kernel (and also 64-bit Linux Mint), so > took 3 quick sample benchmarks on those two commits, and I get the > following (all without --enable-cassert): > > 269e78: 1162.593665 / 1158.644302 / 1153.955611 > 8800d8: 2446.087618 / 2443.797252 / 2321.876431 > > And running your test program gives the following (again, just on my laptop): > > for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done > method: classic. 1 open/close iterations, 1 rewrite in 0.6380s > method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3204s > method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6274s > method: classic. 2 open/close iterations, 1 rewrite in 1.2908s > method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6596s > method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.2666s > method: classic. 5 open/close iterations, 1 rewrite in 3.1419s > method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.5930s > method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1516s > method: classic. 10 open/close iterations, 1 rewrite in 6.2912s > method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2626s > method: glibc emulation. 10 open/close iterations, 1 rewrite in 6.3667s > method: classic. 100 open/close iterations, 1 rewrite in 67.4174s > method: posix_fallocate. 100 open/close iterations, 1 rewrite in 37.8788s > method: glibc emulation. 100 open/close iterations, 1 rewrite in 55.0714s OK. That's interesting, and a good data point. One thing you could try manually disabling the use of posix_fallocate in 269e78. After running ./configure --stuff-here edit src/include/pg_config.h and comment out the following line (on or around line 374) #define HAVE_POSIX_FALLOCATE 1 *then* build postgresql and see if the performance hit is still there. -- Jon
On 7 August 2013 18:49, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > On Wed, Aug 7, 2013 at 12:42 PM, Thom Brown <thom@linux.com> wrote: >> for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done >> method: classic. 1 open/close iterations, 1 rewrite in 0.6380s >> method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3204s >> method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6274s >> method: classic. 2 open/close iterations, 1 rewrite in 1.2908s >> method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6596s >> method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.2666s >> method: classic. 5 open/close iterations, 1 rewrite in 3.1419s >> method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.5930s >> method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1516s >> method: classic. 10 open/close iterations, 1 rewrite in 6.2912s >> method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2626s >> method: glibc emulation. 10 open/close iterations, 1 rewrite in 6.3667s >> method: classic. 100 open/close iterations, 1 rewrite in 67.4174s >> method: posix_fallocate. 100 open/close iterations, 1 rewrite in 37.8788s >> method: glibc emulation. 100 open/close iterations, 1 rewrite in 55.0714s > > OK. That's interesting, and a good data point. > > One thing you could try manually disabling the use of posix_fallocate in 269e78. > After running ./configure --stuff-here > edit src/include/pg_config.h and comment out the following line (on or > around line 374) > #define HAVE_POSIX_FALLOCATE 1 > > *then* build postgresql and see if the performance hit is still there. Okay, done that. The TPS increases again: 2308.807568 / 2554.264572 / 2563.190204 And I did run ./configure... before removing the line, and built it after the change. -- Thom
On 7 August 2013 17:54, Thom Brown <thom@linux.com> wrote: > On 7 August 2013 17:49, Andres Freund <andres@2ndquadrant.com> wrote: >> On 2013-08-07 17:21:01 +0100, Thom Brown wrote: >>> Only build option used was --enable-depend. I did have >>> --enable-cassert for the shorter 5 min benchmarks, but was removed for >>> the 30 min tests. >> >>> pgbench -j 80 -c 80 -T 300: >>> >>> 8.4 - 535.990042 >>> 9.2 - 820.798141 >>> 9.3 - 828.395498 >>> 9.4 - 197.851720 >> e> >>> pgbench -j 80 -c 80 -T 1800: >>> >>> 8.4: 812.482108 >>> 9.4 HEAD: 355.397658 >>> 9.4 e5592c (9th July): 356.485625 >>> 9.4 537227 (7th July): 365.992518 >>> 9.4 9b2543 (7th July): 362.587339 >>> 9.4 269e78 (5th July): 359.439143 >>> 9.4 8800d8 (5th July): 821.933082 >>> 9.4 568d41 (2nd July): 822.991160 >> >> The differences between those runs look to small for enable/disable >> cassert to me. Are you you properly rebuilt for that? >> >>> 269e78 was the commit immediately after 8800d8, so it appears that >>> introduced the regression. >>> >>> "Use posix_fallocate() for new WAL files, where available." >> >> This is curious. Could you either run a longer test before/after the >> commit or reduce checkpoint_timeout to something like 3min? > > Okay, I'll rerun the test for both those commits at 1 hour each with > checkpoint_timeout set at 3mins, but with all other configuration > settings the same Results (checkpoint_timeout = 3min) pgbench -j 80 -c 80 -T 3600 269e78: 606.268013 8800d8: 779.583129 Jon, here are the test results you asked for: $ for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done method: classic. 1 open/close iterations, 1 rewrite in 0.9157s method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.5314s method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6018s method: classic. 2 open/close iterations, 1 rewrite in 1.1417s method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6505s method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.8900s method: classic. 5 open/close iterations, 1 rewrite in 3.6490s method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.9841s method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1053s method: classic. 10 open/close iterations, 1 rewrite in 5.7562s method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2015s method: glibc emulation. 10 open/close iterations, 1 rewrite in 7.1426s method: classic. 100 open/close iterations, 1 rewrite in 64.9483s method: posix_fallocate. 100 open/close iterations, 1 rewrite in 36.3748s method: glibc emulation. 100 open/close iterations, 1 rewrite in 64.8386s -- Thom
Thom Brown <thom@linux.com> wrote: > pgbench -j 80 -c 80 -T 3600 > > 269e78: 606.268013 > 8800d8: 779.583129 As another data point I duplicated Thom's original tests: max_connections = 500 shared_buffers = 4GB effective_cache_size = 12GB random_page_cost = 2.0 cpu_tuple_cost = 0.03 wal_buffers = 32MB work_mem = 100MB maintenance_work_mem = 512MB checkpoint_segments = 32 checkpoint_timeout = 15min checkpoint_completion_target = 0.8 commit_delay = 50 commit_siblings = 15 pgbench -i -s 20 pgbench pgbench -j 80 -c 80 -T 1800 pgbench (note the high contention pgbench setup) On this system: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (hyperthreading enabled) 2 x 15k RPS drives in software RAID 1, ext4, write cache disabled 3.5.0-37-generic #58-Ubuntu SMP Mon Jul 8 22:07:55 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux commit 8800d8061dd151d6556f5f8d58f8211fd830169f number of transactions actually processed: 1298864 tps = 721.384612 (including connections establishing) tps = 721.388938 (excluding connections establishing) commit 269e780822abb2e44189afaccd6b0ee7aefa7ddd number of transactions actually processed: 875900 tps = 486.524741 (including connections establishing) tps = 486.527891 (excluding connections establishing) -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Did you report information about the system affected? What filesystem is it on? If it's ext4 does it have extents enabled?<br/><br />I think on ext3 or ext4 without extents it won't have any benefit but it shouldn't really be any slowereither since the libc implementation is very similar to what we used to do.<br /><br />I wouldn't expect the benefitfrom it to have a huge effect either. The main benefit is actually the reduced fragmentation so it would be prettydependent on the filesystem it's on anyways. Avoiding stalls when creating new wal files is also nice but shouldn'thappen often enough n your benchmarks to cause an effect on the average numbers.<br /><br />
On Wed, Aug 7, 2013 at 4:18 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > Thom Brown <thom@linux.com> wrote: > >> pgbench -j 80 -c 80 -T 3600 >> >> 269e78: 606.268013 >> 8800d8: 779.583129 I have also been running some tests and - as yet - they are inconclusive. What I can say about them so far is that - at times - they agree with these results and do show a performance hit. I took the settings as posted and adjusted them slightly for my much lower-powered personal laptop, changing checkpoint_completion_target to 1.0 and checkpoint_timeout to 1min. I am testing with the latest git HEAD, turning fallocate support on and off by editing xlog.c directly. Furthermore, before each run I would try to reduce the number of existing WAL files by making a "pre" run with checkpoint_segments = 3 before changing it to 32. For reasons I don't entirely understand, when WAL files are being created (rather than recycled) I found a performance hit, but when they were being recycled I got a slight performance gain (this may be due to reduced fragmentation) but the gain was never more than 10% and frequently less than that. I can't explain - yet (if ever!) - why my initial tests (and many of those done subsequently) showed improvement - it may very well have something to do with how the code interacts with these settings. -- Jon
On 7 August 2013 23:40, Greg Stark <stark@mit.edu> wrote: > Did you report information about the system affected? What filesystem is it > on? If it's ext4 does it have extents enabled? Yes, ext4. It's using whatever the default options are, but running lsattr on my data dir shows that extents are being used. -- Thom
On 8 August 2013 00:04, Thom Brown <thom@linux.com> wrote: > On 7 August 2013 23:40, Greg Stark <stark@mit.edu> wrote: >> Did you report information about the system affected? What filesystem is it >> on? If it's ext4 does it have extents enabled? > > Yes, ext4. It's using whatever the default options are, but running > lsattr on my data dir shows that extents are being used. I've just created an ext2 partition (so no extents), and did some quick benchmarks of those 2 commits. This is on the original system (not my laptop), with the same config settings as before. All tests initialised with -s 20 and tested with -j 80 -c 80. 30 seconds: 8800d8: 1327.983018 / 1317.848514 269e78: 1388.009819 / 1285.929100 5 minutes: 8800d8: 1744.446592 / 1729.602706 269e78: 1677.561277 / 1677.967873 10 minutes: 8800d8: 1753.147609 / 1806.600335 269e78: 1815.862842 / 1804.906769 So this appears to be okay. -- Thom
On 08/07/2013 05:40 PM, Thom Brown wrote: > On 8 August 2013 00:04, Thom Brown <thom@linux.com> wrote: >> On 7 August 2013 23:40, Greg Stark <stark@mit.edu> wrote: >>> Did you report information about the system affected? What filesystem is it >>> on? If it's ext4 does it have extents enabled? >> >> Yes, ext4. It's using whatever the default options are, but running >> lsattr on my data dir shows that extents are being used. Would it be worth testing XFS? If so, any particular options I should use? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 2013-08-07 20:23:55 +0100, Thom Brown wrote: > >>> 269e78 was the commit immediately after 8800d8, so it appears that > >>> introduced the regression. > >>> > >>> "Use posix_fallocate() for new WAL files, where available." > >> > >> This is curious. Could you either run a longer test before/after the > >> commit or reduce checkpoint_timeout to something like 3min? > > > > Okay, I'll rerun the test for both those commits at 1 hour each with > > checkpoint_timeout set at 3mins, but with all other configuration > > settings the same > > Results > (checkpoint_timeout = 3min) > > pgbench -j 80 -c 80 -T 3600 > > 269e78: 606.268013 > 8800d8: 779.583129 Ok, so the performance difference is lower. But, it seems to be still to high to be explaineable by performance differences during initialization/fallocate. In a very quick test, I don't see the same on my laptop. I am currently travelling and I don't have convenient access to anything else. Could you: - run filefrag on a couple of wal segments of both clusters after the test and post the results here? - enable log_checkpoints, post the lines spat out together with the results - run pgbench without reinitializing the cluster/pgbench tables inbetween? Basically, I'd like to know whether its the initialization that's slow (measurable because we're regularly creating new files because of a too low checkpoint_segments) or whether it's the actual writes. > Jon, here are the test results you asked for: > > $ for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done > method: classic. 1 open/close iterations, 1 rewrite in 0.9157s > method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.5314s > method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6018s > method: classic. 2 open/close iterations, 1 rewrite in 1.1417s > method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6505s > method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.8900s > method: classic. 5 open/close iterations, 1 rewrite in 3.6490s > method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.9841s > method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1053s > method: classic. 10 open/close iterations, 1 rewrite in 5.7562s > method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2015s > method: glibc emulation. 10 open/close iterations, 1 rewrite in 7.1426s > method: classic. 100 open/close iterations, 1 rewrite in 64.9483s > method: posix_fallocate. 100 open/close iterations, 1 rewrite in 36.3748s > method: glibc emulation. 100 open/close iterations, 1 rewrite in 64.8386s Ok, this seems to indicate that fallocate works nicely. Jon, wasn't there a version of the test that rewrote the file afterwards? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Aug 7, 2013 at 10:05 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-08-07 20:23:55 +0100, Thom Brown wrote: >> >>> 269e78 was the commit immediately after 8800d8, so it appears that >> >>> introduced the regression. >> >>> >> >>> "Use posix_fallocate() for new WAL files, where available." >> >> >> >> This is curious. Could you either run a longer test before/after the >> >> commit or reduce checkpoint_timeout to something like 3min? >> > >> > Okay, I'll rerun the test for both those commits at 1 hour each with >> > checkpoint_timeout set at 3mins, but with all other configuration >> > settings the same >> >> Results >> (checkpoint_timeout = 3min) >> >> pgbench -j 80 -c 80 -T 3600 >> >> 269e78: 606.268013 >> 8800d8: 779.583129 > > Ok, so the performance difference is lower. But, it seems to be still to > high to be explaineable by performance differences during > initialization/fallocate. > In a very quick test, I don't see the same on my laptop. I am currently > travelling and I don't have convenient access to anything else. > > Could you: > - run filefrag on a couple of wal segments of both clusters after the > test and post the results here? For me, there is no meaningful difference, but I have a relatively fresh filesystem (ext4) with lots of free space. > - enable log_checkpoints, post the lines spat out together with the results > - run pgbench without reinitializing the cluster/pgbench tables > inbetween? When I do this (as I note below), the performance difference (for me) disappears. > Basically, I'd like to know whether its the initialization that's slow > (measurable because we're regularly creating new files because of a too > low checkpoint_segments) or whether it's the actual writes. What I've found so far is very confusing. I start by using initdb to initialize a temporary cluster, copy in a postgresql.conf (with the modifcations from Thom Brown tweaked for my small laptop), start the cluster, create a test database, initialize it with pgbench, and then run. I'm also only running for two minutes at this time. Every time I test, the non-fallocate version is faster. I modifed xlog.c to use posix_fallocate (or not) based on an environment variable. Once the WAL files have been rewritten at least once, then it doesn't seem to matter which method is used to allocate them (although posix_fallocate seems to have a slight edge). I even went to far as to modify the code to posix_fallocate the file *and then zero it out anyway*, and it was almost as fast as zeroing alone, and faster than using posix_fallocate alone. >> Jon, here are the test results you asked for: >> >> $ for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done >> method: classic. 1 open/close iterations, 1 rewrite in 0.9157s >> method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.5314s >> method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6018s >> method: classic. 2 open/close iterations, 1 rewrite in 1.1417s >> method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6505s >> method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.8900s >> method: classic. 5 open/close iterations, 1 rewrite in 3.6490s >> method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.9841s >> method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1053s >> method: classic. 10 open/close iterations, 1 rewrite in 5.7562s >> method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2015s >> method: glibc emulation. 10 open/close iterations, 1 rewrite in 7.1426s >> method: classic. 100 open/close iterations, 1 rewrite in 64.9483s >> method: posix_fallocate. 100 open/close iterations, 1 rewrite in 36.3748s >> method: glibc emulation. 100 open/close iterations, 1 rewrite in 64.8386s > > Ok, this seems to indicate that fallocate works nicely. Jon, wasn't > there a version of the test that rewrote the file afterwards? Yes. If you use a different number besides '1' as the third argument in the command line above, it will rewrite the file that many times. -- Jon
On 2013-08-07 23:55:22 -0500, Jon Nelson wrote: > > - enable log_checkpoints, post the lines spat out together with the results > > - run pgbench without reinitializing the cluster/pgbench tables > > inbetween? > > When I do this (as I note below), the performance difference (for me) > disappears. Ok. In this case the files have been written to and fsync()ed before the test is run. > > Basically, I'd like to know whether its the initialization that's slow > > (measurable because we're regularly creating new files because of a too > > low checkpoint_segments) or whether it's the actual writes. > > What I've found so far is very confusing. > I start by using initdb to initialize a temporary cluster, copy in a > postgresql.conf (with the modifcations from Thom Brown tweaked for my > small laptop), start the cluster, create a test database, initialize > it with pgbench, and then run. I'm also only running for two minutes > at this time. Hm. In a 2 minutes run there won't be any checkpoints.. > Every time I test, the non-fallocate version is faster. I modifed > xlog.c to use posix_fallocate (or not) based on an environment > variable. > Once the WAL files have been rewritten at least once, then it doesn't > seem to matter which method is used to allocate them (although > posix_fallocate seems to have a slight edge). I even went to far as to > modify the code to posix_fallocate the file *and then zero it out > anyway*, and it was almost as fast as zeroing alone, and faster than > using posix_fallocate alone. Ok, so it seems there's performance issues with allocating the data even though the allocation supposedly has already been made. Access by several backends may play a role here. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
A follow-up. I found this thread that seems to explain some things: http://comments.gmane.org/gmane.comp.file-systems.ext4/33024 Short version: if we are writing into the "middle" of the newly-fallocated file on ext4 (with extents) the extent tree can fragment badly, causing poor performance due to extent merging. I also ran some tests with xfs, and the results were even stranger: xfs performed very slightly better with fallocate vs. without. xfs (fallocate) 216 tps vs. (no fallocate) 206 ext4 (no fallocate) 605 vs (fallocate) 134. I made an ext4 filesystem without extents using the same block device as above - a real, single hard drive with nothing else on it. On this filesystem, the performance remained the same (or - perhaps - improved very slightly): 633tps (with fallocate) vs. 607. Using the following additions postgresql.conf: max_connections = 500 shared_buffers = 1GB effective_cache_size = 1GB random_page_cost = 2.0 cpu_tuple_cost = 0.03 wal_buffers = 32MB work_mem = 100MB maintenance_work_mem = 512MB commit_delay = 50 commit_siblings = 15 checkpoint_segments = 256 log_checkpoints = on and this partial script for testing: pg_ctl initdb -D tt -w cp ~/postgresql.conf tt pg_ctl -D tt -w start createdb -p 54320 pgb pgbench -s 20 -p 54320 -d pgb -i pgbench -j 80 -c 80 -T 120 -p 54320 pgb pg_ctl -D tt -w stop -- Jon
On 8 August 2013 04:05, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-08-07 20:23:55 +0100, Thom Brown wrote:
>> >>> 269e78 was the commit immediately after 8800d8, so it appears that
>> >>> introduced the regression.
>> >>>
>> >>> "Use posix_fallocate() for new WAL files, where available."
>> >>
>> >> This is curious. Could you either run a longer test before/after the
>> >> commit or reduce checkpoint_timeout to something like 3min?
>> >
>> > Okay, I'll rerun the test for both those commits at 1 hour each with
>> > checkpoint_timeout set at 3mins, but with all other configuration
>> > settings the same
>>
>> Results
>> (checkpoint_timeout = 3min)
>>
>> pgbench -j 80 -c 80 -T 3600
>>
>> 269e78: 606.268013
>> 8800d8: 779.583129
>
> Ok, so the performance difference is lower. But, it seems to be still to
> high to be explaineable by performance differences during
> initialization/fallocate.
> In a very quick test, I don't see the same on my laptop. I am currently
> travelling and I don't have convenient access to anything else.
>
> Could you:
> - run filefrag on a couple of wal segments of both clusters after the
> test and post the results here?
I ran a benchmark for 10 minutes. I ran filefrag against all files in pg_xlog and they all say "1 extent found"
> - enable log_checkpoints, post the lines spat out together with the results
LOG: database system was shut down at 2013-08-08 16:38:28 BST
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.041 s, sync=0.034 s, total=0.132 s; sync files=3, longest=0.017 s, average=0.011 s
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.015 s, sync=0.000 s, total=0.071 s; sync files=0, longest=0.000 s, average=0.000 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 39137 buffers (7.5%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=143.757 s, sync=5.659 s, total=149.620 s; sync files=44, longest=3.814 s, average=0.128 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 21141 buffers (4.0%); 0 transaction log file(s) added, 0 removed, 15 recycled; write=143.082 s, sync=13.144 s, total=156.421 s; sync files=13, longest=10.336 s, average=1.011 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 27702 buffers (5.3%); 0 transaction log file(s) added, 0 removed, 15 recycled; write=143.739 s, sync=4.470 s, total=148.337 s; sync files=12, longest=2.545 s, average=0.372 s
LOG: checkpoint starting: time
> - run pgbench without reinitializing the cluster/pgbench tables
> inbetween?
Set 1 (10 mins each):
First run after initialisation: 289.410795
--
Thom
> On 2013-08-07 20:23:55 +0100, Thom Brown wrote:
>> >>> 269e78 was the commit immediately after 8800d8, so it appears that
>> >>> introduced the regression.
>> >>>
>> >>> "Use posix_fallocate() for new WAL files, where available."
>> >>
>> >> This is curious. Could you either run a longer test before/after the
>> >> commit or reduce checkpoint_timeout to something like 3min?
>> >
>> > Okay, I'll rerun the test for both those commits at 1 hour each with
>> > checkpoint_timeout set at 3mins, but with all other configuration
>> > settings the same
>>
>> Results
>> (checkpoint_timeout = 3min)
>>
>> pgbench -j 80 -c 80 -T 3600
>>
>> 269e78: 606.268013
>> 8800d8: 779.583129
>
> Ok, so the performance difference is lower. But, it seems to be still to
> high to be explaineable by performance differences during
> initialization/fallocate.
> In a very quick test, I don't see the same on my laptop. I am currently
> travelling and I don't have convenient access to anything else.
>
> Could you:
> - run filefrag on a couple of wal segments of both clusters after the
> test and post the results here?
I ran a benchmark for 10 minutes. I ran filefrag against all files in pg_xlog and they all say "1 extent found"
> - enable log_checkpoints, post the lines spat out together with the results
LOG: database system was shut down at 2013-08-08 16:38:28 BST
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.041 s, sync=0.034 s, total=0.132 s; sync files=3, longest=0.017 s, average=0.011 s
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.015 s, sync=0.000 s, total=0.071 s; sync files=0, longest=0.000 s, average=0.000 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 39137 buffers (7.5%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=143.757 s, sync=5.659 s, total=149.620 s; sync files=44, longest=3.814 s, average=0.128 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 21141 buffers (4.0%); 0 transaction log file(s) added, 0 removed, 15 recycled; write=143.082 s, sync=13.144 s, total=156.421 s; sync files=13, longest=10.336 s, average=1.011 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 27702 buffers (5.3%); 0 transaction log file(s) added, 0 removed, 15 recycled; write=143.739 s, sync=4.470 s, total=148.337 s; sync files=12, longest=2.545 s, average=0.372 s
LOG: checkpoint starting: time
> - run pgbench without reinitializing the cluster/pgbench tables
> inbetween?
Set 1 (10 mins each):
First run after initialisation: 289.410795
Re-run without re-initialisation: 519.694545
These additional lines were added to the log for the re-run:
LOG: checkpoint complete: wrote 14793 buffers (2.8%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=143.299 s, sync=2.318 s, total=145.691 s; sync files=13, longest=1.910 s, average=0.178 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 37236 buffers (7.1%); 0 transaction log file(s) added, 0 removed, 13 recycled; write=143.410 s, sync=7.363 s, total=150.981 s; sync files=16, longest=4.871 s, average=0.460 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 28271 buffers (5.4%); 0 transaction log file(s) added, 0 removed, 20 recycled; write=143.597 s, sync=11.761 s, total=155.472 s; sync files=13, longest=8.494 s, average=0.904 s
LOG: checkpoint starting: time
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 37236 buffers (7.1%); 0 transaction log file(s) added, 0 removed, 13 recycled; write=143.410 s, sync=7.363 s, total=150.981 s; sync files=16, longest=4.871 s, average=0.460 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 28271 buffers (5.4%); 0 transaction log file(s) added, 0 removed, 20 recycled; write=143.597 s, sync=11.761 s, total=155.472 s; sync files=13, longest=8.494 s, average=0.904 s
LOG: checkpoint starting: time
Set 2 (10 mins each):
First run after initialisation: 295.925985
Re-run without re-initialisation: 503.239516
Here's the log for set 2 after the first run:
LOG: database system was shut down at 2013-08-08 17:06:41 BST
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.041 s, sync=0.034 s, total=0.153 s; sync files=3, longest=0.017 s, average=0.011 s
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.014 s, sync=0.000 s, total=0.057 s; sync files=0, longest=0.000 s, average=0.000 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 39148 buffers (7.5%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=143.322 s, sync=5.519 s, total=149.256 s; sync files=44, longest=3.691 s, average=0.125 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 21447 buffers (4.1%); 0 transaction log file(s) added, 0 removed, 15 recycled; write=143.161 s, sync=13.459 s, total=156.843 s; sync files=13, longest=10.915 s, average=1.035 s
LOG: checkpoint starting: time
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.041 s, sync=0.034 s, total=0.153 s; sync files=3, longest=0.017 s, average=0.011 s
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.014 s, sync=0.000 s, total=0.057 s; sync files=0, longest=0.000 s, average=0.000 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 39148 buffers (7.5%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=143.322 s, sync=5.519 s, total=149.256 s; sync files=44, longest=3.691 s, average=0.125 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 21447 buffers (4.1%); 0 transaction log file(s) added, 0 removed, 15 recycled; write=143.161 s, sync=13.459 s, total=156.843 s; sync files=13, longest=10.915 s, average=1.035 s
LOG: checkpoint starting: time
After the re-run in set 2, these lines were added:
LOG: checkpoint complete: wrote 26609 buffers (5.1%); 0 transaction log file(s) added, 0 removed, 15 recycled; write=143.253 s, sync=6.588 s, total=150.036 s; sync files=15, longest=2.606 s, average=0.439 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 16931 buffers (3.2%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=143.171 s, sync=16.489 s, total=160.014 s; sync files=14, longest=12.660 s, average=1.177 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 35345 buffers (6.7%); 0 transaction log file(s) added, 0 removed, 14 recycled; write=143.170 s, sync=10.596 s, total=153.977 s; sync files=16, longest=7.996 s, average=0.662 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 27907 buffers (5.3%); 0 transaction log file(s) added, 0 removed, 19 recycled; write=143.184 s, sync=12.562 s, total=155.969 s; sync files=14, longest=7.844 s, average=0.897 s
LOG: checkpoint starting: time
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 16931 buffers (3.2%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=143.171 s, sync=16.489 s, total=160.014 s; sync files=14, longest=12.660 s, average=1.177 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 35345 buffers (6.7%); 0 transaction log file(s) added, 0 removed, 14 recycled; write=143.170 s, sync=10.596 s, total=153.977 s; sync files=16, longest=7.996 s, average=0.662 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 27907 buffers (5.3%); 0 transaction log file(s) added, 0 removed, 19 recycled; write=143.184 s, sync=12.562 s, total=155.969 s; sync files=14, longest=7.844 s, average=0.897 s
LOG: checkpoint starting: time
I rechecked WAL file fragmentation after each run, and it always says "1 extent found".
--
Thom
On 08/08/2013 05:28 PM, Jon Nelson wrote: > A follow-up. > I found this thread that seems to explain some things: > > http://comments.gmane.org/gmane.comp.file-systems.ext4/33024 > > Short version: if we are writing into the "middle" of the > newly-fallocated file on ext4 (with extents) the extent tree can > fragment badly, causing poor performance due to extent merging. > > I also ran some tests with xfs, and the results were even stranger: > > xfs performed very slightly better with fallocate vs. without. > xfs (fallocate) 216 tps vs. (no fallocate) 206 > ext4 (no fallocate) 605 vs (fallocate) 134. > > I made an ext4 filesystem without extents using the same block device > as above - a real, single hard drive with nothing else on it. > On this filesystem, the performance remained the same (or - perhaps - > improved very slightly): 633tps (with fallocate) vs. 607. > > > Using the following additions postgresql.conf: > > max_connections = 500 > shared_buffers = 1GB > effective_cache_size = 1GB > random_page_cost = 2.0 > cpu_tuple_cost = 0.03 > wal_buffers = 32MB > work_mem = 100MB > maintenance_work_mem = 512MB > commit_delay = 50 > commit_siblings = 15 > checkpoint_segments = 256 > log_checkpoints = on > > and this partial script for testing: > > pg_ctl initdb -D tt -w > cp ~/postgresql.conf tt > pg_ctl -D tt -w start > createdb -p 54320 pgb > pgbench -s 20 -p 54320 -d pgb -i Just an idea - can you check if using a fillfactor different form 100 changes anything pgbench -s 20 -p 54320 -d pgb -F 90 -i > pgbench -j 80 -c 80 -T 120 -p 54320 pgb > pg_ctl -D tt -w stop That is, does extending tables and indexes to add updated tuples play any role here -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On Thu, Aug 8, 2013 at 2:50 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > On 08/08/2013 05:28 PM, Jon Nelson wrote: ... > Just an idea - can you check if using a fillfactor different form 100 > changes anything > > pgbench -s 20 -p 54320 -d pgb -F 90 -i > > >> pgbench -j 80 -c 80 -T 120 -p 54320 pgb >> pg_ctl -D tt -w stop > > That is, does extending tables and indexes to add updated tuples play > any role here I gave it a go - it didn't make any difference at all. At this point I'm convinced that the issue is a pathological case in ext4. The performance impact disappears as soon as the unwritten extent(s) are written to with real data. Thus, even though allocating files with posix_fallocate is - frequently - orders of magnitude quicker than doing it with write(2), the subsequent re-write can be more expensive. At least, that's what I'm gathering from the various threads. Why this issue didn't crop up in earlier testing and why I can't seem to make test_fallocate do it (even when I modify test_fallocate to write to the newly-allocated file in a mostly-random fashion) has me baffled. Should this feature be reconsidered? -- Jon
On Thu, Aug 8, 2013 at 04:12:06PM -0500, Jon Nelson wrote: > On Thu, Aug 8, 2013 at 2:50 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > > On 08/08/2013 05:28 PM, Jon Nelson wrote: > ... > > > Just an idea - can you check if using a fillfactor different form 100 > > changes anything > > > > pgbench -s 20 -p 54320 -d pgb -F 90 -i > > > > > >> pgbench -j 80 -c 80 -T 120 -p 54320 pgb > >> pg_ctl -D tt -w stop > > > > That is, does extending tables and indexes to add updated tuples play > > any role here > > I gave it a go - it didn't make any difference at all. > At this point I'm convinced that the issue is a pathological case in > ext4. The performance impact disappears as soon as the unwritten > extent(s) are written to with real data. Thus, even though allocating > files with posix_fallocate is - frequently - orders of magnitude > quicker than doing it with write(2), the subsequent re-write can be > more expensive. At least, that's what I'm gathering from the various > threads. Why this issue didn't crop up in earlier testing and why I > can't seem to make test_fallocate do it (even when I modify > test_fallocate to write to the newly-allocated file in a mostly-random > fashion) has me baffled. Should this feature be reconsidered? How much slower would it be if we wrote it with zeros after posix_fallocate() --- that would still give use single extents. Has anyone tested to see if the write without test_fallocate() still gives us one extent? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, Aug 8, 2013 at 4:24 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Thu, Aug 8, 2013 at 04:12:06PM -0500, Jon Nelson wrote: >> On Thu, Aug 8, 2013 at 2:50 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: >> > On 08/08/2013 05:28 PM, Jon Nelson wrote: >> ... >> >> > Just an idea - can you check if using a fillfactor different form 100 >> > changes anything >> > >> > pgbench -s 20 -p 54320 -d pgb -F 90 -i >> > >> > >> >> pgbench -j 80 -c 80 -T 120 -p 54320 pgb >> >> pg_ctl -D tt -w stop >> > >> > That is, does extending tables and indexes to add updated tuples play >> > any role here >> >> I gave it a go - it didn't make any difference at all. >> At this point I'm convinced that the issue is a pathological case in >> ext4. The performance impact disappears as soon as the unwritten >> extent(s) are written to with real data. Thus, even though allocating >> files with posix_fallocate is - frequently - orders of magnitude >> quicker than doing it with write(2), the subsequent re-write can be >> more expensive. At least, that's what I'm gathering from the various >> threads. Why this issue didn't crop up in earlier testing and why I >> can't seem to make test_fallocate do it (even when I modify >> test_fallocate to write to the newly-allocated file in a mostly-random >> fashion) has me baffled. Should this feature be reconsidered? > > How much slower would it be if we wrote it with zeros after > posix_fallocate() --- that would still give use single extents. Has > anyone tested to see if the write without test_fallocate() still gives > us one extent? Actually, I did that test - I found that there was no longer a performance drop and possibly a slight performance benefit. I couldn't quite parse your last sentence, I assume you mean "does skipping the posix_fallocate altogether still produce one extent?" Sadly, I can't really test that properly as the ext4 filesystem I'm testing on is quite fresh and has lots and lots of free space. -- Jon
Jon Nelson <jnelson+pgsql@jamponi.net> writes: > At this point I'm convinced that the issue is a pathological case in > ext4. The performance impact disappears as soon as the unwritten > extent(s) are written to with real data. Thus, even though allocating > files with posix_fallocate is - frequently - orders of magnitude > quicker than doing it with write(2), the subsequent re-write can be > more expensive. At least, that's what I'm gathering from the various > threads. Why this issue didn't crop up in earlier testing and why I > can't seem to make test_fallocate do it (even when I modify > test_fallocate to write to the newly-allocated file in a mostly-random > fashion) has me baffled. Does your test program use all the same writing options that the real WAL writes do (like O_DIRECT)? > Should this feature be reconsidered? Well, ext4 isn't the whole world, but it's sure a big part of it. The real point though is that obviously we didn't do enough performance testing, so we'd better do more before deciding what to do. regards, tom lane
On Thu, Aug 8, 2013 at 04:33:05PM -0500, Jon Nelson wrote: > > How much slower would it be if we wrote it with zeros after > > posix_fallocate() --- that would still give use single extents. Has > > anyone tested to see if the write without test_fallocate() still gives > > us one extent? > > Actually, I did that test - I found that there was no longer a > performance drop and possibly a slight performance benefit. > I couldn't quite parse your last sentence, I assume you mean "does > skipping the posix_fallocate altogether still produce one extent?" > Sadly, I can't really test that properly as the ext4 filesystem I'm > testing on is quite fresh and has lots and lots of free space. Yes, that was the question. I meant to say: Has anyone tested to see if the write without posix_fallocate() stillcreates one extent per 16MB file, meaning the posix_fallocate()isunnecessary? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, Aug 8, 2013 at 4:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jon Nelson <jnelson+pgsql@jamponi.net> writes: >> At this point I'm convinced that the issue is a pathological case in >> ext4. The performance impact disappears as soon as the unwritten >> extent(s) are written to with real data. Thus, even though allocating >> files with posix_fallocate is - frequently - orders of magnitude >> quicker than doing it with write(2), the subsequent re-write can be >> more expensive. At least, that's what I'm gathering from the various >> threads. Why this issue didn't crop up in earlier testing and why I >> can't seem to make test_fallocate do it (even when I modify >> test_fallocate to write to the newly-allocated file in a mostly-random >> fashion) has me baffled. > > Does your test program use all the same writing options that the real > WAL writes do (like O_DIRECT)? I believe so. From xlog.c: /* do not use get_sync_bit() here --- want to fsync only at end of fill */ fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT| O_EXCL | PG_BINARY, S_IRUSR | S_IWUSR); and from the test program: fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600); PG_BINARY expands to 0 on non-Windows. I also tried using O_WRONLY in xlog.c without change. -- Jon
Jon Nelson <jnelson+pgsql@jamponi.net> writes: > On Thu, Aug 8, 2013 at 4:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Does your test program use all the same writing options that the real >> WAL writes do (like O_DIRECT)? > I believe so. >> From xlog.c: > /* do not use get_sync_bit() here --- want to fsync only at end of fill */ > fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY, > S_IRUSR | S_IWUSR); > and from the test program: > fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600); Maybe I misunderstood, but I thought the performance complaint had to do with the actual writes of WAL data, not with the pre-fill. That is, you should not just be measuring how long the pre-fill takes, but what is the speed of real writes to the file later on (which will be using get_sync_bit, for various values of the sync settings). regards, tom lane
On Thu, Aug 8, 2013 at 5:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jon Nelson <jnelson+pgsql@jamponi.net> writes: >> On Thu, Aug 8, 2013 at 4:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Does your test program use all the same writing options that the real >>> WAL writes do (like O_DIRECT)? > >> I believe so. > >>> From xlog.c: > >> /* do not use get_sync_bit() here --- want to fsync only at end of fill */ >> fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY, >> S_IRUSR | S_IWUSR); > >> and from the test program: > >> fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600); > > Maybe I misunderstood, but I thought the performance complaint had to do > with the actual writes of WAL data, not with the pre-fill. That is, you > should not just be measuring how long the pre-fill takes, but what is the > speed of real writes to the file later on (which will be using > get_sync_bit, for various values of the sync settings). Ah, no, I misunderstood your question. I'm fairly certain the test program doesn't open up files with any sort of sync. bit set. I'll have to see what PostgreSQL is using, exactly, and get back to you. -- Jon
On Thu, Aug 8, 2013 at 5:25 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > On Thu, Aug 8, 2013 at 5:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Jon Nelson <jnelson+pgsql@jamponi.net> writes: >>> On Thu, Aug 8, 2013 at 4:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> Does your test program use all the same writing options that the real >>>> WAL writes do (like O_DIRECT)? >> >>> I believe so. >> >>>> From xlog.c: >> >>> /* do not use get_sync_bit() here --- want to fsync only at end of fill */ >>> fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY, >>> S_IRUSR | S_IWUSR); >> >>> and from the test program: >> >>> fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600); >> >> Maybe I misunderstood, but I thought the performance complaint had to do >> with the actual writes of WAL data, not with the pre-fill. That is, you >> should not just be measuring how long the pre-fill takes, but what is the >> speed of real writes to the file later on (which will be using >> get_sync_bit, for various values of the sync settings). > > Ah, no, I misunderstood your question. > I'm fairly certain the test program doesn't open up files with any > sort of sync. bit set. > I'll have to see what PostgreSQL is using, exactly, and get back to you. My reading of xlog.c seems to indicate that if the sync mode is fdatasync, then the file is not opened with any special sync bit, but that an fdatasync occurs at the end. That is also what the test program does. -- Jon
On 2013-08-08 16:12:06 -0500, Jon Nelson wrote: > On Thu, Aug 8, 2013 at 2:50 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote: > > On 08/08/2013 05:28 PM, Jon Nelson wrote: > ... > > > Just an idea - can you check if using a fillfactor different form 100 > > changes anything > > > > pgbench -s 20 -p 54320 -d pgb -F 90 -i > > > > > >> pgbench -j 80 -c 80 -T 120 -p 54320 pgb > >> pg_ctl -D tt -w stop > > > > That is, does extending tables and indexes to add updated tuples play > > any role here > > I gave it a go - it didn't make any difference at all. > At this point I'm convinced that the issue is a pathological case in > ext4. The performance impact disappears as soon as the unwritten > extent(s) are written to with real data. Thus, even though allocating > files with posix_fallocate is - frequently - orders of magnitude > quicker than doing it with write(2), the subsequent re-write can be > more expensive. At least, that's what I'm gathering from the various > threads. > Why this issue didn't crop up in earlier testing and why I > can't seem to make test_fallocate do it (even when I modify > test_fallocate to write to the newly-allocated file in a mostly-random > fashion) has me baffled. It might be kernel version specific and concurrency seems to play a role. If you reproduce the problem, could you run a "perf record -ga" to collect a systemwide profile? There's some more things to test: - is the slowdown dependent on the scale? I.e is it visible with -j 1 -c 1? - Does it also occur in synchronous_commit=off configurations? Those don't fdatasync() from so many backends, that mightplay a role. - do bulkloads see it? E.g. the initial pgbench load? > Should this feature be reconsidered? Let's try to get to the ground of this for a bit more. From the outset we've said that we're using this as a testbed for extending the fallocate() usage. It's not like 9.4 beta1 is going to be released anytime soon. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Aug 8, 2013 at 9:27 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-08-08 16:12:06 -0500, Jon Nelson wrote: ... >> At this point I'm convinced that the issue is a pathological case in >> ext4. The performance impact disappears as soon as the unwritten >> extent(s) are written to with real data. Thus, even though allocating >> files with posix_fallocate is - frequently - orders of magnitude >> quicker than doing it with write(2), the subsequent re-write can be >> more expensive. At least, that's what I'm gathering from the various >> threads. > > >> Why this issue didn't crop up in earlier testing and why I >> can't seem to make test_fallocate do it (even when I modify >> test_fallocate to write to the newly-allocated file in a mostly-random >> fashion) has me baffled. > > It might be kernel version specific and concurrency seems to play a > role. If you reproduce the problem, could you run a "perf record -ga" to > collect a systemwide profile? Finally, an excuse to learn how to use 'perf'! I'll try to provide that info when I am able. > There's some more things to test: > - is the slowdown dependent on the scale? I.e is it visible with -j 1 -c > 1? scale=1 (-j 1 -c 1): with fallocate: 685 tps without: 727 scale=20 with fallocate: 129 without: 402 scale=40 with fallocate: 163 without: 511 > - Does it also occur in synchronous_commit=off configurations? Those > don't fdatasync() from so many backends, that might play a role. With synchronous_commit=off, the performance is vastly improved. Interestingly, the fallocate case is (immaterially) faster than the non-fallocate case: 3766tps vs 3700tps. I tried a few other wal_sync_methods besides the default of fdatasync, all with scale=80. fsync: 198 tps (with fallocate) vs 187. open_sync: 195 tps (with fallocate) vs. 192 > - do bulkloads see it? E.g. the initial pgbench load? time pgbench -s 200 -p 54320 -d pgb -i with fallocate: 2m47s without: 2m50s Hopefully the above is useful. -- Jon
On 2013-08-08 22:58:42 -0500, Jon Nelson wrote: > On Thu, Aug 8, 2013 at 9:27 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-08-08 16:12:06 -0500, Jon Nelson wrote: > ... > > >> At this point I'm convinced that the issue is a pathological case in > >> ext4. The performance impact disappears as soon as the unwritten > >> extent(s) are written to with real data. Thus, even though allocating > >> files with posix_fallocate is - frequently - orders of magnitude > >> quicker than doing it with write(2), the subsequent re-write can be > >> more expensive. At least, that's what I'm gathering from the various > >> threads. > > > > > >> Why this issue didn't crop up in earlier testing and why I > >> can't seem to make test_fallocate do it (even when I modify > >> test_fallocate to write to the newly-allocated file in a mostly-random > >> fashion) has me baffled. > > > > It might be kernel version specific and concurrency seems to play a > > role. If you reproduce the problem, could you run a "perf record -ga" to > > collect a systemwide profile? > > Finally, an excuse to learn how to use 'perf'! I'll try to provide > that info when I am able. Running perf record as above during the first minute and then doing a perf report > somefile (redirected to get the noninteractive version) should get you started. > > There's some more things to test: > > - is the slowdown dependent on the scale? I.e is it visible with -j 1 -c > > 1? > > scale=1 (-j 1 -c 1): > with fallocate: 685 tps > without: 727 > > scale=20 > with fallocate: 129 > without: 402 > > scale=40 > with fallocate: 163 > without: 511 Ok, so there's some clear correlation with the amount of writers. > > - Does it also occur in synchronous_commit=off configurations? Those > > don't fdatasync() from so many backends, that might play a role. > > With synchronous_commit=off, the performance is vastly improved. > Interestingly, the fallocate case is (immaterially) faster than the > non-fallocate case: 3766tps vs 3700tps. That's interesting because in the synchronous_commit=off case most of the writing and syncing should be done by the wal writer. So there's another hint that there's some scalability issue causing place, presumably in the kernel. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Aug 7, 2013 at 07:04:43PM +0100, Thom Brown wrote: > On 7 August 2013 18:49, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > > On Wed, Aug 7, 2013 at 12:42 PM, Thom Brown <thom@linux.com> wrote: > >> for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done > >> method: classic. 1 open/close iterations, 1 rewrite in 0.6380s > >> method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3204s > >> method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.6274s > >> method: classic. 2 open/close iterations, 1 rewrite in 1.2908s > >> method: posix_fallocate. 2 open/close iterations, 1 rewrite in 0.6596s > >> method: glibc emulation. 2 open/close iterations, 1 rewrite in 1.2666s > >> method: classic. 5 open/close iterations, 1 rewrite in 3.1419s > >> method: posix_fallocate. 5 open/close iterations, 1 rewrite in 1.5930s > >> method: glibc emulation. 5 open/close iterations, 1 rewrite in 3.1516s > >> method: classic. 10 open/close iterations, 1 rewrite in 6.2912s > >> method: posix_fallocate. 10 open/close iterations, 1 rewrite in 3.2626s > >> method: glibc emulation. 10 open/close iterations, 1 rewrite in 6.3667s > >> method: classic. 100 open/close iterations, 1 rewrite in 67.4174s > >> method: posix_fallocate. 100 open/close iterations, 1 rewrite in 37.8788s > >> method: glibc emulation. 100 open/close iterations, 1 rewrite in 55.0714s > > > > OK. That's interesting, and a good data point. > > > > One thing you could try manually disabling the use of posix_fallocate in 269e78. > > After running ./configure --stuff-here > > edit src/include/pg_config.h and comment out the following line (on or > > around line 374) > > #define HAVE_POSIX_FALLOCATE 1 > > > > *then* build postgresql and see if the performance hit is still there. > > Okay, done that. The TPS increases again: > > 2308.807568 / 2554.264572 / 2563.190204 > > And I did run ./configure... before removing the line, and built it > after the change. Where are we on this issue? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
> Where are we on this issue? I've been able to replicate it pretty easily with PostgreSQL and continue to look into it. I've contacted Theodore Ts'o and have gotten some useful information, however I'm unable to replicate the behavior with the test program (even one that's been modified). What I've learned is: - XLogWrite appears to take approx. 2.5 times longer when writing to a file allocated with posix_fallocate, but only the first time the file contents are overwritten. This is partially explained by how ext4 handles extents and uninitialized data, but 2.5x is MUCH more expensive than anticipated or expected here. - Writing zeroes to a file allocated with posix_fallocate (essentially adding a posix_fallocate step before the usual write-zeroes-in-a-loop approach) not only doesn't seem to hurt performance, it seems to help or at least have parity, *and* the space is guaranteed to exist on disk. At the very least that seems useful. -- Jon
On Thu, Aug 15, 2013 at 12:08:57PM -0500, Jon Nelson wrote: > > Where are we on this issue? > > I've been able to replicate it pretty easily with PostgreSQL and > continue to look into it. I've contacted Theodore Ts'o and have gotten > some useful information, however I'm unable to replicate the behavior > with the test program (even one that's been modified). What I've > learned is: > > - XLogWrite appears to take approx. 2.5 times longer when writing to a > file allocated with posix_fallocate, but only the first time the file > contents are overwritten. This is partially explained by how ext4 > handles extents and uninitialized data, but 2.5x is MUCH more > expensive than anticipated or expected here. > - Writing zeroes to a file allocated with posix_fallocate (essentially > adding a posix_fallocate step before the usual write-zeroes-in-a-loop > approach) not only doesn't seem to hurt performance, it seems to help > or at least have parity, *and* the space is guaranteed to exist on > disk. At the very least that seems useful. Is it time to revert this patch until we know more? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Fri, Aug 16, 2013 at 3:57 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Thu, Aug 15, 2013 at 12:08:57PM -0500, Jon Nelson wrote: >> > Where are we on this issue? >> >> I've been able to replicate it pretty easily with PostgreSQL and >> continue to look into it. I've contacted Theodore Ts'o and have gotten >> some useful information, however I'm unable to replicate the behavior >> with the test program (even one that's been modified). What I've >> learned is: >> >> - XLogWrite appears to take approx. 2.5 times longer when writing to a >> file allocated with posix_fallocate, but only the first time the file >> contents are overwritten. This is partially explained by how ext4 >> handles extents and uninitialized data, but 2.5x is MUCH more >> expensive than anticipated or expected here. >> - Writing zeroes to a file allocated with posix_fallocate (essentially >> adding a posix_fallocate step before the usual write-zeroes-in-a-loop >> approach) not only doesn't seem to hurt performance, it seems to help >> or at least have parity, *and* the space is guaranteed to exist on >> disk. At the very least that seems useful. > > Is it time to revert this patch until we know more? While I'm not qualified to say, my inclination is to say yes. It can always be added back later. The only caveat there would be that - perhaps - a small modification of the patch would be warranted. Specifically, with with posix_fallocate, I saw no undesirable behavior when the (newly allocated) file was manually zeroed anyway. The only advantages (that I can see) to doing it this way versus not using posix_fallocate at all is (a) a potential reduction in the number of extents and (b) the space is guaranteed to be on disk if posix_fallocate succeeds. My reading of the patch is that even if posix_fallocate fails due to out of space conditions, we will still try to create the file by writing out zeroes, so perhaps the out-of-disk-space scenario isn't all that useful anyway. I'm awaiting more information from Theodore Ts'o, but I don't expect things to materially change in the near future. -- Jon
On 2013-08-19 14:40:07 -0500, Jon Nelson wrote: > On Fri, Aug 16, 2013 at 3:57 PM, Bruce Momjian <bruce@momjian.us> wrote: > > On Thu, Aug 15, 2013 at 12:08:57PM -0500, Jon Nelson wrote: > >> > Where are we on this issue? > >> > >> I've been able to replicate it pretty easily with PostgreSQL and > >> continue to look into it. I've contacted Theodore Ts'o and have gotten > >> some useful information, however I'm unable to replicate the behavior > >> with the test program (even one that's been modified). What I've > >> learned is: > >> > >> - XLogWrite appears to take approx. 2.5 times longer when writing to a > >> file allocated with posix_fallocate, but only the first time the file > >> contents are overwritten. This is partially explained by how ext4 > >> handles extents and uninitialized data, but 2.5x is MUCH more > >> expensive than anticipated or expected here. > >> - Writing zeroes to a file allocated with posix_fallocate (essentially > >> adding a posix_fallocate step before the usual write-zeroes-in-a-loop > >> approach) not only doesn't seem to hurt performance, it seems to help > >> or at least have parity, *and* the space is guaranteed to exist on > >> disk. At the very least that seems useful. > > > > Is it time to revert this patch until we know more? > > While I'm not qualified to say, my inclination is to say yes. It can > always be added back later. The only caveat there would be that - > perhaps - a small modification of the patch would be warranted. > Specifically, with with posix_fallocate, I saw no undesirable behavior > when the (newly allocated) file was manually zeroed anyway. The only > advantages (that I can see) to doing it this way versus not using > posix_fallocate at all is (a) a potential reduction in the number of > extents I vote for adapting the patch to additionally zero out the file via write(). In your tests that seemed to perform at least as good as the old method... It also has the advantage that we can use it a littlebit more as a testbed for possibly using it for heap extensions one day. We're pretty early in the cycle, so I am not worried about this too much... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
* Andres Freund (andres@2ndquadrant.com) wrote: > I vote for adapting the patch to additionally zero out the file via > write(). In your tests that seemed to perform at least as good as the > old method... It also has the advantage that we can use it a littlebit > more as a testbed for possibly using it for heap extensions one day. > We're pretty early in the cycle, so I am not worried about this too much... I dunno, I'm pretty disappointed that this doesn't actually improve things. Just following this casually, it looks like it might be some kind of locking issue in the kernel that's causing it to be slower; or at least some code path that isn't exercise terribly much and therefore hasn't been given the love that it should. Definitely interested in what Ts'o says, but if we can't figure out why it's slower *without* writing out the zeros, I'd say we punt on this until Linux and the other OS folks improve the situation. Thanks, Stephen
On Mon, Aug 19, 2013 at 09:27:57PM -0400, Stephen Frost wrote: > * Andres Freund (andres@2ndquadrant.com) wrote: > > I vote for adapting the patch to additionally zero out the file via > > write(). In your tests that seemed to perform at least as good as the > > old method... It also has the advantage that we can use it a littlebit > > more as a testbed for possibly using it for heap extensions one day. > > We're pretty early in the cycle, so I am not worried about this too much... > > I dunno, I'm pretty disappointed that this doesn't actually improve > things. Just following this casually, it looks like it might be some > kind of locking issue in the kernel that's causing it to be slower; or > at least some code path that isn't exercise terribly much and therefore > hasn't been given the love that it should. > > Definitely interested in what Ts'o says, but if we can't figure out why > it's slower *without* writing out the zeros, I'd say we punt on this > until Linux and the other OS folks improve the situation. Agreed. Anyone with an affected kernel really can't be doing performance tests right now, and that isn't good. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, 2013-08-08 at 17:42 -0400, Tom Lane wrote: > Jon Nelson <jnelson+pgsql@jamponi.net> writes: > > At this point I'm convinced that the issue is a pathological case in > > ext4. The performance impact disappears as soon as the unwritten > > extent(s) are written to with real data. Thus, even though allocating > > files with posix_fallocate is - frequently - orders of magnitude > > quicker than doing it with write(2), the subsequent re-write can be > > more expensive. At least, that's what I'm gathering from the various > > threads. Why this issue didn't crop up in earlier testing and why I > > can't seem to make test_fallocate do it (even when I modify > > test_fallocate to write to the newly-allocated file in a mostly-random > > fashion) has me baffled. > > Does your test program use all the same writing options that the real > WAL writes do (like O_DIRECT)? > > > Should this feature be reconsidered? > > Well, ext4 isn't the whole world, but it's sure a big part of it. > The real point though is that obviously we didn't do enough performance > testing, so we'd better do more before deciding what to do. Greg Smith, Do you have some thoughts about this? More tests we should run? We may need to revert the patch. Regards,Jeff Davis
On 2013-09-04 08:01:30 -0700, Jeff Davis wrote: > On Thu, 2013-08-08 at 17:42 -0400, Tom Lane wrote: > > Jon Nelson <jnelson+pgsql@jamponi.net> writes: > > > At this point I'm convinced that the issue is a pathological case in > > > ext4. The performance impact disappears as soon as the unwritten > > > extent(s) are written to with real data. Thus, even though allocating > > > files with posix_fallocate is - frequently - orders of magnitude > > > quicker than doing it with write(2), the subsequent re-write can be > > > more expensive. At least, that's what I'm gathering from the various > > > threads. Why this issue didn't crop up in earlier testing and why I > > > can't seem to make test_fallocate do it (even when I modify > > > test_fallocate to write to the newly-allocated file in a mostly-random > > > fashion) has me baffled. > > > > Does your test program use all the same writing options that the real > > WAL writes do (like O_DIRECT)? > > > > > Should this feature be reconsidered? > > > > Well, ext4 isn't the whole world, but it's sure a big part of it. > > The real point though is that obviously we didn't do enough performance > > testing, so we'd better do more before deciding what to do. > > Greg Smith, > > Do you have some thoughts about this? More tests we should run? We may > need to revert the patch. I'd vote for adding zeroing *after* the fallocate() first. That's what's suggested by kernel hackers and what several other large applications do. As it looks like it's what we would have to do if we ever get to use fallocate for relation extension where we would have actual benefits from it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
* Andres Freund (andres@2ndquadrant.com) wrote: > I'd vote for adding zeroing *after* the fallocate() first. That's what's > suggested by kernel hackers and what several other large applications > do. As it looks like it's what we would have to do if we ever get to use > fallocate for relation extension where we would have actual benefits > from it. Does that actually end up doing anything different from what we were doing pre-patch here? At best, it *might* end up using a larger extent, but unless we can actually be confident that it does, I'm not convinced the additional complexity is worth it and would rather see this simply reverted. One might ask why the kernel guys aren't doing this themselves or figuring out why it's necessary to make it worthwhile. Thanks, Stephen
Hi, On 2013-09-04 11:15:37 -0400, Stephen Frost wrote: > * Andres Freund (andres@2ndquadrant.com) wrote: > > I'd vote for adding zeroing *after* the fallocate() first. That's what's > > suggested by kernel hackers and what several other large applications > > do. As it looks like it's what we would have to do if we ever get to use > > fallocate for relation extension where we would have actual benefits > > from it. > > Does that actually end up doing anything different from what we were > doing pre-patch here? At best, it *might* end up using a larger extent, > but unless we can actually be confident that it does, I'm not convinced > the additional complexity is worth it and would rather see this simply > reverted. It does end up with larger extents. That's why e.g. several of the virtualization solutions do it for container files. I was sceptical of the idea of using fallocate() for WAL file allocation from the get go and voiced that repeatedly, but the argument was made that it's a convenient testcase to familiarize ourselves with fallocate(). As that argument won we should stick to it and learn the ropes. Part of that is zeroing, so we should do that. It's a single write() + error checking, so I don't really see much complexity that way. > One might ask why the kernel guys aren't doing this themselves or > figuring out why it's necessary to make it worthwhile. So, if I understood Ted correctly the reason it's slower is that fallocate() onl marks the extents as containing 'zero' without actually touching the data. When then a partial write into the area is done the rest of the page has to be actively zeroed on disk. I am not sure that explains all the slowdown, but given that according to tests published in this thread, the additional zeroing helps to get at the former + small win state... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Stephen Frost <sfrost@snowman.net> writes: > * Andres Freund (andres@2ndquadrant.com) wrote: >> I'd vote for adding zeroing *after* the fallocate() first. That's what's >> suggested by kernel hackers and what several other large applications >> do. As it looks like it's what we would have to do if we ever get to use >> fallocate for relation extension where we would have actual benefits >> from it. > Does that actually end up doing anything different from what we were > doing pre-patch here? At best, it *might* end up using a larger extent, > but unless we can actually be confident that it does, I'm not convinced > the additional complexity is worth it and would rather see this simply > reverted. > One might ask why the kernel guys aren't doing this themselves or > figuring out why it's necessary to make it worthwhile. The larger picture is that that isn't the committed behavior, but a different one, one which would need performance testing. At this point, I vote for reverting the patch and allowing it to be resubmitted for a fresh round of testing with the zeroing added. And this time we'll need to do the testing more carefully. regards, tom lane
On Wed, Sep 4, 2013 at 11:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Stephen Frost <sfrost@snowman.net> writes: >> * Andres Freund (andres@2ndquadrant.com) wrote: >>> I'd vote for adding zeroing *after* the fallocate() first. That's what's >>> suggested by kernel hackers and what several other large applications >>> do. As it looks like it's what we would have to do if we ever get to use >>> fallocate for relation extension where we would have actual benefits >>> from it. > >> Does that actually end up doing anything different from what we were >> doing pre-patch here? At best, it *might* end up using a larger extent, >> but unless we can actually be confident that it does, I'm not convinced >> the additional complexity is worth it and would rather see this simply >> reverted. > >> One might ask why the kernel guys aren't doing this themselves or >> figuring out why it's necessary to make it worthwhile. > > The larger picture is that that isn't the committed behavior, > but a different one, one which would need performance testing. > > At this point, I vote for reverting the patch and allowing it to be > resubmitted for a fresh round of testing with the zeroing added. > And this time we'll need to do the testing more carefully. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
(2013/09/05 0:04), Andres Freund wrote: > I'd vote for adding zeroing *after* the fallocate() first. +1, with FALLOC_FL_KEEP_SIZE flag. At least, fallocate with FALLOC_FL_KEEP_SIZE flag is faster than nothing in my developing sorted checkpoint. I adopted it to relation file, so I don't know about WAL file. With FALLOC_FL_KEEP_SIZE flag does not change file size and not do zero fill. It seems to only get continuous physical disk space in file system. Here is draft patch. Regards, -- Mitsumasa KONDO NTT Open Source Software Center
Attachment
On Mon, Aug 19, 2013 at 09:27:57PM -0400, Stephen Frost wrote: > * Andres Freund (andres@2ndquadrant.com) wrote: > > I vote for adapting the patch to additionally zero out the file via > > write(). In your tests that seemed to perform at least as good as the > > old method... It also has the advantage that we can use it a littlebit > > more as a testbed for possibly using it for heap extensions one day. > > We're pretty early in the cycle, so I am not worried about this too much... > > I dunno, I'm pretty disappointed that this doesn't actually improve > things. Just following this casually, it looks like it might be some > kind of locking issue in the kernel that's causing it to be slower; or > at least some code path that isn't exercise terribly much and therefore > hasn't been given the love that it should. > > Definitely interested in what Ts'o says, but if we can't figure out why > it's slower *without* writing out the zeros, I'd say we punt on this > until Linux and the other OS folks improve the situation. FYI, the patch has been reverted. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 5 September 2013 22:24, Bruce Momjian <bruce@momjian.us> wrote: > On Mon, Aug 19, 2013 at 09:27:57PM -0400, Stephen Frost wrote: >> * Andres Freund (andres@2ndquadrant.com) wrote: >> > I vote for adapting the patch to additionally zero out the file via >> > write(). In your tests that seemed to perform at least as good as the >> > old method... It also has the advantage that we can use it a littlebit >> > more as a testbed for possibly using it for heap extensions one day. >> > We're pretty early in the cycle, so I am not worried about this too much... >> >> I dunno, I'm pretty disappointed that this doesn't actually improve >> things. Just following this casually, it looks like it might be some >> kind of locking issue in the kernel that's causing it to be slower; or >> at least some code path that isn't exercise terribly much and therefore >> hasn't been given the love that it should. >> >> Definitely interested in what Ts'o says, but if we can't figure out why >> it's slower *without* writing out the zeros, I'd say we punt on this >> until Linux and the other OS folks improve the situation. > > FYI, the patch has been reverted. Is there an updated patch available for this? And did anyone hear from Ts'o? -- Thom
On Thu, Oct 24, 2013 at 5:41 AM, Thom Brown <thom@linux.com> wrote: > On 5 September 2013 22:24, Bruce Momjian <bruce@momjian.us> wrote: >> On Mon, Aug 19, 2013 at 09:27:57PM -0400, Stephen Frost wrote: >>> * Andres Freund (andres@2ndquadrant.com) wrote: >>> > I vote for adapting the patch to additionally zero out the file via >>> > write(). In your tests that seemed to perform at least as good as the >>> > old method... It also has the advantage that we can use it a littlebit >>> > more as a testbed for possibly using it for heap extensions one day. >>> > We're pretty early in the cycle, so I am not worried about this too much... >>> >>> I dunno, I'm pretty disappointed that this doesn't actually improve >>> things. Just following this casually, it looks like it might be some >>> kind of locking issue in the kernel that's causing it to be slower; or >>> at least some code path that isn't exercise terribly much and therefore >>> hasn't been given the love that it should. >>> >>> Definitely interested in what Ts'o says, but if we can't figure out why >>> it's slower *without* writing out the zeros, I'd say we punt on this >>> until Linux and the other OS folks improve the situation. >> >> FYI, the patch has been reverted. > > Is there an updated patch available for this? And did anyone hear from Ts'o? After the patch was reverted, it was not re-submitted. I have tried 3 or 4 times to get more info out of Ts'o , without luck. -- Jon