Thread: fallocate / posix_fallocate for new WAL file creation (etc...)
Pertinent to another thread titled [HACKERS] corrupt pages detected by enabling checksums I hope to explore the possibility of using fallocate (or posix_fallocate) for new WAL file creation. Most modern Linux filesystems support fast fallocate/posix_fallocate, reducing extent fragmentation (where extents are used) and frequently offering a pretty significant speed improvement. In my tests, using posix_fallocate (followed by pg_fsync) is at least 28 times quicker than using the current method (which writes zeroes followed by pg_fsync). I have written up a patch to use posix_fallocate in new WAL file creation, including configuration by way of a GUC variable, but I've not contributed to the PostgreSQL project before. Therefore, I'm fairly certain the patch is not formatted properly or conforms to the appropriate style guides. Currently, the patch is based on 9.2, and is quite small in size - 3.6KiB. Advice on how to proceed is appreciated. -- Jon
On Mon, May 13, 2013 at 08:54:39PM -0500, Jon Nelson wrote: > Pertinent to another thread titled > [HACKERS] corrupt pages detected by enabling checksums > I hope to explore the possibility of using fallocate (or > posix_fallocate) for new WAL file creation. > > Most modern Linux filesystems support fast fallocate/posix_fallocate, > reducing extent fragmentation (where extents are used) and frequently > offering a pretty significant speed improvement. In my tests, using > posix_fallocate (followed by pg_fsync) is at least 28 times quicker > than using the current method (which writes zeroes followed by > pg_fsync). > > I have written up a patch to use posix_fallocate in new WAL file > creation, including configuration by way of a GUC variable, but I've > not contributed to the PostgreSQL project before. Therefore, I'm > fairly certain the patch is not formatted properly or conforms to the > appropriate style guides. Currently, the patch is based on 9.2, and is > quite small in size - 3.6KiB. > > Advice on how to proceed is appreciated. Thanks for hopping in! Please re-base the patch vs. git master, as new features like this go there. Please also to send along the tests you're doing so others can riff. Tests that find any weak points are also good. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Mon, May 13, 2013 at 9:54 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > Pertinent to another thread titled > [HACKERS] corrupt pages detected by enabling checksums > I hope to explore the possibility of using fallocate (or > posix_fallocate) for new WAL file creation. > > Most modern Linux filesystems support fast fallocate/posix_fallocate, > reducing extent fragmentation (where extents are used) and frequently > offering a pretty significant speed improvement. In my tests, using > posix_fallocate (followed by pg_fsync) is at least 28 times quicker > than using the current method (which writes zeroes followed by > pg_fsync). > > I have written up a patch to use posix_fallocate in new WAL file > creation, including configuration by way of a GUC variable, but I've > not contributed to the PostgreSQL project before. Therefore, I'm > fairly certain the patch is not formatted properly or conforms to the > appropriate style guides. Currently, the patch is based on 9.2, and is > quite small in size - 3.6KiB. > > Advice on how to proceed is appreciated. Make sure to list it here: https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 14, 2013 at 9:43 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, May 13, 2013 at 9:54 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: >> Pertinent to another thread titled >> [HACKERS] corrupt pages detected by enabling checksums >> I hope to explore the possibility of using fallocate (or >> posix_fallocate) for new WAL file creation. >> >> Most modern Linux filesystems support fast fallocate/posix_fallocate, >> reducing extent fragmentation (where extents are used) and frequently >> offering a pretty significant speed improvement. In my tests, using >> posix_fallocate (followed by pg_fsync) is at least 28 times quicker >> than using the current method (which writes zeroes followed by >> pg_fsync). >> >> I have written up a patch to use posix_fallocate in new WAL file >> creation, including configuration by way of a GUC variable, but I've >> not contributed to the PostgreSQL project before. Therefore, I'm >> fairly certain the patch is not formatted properly or conforms to the >> appropriate style guides. Currently, the patch is based on 9.2, and is >> quite small in size - 3.6KiB. I have re-based and reformatted the code, and basic testing shows a reduction in WAL-file creation time of a fairly significant amount. I ran 'make test' and did additional local testing without issue. Therefore, I am attaching the patch. I will try to add it to the commitfest page. -- Jon
Attachment
Hi, On 2013-05-15 16:26:15 -0500, Jon Nelson wrote: > >> I have written up a patch to use posix_fallocate in new WAL file > >> creation, including configuration by way of a GUC variable, but I've > >> not contributed to the PostgreSQL project before. Therefore, I'm > >> fairly certain the patch is not formatted properly or conforms to the > >> appropriate style guides. Currently, the patch is based on 9.2, and is > >> quite small in size - 3.6KiB. > > I have re-based and reformatted the code, and basic testing shows a > reduction in WAL-file creation time of a fairly significant amount. > I ran 'make test' and did additional local testing without issue. > Therefore, I am attaching the patch. I will try to add it to the > commitfest page. Some where quick comments, without thinking about this: * needs a configure check for posix_fallocate. The current version will e.g. fail to compile on windows or many other nonlinux systems. Check how its done for posix_fadvise. * Is wal file creation performance actually relevant? Is the performance of a system running on fallocate()d wal files anydifferent? * According to the man page posix_fallocate doesn't set errno but rather returns the error code. * I wonder whether we ever want to actually disable this? Afair the libc contains emulation for posix_fadvise if the filesystemdoesn't support it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, May 15, 2013 at 4:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2013-05-15 16:26:15 -0500, Jon Nelson wrote: >> >> I have written up a patch to use posix_fallocate in new WAL file >> >> creation, including configuration by way of a GUC variable, but I've >> >> not contributed to the PostgreSQL project before. Therefore, I'm >> >> fairly certain the patch is not formatted properly or conforms to the >> >> appropriate style guides. Currently, the patch is based on 9.2, and is >> >> quite small in size - 3.6KiB. >> >> I have re-based and reformatted the code, and basic testing shows a >> reduction in WAL-file creation time of a fairly significant amount. >> I ran 'make test' and did additional local testing without issue. >> Therefore, I am attaching the patch. I will try to add it to the >> commitfest page. > > Some where quick comments, without thinking about this: Thank you for the kind feedback. > * needs a configure check for posix_fallocate. The current version will > e.g. fail to compile on windows or many other non linux systems. Check > how its done for posix_fadvise. I will address as soon as I am able. > * Is wal file creation performance actually relevant? Is the performance > of a system running on fallocate()d wal files any different? In my limited testing, I noticed a drop of approx. 100ms per WAL file. I do not have a good idea for how to really stress the WAL-file creation area without calling pg_start_backup and pg_stop_backup over and over (with archiving enabled). However, a file allocated with fallocate is (supposed to be) less fragmented than one created by the traditional means. > * According to the man page posix_fallocate doesn't set errno but rather > returns the error code. That's true. I originally wrote the patch using fallocate(2). What would be appropriate here? Should I switch on the return value and the six (6) or so relevant error codes? > * I wonder whether we ever want to actually disable this? Afair the libc > contains emulation for posix_fadvise if the filesystem doesn't support > it. I know that glibc does, but I don't know about other libc implementations. -- Jon
On Wed, May 15, 2013 at 4:46 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > On Wed, May 15, 2013 at 4:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: .. >> Some where quick comments, without thinking about this: > > Thank you for the kind feedback. > >> * needs a configure check for posix_fallocate. The current version will >> e.g. fail to compile on windows or many other non linux systems. Check >> how its done for posix_fadvise. The following patch includes the changes to configure.in. I had to make other changes (not included here) because my local system uses autoconf 2.69, but I did test this successfully. > That's true. I originally wrote the patch using fallocate(2). What > would be appropriate here? Should I switch on the return value and the > six (6) or so relevant error codes? I addressed this, hopefully in a reasonable way. -- Jon
Attachment
Jon Nelson escribió: > On Wed, May 15, 2013 at 4:46 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > > That's true. I originally wrote the patch using fallocate(2). What > > would be appropriate here? Should I switch on the return value and the > > six (6) or so relevant error codes? > > I addressed this, hopefully in a reasonable way. Would it work to just assign the value you got from posix_fallocate (if nonzero) to errno and then use %m in the errmsg() call in ereport()? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, May 15, 2013 at 10:17 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Jon Nelson escribió: >> On Wed, May 15, 2013 at 4:46 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > >> > That's true. I originally wrote the patch using fallocate(2). What >> > would be appropriate here? Should I switch on the return value and the >> > six (6) or so relevant error codes? >> >> I addressed this, hopefully in a reasonable way. > > Would it work to just assign the value you got from posix_fallocate (if > nonzero) to errno and then use %m in the errmsg() call in ereport()? That strikes me as a better way. I'll work something up soon. Thanks! -- Jon
On Wed, May 15, 2013 at 10:36 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > On Wed, May 15, 2013 at 10:17 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Jon Nelson escribió: >>> On Wed, May 15, 2013 at 4:46 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: >> >>> > That's true. I originally wrote the patch using fallocate(2). What >>> > would be appropriate here? Should I switch on the return value and the >>> > six (6) or so relevant error codes? >>> >>> I addressed this, hopefully in a reasonable way. >> >> Would it work to just assign the value you got from posix_fallocate (if >> nonzero) to errno and then use %m in the errmsg() call in ereport()? > > That strikes me as a better way. I'll work something up soon. > Thanks! Please find attached version 3. Am I doing this the right way? Should I be posting the full patch each time, or incremental patches? -- Jon
Attachment
Jon Nelson escribió: > Am I doing this the right way? Should I be posting the full patch each > time, or incremental patches? Full patch each time is okay. Context-format patch is even better. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 5/16/13 9:16 AM, Jon Nelson wrote: > Am I doing this the right way? Should I be posting the full patch each > time, or incremental patches? There are guidelines for getting your patch in the right format at https://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git that would improve this one. You have some formatting issues with tab spacing at lines 120 through 133 in your v3 patch. And it looks like there was a formatting change on line 146 that is making the diff larger than it needs to be. The biggest thing missing from this submission is information about what performance testing you did. Ideally performance patches are submitted with enough information for a reviewer to duplicate the same test the author did, as well as hard before/after performance numbers from your test system. It often turns tricky to duplicate a performance gain, and being able to run the same test used for initial development eliminates a lot of the problems. Second bit of nitpicking. There are already some GUC values that appear or disappear based on compile time options. They're all debugging related things though. I would prefer not to see this one go away when it's implementation isn't available. That's going to break any scripts that SHOW the setting to see if it's turned on or not as a first problem. I think the right model to follow here is the IFDEF setup used for effective_io_concurrency. I wouldn't worry about this too much though. Having a wal_use_fallocate GUC is good for testing. But if it works out well, when it's ready for commit I don't see why anyone would want it turned off on platforms where it works. There are already too many performance tweaking GUCs. Something has to be very likely to be changed from the default before its worth adding one for it. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 2013-05-15 16:46:33 -0500, Jon Nelson wrote: > > * Is wal file creation performance actually relevant? Is the performance > > of a system running on fallocate()d wal files any different? > > In my limited testing, I noticed a drop of approx. 100ms per WAL file. > I do not have a good idea for how to really stress the WAL-file > creation area without calling pg_start_backup and pg_stop_backup over > and over (with archiving enabled). My point is that wal file creation usually isn't all that performance sensitive. Once the cluster has enough WAL files it will usually recycle them and thus never allocate new ones. So for this to be really beneficial it would be interesting to show different performance during normal running. You could also check out of how many extents a wal file is made out of with fallocate in comparison to the old style method (filefrag will give you that for most filesystems). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, May 17, 2013 at 4:47 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-05-15 16:46:33 -0500, Jon Nelson wrote: >> > * Is wal file creation performance actually relevant? Is the performance >> > of a system running on fallocate()d wal files any different? >> >> In my limited testing, I noticed a drop of approx. 100ms per WAL file. >> I do not have a good idea for how to really stress the WAL-file >> creation area without calling pg_start_backup and pg_stop_backup over >> and over (with archiving enabled). > > My point is that wal file creation usually isn't all that performance > sensitive. Once the cluster has enough WAL files it will usually recycle > them and thus never allocate new ones. So for this to be really > beneficial it would be interesting to show different performance during > normal running. You could also check out of how many extents a wal file > is made out of with fallocate in comparison to the old style method > (filefrag will give you that for most filesystems). But why does it have to be *really* beneficial? We're already making optional posix_fxxx calls and fallocate seems to do exactly what we would want in this context. Even if the 100ms drop doesn't show up all that often, I'd still take it just for the defragmentation benefits and the patch is fairly tiny. merlin
On Fri, May 17, 2013 at 8:29 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Fri, May 17, 2013 at 4:47 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> On 2013-05-15 16:46:33 -0500, Jon Nelson wrote: >>> > * Is wal file creation performance actually relevant? Is the performance >>> > of a system running on fallocate()d wal files any different? >>> >>> In my limited testing, I noticed a drop of approx. 100ms per WAL file. >>> I do not have a good idea for how to really stress the WAL-file >>> creation area without calling pg_start_backup and pg_stop_backup over >>> and over (with archiving enabled). >> >> My point is that wal file creation usually isn't all that performance >> sensitive. Once the cluster has enough WAL files it will usually recycle >> them and thus never allocate new ones. So for this to be really >> beneficial it would be interesting to show different performance during >> normal running. You could also check out of how many extents a wal file >> is made out of with fallocate in comparison to the old style method >> (filefrag will give you that for most filesystems). > > But why does it have to be *really* beneficial? We're already making > optional posix_fxxx calls and fallocate seems to do exactly what we > would want in this context. Even if the 100ms drop doesn't show up > all that often, I'd still take it just for the defragmentation > benefits and the patch is fairly tiny. Here is sample output of filefrag on a somewhat busy database from our testing environment that exactly duplicates our production workloads..It does a lot of batch processing at night and a mixof 80%oltp 20% olap during the day. This is on ext3. Interestingly, on ext4 servers I never saw more than 2 extents per file (but those servers are mostly not as busy). [root@rpisatysw001 pg_xlog]# filefrag * 00000001000006D200000064: 490 extents found, perfection would be 1 extent 00000001000006D200000065: 33 extents found, perfection would be 1 extent 00000001000006D200000066: 43 extents found, perfection would be 1 extent 00000001000006D200000067: 71 extents found, perfection would be 1 extent 00000001000006D200000068: 43 extents found, perfection would be 1 extent 00000001000006D200000069: 156 extents found, perfection would be 1 extent 00000001000006D20000006A: 52 extents found, perfection would be 1 extent 00000001000006D20000006B: 108 extents found, perfection would be 1 extent merlin
On 2013-05-17 15:48:38 -0500, Merlin Moncure wrote: > On Fri, May 17, 2013 at 8:29 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > > On Fri, May 17, 2013 at 4:47 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> On 2013-05-15 16:46:33 -0500, Jon Nelson wrote: > >>> > * Is wal file creation performance actually relevant? Is the performance > >>> > of a system running on fallocate()d wal files any different? > >>> > >>> In my limited testing, I noticed a drop of approx. 100ms per WAL file. > >>> I do not have a good idea for how to really stress the WAL-file > >>> creation area without calling pg_start_backup and pg_stop_backup over > >>> and over (with archiving enabled). > >> > >> My point is that wal file creation usually isn't all that performance > >> sensitive. Once the cluster has enough WAL files it will usually recycle > >> them and thus never allocate new ones. So for this to be really > >> beneficial it would be interesting to show different performance during > >> normal running. You could also check out of how many extents a wal file > >> is made out of with fallocate in comparison to the old style method > >> (filefrag will give you that for most filesystems). > > > > But why does it have to be *really* beneficial? We're already making > > optional posix_fxxx calls and fallocate seems to do exactly what we > > would want in this context. Even if the 100ms drop doesn't show up > > all that often, I'd still take it just for the defragmentation > > benefits and the patch is fairly tiny. Well, it needs to be tested et al. And its a fairly critical code path. I seem to remember that there were older glibc versions that didn't do such a great job at emulating fallocate for example. > Here is sample output of filefrag on a somewhat busy database from our > testing environment that exactly duplicates our production workloads.. > It does a lot of batch processing at night and a mix of 80%oltp 20% > olap during the day. This is on ext3. Interestingly, on ext4 servers > I never saw more than 2 extents per file (but those servers are mostly > not as busy). Ok, that's pretty bad. 490 extents in one file? Really? I'd consider shutting down the cluster, copying the wal files in a moment where there is enough free space. Just don't forget to sync afterwards. EXT4 is notably better at allocating space in growing files than ext3 due to delayed allocation (and other things), so it wouldn't surprise me similar differences in fragmentation even if the load were comparable. Ext3 doesn't have fallocate btw, so it wouldn't benefit from such a patch anyway. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, May 17, 2013 at 4:18 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-05-17 15:48:38 -0500, Merlin Moncure wrote: >> On Fri, May 17, 2013 at 8:29 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> > On Fri, May 17, 2013 at 4:47 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> >> On 2013-05-15 16:46:33 -0500, Jon Nelson wrote: >> >>> > * Is wal file creation performance actually relevant? Is the performance >> >>> > of a system running on fallocate()d wal files any different? >> >>> >> >>> In my limited testing, I noticed a drop of approx. 100ms per WAL file. >> >>> I do not have a good idea for how to really stress the WAL-file >> >>> creation area without calling pg_start_backup and pg_stop_backup over >> >>> and over (with archiving enabled). >> >> >> >> My point is that wal file creation usually isn't all that performance >> >> sensitive. Once the cluster has enough WAL files it will usually recycle >> >> them and thus never allocate new ones. So for this to be really >> >> beneficial it would be interesting to show different performance during >> >> normal running. You could also check out of how many extents a wal file >> >> is made out of with fallocate in comparison to the old style method >> >> (filefrag will give you that for most filesystems). >> > >> > But why does it have to be *really* beneficial? We're already making >> > optional posix_fxxx calls and fallocate seems to do exactly what we >> > would want in this context. Even if the 100ms drop doesn't show up >> > all that often, I'd still take it just for the defragmentation >> > benefits and the patch is fairly tiny. > > Well, it needs to be tested et al. And its a fairly critical code > path. I seem to remember that there were older glibc versions that > didn't do such a great job at emulating fallocate for example. > >> Here is sample output of filefrag on a somewhat busy database from our >> testing environment that exactly duplicates our production workloads.. >> It does a lot of batch processing at night and a mix of 80%oltp 20% >> olap during the day. This is on ext3. Interestingly, on ext4 servers >> I never saw more than 2 extents per file (but those servers are mostly >> not as busy). > > Ok, that's pretty bad. 490 extents in one file? Really? I'd consider > shutting down the cluster, copying the wal files in a moment where there > is enough free space. Just don't forget to sync afterwards. > EXT4 is notably better at allocating space in growing files than ext3 > due to delayed allocation (and other things), so it wouldn't surprise me > similar differences in fragmentation even if the load were comparable. > > Ext3 doesn't have fallocate btw, so it wouldn't benefit from such a > patch anyway. yeah -- I see your point. The object lesson isn't so much 'improve postgres' as it is to 'use a modern filesystem'. merlin
On Thu, May 16, 2013 at 7:05 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On 5/16/13 9:16 AM, Jon Nelson wrote: >> >> Am I doing this the right way? Should I be posting the full patch each >> time, or incremental patches? > > > There are guidelines for getting your patch in the right format at > https://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git > that would improve this one. You have some formatting issues with tab > spacing at lines 120 through 133 in your v3 patch. And it looks like there > was a formatting change on line 146 that is making the diff larger than it > needs to be. I've corrected the formatting change (end-of-line whitespace was stripped) on line 146. The other whitespace changes are - I think - due to newly-indented code due to a new code block. Included please find a v4 patch which uses context diffs per the above url. > The biggest thing missing from this submission is information about what > performance testing you did. Ideally performance patches are submitted with > enough information for a reviewer to duplicate the same test the author did, > as well as hard before/after performance numbers from your test system. It > often turns tricky to duplicate a performance gain, and being able to run > the same test used for initial development eliminates a lot of the problems. This has been a bit of a struggle. While it's true that WAL file creation doesn't happen with great frequency, and while it's also true that - with strace and other tests - it can be proven that fallocate(16MB) is much quicker than writing it zeroes by hand, proving that in the larger context of a running install has been challenging. Attached you'll find a small test script (t.sh) which creates a new cluster in 'foo', changes some config values, starts the cluster, and then times how long it takes pgbench to prepare a database. I've used "wal_level = hot_standby" in the hopes that this generates the largest number of WAL files (and I set the number of such files to 1024). The hardware is an AMD 9150e with a 2-disk software RAID1 (SATA disks) on kernel 3.9.2 and ext4 (x86_64, openSUSE 12.3). The test results are not that surprising. The longer the test (the larger the scale factor) the less of a difference using posix_fallocate makes. With a scale factor of 100, I see an average of 10-11% reduction in the time taken to initialize the database. With 300, it's about 5.5% and with 900, it's between 0 and 1.2%. I will be doing more testing but this is what I started with. I'm very open to suggestions. > Second bit of nitpicking. There are already some GUC values that appear or > disappear based on compile time options. They're all debugging related > things though. I would prefer not to see this one go away when it's > implementation isn't available. That's going to break any scripts that SHOW > the setting to see if it's turned on or not as a first problem. I think the > right model to follow here is the IFDEF setup used for > effective_io_concurrency. I wouldn't worry about this too much though. > Having a wal_use_fallocate GUC is good for testing. But if it works out > well, when it's ready for commit I don't see why anyone would want it turned > off on platforms where it works. There are already too many performance > tweaking GUCs. Something has to be very likely to be changed from the > default before its worth adding one for it. Ack. I've revised the patch to always have the GUC (for now), default to false, and if configure can't find posix_fallocate (or the user disables it by way of pg_config_manual.h) then it remains a GUC that simply can't be changed. I'll also be re-running the tests. -- Jon
Attachment
On Sat, May 25, 2013 at 2:55 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: >> The biggest thing missing from this submission is information about what >> performance testing you did. Ideally performance patches are submitted with >> enough information for a reviewer to duplicate the same test the author did, >> as well as hard before/after performance numbers from your test system. It >> often turns tricky to duplicate a performance gain, and being able to run >> the same test used for initial development eliminates a lot of the problems. > > This has been a bit of a struggle. While it's true that WAL file > creation doesn't happen with great frequency, and while it's also true > that - with strace and other tests - it can be proven that > fallocate(16MB) is much quicker than writing it zeroes by hand, > proving that in the larger context of a running install has been > challenging. It's nice to be able to test things in the context of a running install, but sometimes a microbenchmark is just as good. I mean, if posix_fallocate() is faster, then it's just faster, right? It's likely to be pretty hard to get reproducible numbers for how much this actually helps in the real world because write tests are inherently pretty variable depending on a lot of factors we don't control, so even if Jon has got the best possible test, the numbers may bounce around so much that you can't really measure the (probably small) gain from this approach. But that doesn't seem like a reason not to adopt the approach and take whatever gain there is. At least, not that I can see. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-05-28 10:03:58 -0400, Robert Haas wrote: > On Sat, May 25, 2013 at 2:55 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > >> The biggest thing missing from this submission is information about what > >> performance testing you did. Ideally performance patches are submitted with > >> enough information for a reviewer to duplicate the same test the author did, > >> as well as hard before/after performance numbers from your test system. It > >> often turns tricky to duplicate a performance gain, and being able to run > >> the same test used for initial development eliminates a lot of the problems. > > > > This has been a bit of a struggle. While it's true that WAL file > > creation doesn't happen with great frequency, and while it's also true > > that - with strace and other tests - it can be proven that > > fallocate(16MB) is much quicker than writing it zeroes by hand, > > proving that in the larger context of a running install has been > > challenging. > > It's nice to be able to test things in the context of a running > install, but sometimes a microbenchmark is just as good. I mean, if > posix_fallocate() is faster, then it's just faster, right? Well, it's a bit more complex than that. Fallocate doesn't actually initializes the disk space in most filesystems, just marks it as allocated and zeroed which is one of the reasons it can be noticeably faster. But that can make the runtime overhead of writing to those pages higher. I wonder whether noticeably upping checkpoint segments and then a) COPY in a large table b) a pgbench on a previously initialized table. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, May 28, 2013 at 10:15 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-05-28 10:03:58 -0400, Robert Haas wrote: >> On Sat, May 25, 2013 at 2:55 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: >> >> The biggest thing missing from this submission is information about what >> >> performance testing you did. Ideally performance patches are submitted with >> >> enough information for a reviewer to duplicate the same test the author did, >> >> as well as hard before/after performance numbers from your test system. It >> >> often turns tricky to duplicate a performance gain, and being able to run >> >> the same test used for initial development eliminates a lot of the problems. >> > >> > This has been a bit of a struggle. While it's true that WAL file >> > creation doesn't happen with great frequency, and while it's also true >> > that - with strace and other tests - it can be proven that >> > fallocate(16MB) is much quicker than writing it zeroes by hand, >> > proving that in the larger context of a running install has been >> > challenging. >> >> It's nice to be able to test things in the context of a running >> install, but sometimes a microbenchmark is just as good. I mean, if >> posix_fallocate() is faster, then it's just faster, right? > > Well, it's a bit more complex than that. Fallocate doesn't actually > initializes the disk space in most filesystems, just marks it as > allocated and zeroed which is one of the reasons it can be noticeably > faster. But that can make the runtime overhead of writing to those pages > higher. Maybe it would be good to measure that impact. Something like this: 1. Write 16MB of zeroes to an empty file in the same size chunks we're currently using (8kB?). Time that. Rewrite the file with real data. Time that. 2. posix_fallocate() an empty file out to 16MB. Time that. Rewrite the fie with real data. Time that. Personally, I have trouble believing that writing 16MB of zeroes by hand is "better" than telling the OS to do it for us. If that's so, the OS is just stupid, because it ought to be able to optimize the crap out of that compared to anything we can do. Of course, it is more than possible that the OS is in fact stupid. But I'd like to hope not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 28, 2013 at 9:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, May 28, 2013 at 10:15 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> On 2013-05-28 10:03:58 -0400, Robert Haas wrote: >>> On Sat, May 25, 2013 at 2:55 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: >>> >> The biggest thing missing from this submission is information about what >>> >> performance testing you did. Ideally performance patches are submitted with >>> >> enough information for a reviewer to duplicate the same test the author did, >>> >> as well as hard before/after performance numbers from your test system. It >>> >> often turns tricky to duplicate a performance gain, and being able to run >>> >> the same test used for initial development eliminates a lot of the problems. >>> > >>> > This has been a bit of a struggle. While it's true that WAL file >>> > creation doesn't happen with great frequency, and while it's also true >>> > that - with strace and other tests - it can be proven that >>> > fallocate(16MB) is much quicker than writing it zeroes by hand, >>> > proving that in the larger context of a running install has been >>> > challenging. >>> >>> It's nice to be able to test things in the context of a running >>> install, but sometimes a microbenchmark is just as good. I mean, if >>> posix_fallocate() is faster, then it's just faster, right? >> >> Well, it's a bit more complex than that. Fallocate doesn't actually >> initializes the disk space in most filesystems, just marks it as >> allocated and zeroed which is one of the reasons it can be noticeably >> faster. But that can make the runtime overhead of writing to those pages >> higher. > > Maybe it would be good to measure that impact. Something like this: > > 1. Write 16MB of zeroes to an empty file in the same size chunks we're > currently using (8kB?). Time that. Rewrite the file with real data. > Time that. > 2. posix_fallocate() an empty file out to 16MB. Time that. Rewrite > the fie with real data. Time that. > > Personally, I have trouble believing that writing 16MB of zeroes by > hand is "better" than telling the OS to do it for us. If that's so, > the OS is just stupid, because it ought to be able to optimize the > crap out of that compared to anything we can do. Of course, it is > more than possible that the OS is in fact stupid. But I'd like to > hope not. I wrote a little C program to do something very similar to that (which I'll hope to post later today). It opens a new file, fallocates 16MB, calls fdatasync. Then it loops 10 times: seek to the start of the file, writes 16MB of ones, calls fdatasync. Then it closes and removes the file, re-opens it, and this time writes out 16MB of zeroes, calls fdatasync, and then does the same loop as above. The program times the process from file open to file unlink, inclusive. The results - for me - are pretty consistent: using fallocate is 12-13% quicker than writing out zeroes. I used fdatasync twice to (attempt) to mimic what the WAL writer does. -- Jon
On 2013-05-28 10:12:05 -0500, Jon Nelson wrote: > On Tue, May 28, 2013 at 9:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, May 28, 2013 at 10:15 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> On 2013-05-28 10:03:58 -0400, Robert Haas wrote: > >>> On Sat, May 25, 2013 at 2:55 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > >>> >> The biggest thing missing from this submission is information about what > >>> >> performance testing you did. Ideally performance patches are submitted with > >>> >> enough information for a reviewer to duplicate the same test the author did, > >>> >> as well as hard before/after performance numbers from your test system. It > >>> >> often turns tricky to duplicate a performance gain, and being able to run > >>> >> the same test used for initial development eliminates a lot of the problems. > >>> > > >>> > This has been a bit of a struggle. While it's true that WAL file > >>> > creation doesn't happen with great frequency, and while it's also true > >>> > that - with strace and other tests - it can be proven that > >>> > fallocate(16MB) is much quicker than writing it zeroes by hand, > >>> > proving that in the larger context of a running install has been > >>> > challenging. > >>> > >>> It's nice to be able to test things in the context of a running > >>> install, but sometimes a microbenchmark is just as good. I mean, if > >>> posix_fallocate() is faster, then it's just faster, right? > >> > >> Well, it's a bit more complex than that. Fallocate doesn't actually > >> initializes the disk space in most filesystems, just marks it as > >> allocated and zeroed which is one of the reasons it can be noticeably > >> faster. But that can make the runtime overhead of writing to those pages > >> higher. > > > > Maybe it would be good to measure that impact. Something like this: > > > > 1. Write 16MB of zeroes to an empty file in the same size chunks we're > > currently using (8kB?). Time that. Rewrite the file with real data. > > Time that. > > 2. posix_fallocate() an empty file out to 16MB. Time that. Rewrite > > the fie with real data. Time that. > > > > Personally, I have trouble believing that writing 16MB of zeroes by > > hand is "better" than telling the OS to do it for us. If that's so, > > the OS is just stupid, because it ought to be able to optimize the > > crap out of that compared to anything we can do. Of course, it is > > more than possible that the OS is in fact stupid. But I'd like to > > hope not. > > I wrote a little C program to do something very similar to that (which > I'll hope to post later today). > It opens a new file, fallocates 16MB, calls fdatasync. Then it loops > 10 times: seek to the start of the file, writes 16MB of ones, calls > fdatasync. You need to call fsync() not fdatasync() the first time round. fdatasync doesn't guarantee metadata is synced. > Then it closes and removes the file, re-opens it, and this time writes > out 16MB of zeroes, calls fdatasync, and then does the same loop as > above. The program times the process from file open to file unlink, > inclusive. > > The results - for me - are pretty consistent: using fallocate is > 12-13% quicker than writing out zeroes. Cool! > I used fdatasync twice to (attempt) to mimic what the WAL writer does. Not sure what you mean by that though? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 5/28/13 11:12 AM, Jon Nelson wrote: > It opens a new file, fallocates 16MB, calls fdatasync. Outside of the run for performance testing, I think it would be good at this point to validate that there is really a 16MB file full of zeroes resulting from these operations. I am not really concerned that posix_fallocate might be slower in some cases; that seems unlikely. I am concerned that it might result in a file that isn't structurally the same as the 16MB of zero writes implementation used now. The timing program you're writing has some aspects that are similar to the contrib/pg_test_fsync program. You might borrow some code from there usefully. To clarify the suggestion I was making before about including performance test results: that doesn't necessarily mean the testing code must run using only the database. That's better if possible, but as Robert says it may not be for some optimizations. The important thing is to have something measuring the improvement that a reviewer can duplicate, and if that's a standalone benchmark problem that's still very useful. The main thing I'm wary of is any "this should be faster" claims that don't come with any repeatable measurements at all. Very often theories about the fastest way to do something don't match what's actually seen in testing. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Tue, May 28, 2013 at 10:36 AM, Greg Smith <greg@2ndquadrant.com> wrote: > On 5/28/13 11:12 AM, Jon Nelson wrote: >> >> It opens a new file, fallocates 16MB, calls fdatasync. > > > Outside of the run for performance testing, I think it would be good at this > point to validate that there is really a 16MB file full of zeroes resulting > from these operations. I am not really concerned that posix_fallocate might > be slower in some cases; that seems unlikely. I am concerned that it might > result in a file that isn't structurally the same as the 16MB of zero writes > implementation used now. util-linux comes with fallocate which (might?) suffice for testing in that respect, no? If that is a real concern, it could be made part of the autoconf testing, perhaps. > The timing program you're writing has some aspects that are similar to the > contrib/pg_test_fsync program. You might borrow some code from there > usefully. Thanks! If it looks like what I'm attaching will not do, then I'll look at that as a possible next step. > To clarify the suggestion I was making before about including performance > test results: that doesn't necessarily mean the testing code must run using > only the database. That's better if possible, but as Robert says it may not > be for some optimizations. The important thing is to have something > measuring the improvement that a reviewer can duplicate, and if that's a > standalone benchmark problem that's still very useful. The main thing I'm > wary of is any "this should be faster" claims that don't come with any > repeatable measurements at all. Very often theories about the fastest way > to do something don't match what's actually seen in testing. Ack. A note: The attached test program uses *fsync* instead of *fdatasync* after calling fallocate (or writing out 16MB of zeroes), per an earlier suggestion. -- Jon
Attachment
On 5/28/13 10:00 PM, Jon Nelson wrote: > On Tue, May 28, 2013 at 10:36 AM, Greg Smith <greg@2ndquadrant.com> wrote: >> On 5/28/13 11:12 AM, Jon Nelson wrote: >>> >>> It opens a new file, fallocates 16MB, calls fdatasync. >> >> >> Outside of the run for performance testing, I think it would be good at this >> point to validate that there is really a 16MB file full of zeroes resulting >> from these operations. I am not really concerned that posix_fallocate might >> be slower in some cases; that seems unlikely. I am concerned that it might >> result in a file that isn't structurally the same as the 16MB of zero writes >> implementation used now. > > util-linux comes with fallocate which (might?) suffice for testing in > that respect, no? > If that is a real concern, it could be made part of the autoconf > testing, perhaps. I was just thinking of something to run in your test program, not another build time check. Just run the new allocation sequence, and then check the resulting WAL file for a) correct length, and b) 16K of zero bytes. I would like to build some confidence that posix_fallocate is operating correctly in this context on at least one platform. My experience with Linux handling this class of functions correctly has left me skeptical of them working until that's proven to be the case. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 5/28/13 11:36 AM, Greg Smith wrote: > Outside of the run for performance testing, I think it would be good at > this point to validate that there is really a 16MB file full of zeroes > resulting from these operations. I am not really concerned that > posix_fallocate might be slower in some cases; that seems unlikely. I > am concerned that it might result in a file that isn't structurally the > same as the 16MB of zero writes implementation used now. I see nothing in the posix_fallocate() man pages that says that the allocated space is filled with any kind of data or zeroes. It will likely be garbage data, but that should be fine for a new WAL file.
* Peter Eisentraut (peter_e@gmx.net) wrote: > On 5/28/13 11:36 AM, Greg Smith wrote: > > Outside of the run for performance testing, I think it would be good at > > this point to validate that there is really a 16MB file full of zeroes > > resulting from these operations. I am not really concerned that > > posix_fallocate might be slower in some cases; that seems unlikely. I > > am concerned that it might result in a file that isn't structurally the > > same as the 16MB of zero writes implementation used now. > > I see nothing in the posix_fallocate() man pages that says that the > allocated space is filled with any kind of data or zeroes. It will > likely be garbage data, but that should be fine for a new WAL file. I *really* hope that the Linux kernel, and other, folks are smart enough to realize that they can't just re-use random blocks from an I/O device without cleaning it first. That would be one massive security hole. I expect posix_fallocate() actually works more like spase files, except that it also counts the space as being 'taken', but it doesn't go out and actually pull blocks to use until you actually go to write to it. At which point, perhaps there's an optimization that says "if the first thing done with this is writing, then just write out whatever data is requested and then fill the rest of the block out with zeros", and a similar read operation which says "if we havn't formally assigned a block for this, just return zeros". Hopefully it's smart enough to avoid writing out all zeros and then turning around and writing out whatever data is given, though since it'd all be in memory, perhaps that wouldn't be too bad and might be simpler to implement. Thanks, Stephen
On 2013-05-29 10:36:07 -0400, Stephen Frost wrote: > * Peter Eisentraut (peter_e@gmx.net) wrote: > > On 5/28/13 11:36 AM, Greg Smith wrote: > > > Outside of the run for performance testing, I think it would be good at > > > this point to validate that there is really a 16MB file full of zeroes > > > resulting from these operations. I am not really concerned that > > > posix_fallocate might be slower in some cases; that seems unlikely. I > > > am concerned that it might result in a file that isn't structurally the > > > same as the 16MB of zero writes implementation used now. > > > > I see nothing in the posix_fallocate() man pages that says that the > > allocated space is filled with any kind of data or zeroes. It will > > likely be garbage data, but that should be fine for a new WAL file. > > I *really* hope that the Linux kernel, and other, folks are smart enough > to realize that they can't just re-use random blocks from an I/O device > without cleaning it first. FWIW, posix' description about posix_fallocate() doesn't actually say *anything* about reading. The guarantee it makes is: "If posix_fallocate() returns successfully, subsequent writes to the specified file data shall not fail due to the lack of free space on the file system storage media.". http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html So we don't even know whether we can read. I think that means we need to zero the file anyway... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 5/29/13 10:42 AM, Andres Freund wrote: > On 2013-05-29 10:36:07 -0400, Stephen Frost wrote: >> I *really* hope that the Linux kernel, and other, folks are smart enough >> to realize that they can't just re-use random blocks from an I/O device >> without cleaning it first. > > FWIW, posix' description about posix_fallocate() doesn't actually say > *anything* about reading. The guarantee it makes is: > "If posix_fallocate() returns successfully, subsequent writes to the > specified file data shall not fail due to the lack of free space on the > file system storage media.". > > http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html > > So we don't even know whether we can read. I think that means we need to > zero the file anyway... We could use Linux fallocate(), which does guarantee that the file reads back as zeroes. Or we use posix_fallocate() and write over the first few bytes, enough for a subsequent reader to detect that it shouldn't read any further. But all of this is getting very complicated for such a marginal improvement.
On Wed, May 29, 2013 at 10:42 AM, Andres Freund <andres@2ndquadrant.com> wrote: > FWIW, posix' description about posix_fallocate() doesn't actually say > *anything* about reading. The guarantee it makes is: > "If posix_fallocate() returns successfully, subsequent writes to the > specified file data shall not fail due to the lack of free space on the > file system storage media.". > > http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html > > So we don't even know whether we can read. I think that means we need to > zero the file anyway... Surely this is undue pessimism. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 5/30/13 6:49 AM, Robert Haas wrote: > On Wed, May 29, 2013 at 10:42 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> So we don't even know whether we can read. I think that means we need to >> zero the file anyway... > > Surely this is undue pessimism. There have been many occasions where I've found the Linux kernel defining support for POSIX behavior with a NOP stub that basically says "we should make this work one day". I don't know whether the fallocate code is one of those or a fully implemented call. Based on that history, until I see a reader that validates the resulting files are good I have to assume they're not. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 2013-05-30 06:49:42 -0400, Robert Haas wrote: > On Wed, May 29, 2013 at 10:42 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > FWIW, posix' description about posix_fallocate() doesn't actually say > > *anything* about reading. The guarantee it makes is: > > "If posix_fallocate() returns successfully, subsequent writes to the > > specified file data shall not fail due to the lack of free space on the > > file system storage media.". > > > > http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html > > > > So we don't even know whether we can read. I think that means we need to > > zero the file anyway... > > Surely this is undue pessimism. Why? The spec doesn't specify that case and that very well allows other behaviour. Glibc sure does behave sensibly and zeroes the data (sysdeps/posix/posix_fallocate64.c for the generic implementation) and so does linux' fallocate() syscall, but that doesn't say much about other implementations. None of the manpages I could find, nor the spec says anything about the file's contents in the extended range. Given there were at least three manpages of different origins that didn't specify that behaviour I am not too optimistic. Why they didn't specify that completely obvious question is hard to understand from my pov. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-05-30 06:55:16 -0400, Greg Smith wrote: > On 5/30/13 6:49 AM, Robert Haas wrote: > >On Wed, May 29, 2013 at 10:42 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >>So we don't even know whether we can read. I think that means we need to > >>zero the file anyway... > > > >Surely this is undue pessimism. > > There have been many occasions where I've found the Linux kernel defining > support for POSIX behavior with a NOP stub that basically says "we should > make this work one day". I don't know whether the fallocate code is one of > those or a fully implemented call. Based on that history, until I see a > reader that validates the resulting files are good I have to assume they're > not. That argument in contrast I find not very convincing though. What was the last incidence of such a system call that did not just error out with ENOTSUPP or such? The linux fallocate call is fully specified for this behaviour and got added 2.6.23, there wasn't a stub before, so I am far less worried about it than about the underspecifiedness of posix_fallocate(). Also, if some system call doesn't follow its documented specifications it's not fully our problem anymore. If we rely on undocumented behaviour though... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 5/30/13 7:17 AM, Andres Freund wrote: > That argument in contrast I find not very convincing though. What was > the last incidence of such a system call that did not just error out > with ENOTSUPP or such? http://linux.die.net/man/2/posix_fadvise talks about POSIX_FADV_NOREUSE and POSIX_FADV_WILLNEED being both buggy and quietly mapped to a no-op, depending on your version. I know there were more examples than just that one that popped up during the testing of effective_io_concurrency. My starting position has to assume that posix_fallocatecan have the same sort of surprising behavior that showed up repeatedly when we were trying to use posix_fadvise more aggressively. The way O_SYNC was quietly mapped to O_DSYNC (which isn't the same thing) was a similar issue, and that's the first one that left me forever skeptical of Linux kernel claims in this area until they are explicitly validated: http://lwn.net/Articles/350225/ -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 2013-05-30 07:48:51 -0400, Greg Smith wrote: > On 5/30/13 7:17 AM, Andres Freund wrote: > >That argument in contrast I find not very convincing though. What was > >the last incidence of such a system call that did not just error out > >with ENOTSUPP or such? > > http://linux.die.net/man/2/posix_fadvise talks about POSIX_FADV_NOREUSE and > POSIX_FADV_WILLNEED being both buggy and quietly mapped to a no-op, > depending on your version. I know there were more examples than just that > one that popped up during the testing of effective_io_concurrency. My > starting position has to assume that posix_fallocate can have the same sort > of surprising behavior that showed up repeatedly when we were trying to use > posix_fadvise more aggressively. Uh. How is that a correctness problem? fadvise is a hint which is pretty different from a fallocate where ignoring would have way much more severe consequences. I don't think that's a very meaningful comparison. > The way O_SYNC was quietly mapped to O_DSYNC (which isn't the same thing) > was a similar issue, and that's the first one that left me forever skeptical > of Linux kernel claims in this area until they are explicitly validated: > http://lwn.net/Articles/350225/ Yea, but that mistake is literally decades old... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 5/30/13 7:52 AM, Andres Freund wrote: > fadvise is a hint which is pretty > different from a fallocate where ignoring would have way much more > severe consequences. Yes, it will. That's why I want to see it tested. There is more than enough past examples of bad behavior here to be skeptical that this sort of API may not work exactly as specified. If you're willing to believe the spec, that's fine, but I think that's dangerously optimistic. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Thu, May 30, 2013 at 7:13 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> Surely this is undue pessimism. > > Why? The spec doesn't specify that case and that very well allows other > behaviour. Glibc sure does behave sensibly and zeroes the data > (sysdeps/posix/posix_fallocate64.c for the generic implementation) and > so does linux' fallocate() syscall, but that doesn't say much about > other implementations. > > None of the manpages I could find, nor the spec says anything about the > file's contents in the extended range. Given there were at least three > manpages of different origins that didn't specify that behaviour I am > not too optimistic. Why they didn't specify that completely obvious > question is hard to understand from my pov. I think they didn't specify it because it IS obvious. As Stephen says, it's been understood for decades that allowing unzeroed pages to be reallocated to some other file is a major security hole. I think we can assume that no credible OS does that. If there's some OS out there that chooses to fill the pre-extended pages with 0x55 or cat /dev/urandom instead of 0x00, they probably deserve what they get. It's hard for me to be believe that anything that silly actually exists. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-05-30 08:02:56 -0400, Robert Haas wrote: > On Thu, May 30, 2013 at 7:13 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> Surely this is undue pessimism. > > > > Why? The spec doesn't specify that case and that very well allows other > > behaviour. Glibc sure does behave sensibly and zeroes the data > > (sysdeps/posix/posix_fallocate64.c for the generic implementation) and > > so does linux' fallocate() syscall, but that doesn't say much about > > other implementations. > > > > None of the manpages I could find, nor the spec says anything about the > > file's contents in the extended range. Given there were at least three > > manpages of different origins that didn't specify that behaviour I am > > not too optimistic. Why they didn't specify that completely obvious > > question is hard to understand from my pov. > > I think they didn't specify it because it IS obvious. As Stephen > says, it's been understood for decades that allowing unzeroed pages to > be reallocated to some other file is a major security hole. I think > we can assume that no credible OS does that. If there's some OS out > there that chooses to fill the pre-extended pages with 0x55 or cat > /dev/urandom instead of 0x00, they probably deserve what they get. > It's hard for me to be believe that anything that silly actually > exists. I don't think there's much danger of getting uninitialized data or such. That clearly would be insane. I think somebody might interpret it as read(2) returning an error until the page has been written to which isn't completely crazy. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 5/30/13 7:13 AM, Andres Freund wrote: > Why? The spec doesn't specify that case and that very well allows other > behaviour. Glibc sure does behave sensibly and zeroes the data > (sysdeps/posix/posix_fallocate64.c for the generic implementation) and > so does linux' fallocate() syscall, but that doesn't say much about > other implementations. glibc actually only writes one byte to every file system block, to make sure the block is allocated. It doesn't actually zero every byte.
On 5/30/13 8:02 AM, Robert Haas wrote: > If there's some OS out > there that chooses to fill the pre-extended pages with 0x55 or cat > /dev/urandom instead of 0x00, they probably deserve what they get. Even that wouldn't be a problem for our purpose. The only problem would be if you can't read from the allocated region at all.
On 2013-05-30 08:19:17 -0400, Peter Eisentraut wrote: > On 5/30/13 8:02 AM, Robert Haas wrote: > > If there's some OS out > > there that chooses to fill the pre-extended pages with 0x55 or cat > > /dev/urandom instead of 0x00, they probably deserve what they get. > > Even that wouldn't be a problem for our purpose. The only problem would > be if you can't read from the allocated region at all. Well, only as long as we only use it for preallocation of wal files. I am much, much more interested in doing that for the heap. And there that surely would be a problem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-05-30 08:17:28 -0400, Peter Eisentraut wrote: > On 5/30/13 7:13 AM, Andres Freund wrote: > > Why? The spec doesn't specify that case and that very well allows other > > behaviour. Glibc sure does behave sensibly and zeroes the data > > (sysdeps/posix/posix_fallocate64.c for the generic implementation) and > > so does linux' fallocate() syscall, but that doesn't say much about > > other implementations. > > glibc actually only writes one byte to every file system block, to make > sure the block is allocated. It doesn't actually zero every byte. Which is fine since that guarantees we can read from those areas... And unless I misremember something that actually guarantees that the rest of the data is initialized to zero as well. Yes: "subsequent reads of data in the gap shall return bytes with the value 0 until data is actually written into the gap". But really, I am not at all concerned about some obscure values being returned, but about a read() not being successful.. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
* Peter Eisentraut (peter_e@gmx.net) wrote: > On 5/30/13 7:13 AM, Andres Freund wrote: > > Why? The spec doesn't specify that case and that very well allows other > > behaviour. Glibc sure does behave sensibly and zeroes the data > > (sysdeps/posix/posix_fallocate64.c for the generic implementation) and > > so does linux' fallocate() syscall, but that doesn't say much about > > other implementations. > > glibc actually only writes one byte to every file system block, to make > sure the block is allocated. It doesn't actually zero every byte. That goes back to the 'sane implementation' question.. Is there a case where that would actually be different from writing zeros for the entire block..? Is there some OS that gives you random data for the 'hole' when you write a byte, seek to the start of the next block and then write another byte? That actually *would* be against what's documented and required by spec, no? Thanks, Stephen
* Andres Freund (andres@2ndquadrant.com) wrote: > On 2013-05-30 08:19:17 -0400, Peter Eisentraut wrote: > > On 5/30/13 8:02 AM, Robert Haas wrote: > > > If there's some OS out > > > there that chooses to fill the pre-extended pages with 0x55 or cat > > > /dev/urandom instead of 0x00, they probably deserve what they get. > > > > Even that wouldn't be a problem for our purpose. The only problem would > > be if you can't read from the allocated region at all. > > Well, only as long as we only use it for preallocation of wal files. I > am much, much more interested in doing that for the heap. And there that > surely would be a problem. Yes, that was my thinking as well. If posix_fallocate is faster than writing out 8K of zeros, and the block can immediately be read as if it had actually been written to, then I'd be very interested in using it to extend heap files. As I mentioned in this thread (or perhaps it was another), I don't think this solves the locking issue around the relation extention lock, but it might help some and it may make it easier to tweak that logic to allocate larger chunks in the future. Thanks, Stephen
* Andres Freund (andres@2ndquadrant.com) wrote: > But really, I am not at all concerned about some obscure values being > returned, but about a read() not being successful.. Alright, so what do we need to do to test this? We really just need a short C program written up and then a bunch of folks to run it on various architectures, right? Gee, sounds like what the buildfarm was made for (alright, alright, PostgreSQL isn't exactly a 'short C program', but you get the idea). As I recall, Andrew reworked the buildfarm code to be more modular too.. Anyone have thoughts about how we could run these kinds of tests with it? Or do people think that's a bad idea? Thanks, Stephen
On 2013-05-30 08:53:37 -0400, Stephen Frost wrote: > * Andres Freund (andres@2ndquadrant.com) wrote: > > But really, I am not at all concerned about some obscure values being > > returned, but about a read() not being successful.. > > Alright, so what do we need to do to test this? We really just need a > short C program written up and then a bunch of folks to run it on > various architectures, right? After a bit of standard perusing writing a single byte to the end of the file after the fallocate ought to make at least the reading guaranteed to be defined. If we did seek(last_byte); write(); posix_fallocate() we should even always have defined content. Yuck. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
* Andres Freund (andres@2ndquadrant.com) wrote: > After a bit of standard perusing writing a single byte to the end of the > file after the fallocate ought to make at least the reading guaranteed > to be defined. If we did seek(last_byte); write(); posix_fallocate() we > should even always have defined content. Yuck. Alright, but would that actually be any better than just doing what glibc's posix_fallocate() does in the generic case? And, to be honest, it makes me a bit nervous to seek/write like that because it looks like the typical "create a hole" setup, which we certainly aren't intending, yet if the posix_fallocate() call disappeared, or did nothing, or this code was copied w/o it, or someone didn't understand what it did, we could end up with that. Not a fan. :( Thanks, Stephen
On 5/30/13 8:50 AM, Stephen Frost wrote: > I don't think this solves the locking issue around the > relation extention lock, but it might help some and it may make it > easier to tweak that logic to allocate larger chunks in the future. It might make it a bit faster, but it doesn't make it any easier to implement. The messy part of extending relations in larger chunks is how to communicate that back into the buffer manager usefully. The extension path causing trouble is RelationGetBufferForTuple calling ReadBufferBI. All of that is passing a single buffer around. There's no simple way I can see to rewrite it to handle more than one at a time. I have a test case for relation extension that I'm going to package up soon. That makes it easy to see where the problem is at. Far as I can tell the reason it hasn't been fixed before now is that it's a pain to write the code. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Thu, May 30, 2013 at 8:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I don't think there's much danger of getting uninitialized data or > such. That clearly would be insane. I think somebody might interpret it > as read(2) returning an error until the page has been written to which > isn't completely crazy. In the absence of tangible evidence of some implementation that behaves that way, I think that's just paranoia. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greg Smith escribió: > The messy part of extending relations in larger chunks > is how to communicate that back into the buffer manager usefully. > The extension path causing trouble is RelationGetBufferForTuple > calling ReadBufferBI. All of that is passing a single buffer > around. There's no simple way I can see to rewrite it to handle > more than one at a time. No, but we can have it create several pages and insert them into the FSM. So they aren't returned to the original caller but are available to future users. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 5/30/13 11:21 AM, Alvaro Herrera wrote: > Greg Smith escribió: > >> The messy part of extending relations in larger chunks >> is how to communicate that back into the buffer manager usefully. >> The extension path causing trouble is RelationGetBufferForTuple >> calling ReadBufferBI. All of that is passing a single buffer >> around. There's no simple way I can see to rewrite it to handle >> more than one at a time. > > No, but we can have it create several pages and insert them into the > FSM. So they aren't returned to the original caller but are available > to future users. There's actually a code comment wondering about this topic for the pages that are already created, in src/backend/access/heap/hio.c : "Remember the new page as our target for future insertions. XXX should we enter the new page into the free space map immediately, or just keep it for this backend's exclusive use in the short run (until VACUUM sees it)? Seems to depend on whether you expect the current backend to make more insertions or not, which is probably a good bet most of the time. So for now, don't add it to FSM yet." We have to be careful about touching too much at that particular point, because it's holding a relation extension lock at the obvious spot to make a change. There's an interesting overlap with these questions about how files are extended too, with this comment in that file too, just before the above: "XXX This does an lseek - rather expensive - but at the moment it is the only way to accurately determine how many blocks are in a relation. Is it worth keeping an accurate file length in shared memory someplace, rather than relying on the kernel to do it for us?" That whole sequence of code took the easy way forward when it was written, but it's obvious the harder one (also touching the FSM) was considered even then. The whole sequence needs to be revisited to pull off multiple page extension. I wouldn't say it's hard, but it's enough work that I haven't been able to find a block of time to go through the whole thing. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Thu, May 30, 2013 at 02:58:26PM +0200, Andres Freund wrote: > > * Andres Freund (andres@2ndquadrant.com) wrote: > > > But really, I am not at all concerned about some obscure values being > > > returned, but about a read() not being successful.. > After a bit of standard perusing writing a single byte to the end of the > file after the fallocate ought to make at least the reading guaranteed > to be defined. If we did seek(last_byte); write(); posix_fallocate() we > should even always have defined content. Yuck. This portion of the posix_fallocate() specification requires the hoped-for effect on subsequent read() calls: If the offset+ len is beyond the current file size, then posix_fallocate() shall adjust the file size to offset+ len. Otherwise,the file size shall not be changed. -- http://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_fallocate.html When the file size increases, read()'s defined behavior switches from returning short to retrieving zeros. There's no need for an additional write() to ensure that. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
There hasn't been much activity here recently. I'm curious, then, if there are questions that I can answer. It may be useful to summarize some things here: - the purpose of the patch is to use posix_fallocate when creating new WAL files, because it's (usually) much quicker - using posix_fallocate is *one* system call versus 2048 calls to write(2) - additionally, using posix_fallocate /guarantees/ that the filesystem has space for the WAL file (by spec) - reportedly (difficult to test or prove), using posix_fallocate *may* reduce file fragmentation - the (limited) testing I've done bears this out: the more new WAL file creation there is, the more the improvement. Once the number of WAL files reaches a constant point, there does not appear to be either a positive or a negative performance impact. This is as expected. - a test program (C) was also written and used which creates, allocates, and then writes to files as fast as possible. This test program also shows the expected performance benefits. - the performance benefits range from a few percent up to about 15 percent Concerns: - some were concerned that the spec makes no claims about posix_fallocate being able to guarantee that the space allocated has zeroes in it. This was discussed here and on the Linux Kernel mailing list, wherein the expected behavior is that it does provide zeroes - most systems don't allocate a great many new WAL files, so the performance benefit is small - <your concern here> Benefits: - new WAL file allocate is much quicker, more efficient (fewer system calls) - the patch is (reportedly - I'm not a good judge here!) quite small -- Jon
On 2013-11.06 17:28, Jon Nelson wrote: > There hasn't been much activity here recently. I'm curious, then, if > there are questions that I can answer. > It may be useful to summarize some things here: > > - the purpose of the patch is to use posix_fallocate when creating new > WAL files, because it's (usually) much quicker > - using posix_fallocate is *one* system call versus 2048 calls to write(2) > - additionally, using posix_fallocate /guarantees/ that the filesystem > has space for the WAL file (by spec) > - reportedly (difficult to test or prove), using posix_fallocate *may* > reduce file fragmentation > - the (limited) testing I've done bears this out: the more new WAL > file creation there is, the more the improvement. Once the number of > WAL files reaches a constant point, there does not appear to be either > a positive or a negative performance impact. This is as expected. > - a test program (C) was also written and used which creates, > allocates, and then writes to files as fast as possible. This test > program also shows the expected performance benefits. > - the performance benefits range from a few percent up to about 15 percent tried the test program of the patch at least a bit. Retrieved it from: http://www.postgresql.org/message-id/attachment/29088/test_fallocate.c on an oldish linux box (Kernel 2.6.32, x86_64) following $ gcc -o test_fallocate test_fallocate.c a short $ ./test_fallocate foo 1 1 yields: without posix_fallocate: 1 open/close iterations, 1 rewrite in 26.1993s with posix_fallocate: 1 open/close iterations, 1 rewrite in 13.3299s on another box (Kernel 3.2.0, x86_64) same procedure yields: without posix_fallocate: 1 open/close iterations, 1 rewrite in 19.1972s with posix_fallocate: 1 open/close iterations, 1 rewrite in 9.9280s Note, when trying gcc -O2 test_fallocate.c fails to compile with: In file included from /usr/include/fcntl.h:252:0, from test_fallocate.c:3: In function ‘open’, inlined from ‘main’ at test_fallocate.c:68:16: /usr/include/x86_64-linux-gnu/bits/fcntl2.h:51:24: error: call to ‘__open_missing_mode’ declared with attribute error: open with O_CREAT in second argument needs 3 arguments > Concerns: > - some were concerned that the spec makes no claims about > posix_fallocate being able to guarantee that the space allocated has > zeroes in it. This was discussed here and on the Linux Kernel mailing > list, wherein the expected behavior is that it does provide zeroes > - most systems don't allocate a great many new WAL files, so the > performance benefit is small > - <your concern here> > > Benefits: > - new WAL file allocate is much quicker, more efficient (fewer system calls) > - the patch is (reportedly - I'm not a good judge here!) quite small HTH, Stefan.
On Tue, Jun 11, 2013 at 11:08 AM, Stefan Drees <stefan@drees.name> wrote: > On 2013-11.06 17:28, Jon Nelson wrote: >> >> There hasn't been much activity here recently. I'm curious, then, if >> there are questions that I can answer. >> It may be useful to summarize some things here: >> >> - the purpose of the patch is to use posix_fallocate when creating new >> WAL files, because it's (usually) much quicker >> - using posix_fallocate is *one* system call versus 2048 calls to write(2) >> - additionally, using posix_fallocate /guarantees/ that the filesystem >> has space for the WAL file (by spec) >> - reportedly (difficult to test or prove), using posix_fallocate *may* >> reduce file fragmentation >> - the (limited) testing I've done bears this out: the more new WAL >> file creation there is, the more the improvement. Once the number of >> WAL files reaches a constant point, there does not appear to be either >> a positive or a negative performance impact. This is as expected. >> - a test program (C) was also written and used which creates, >> allocates, and then writes to files as fast as possible. This test >> program also shows the expected performance benefits. >> - the performance benefits range from a few percent up to about 15 percent > > > tried the test program of the patch at least a bit. > > Retrieved it from: > http://www.postgresql.org/message-id/attachment/29088/test_fallocate.c > > on an oldish linux box (Kernel 2.6.32, x86_64) following > $ gcc -o test_fallocate test_fallocate.c > a short > $ ./test_fallocate foo 1 1 > yields: > without posix_fallocate: 1 open/close iterations, 1 rewrite in 26.1993s > with posix_fallocate: 1 open/close iterations, 1 rewrite in 13.3299s > > on another box (Kernel 3.2.0, x86_64) same procedure yields: > without posix_fallocate: 1 open/close iterations, 1 rewrite in 19.1972s > with posix_fallocate: 1 open/close iterations, 1 rewrite in 9.9280s > > Note, when trying gcc -O2 test_fallocate.c fails to compile with: > > In file included from /usr/include/fcntl.h:252:0, > from test_fallocate.c:3: > In function ‘open’, > inlined from ‘main’ at test_fallocate.c:68:16: > /usr/include/x86_64-linux-gnu/bits/fcntl2.h:51:24: error: call to > ‘__open_missing_mode’ declared with attribute error: open with O_CREAT in > second argument needs 3 arguments It's understood that posix_fallocate is faster at this -- the question on the table is 'does this matter in context of postgres?'. Personally I think this patch should go in regardless -- the concerns made IMNSHO are specious. merlin
* Merlin Moncure (mmoncure@gmail.com) wrote: > It's understood that posix_fallocate is faster at this -- the question > on the table is 'does this matter in context of postgres?'. > Personally I think this patch should go in regardless -- the concerns > made IMNSHO are specious. I've not had a chance to look at this patch, but I tend to agree with Merlin. My main question is really- would this be useful for extending *relations*? Apologies if it's already been discussed; I do plan to go back and read the threads about this more fully, but I wanted to voice my support for using posix_fallocate, when available, in general. Thanks, Stephen
On 6/11/13 12:22 PM, Merlin Moncure wrote: > Personally I think this patch should go in regardless -- the concerns > made IMNSHO are specious. That's nice, but we have this process for validating whether features go in or not that relies on review instead of opinions. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 2013-06-11 19:45 CEST, Greg Smith wrote: > On 6/11/13 12:22 PM, Merlin Moncure wrote: > >> Personally I think this patch should go in regardless -- the concerns >> made IMNSHO are specious. > > That's nice, but we have this process for validating whether features go > in or not that relies on review instead of opinions. > ;-) that's why I played with the test_fallocate.c, as it was easy to do and I understood, the author (of the patch) wanted to trigger some reviews ... I do not (yet) know anything about the core codes, so I leave this to the hackers. My review result was, that with newer gcc's you should specify an open mode flag as third argument of the fopen call, as only with the test tool nothing important found. Stefan.
On Tue, Jun 11, 2013 at 12:49 PM, Stefan Drees <stefan@drees.name> wrote: > On 2013-06-11 19:45 CEST, Greg Smith wrote: >> >> On 6/11/13 12:22 PM, Merlin Moncure wrote: >> >>> Personally I think this patch should go in regardless -- the concerns >>> made IMNSHO are specious. >> >> >> That's nice, but we have this process for validating whether features go >> in or not that relies on review instead of opinions. >> > ;-) that's why I played with the test_fallocate.c, as it was easy to do and > I understood, the author (of the patch) wanted to trigger some reviews ... I > do not (yet) know anything about the core codes, so I leave this to the > hackers. My review result was, that with newer gcc's you should specify an > open mode flag as third argument of the fopen call, as only with the test > tool nothing important found. If you change line 68 to read: fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600); then it should work just fine (assuming you have not already done so). -- Jon
On Tue, Jun 11, 2013 at 12:45 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On 6/11/13 12:22 PM, Merlin Moncure wrote: > >> Personally I think this patch should go in regardless -- the concerns >> made IMNSHO are specious. > > That's nice, but we have this process for validating whether features go in > or not that relies on review instead of opinions. Sure: I phrased that badly by 'go in' I meant 'go through the review process', not commit out of hand. merlin
On Tue, Jun 11, 2013 at 12:58 PM, Stephen Frost <sfrost@snowman.net> wrote: > * Merlin Moncure (mmoncure@gmail.com) wrote: >> It's understood that posix_fallocate is faster at this -- the question >> on the table is 'does this matter in context of postgres?'. >> Personally I think this patch should go in regardless -- the concerns >> made IMNSHO are specious. > > I've not had a chance to look at this patch, but I tend to agree with > Merlin. I also think this is a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote: > Ack. I've revised the patch to always have the GUC (for now), default > to false, and if configure can't find posix_fallocate (or the user > disables it by way of pg_config_manual.h) then it remains a GUC that > simply can't be changed. Why have a GUC here at all? Perhaps this was already discussed, and I missed it? Is it just for testing purposes, or did you intend for it to be in the final version? If it's supported, it seems like we always want it. I doubt there are cases where it hurts performance; but if there are, it's pretty hard for a DBA to know what those cases are, anyway. Also: * The other code assumes that no errno means ENOSPC. We should be consistent about that assumption, and do the same thing in the fallocate case. * You check for the presence of posix_fallocate at configure time, but don't #ifdef the call site. It looks like you removed this from the v2 patch, was there a reason for that? Won't that cause build errors for platforms without it? I started looking at this patch and it looks like we are getting a consensus that it's the right approach. Microbenchmarks appear to show a benefit, and (thanks to Noah's comment) it seems like the change is safe. Are there any remaining questions or objections? Regards,Jeff Davis
On 6/14/13 1:06 PM, Jeff Davis wrote: > Why have a GUC here at all? Perhaps this was already discussed, and I > missed it? Is it just for testing purposes, or did you intend for it to > be in the final version? You have guessed correctly! I suggested it stay in there only to make review benchmarking easier. > I started looking at this patch and it looks like we are getting a > consensus that it's the right approach. Microbenchmarks appear to show a > benefit, and (thanks to Noah's comment) it seems like the change is > safe. Are there any remaining questions or objections? I'm planning to duplicate Jon's test program on a few machines here, and then see if that turns into a useful latency improvement for clients. I'm trying to get this pgbench rate limit stuff working first though, because one of the tests I had in mind for WAL creation overhead would benefit from it. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Tue, 2013-06-11 at 12:58 -0400, Stephen Frost wrote: > My main question is really- would this be useful for extending > *relations*? Apologies if it's already been discussed; I do plan to go > back and read the threads about this more fully, but I wanted to voice > my support for using posix_fallocate, when available, in general. +1, though separate from this patch. Andres also pointed out that we can try to track a point in the file that is below any place where a zero page might still exist. That will allow us to call zero pages invalid unless they are related to a recent extension, which is a weakness in the current checksums code. Regards,Jeff Davis
On Fri, 2013-06-14 at 13:21 -0400, Greg Smith wrote: > I'm planning to duplicate Jon's test program on a few machines here, and > then see if that turns into a useful latency improvement for clients. > I'm trying to get this pgbench rate limit stuff working first though, > because one of the tests I had in mind for WAL creation overhead would > benefit from it. Great, I'll wait on those results. Regards,Jeff Davis
On Fri, Jun 14, 2013 at 12:06 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote: >> Ack. I've revised the patch to always have the GUC (for now), default >> to false, and if configure can't find posix_fallocate (or the user >> disables it by way of pg_config_manual.h) then it remains a GUC that >> simply can't be changed. > > Why have a GUC here at all? Perhaps this was already discussed, and I > missed it? Is it just for testing purposes, or did you intend for it to > be in the final version? As Greg Smith noted, right now it's only for testing purposes. > * The other code assumes that no errno means ENOSPC. We should be > consistent about that assumption, and do the same thing in the fallocate > case. Unlike write(2), posix_fallocate does *not* set errno, but instead returns a subset of the errno values as it's return code. The current approach was suggested to me by Andres Freund and Alvaro Herrera. > * You check for the presence of posix_fallocate at configure time, but > don't #ifdef the call site. It looks like you removed this from the v2 > patch, was there a reason for that? Won't that cause build errors for > platforms without it? Indeed! I believe I have corrected the error and will be sending out a revised patch (v5), once the performance and back-testing have completed. -- Jon
On Sun, Jun 16, 2013 at 9:59 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > On Fri, Jun 14, 2013 at 12:06 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote: .. >> * You check for the presence of posix_fallocate at configure time, but >> don't #ifdef the call site. It looks like you removed this from the v2 >> patch, was there a reason for that? Won't that cause build errors for >> platforms without it? > > Indeed! I believe I have corrected the error and will be sending out a > revised patch (v5), once the performance and back-testing have > completed. Please find attached v5 of the patch, with the above correction in place. The only change from the v4 patch is wrapping the if (wal_use_fallocate) block in an #ifdef USE_POSIX_FALLOCATE. Thanks! -- Jon
Attachment
On Sun, 2013-06-16 at 23:53 -0500, Jon Nelson wrote: > Please find attached v5 of the patch, with the above correction in place. > The only change from the v4 patch is wrapping the if > (wal_use_fallocate) block in an #ifdef USE_POSIX_FALLOCATE. > Thanks! Thank you. Greg, are you still working on those tests? Regards,Jeff Davis
On 06/20/2013 07:19 PM, Jeff Davis wrote: > On Sun, 2013-06-16 at 23:53 -0500, Jon Nelson wrote: >> Please find attached v5 of the patch, with the above correction in place. >> The only change from the v4 patch is wrapping the if >> (wal_use_fallocate) block in an #ifdef USE_POSIX_FALLOCATE. >> Thanks! > > Thank you. > > Greg, are you still working on those tests? Since Greg seems to be busy, what needs to be done to test this? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, 2013-05-28 at 22:10 -0400, Greg Smith wrote: > I was just thinking of something to run in your test program, not > another build time check. Just run the new allocation sequence, and > then check the resulting WAL file for a) correct length, and b) 16K of > zero bytes. I would like to build some confidence that posix_fallocate > is operating correctly in this context on at least one platform. My > experience with Linux handling this class of functions correctly has > left me skeptical of them working until that's proven to be the case. As I understand it, you are basically asking if posix_fallocate() works at all anywhere. Simple test program attached, which creates two files and fills them: one by 2048 8KB writes; and another by 1 posix_fallocate of 16MB. Then, I just cmp the resulting files (and also "ls" them, to make sure they are 16MB). Passes on my workstation: $ uname -a Linux jdavis 3.5.0-34-generic #55-Ubuntu SMP Thu Jun 6 20:18:19 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Regards, Jeff Davis
Attachment
On Fri, 2013-06-28 at 11:38 -0700, Josh Berkus wrote: > Since Greg seems to be busy, what needs to be done to test this? As I understand it, he was mainly asking if posix_fallocate works at all. I tried to address that question with a simple test, which behaves as I expected it to: http://www.postgresql.org/message-id/1372615313.19747.13.camel@jdavis Unless something surprising comes up, or someone thinks and objection has been missed, I am going to commit this soon. (Of course, to avoid your wrath, I'll get rid of the GUC which was added for testing.) Regards,Jeff Davis
On Sun, 2013-06-30 at 11:11 -0700, Jeff Davis wrote: > Unless something surprising comes up, or someone thinks and objection > has been missed, I am going to commit this soon. Quick question to anyone who happens to know: What is the standard procedure for changes to pg_config.h.win32? I looked at an old patch of mine that Tom (CC'd) committed: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b966dd6c4228d696b291c1cdcb5ab8c8475fefa8 and I see that it modifies pg_config.h.win32. Should I modify it in a similar way for this fallocate patch? Right now, pg_config.in.win32 seems a little inconsistent because it has an entry for HAVE_POSIX_SIGNALS but not HAVE_POSIX_FADVISE. It says it's a generated file, but I don't have a windows environment or MingW. Regards,Jeff Davis
On 06/30/2013 03:50 PM, Jeff Davis wrote: > On Sun, 2013-06-30 at 11:11 -0700, Jeff Davis wrote: >> Unless something surprising comes up, or someone thinks and objection >> has been missed, I am going to commit this soon. > Quick question to anyone who happens to know: > > What is the standard procedure for changes to pg_config.h.win32? I > looked at an old patch of mine that Tom (CC'd) committed: > > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b966dd6c4228d696b291c1cdcb5ab8c8475fefa8 > > and I see that it modifies pg_config.h.win32. Should I modify it in a > similar way for this fallocate patch? > > Right now, pg_config.in.win32 seems a little inconsistent because it has > an entry for HAVE_POSIX_SIGNALS but not HAVE_POSIX_FADVISE. It says it's > a generated file, but I don't have a windows environment or MingW. > It was originally generated. Since then it's been maintained by hand. If you need a Windows environment to test on, see the Amazon recipe at <http://wiki.postgresql.org/wiki/Building_With_MinGW> cheers andrew > > >
On 6/30/13 2:01 PM, Jeff Davis wrote: > Simple test program attached, which creates two files and fills them: > one by 2048 8KB writes; and another by 1 posix_fallocate of 16MB. Then, > I just cmp the resulting files (and also "ls" them, to make sure they > are 16MB). This makes platform level testing a lot easier, thanks. Attached is an updated copy of that program with some error checking. If the files it creates already existed, the code didn't notice, and a series of write errors happened. If you set the test up right it's not a problem, but it's better if a bad setup is caught. I wrapped the whole test with a shell script, also attached, which insures the right test sequence and checks. Your C test program compiles and passes on RHEL5/6 here, doesn't on OS X Darwin. No surprises there, there's a long list of platforms that don't support this call at https://www.gnu.org/software/gnulib/manual/html_node/posix_005ffallocate.html and the Mac is on it. Many other platforms I was worried about don't support it too--older FreeBSD, HP-UX 11, Solaris 10, mingw, MSVC--so that cuts down on testing quite a bit. If it runs faster on Linux, that's the main target here, just like the existing effective_io_concurrency fadvise code. The specific thing I was worried about is that this interface might have a stub that doesn't work perfectly in older Linux kernels. After being surprised to find this interface worked on RHEL5 with your test program, I dug into this more. It works there, but it may actually be slower. posix_fallocate is actually implemented by glibc on Linux. Been there since 2.1.94 according to the Linux man pages. But Linux itself didn't add the feature until kernel 2.6.20: http://lwn.net/Articles/226436/ The biggest thing I was worried about--the call might be there in early kernels but with a non-functional implementation--that's not the case. Looking at the diff, before that patch there's no fallocate at all. So what happened in earlier kernels, where there was no kernel level fallocate available? According to https://www.redhat.com/archives/fedora-devel-list/2009-April/msg00110.html what glibc does is check for kernel fallocate(), and if it's not there it writes a bunch of zeros to create the file instead. What is actually happening on a RHEL5 system (with kernel 2.6.18) is that calling posix_fallocate does this fallback behavior, where it basically does the same thing the existing WAL clearing code does. I can even prove that's the case. On RHEL5, if you run "strace -o out ./fallocate" the main write loop looks like this: write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 But when you call posix_fallocate, you still get a bunch of writes, but 4 bytes at a time: pwrite(4, "\0", 1, 16769023) = 1 pwrite(4, "\0", 1, 16773119) = 1 pwrite(4, "\0", 1, 16777215) = 1 That's glibc helpfully converting your call to posix_fallocate into small writes, because the OS doesn't provide a better way in that kernel. It's not hard to imagine this being slower than what the WAL code is doing right now. I'm not worried about correctness issues anymore, but my gut paranoia about this not working as expected on older systems was justified. Everyone who thought I was just whining owes me a cookie. This is what I plan to benchmark specifically next. If the posix_fallocate approach is actually slower than what's done now when it's not getting kernel acceleration, which is the case on RHEL5 era kernels, we might need to make the configure time test more complicated. Whether posix_fallocate is defined isn't sensitive enough; on Linux it may be the case that this only is usable when fallocate() is also there. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Attachment
On Sun, Jun 30, 2013 at 5:55 PM, Greg Smith <greg@2ndquadrant.com> wrote: > > > pwrite(4, "\0", 1, 16769023) = 1 > pwrite(4, "\0", 1, 16773119) = 1 > pwrite(4, "\0", 1, 16777215) = 1 > > That's glibc helpfully converting your call to posix_fallocate into small > writes, because the OS doesn't provide a better way in that kernel. It's > not hard to imagine this being slower than what the WAL code is doing right > now. I'm not worried about correctness issues anymore, but my gut paranoia > about this not working as expected on older systems was justified. Everyone > who thought I was just whining owes me a cookie. I had noted in the very early part of the thread that glibc emulates posix_fallocate when the (Linux-specific) 'fallocate' systemcall fails. In this case, it's writing 4 bytes of zeros and then essentially seeking forward 4092 (4096-4) bytes. This prevents files with holes in them because the holes have to be at least 4kiB in size, if I recall properly. It's *not* writing out 16MiB in 4 byte increments. -- Jon
On 5/28/13 10:00 PM, Jon Nelson wrote: > A note: The attached test program uses *fsync* instead of *fdatasync* > after calling fallocate (or writing out 16MB of zeroes), per an > earlier suggestion. I tried this out on the RHEL5 platform I'm worried about now. There's something weird about the test program there. If I run it once it shows posix_fallocate running much faster: without posix_fallocate: 1 open/close iterations, 1 rewrite in 23.0169s with posix_fallocate: 1 open/close iterations, 1 rewrite in 11.1904s The problem is that I'm seeing the gap between the two get smaller the more iterations I run, which makes me wonder if the test is completely fair: without posix_fallocate: 2 open/close iterations, 2 rewrite in 34.3281s with posix_fallocate: 2 open/close iterations, 2 rewrite in 23.1798s without posix_fallocate: 3 open/close iterations, 3 rewrite in 44.4791s with posix_fallocate: 3 open/close iterations, 3 rewrite in 33.6102s without posix_fallocate: 5 open/close iterations, 5 rewrite in 65.6244s with posix_fallocate: 5 open/close iterations, 5 rewrite in 61.0991s You didn't show any output from the latest program on your system, so I'm not sure how it behaved for you here. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Sun, Jun 30, 2013 at 6:49 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On 5/28/13 10:00 PM, Jon Nelson wrote: > >> A note: The attached test program uses *fsync* instead of *fdatasync* >> after calling fallocate (or writing out 16MB of zeroes), per an >> earlier suggestion. > > > I tried this out on the RHEL5 platform I'm worried about now. There's > something weird about the test program there. If I run it once it shows > posix_fallocate running much faster: > > without posix_fallocate: 1 open/close iterations, 1 rewrite in 23.0169s > with posix_fallocate: 1 open/close iterations, 1 rewrite in 11.1904s Assuming the platform chosen is using the glibc approach of pwrite(4 bytes) every 4KiB, then the results ought to be similar, and I'm at a loss to explain why it's performing better (unless - grasping at straws - simply the *volume* of data transferred from userspace to the kernel is at play, in which case posix_fallocate will result in 4096 calls to pwrite but at 4 bytes each versus 2048 calls to write at 8KiB each.) Ultimately the same amount of data gets written to disk (one would imagine), but otherwise I can't really think of much. I have also found several errors test_fallocate.c program I posted, corrected below. One of them is: it is missing two pairs of parentheses around two #defines: #define SIXTEENMB 1024*1024*16 #define EIGHTKB 1024*8 should be: #define SIXTEENMB (1024*1024*16) #define EIGHTKB (1024*8) Otherwise the program will end up writing (131072) 8KiB blocks instead of 2048. This actually makes the comparison between writing 8KiB blocks and using posix_fallocate favor the latter more strongly in the results (also seen below). > The problem is that I'm seeing the gap between the two get smaller the more > iterations I run, which makes me wonder if the test is completely fair: > > without posix_fallocate: 2 open/close iterations, 2 rewrite in 34.3281s > with posix_fallocate: 2 open/close iterations, 2 rewrite in 23.1798s > > > without posix_fallocate: 3 open/close iterations, 3 rewrite in 44.4791s > with posix_fallocate: 3 open/close iterations, 3 rewrite in 33.6102s > > without posix_fallocate: 5 open/close iterations, 5 rewrite in 65.6244s > with posix_fallocate: 5 open/close iterations, 5 rewrite in 61.0991s > > You didn't show any output from the latest program on your system, so I'm > not sure how it behaved for you here. On the the platform I use - openSUSE (12.3, x86_64, kernel 3.9.7 currently) I never see posix_fadvise perform worse. Typically better, sometimes much better. To set the number of times the file is overwritten to just 1 (one): for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done I am including a revised version of test_fallocate.c that corrects the above noted error, one typo (from when I changed fdatasync to fsync) that did not alter program behavior, corrects a mis-placed gettimeofday (which does change the results) and includes a new test that aims (perhaps poorly) to emulate the glibc style of pwrite(4 bytes) for every 4KiB, and tests the resulting file size to make sure it is 16MiB in size. The performance of the latter (new) test sometimes seems to perform worse and sometimes seems to perform better (usually worse) than either of the other two. In all cases, posix_fallocate performs better, but I don't have a sufficiently old kernel to test with. The new results on one machine are below. With 0 (zero) rewrites (testing *just* open/some_type_of_allocation/fsync/close): method: classic. 100 open/close iterations, 0 rewrite in 29.6060s method: posix_fallocate. 100 open/close iterations, 0 rewrite in 2.1054s method: glibc emulation. 100 open/close iterations, 0 rewrite in 31.7445s And with the same number of rewrites as open/close cycles: method: classic. 1 open/close iterations, 1 rewrite in 0.6297s method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3028s method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.5521s method: classic. 2 open/close iterations, 2 rewrite in 1.6455s method: posix_fallocate. 2 open/close iterations, 2 rewrite in 1.0409s method: glibc emulation. 2 open/close iterations, 2 rewrite in 1.5604s method: classic. 5 open/close iterations, 5 rewrite in 7.5916s method: posix_fallocate. 5 open/close iterations, 5 rewrite in 6.9177s method: glibc emulation. 5 open/close iterations, 5 rewrite in 8.1137s method: classic. 10 open/close iterations, 10 rewrite in 29.2816s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 28.4400s method: glibc emulation. 10 open/close iterations, 10 rewrite in 31.2693s -- Jon
Attachment
On 6/30/13 9:28 PM, Jon Nelson wrote: > The performance of the latter (new) test sometimes seems to perform > worse and sometimes seems to perform better (usually worse) than > either of the other two. In all cases, posix_fallocate performs > better, but I don't have a sufficiently old kernel to test with. This updated test program looks reliable now. The numbers are very tight when I'd expect them to be, and there's nowhere with the huge differences I saw in the earlier test program. Here's results from a few sets of popular older platforms: RHEL5, ext3 method: classic. 10 open/close iterations, 10 rewrite in 22.6949s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 23.0113s method: glibc emulation. 10 open/close iterations, 10 rewrite in 22.4921s method: classic. 10 open/close iterations, 10 rewrite in 23.2808s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 22.4736s method: glibc emulation. 10 open/close iterations, 10 rewrite in 23.9871s method: classic. 10 open/close iterations, 10 rewrite in 22.4812s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 22.2393s method: glibc emulation. 10 open/close iterations, 10 rewrite in 23.6940s RHEL6, ext4 method: classic. 10 open/close iterations, 10 rewrite in 56.0483s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 61.5092s method: glibc emulation. 10 open/close iterations, 10 rewrite in 53.8569s method: classic. 10 open/close iterations, 10 rewrite in 57.0361s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 55.9840s method: glibc emulation. 10 open/close iterations, 10 rewrite in 64.9437sb RHEL6, ext3 method: classic. 10 open/close iterations, 10 rewrite in 14.4080s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 16.1395s method: glibc emulation. 10 open/close iterations, 10 rewrite in 16.9657s method: classic. 10 open/close iterations, 10 rewrite in 15.2825s method: posix_fallocate. 10 open/close iterations, 10 rewrite in 16.5315s method: glibc emulation. 10 open/close iterations, 10 rewrite in 14.8115s The win for posix_fallocate is there in most cases, but it's pretty hard to see in these older systems. That could be OK. As long as the difference is no more than noise, and that is the case, this could be good enough to commit. If there are significantly better results on the new platforms, the old ones need to just not get worse. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Sun, Jun 30, 2013 at 11:52 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On 6/30/13 9:28 PM, Jon Nelson wrote: >> >> The performance of the latter (new) test sometimes seems to perform >> worse and sometimes seems to perform better (usually worse) than >> either of the other two. In all cases, posix_fallocate performs >> better, but I don't have a sufficiently old kernel to test with. > > > This updated test program looks reliable now. The numbers are very tight > when I'd expect them to be, and there's nowhere with the huge differences I > saw in the earlier test program. > > Here's results from a few sets of popular older platforms: If you found yourself with a spare moment, could you run these again with the number of open/close cycles set high (say, 100) and the number of rewrites set to 0 and also to 1? Most of the time spent is actually spent overwriting the files so by reducing or eliminating that aspect it might be easier to get a handle on the actual performance difference. -- Jon
On Sun, 2013-06-30 at 16:09 -0400, Andrew Dunstan wrote: > It was originally generated. Since then it's been maintained by hand. What is the procedure for maintaining it by hand? Why are HAVE_POSIX_SIGNALS and HAVE_SYNC_FILE_RANGE in there (though commented out), but not HAVE_POSIX_FADVISE? Regards,Jeff Davis
On Sun, 2013-06-30 at 18:55 -0400, Greg Smith wrote: > This makes platform level testing a lot easier, thanks. Attached is an > updated copy of that program with some error checking. If the files it > creates already existed, the code didn't notice, and a series of write > errors happened. If you set the test up right it's not a problem, but > it's better if a bad setup is caught. I wrapped the whole test with a > shell script, also attached, which insures the right test sequence and > checks. Thank you. > That's glibc helpfully converting your call to posix_fallocate into > small writes, because the OS doesn't provide a better way in that > kernel. It's not hard to imagine this being slower than what the WAL > code is doing right now. I'm not worried about correctness issues > anymore, but my gut paranoia about this not working as expected on older > systems was justified. Everyone who thought I was just whining owes me > a cookie. So your theory is that it may be slower because there are twice as many syscalls (one per 4K page rather than one per 8K page)? Interesting observation. > This is what I plan to benchmark specifically next. In the interest of keeping this patch moving forward, do you have an estimate for when this testing will be complete? > If the > posix_fallocate approach is actually slower than what's done now when > it's not getting kernel acceleration, which is the case on RHEL5 era > kernels, we might need to make the configure time test more complicated. > Whether posix_fallocate is defined isn't sensitive enough; on Linux it > may be the case that this only is usable when fallocate() is also there. I'd say that if posix_fallocate is slower than the existing code on pretty much any platform, we shouldn't commit the patch at all. I would be quite surprised if that was the case, however. Regards,Jeff Davis
On Tue, Jul 2, 2013 at 1:55 AM, Jeff Davis <pgsql@j-davis.com> wrote: > On Sun, 2013-06-30 at 18:55 -0400, Greg Smith wrote: >> This makes platform level testing a lot easier, thanks. Attached is an >> updated copy of that program with some error checking. If the files it >> creates already existed, the code didn't notice, and a series of write >> errors happened. If you set the test up right it's not a problem, but >> it's better if a bad setup is caught. I wrapped the whole test with a >> shell script, also attached, which insures the right test sequence and >> checks. > > Thank you. > >> That's glibc helpfully converting your call to posix_fallocate into >> small writes, because the OS doesn't provide a better way in that >> kernel. It's not hard to imagine this being slower than what the WAL >> code is doing right now. I'm not worried about correctness issues >> anymore, but my gut paranoia about this not working as expected on older >> systems was justified. Everyone who thought I was just whining owes me >> a cookie. > > So your theory is that it may be slower because there are twice as many > syscalls (one per 4K page rather than one per 8K page)? Interesting > observation. > >> This is what I plan to benchmark specifically next. > > In the interest of keeping this patch moving forward, do you have an > estimate for when this testing will be complete? > >> If the >> posix_fallocate approach is actually slower than what's done now when >> it's not getting kernel acceleration, which is the case on RHEL5 era >> kernels, we might need to make the configure time test more complicated. >> Whether posix_fallocate is defined isn't sensitive enough; on Linux it >> may be the case that this only is usable when fallocate() is also there. > > I'd say that if posix_fallocate is slower than the existing code on > pretty much any platform, we shouldn't commit the patch at all. Even in that case, if a user can easily know which platform posix_fallocate should be used in, we can commit the patch with the configurable GUC parameter. Regards, -- Fujii Masao
On Tue, 2013-07-02 at 02:13 +0900, Fujii Masao wrote: > Even in that case, if a user can easily know which platform posix_fallocate > should be used in, we can commit the patch with the configurable GUC > parameter. I disagree here. We're not talking about a huge win; this speedup may not even be detectable on a running system. I think Robert summarized the reason for the patch best: "I mean, if posix_fallocate() is faster, then it's just faster, right?". But if we need a new GUC, and DBAs now have one more thing they need to test about their platform, then that argument goes out the window. Regards,Jeff Davis
On Mon, Jul 1, 2013 at 2:17 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Tue, 2013-07-02 at 02:13 +0900, Fujii Masao wrote: >> Even in that case, if a user can easily know which platform posix_fallocate >> should be used in, we can commit the patch with the configurable GUC >> parameter. > > I disagree here. We're not talking about a huge win; this speedup may > not even be detectable on a running system. > > I think Robert summarized the reason for the patch best: "I mean, if > posix_fallocate() is faster, then it's just faster, right?". But if we > need a new GUC, and DBAs now have one more thing they need to test about > their platform, then that argument goes out the window. Yeah. If the patch isn't going to be a win on RHEL 5, I'd consider that a good reason to scrap it for now and revisit it in 3 years. There are still a LOT of people running RHEL 5, and the win isn't big enough to engineer a more complex solution. But ultimately RHEL 5, like any other old system, will go away. However, Greg's latest testing results seem to show that there isn't much to worry about, so we may be fine anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/1/13 12:34 PM, Jeff Davis wrote: > On Sun, 2013-06-30 at 16:09 -0400, Andrew Dunstan wrote: >> It was originally generated. Since then it's been maintained by hand. > > What is the procedure for maintaining it by hand? Edit away. > Why are > HAVE_POSIX_SIGNALS and HAVE_SYNC_FILE_RANGE in there (though commented > out), but not HAVE_POSIX_FADVISE? Because maintaining it by hand is prone to inconsistencies. I wouldn't worry about it.
On 7/1/13 3:44 PM, Robert Haas wrote: > Yeah. If the patch isn't going to be a win on RHEL 5, I'd consider > that a good reason to scrap it for now and revisit it in 3 years. > There are still a LOT of people running RHEL 5, and the win isn't big > enough to engineer a more complex solution. I'm still testing, expect to have this wrapped up on my side by tomorrow. So much of the runtime here is the file setup/close that having a 2:1 difference in number of writes, what happens on the old platforms, it is hard to get excited about. I don't think the complexity to lock out RHEL5 here is that bad even if it turns out to be a good idea. Just add another configure check for fallocate, and on Linux if it's not there don't use posix_fallocate either. Maybe 5 lines of macro code? RHEL5 sure isn't going anyway anytime soon, but at the same time there won't be that many 9.4 deployments on that version. I've been digging into the main situation where this feature helps, and it won't be easy to duplicate in a benchmark situation. Using Linux's fallocate works as a hint that the whole 16MB should be allocated at once, and therefore together on disk if feasible. The resulting WAL files should be less prone to fragmentation. That's actually the biggest win of this approach, but I can't easily duplicate the sort of real-world fragmentation I see on live servers here. Given that, I'm leaning toward saying that unless there's a clear regression on older platforms, above the noise floor, this is still the right thing to do. I fully agree that this needs to fully automatic--no GUC--before it's worth committing. If we can't figure out the right thing to do now, there's little hope anyone else will in a later tuning expedition. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Mon, 2013-07-01 at 00:52 -0400, Greg Smith wrote: > On 6/30/13 9:28 PM, Jon Nelson wrote: > > The performance of the latter (new) test sometimes seems to perform > > worse and sometimes seems to perform better (usually worse) than > > either of the other two. In all cases, posix_fallocate performs > > better, but I don't have a sufficiently old kernel to test with. > > This updated test program looks reliable now. The numbers are very > tight when I'd expect them to be, and there's nowhere with the huge > differences I saw in the earlier test program. > > Here's results from a few sets of popular older platforms: ... > The win for posix_fallocate is there in most cases, but it's pretty hard > to see in these older systems. That could be OK. As long as the > difference is no more than noise, and that is the case, this could be > good enough to commit. If there are significantly better results on the > new platforms, the old ones need to just not get worse. I ran my own little test on my workstation[1] with the attached programs. One does what we do now, another mimics the glibc emulation you described earlier, and another uses posix_fallocate(). It does an allocation phase, an fsync, a single rewrite, and then another fsync. The program runs this 64 times for 64 different 16MB files. write1 and write2 are almost exactly even at 25.5s. write3 is about 14.5s, which is a pretty big improvement. So, my simple conclusion is that glibc emulation should be about the same as what we're doing now, so there's no reason to avoid it. That means, if posix_fallocate() is present, we should use it, because it's either the same (if emulated in glibc) or significantly faster (if implemented in the kernel). If that is your conclusion, as well, it looks like this patch is about ready for commit. What do you think? Regards, Jeff Davis [1] workstation specs (ubuntu 12.10, ext4): $ uname -a Linux jdavis 3.5.0-34-generic #55-Ubuntu SMP Thu Jun 6 20:18:19 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Attachment
On 7/5/13 2:50 AM, Jeff Davis wrote: > So, my simple conclusion is that glibc emulation should be about the > same as what we're doing now, so there's no reason to avoid it. That > means, if posix_fallocate() is present, we should use it, because it's > either the same (if emulated in glibc) or significantly faster (if > implemented in the kernel). That's what I'm seeing everywhere too. I'm happy that we've spent enough time chasing after potential issues without finding anything now. Pull out the GUC that was added for default andthis is ready to commit. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Fri, Jul 5, 2013 at 2:23 AM, Greg Smith <greg@2ndquadrant.com> wrote: > On 7/5/13 2:50 AM, Jeff Davis wrote: >> >> So, my simple conclusion is that glibc emulation should be about the >> same as what we're doing now, so there's no reason to avoid it. That >> means, if posix_fallocate() is present, we should use it, because it's >> either the same (if emulated in glibc) or significantly faster (if >> implemented in the kernel). > > > That's what I'm seeing everywhere too. I'm happy that we've spent enough > time chasing after potential issues without finding anything now. Pull out > the GUC that was added for default and this is ready to commit. Wonderful. Is removing the GUC something that I should do or should that be done by somebody that knows more about what they are doing? (I am happy to give it a go!) Should the small test program that I made also be included somewhere in the source tree? -- Jon
On Fri, 2013-07-05 at 08:21 -0500, Jon Nelson wrote: > Wonderful. Is removing the GUC something that I should do or should > that be done by somebody that knows more about what they are doing? (I > am happy to give it a go!) I'll take care of that. > Should the small test program that I made also be included somewhere > in the source tree? I don't think that's necessary, the fact that it's in the mailing list archives is enough. Regards,Jeff Davis