Thread: Can we trust fsync?
I'm really concerned by this post on Linux's fsync and disk flush behaviour: http://milek.blogspot.com.au/2010/12/linux-osync-and-write-barriers.html and seeking opinions from folks here who've been deeply involved in write reliability work. The amount of change in write reliablity behaviour in Linux across kernel versions, file systems and storage abstraction layers is worrying - different results for LVM vs !LVM, md vs !md, ext3 vs other, etc. If this isn't something that's already been seen and dealt with then I'll see if I can take a look into it once the RLS work is dealt with. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 11/21/2013 07:45 AM, Craig Ringer wrote: > I'm really concerned by this post on Linux's fsync and disk flush behaviour: > > http://milek.blogspot.com.au/2010/12/linux-osync-and-write-barriers.html ... and yes, I realise that's partly why we have the "fsync" param to control different sync modes. Just concerned it's even more variable than I thought. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> On 11/21/2013 07:45 AM, Craig Ringer wrote: >> I'm really concerned by this post on Linux's fsync and disk flush behaviour: >> >> http://milek.blogspot.com.au/2010/12/linux-osync-and-write-barriers.html > > ... and yes, I realise that's partly why we have the "fsync" param to > control different sync modes. Just concerned it's even more variable > than I thought. So on linux, we don't have any safe option for wal_sync_method? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Craig Ringer <craig@2ndquadrant.com> writes: > The amount of change in write reliablity behaviour in Linux across > kernel versions, file systems and storage abstraction layers is worrying > - different results for LVM vs !LVM, md vs !md, ext3 vs other, etc. Well, we pretty much *have to* trust fsync --- there's not a lot we can do if the kernel doesn't get this right. My takeaway is that you don't want to be running a production database on bleeding-edge kernels or filesystem stacks. If you want to use Linux, use a distro from a vendor with a track record for caring about stability. (I'll omit the commercial for my former employers, but ...) Also, it's not that hard to do plug-pull testing to verify that your system is telling the truth about fsync. This really ought to be part of acceptance testing for any new DB server. regards, tom lane
On 11/20/2013 03:45 PM, Craig Ringer wrote: > > I'm really concerned by this post on Linux's fsync and disk flush behaviour: > > http://milek.blogspot.com.au/2010/12/linux-osync-and-write-barriers.html > > and seeking opinions from folks here who've been deeply involved in > write reliability work. > > The amount of change in write reliablity behaviour in Linux across > kernel versions, file systems and storage abstraction layers is worrying > - different results for LVM vs !LVM, md vs !md, ext3 vs other, etc. > > If this isn't something that's already been seen and dealt with then > I'll see if I can take a look into it once the RLS work is dealt with. > I thought Greg did some testing on this a while back and determined which versions were safe... (/me looks for post) JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats
On 11/21/2013 12:45 AM, Craig Ringer wrote: > I'm really concerned by this post on Linux's fsync and disk flush behaviour: > > http://milek.blogspot.com.au/2010/12/linux-osync-and-write-barriers.html > > and seeking opinions from folks here who've been deeply involved in > write reliability work. With ext4 and XFS on plain/LVM/md block devices, this issue should really be a thing of the past. I think the kernel folks would treat this as bugs nowadays, too. -- Florian Weimer / Red Hat Product Security Team
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Thu, Nov 21, 2013 at 1:43 AM, Tom Lane <span dir="ltr"><<ahref="mailto:tgl@sss.pgh.pa.us" target="_blank">tgl@sss.pgh.pa.us</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":3r8" style="overflow:hidden">Also, it's not that hard to do plug-pull testing to verify that your<br /> system is telling thetruth about fsync. This really ought to be part<br /> of acceptance testing for any new DB server.</div></blockquote></div><br/></div><div class="gmail_extra">I've never tried it but I always wondered how easy itwas to do. How would you ever know you had tested it enough?<br /><br /><br /></div><div class="gmail_extra">The originalmail was referencing a problem with syncing *meta* data though. The semantics around meta data syncs are much lessclearly specified, in part because file systems traditionally made nearly all meta data operations synchronous. Doingplug-pull testing on Postgres would not test meta data syncing very well since Postgres specifically avoids doing muchmeta data operations by overwriting existing files and blocks as much as possible. You would have to test doing tableextensions or pulling the plug immediately after switching xlog files repeatedly to have any coverage at all there.<br/></div><div class="gmail_extra"><br clear="all" /><br />-- <br />greg<br /></div></div>
Greg Stark <stark@mit.edu> writes: > On Thu, Nov 21, 2013 at 1:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Also, it's not that hard to do plug-pull testing to verify that your >> system is telling the truth about fsync. This really ought to be part >> of acceptance testing for any new DB server. > I've never tried it but I always wondered how easy it was to do. How would > you ever know you had tested it enough? I used the program Greg Smith recommends on our wiki (can't remember the name offhand) when I got a new house server this spring. With the RAID card configured for writethrough and no battery, it failed all over the place. Fixed those configuration bugs, it was okay three or four times in a row, which was good enough for me. > The original mail was referencing a problem with syncing *meta* data > though. The semantics around meta data syncs are much less clearly > specified, in part because file systems traditionally made nearly all meta > data operations synchronous. Doing plug-pull testing on Postgres would not > test meta data syncing very well since Postgres specifically avoids doing > much meta data operations by overwriting existing files and blocks as much > as possible. True. You're better off with a specialized testing program. (Though now you mention it, I wonder whether that program was stressing metadata or not.) regards, tom lane
On Fri, Nov 22, 2013 at 1:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The original mail was referencing a problem with syncing *meta* data >> though. The semantics around meta data syncs are much less clearly >> specified, in part because file systems traditionally made nearly all meta >> data operations synchronous. Doing plug-pull testing on Postgres would not >> test meta data syncing very well since Postgres specifically avoids doing >> much meta data operations by overwriting existing files and blocks as much >> as possible. > > True. You're better off with a specialized testing program. (Though > now you mention it, I wonder whether that program was stressing metadata > or not.) You can always stress metadata by leaving atime updates in their full setting (whatever it is for that filesystem).
On Fri, Nov 22, 2013 at 11:16:06AM -0500, Tom Lane wrote: > Greg Stark <stark@mit.edu> writes: > > On Thu, Nov 21, 2013 at 1:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Also, it's not that hard to do plug-pull testing to verify that your > >> system is telling the truth about fsync. This really ought to be part > >> of acceptance testing for any new DB server. > > > I've never tried it but I always wondered how easy it was to do. How would > > you ever know you had tested it enough? > > I used the program Greg Smith recommends on our wiki (can't remember the > name offhand) when I got a new house server this spring. With the RAID > card configured for writethrough and no battery, it failed all over the > place. Fixed those configuration bugs, it was okay three or four times > in a row, which was good enough for me. > > > The original mail was referencing a problem with syncing *meta* data > > though. The semantics around meta data syncs are much less clearly > > specified, in part because file systems traditionally made nearly all meta > > data operations synchronous. Doing plug-pull testing on Postgres would not > > test meta data syncing very well since Postgres specifically avoids doing > > much meta data operations by overwriting existing files and blocks as much > > as possible. > > True. You're better off with a specialized testing program. (Though > now you mention it, I wonder whether that program was stressing metadata > or not.) The program is diskchecker: http://brad.livejournal.com/2116715.html I got the author to re-host the source code on github a few years ago. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Fri, Nov 22, 2013 at 2:57 PM, Bruce Momjian <bruce@momjian.us> wrote: > The program is diskchecker: > > http://brad.livejournal.com/2116715.html > > I got the author to re-host the source code on github a few years ago. It might be worth re-implementing this for -contrib. The fact that we mention diskchecker.pl in the docs, and it is a pretty obscure Perl script on some guy's personal website doesn't inspire much confidence. -- Peter Geoghegan
On Fri, Nov 22, 2013 at 03:06:31PM -0800, Peter Geoghegan wrote: > On Fri, Nov 22, 2013 at 2:57 PM, Bruce Momjian <bruce@momjian.us> wrote: > > The program is diskchecker: > > > > http://brad.livejournal.com/2116715.html > > > > I got the author to re-host the source code on github a few years ago. > > It might be worth re-implementing this for -contrib. The fact that we > mention diskchecker.pl in the docs, and it is a pretty obscure Perl > script on some guy's personal website doesn't inspire much confidence. Well, it was his idea, and quite a good one. I guess we could reimplement this in C if someone wants to do the legwork. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 11/22/2013 03:23 PM, Bruce Momjian wrote: > On Fri, Nov 22, 2013 at 03:06:31PM -0800, Peter Geoghegan wrote: >> On Fri, Nov 22, 2013 at 2:57 PM, Bruce Momjian <bruce@momjian.us> wrote: >>> The program is diskchecker: >>> >>> http://brad.livejournal.com/2116715.html >>> >>> I got the author to re-host the source code on github a few years ago. >> >> It might be worth re-implementing this for -contrib. The fact that we >> mention diskchecker.pl in the docs, and it is a pretty obscure Perl >> script on some guy's personal website doesn't inspire much confidence. > > Well, it was his idea, and quite a good one. I guess we could > reimplement this in C if someone wants to do the legwork. Yeah, too bad Brad didn't post a license for it. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Nov 22, 2013 at 03:27:29PM -0800, Josh Berkus wrote: > On 11/22/2013 03:23 PM, Bruce Momjian wrote: > > On Fri, Nov 22, 2013 at 03:06:31PM -0800, Peter Geoghegan wrote: > >> On Fri, Nov 22, 2013 at 2:57 PM, Bruce Momjian <bruce@momjian.us> wrote: > >>> The program is diskchecker: > >>> > >>> http://brad.livejournal.com/2116715.html > >>> > >>> I got the author to re-host the source code on github a few years ago. > >> > >> It might be worth re-implementing this for -contrib. The fact that we > >> mention diskchecker.pl in the docs, and it is a pretty obscure Perl > >> script on some guy's personal website doesn't inspire much confidence. > > > > Well, it was his idea, and quite a good one. I guess we could > > reimplement this in C if someone wants to do the legwork. > > Yeah, too bad Brad didn't post a license for it. We can ask him. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Sat, Nov 23, 2013 at 8:06 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Nov 22, 2013 at 2:57 PM, Bruce Momjian <bruce@momjian.us> wrote: >> The program is diskchecker: >> >> http://brad.livejournal.com/2116715.html >> >> I got the author to re-host the source code on github a few years ago. > > It might be worth re-implementing this for -contrib. The fact that we > mention diskchecker.pl in the docs, and it is a pretty obscure Perl > script on some guy's personal website doesn't inspire much confidence. Yes, having that in contrib would be useful. Those would bring a plus when testing disks for Postgres. -- Michael