Re: Is pg_control file crashsafe? - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Is pg_control file crashsafe?
Date
Msg-id CAA4eK1Kn62ZHHCsUrq+X_31t9wNfxNAW5TwbkWCvmJafTXQA=A@mail.gmail.com
Whole thread Raw
In response to Re: Is pg_control file crashsafe?  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
On Thu, May 5, 2016 at 11:52 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> On Thu, May 5, 2016 at 4:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Amit Kapila <amit.kapila16@gmail.com> writes:
> >> How about using 512 bytes as a write size and perform direct writes rather
> >> than going via OS buffer cache for control file?
> >
> > Wouldn't that fail outright under a lot of implementations of direct write;
> > ie the request needs to be page-aligned, for some not-very-determinate
> > value of page size?
> >

Right, it should be atleast page size.

>
> > To repeat, I'm pretty hesitant to change this logic.  While this is not
> > the first report we've ever heard of loss of pg_control, I believe I could
> > count those reports without running out of fingers on one hand --- and
> > that's counting since the last century. It will take quite a lot of
> > evidence to convince me that some other implementation will be more
> > reliable.  If you just come and present a patch to use direct write, or
> > rename, or anything else for that matter, I'm going to reject it out of
> > hand unless you provide very strong evidence that it's going to be more
> > reliable than the current code across all the systems we support.
>
> I'm not sure how those ideas address the reported problem anyway: the
> *length* was unexpectedly zero after a crash.  UpdateControlFile
> doesn't change the length of the control file, since it doesn't
> specify O_TRUNC or O_APPEND and it always writes the same size.  So it
> seems like a pretty weird failure mode affecting filesystem metadata
> (which I wouldn't expect to change anyway, but I would expect to be
> journaled if it did), not a file-contents-atomicity problem.  Whether
> or not the page cache is involved in a write to a preallocated file
> doesn't seem relevant to a case of unexpected truncation, and the
> atomic rename trick doesn't seem relevant either unless someone with
> expert knowledge of NTFS could explain how a crash could lead to
> truncation in the first place, and how rename would help.
>

I think the real reason for truncation is not known or not discussed here.  It seems to me that the ideas are being discussed on the mere speculation that current way of writing can lead to corruption in certain cases.  I think it would be better to first dig into the actual reason of problem.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Segmentation fault when max_parallel degree is very High
Next
From: Fabien COELHO
Date:
Subject: Re: [BUGS] Breakage with VACUUM ANALYSE + partitions