Thread: Torn page hazard in ginRedoUpdateMetapage()

Torn page hazard in ginRedoUpdateMetapage()

From
Noah Misch
Date:
When GIN changes a metapage, we WAL-log its ex-header content and never use a
backup block.  This reduces WAL volume since the vast majority of the metapage
is unused.  However, ginRedoUpdateMetapage() only restores the WAL-logged
content if the metapage LSN predates the WAL record LSN.  If a metapage write
tore and updated the LSN but not the other content, we would fail to complete
the update.  Instead, unconditionally reinitialize the metapage similar to how
_bt_restore_meta() handles the situation.

I found this problem by code reading and did not attempt to build a test case
illustrating its practical consequences.  It's possible that there's no
problem in practice on account of some reason I haven't contemplated.

Thanks,
nm

Attachment

Re: Torn page hazard in ginRedoUpdateMetapage()

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> When GIN changes a metapage, we WAL-log its ex-header content and never use a
> backup block.  This reduces WAL volume since the vast majority of the metapage
> is unused.  However, ginRedoUpdateMetapage() only restores the WAL-logged
> content if the metapage LSN predates the WAL record LSN.  If a metapage write
> tore and updated the LSN but not the other content, we would fail to complete
> the update.  Instead, unconditionally reinitialize the metapage similar to how
> _bt_restore_meta() handles the situation.

> I found this problem by code reading and did not attempt to build a test case
> illustrating its practical consequences.  It's possible that there's no
> problem in practice on account of some reason I haven't contemplated.

I think there's no problem in practice; the reason is that the
GinMetaPageData struct isn't large enough to extend past the first
physical sector of the page.  So it's in the same disk sector as the
LSN and tearing is impossible.  Still, this might be a good
future-proofing move, in case GinMetaPageData gets larger.
        regards, tom lane


Re: Torn page hazard in ginRedoUpdateMetapage()

From
Noah Misch
Date:
On Mon, Apr 30, 2012 at 02:35:20PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > When GIN changes a metapage, we WAL-log its ex-header content and never use a
> > backup block.  This reduces WAL volume since the vast majority of the metapage
> > is unused.  However, ginRedoUpdateMetapage() only restores the WAL-logged
> > content if the metapage LSN predates the WAL record LSN.  If a metapage write
> > tore and updated the LSN but not the other content, we would fail to complete
> > the update.  Instead, unconditionally reinitialize the metapage similar to how
> > _bt_restore_meta() handles the situation.
> 
> > I found this problem by code reading and did not attempt to build a test case
> > illustrating its practical consequences.  It's possible that there's no
> > problem in practice on account of some reason I haven't contemplated.
> 
> I think there's no problem in practice; the reason is that the
> GinMetaPageData struct isn't large enough to extend past the first
> physical sector of the page.  So it's in the same disk sector as the
> LSN and tearing is impossible.  Still, this might be a good
> future-proofing move, in case GinMetaPageData gets larger.

Can we indeed assume that all support-worthy filesystems align the start of
every file to a physical sector?  I know little about modern filesystem
design, but these references leave me wary of that assumption:

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg14690.html
http://en.wikipedia.org/wiki/Block_suballocation

If it is a safe assumption, we could exploit it elsewhere.

Thanks,
nm


Re: Torn page hazard in ginRedoUpdateMetapage()

From
Daniel Farina
Date:
On Wed, May 2, 2012 at 6:06 PM, Noah Misch <noah@leadboat.com> wrote:
> Can we indeed assume that all support-worthy filesystems align the start of
> every file to a physical sector?  I know little about modern filesystem
> design, but these references leave me wary of that assumption:
>
> http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg14690.html
> http://en.wikipedia.org/wiki/Block_suballocation
>
> If it is a safe assumption, we could exploit it elsewhere.

Not to say whether this is safe or not, but it *is* exploited
elsewhere, as I understand it: the pg_control information, whose
justification for its safety is its small size.  That may point to a
very rare problem with pg_control rather the safety of the assumption
it makes.

--
fdr


Re: Torn page hazard in ginRedoUpdateMetapage()

From
Tom Lane
Date:
Daniel Farina <daniel@heroku.com> writes:
> On Wed, May 2, 2012 at 6:06 PM, Noah Misch <noah@leadboat.com> wrote:
>> Can we indeed assume that all support-worthy filesystems align the start of
>> every file to a physical sector?  I know little about modern filesystem
>> design, but these references leave me wary of that assumption:
>> 
>> http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg14690.html
>> http://en.wikipedia.org/wiki/Block_suballocation
>> 
>> If it is a safe assumption, we could exploit it elsewhere.

> Not to say whether this is safe or not, but it *is* exploited
> elsewhere, as I understand it: the pg_control information, whose
> justification for its safety is its small size.  That may point to a
> very rare problem with pg_control rather the safety of the assumption
> it makes.

I think it's somewhat common now for filesystems to attempt to optimize
very small files (on the order of a few dozen bytes) in that way.  It's
hard to see where's the upside for changing the conventional storage
allocation when the file is sector-sized or larger; the file system does
have to be prepared to rewrite the file on demand, and moving it from
one place to another isn't cheap.

That wikipedia reference argues for doing this type of optimization on
the last partial block of a file, which is entirely irrelevant for our
purposes since we always ask for page-multiples of space.  (The fact
that much of that might be useless padding is, I think, unknown to the
filesystem.)

Having said all that, I wasn't really arguing that this was a guaranteed
safe thing for us to rely on; just pointing out that it's quite likely
that the issue hasn't been seen in the field because of this type of
consideration.
        regards, tom lane


Re: Torn page hazard in ginRedoUpdateMetapage()

From
Robert Haas
Date:
On Thu, May 3, 2012 at 12:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Having said all that, I wasn't really arguing that this was a guaranteed
> safe thing for us to rely on; just pointing out that it's quite likely
> that the issue hasn't been seen in the field because of this type of
> consideration.

Well, we do rely, in numerous places, on writes << 512 bytes not
getting torn.  pd_prune_xid, index tuple kills, heap tuple hint bits,
relmapper files, etc.  We generally assume, for example, that a 4-byte
write which is 4-byte aligned does not need to be WAL-logged, which
would be necessary if we thought that the write might be torn.

Are you planning to commit Noah's patch?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Torn page hazard in ginRedoUpdateMetapage()

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Are you planning to commit Noah's patch?

I wasn't intending to do so personally in the near future; I've got
other things on my to-do list.  I won't object if somebody else
commits it though.
        regards, tom lane


Re: Torn page hazard in ginRedoUpdateMetapage()

From
Robert Haas
Date:
On Mon, Apr 30, 2012 at 1:34 PM, Noah Misch <noah@leadboat.com> wrote:
> When GIN changes a metapage, we WAL-log its ex-header content and never use a
> backup block.  This reduces WAL volume since the vast majority of the metapage
> is unused.  However, ginRedoUpdateMetapage() only restores the WAL-logged
> content if the metapage LSN predates the WAL record LSN.  If a metapage write
> tore and updated the LSN but not the other content, we would fail to complete
> the update.  Instead, unconditionally reinitialize the metapage similar to how
> _bt_restore_meta() handles the situation.
>
> I found this problem by code reading and did not attempt to build a test case
> illustrating its practical consequences.  It's possible that there's no
> problem in practice on account of some reason I haven't contemplated.

The attached patch doesn't apply any more, but it looks like this
issue still exists.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Torn page hazard in ginRedoUpdateMetapage()

From
Heikki Linnakangas
Date:
On 03/10/2014 09:44 PM, Robert Haas wrote:
> On Mon, Apr 30, 2012 at 1:34 PM, Noah Misch <noah@leadboat.com> wrote:
>> When GIN changes a metapage, we WAL-log its ex-header content and never use a
>> backup block.  This reduces WAL volume since the vast majority of the metapage
>> is unused.  However, ginRedoUpdateMetapage() only restores the WAL-logged
>> content if the metapage LSN predates the WAL record LSN.  If a metapage write
>> tore and updated the LSN but not the other content, we would fail to complete
>> the update.  Instead, unconditionally reinitialize the metapage similar to how
>> _bt_restore_meta() handles the situation.
>>
>> I found this problem by code reading and did not attempt to build a test case
>> illustrating its practical consequences.  It's possible that there's no
>> problem in practice on account of some reason I haven't contemplated.
>
> The attached patch doesn't apply any more, but it looks like this
> issue still exists.

Fixed.

- Heikki



Re: Torn page hazard in ginRedoUpdateMetapage()

From
Robert Haas
Date:
On Wed, Mar 12, 2014 at 4:23 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> The attached patch doesn't apply any more, but it looks like this
>> issue still exists.
>
> Fixed.

Did you forget to push?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Torn page hazard in ginRedoUpdateMetapage()

From
Heikki Linnakangas
Date:
On 03/12/2014 02:05 PM, Robert Haas wrote:
> On Wed, Mar 12, 2014 at 4:23 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>>> The attached patch doesn't apply any more, but it looks like this
>>> issue still exists.
>>
>> Fixed.
>
> Did you forget to push?

Yep. Pushed now.

- Heikki