Thread: Torn page hazard in ginRedoUpdateMetapage()
When GIN changes a metapage, we WAL-log its ex-header content and never use a backup block. This reduces WAL volume since the vast majority of the metapage is unused. However, ginRedoUpdateMetapage() only restores the WAL-logged content if the metapage LSN predates the WAL record LSN. If a metapage write tore and updated the LSN but not the other content, we would fail to complete the update. Instead, unconditionally reinitialize the metapage similar to how _bt_restore_meta() handles the situation. I found this problem by code reading and did not attempt to build a test case illustrating its practical consequences. It's possible that there's no problem in practice on account of some reason I haven't contemplated. Thanks, nm
Attachment
Noah Misch <noah@leadboat.com> writes: > When GIN changes a metapage, we WAL-log its ex-header content and never use a > backup block. This reduces WAL volume since the vast majority of the metapage > is unused. However, ginRedoUpdateMetapage() only restores the WAL-logged > content if the metapage LSN predates the WAL record LSN. If a metapage write > tore and updated the LSN but not the other content, we would fail to complete > the update. Instead, unconditionally reinitialize the metapage similar to how > _bt_restore_meta() handles the situation. > I found this problem by code reading and did not attempt to build a test case > illustrating its practical consequences. It's possible that there's no > problem in practice on account of some reason I haven't contemplated. I think there's no problem in practice; the reason is that the GinMetaPageData struct isn't large enough to extend past the first physical sector of the page. So it's in the same disk sector as the LSN and tearing is impossible. Still, this might be a good future-proofing move, in case GinMetaPageData gets larger. regards, tom lane
On Mon, Apr 30, 2012 at 02:35:20PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > When GIN changes a metapage, we WAL-log its ex-header content and never use a > > backup block. This reduces WAL volume since the vast majority of the metapage > > is unused. However, ginRedoUpdateMetapage() only restores the WAL-logged > > content if the metapage LSN predates the WAL record LSN. If a metapage write > > tore and updated the LSN but not the other content, we would fail to complete > > the update. Instead, unconditionally reinitialize the metapage similar to how > > _bt_restore_meta() handles the situation. > > > I found this problem by code reading and did not attempt to build a test case > > illustrating its practical consequences. It's possible that there's no > > problem in practice on account of some reason I haven't contemplated. > > I think there's no problem in practice; the reason is that the > GinMetaPageData struct isn't large enough to extend past the first > physical sector of the page. So it's in the same disk sector as the > LSN and tearing is impossible. Still, this might be a good > future-proofing move, in case GinMetaPageData gets larger. Can we indeed assume that all support-worthy filesystems align the start of every file to a physical sector? I know little about modern filesystem design, but these references leave me wary of that assumption: http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg14690.html http://en.wikipedia.org/wiki/Block_suballocation If it is a safe assumption, we could exploit it elsewhere. Thanks, nm
On Wed, May 2, 2012 at 6:06 PM, Noah Misch <noah@leadboat.com> wrote: > Can we indeed assume that all support-worthy filesystems align the start of > every file to a physical sector? I know little about modern filesystem > design, but these references leave me wary of that assumption: > > http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg14690.html > http://en.wikipedia.org/wiki/Block_suballocation > > If it is a safe assumption, we could exploit it elsewhere. Not to say whether this is safe or not, but it *is* exploited elsewhere, as I understand it: the pg_control information, whose justification for its safety is its small size. That may point to a very rare problem with pg_control rather the safety of the assumption it makes. -- fdr
Daniel Farina <daniel@heroku.com> writes: > On Wed, May 2, 2012 at 6:06 PM, Noah Misch <noah@leadboat.com> wrote: >> Can we indeed assume that all support-worthy filesystems align the start of >> every file to a physical sector? I know little about modern filesystem >> design, but these references leave me wary of that assumption: >> >> http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg14690.html >> http://en.wikipedia.org/wiki/Block_suballocation >> >> If it is a safe assumption, we could exploit it elsewhere. > Not to say whether this is safe or not, but it *is* exploited > elsewhere, as I understand it: the pg_control information, whose > justification for its safety is its small size. That may point to a > very rare problem with pg_control rather the safety of the assumption > it makes. I think it's somewhat common now for filesystems to attempt to optimize very small files (on the order of a few dozen bytes) in that way. It's hard to see where's the upside for changing the conventional storage allocation when the file is sector-sized or larger; the file system does have to be prepared to rewrite the file on demand, and moving it from one place to another isn't cheap. That wikipedia reference argues for doing this type of optimization on the last partial block of a file, which is entirely irrelevant for our purposes since we always ask for page-multiples of space. (The fact that much of that might be useless padding is, I think, unknown to the filesystem.) Having said all that, I wasn't really arguing that this was a guaranteed safe thing for us to rely on; just pointing out that it's quite likely that the issue hasn't been seen in the field because of this type of consideration. regards, tom lane
On Thu, May 3, 2012 at 12:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Having said all that, I wasn't really arguing that this was a guaranteed > safe thing for us to rely on; just pointing out that it's quite likely > that the issue hasn't been seen in the field because of this type of > consideration. Well, we do rely, in numerous places, on writes << 512 bytes not getting torn. pd_prune_xid, index tuple kills, heap tuple hint bits, relmapper files, etc. We generally assume, for example, that a 4-byte write which is 4-byte aligned does not need to be WAL-logged, which would be necessary if we thought that the write might be torn. Are you planning to commit Noah's patch? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Are you planning to commit Noah's patch? I wasn't intending to do so personally in the near future; I've got other things on my to-do list. I won't object if somebody else commits it though. regards, tom lane
On Mon, Apr 30, 2012 at 1:34 PM, Noah Misch <noah@leadboat.com> wrote: > When GIN changes a metapage, we WAL-log its ex-header content and never use a > backup block. This reduces WAL volume since the vast majority of the metapage > is unused. However, ginRedoUpdateMetapage() only restores the WAL-logged > content if the metapage LSN predates the WAL record LSN. If a metapage write > tore and updated the LSN but not the other content, we would fail to complete > the update. Instead, unconditionally reinitialize the metapage similar to how > _bt_restore_meta() handles the situation. > > I found this problem by code reading and did not attempt to build a test case > illustrating its practical consequences. It's possible that there's no > problem in practice on account of some reason I haven't contemplated. The attached patch doesn't apply any more, but it looks like this issue still exists. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 03/10/2014 09:44 PM, Robert Haas wrote: > On Mon, Apr 30, 2012 at 1:34 PM, Noah Misch <noah@leadboat.com> wrote: >> When GIN changes a metapage, we WAL-log its ex-header content and never use a >> backup block. This reduces WAL volume since the vast majority of the metapage >> is unused. However, ginRedoUpdateMetapage() only restores the WAL-logged >> content if the metapage LSN predates the WAL record LSN. If a metapage write >> tore and updated the LSN but not the other content, we would fail to complete >> the update. Instead, unconditionally reinitialize the metapage similar to how >> _bt_restore_meta() handles the situation. >> >> I found this problem by code reading and did not attempt to build a test case >> illustrating its practical consequences. It's possible that there's no >> problem in practice on account of some reason I haven't contemplated. > > The attached patch doesn't apply any more, but it looks like this > issue still exists. Fixed. - Heikki
On Wed, Mar 12, 2014 at 4:23 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> The attached patch doesn't apply any more, but it looks like this >> issue still exists. > > Fixed. Did you forget to push? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 03/12/2014 02:05 PM, Robert Haas wrote: > On Wed, Mar 12, 2014 at 4:23 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >>> The attached patch doesn't apply any more, but it looks like this >>> issue still exists. >> >> Fixed. > > Did you forget to push? Yep. Pushed now. - Heikki