Thread: Error with index on unlogged table
Hi,
I was attempting to set up a data set to test pg_rewind, when I encountered an error. I created a primary and standby, then:# create table test (id serial primary key, thing text);
CREATE TABLE
# create unlogged table utest (id serial primary key, thing text);
CREATE TABLE
# insert into test (thing) values ('jifj');
INSERT 0 1
# insert into utest (thing) values ('jifj');
INSERT 0 1
$ pg_ctl -D standby1 promote
$ psql -p 5531 postgres # standby
# insert into test (thing) values ('moomoo');
INSERT 0 1
# insert into utest (thing) values ('moomoo');
ERROR: index "utest_pkey" contains unexpected zero page at block 0
HINT: Please REINDEX it.
This is built on commit e5f455f59fed0632371cddacddd79895b148dc07.
--
Thom
On Tue, Mar 24, 2015 at 5:53 PM, Thom Brown <thom@linux.com> wrote: > I was attempting to set up a data set to test pg_rewind, when I encountered > an error. I created a primary and standby, then: > > [...] > > # insert into utest (thing) values ('moomoo'); > ERROR: index "utest_pkey" contains unexpected zero page at block 0 > HINT: Please REINDEX it. > > This is built on commit e5f455f59fed0632371cddacddd79895b148dc07. Unlogged tables are not in WAL, and cannot be accessed while in recovery, so having an empty index relation is expected on a promoted standby IMO. Now perhaps we could have a more friendly error message in _bt_checkpage(), _hash_checkpage() and gistcheckpage() with an additional HINT to mention unlogged tables, but I am not sure that this is much worth it. Mentioning this behavior in the docs would be good instead. -- Michael
On March 24, 2015 12:35:28 PM GMT+01:00, Michael Paquier <michael.paquier@gmail.com> wrote: >On Tue, Mar 24, 2015 at 5:53 PM, Thom Brown <thom@linux.com> wrote: >> I was attempting to set up a data set to test pg_rewind, when I >encountered >> an error. I created a primary and standby, then: >> >> [...] >> >> # insert into utest (thing) values ('moomoo'); >> ERROR: index "utest_pkey" contains unexpected zero page at block 0 >> HINT: Please REINDEX it. >> >> This is built on commit e5f455f59fed0632371cddacddd79895b148dc07. > >Unlogged tables are not in WAL, and cannot be accessed while in >recovery, so having an empty index relation is expected on a promoted >standby IMO. Now perhaps we could have a more friendly error message >in _bt_checkpage(), _hash_checkpage() and gistcheckpage() with an >additional HINT to mention unlogged tables, but I am not sure that >this is much worth it. Mentioning this behavior in the docs would be >good instead. I think Thom's point is that he promoted the node... Thom, are you sure this want transient? Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 24 March 2015 at 11:37, Andres Freund <span dir="ltr"><<ahref="mailto:andres@anarazel.de" target="_blank">andres@anarazel.de</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><divclass=""><div class="h5">On March 24, 2015 12:35:28 PM GMT+01:00, Michael Paquier<<a href="mailto:michael.paquier@gmail.com">michael.paquier@gmail.com</a>> wrote:<br /> >On Tue, Mar 24,2015 at 5:53 PM, Thom Brown <<a href="mailto:thom@linux.com">thom@linux.com</a>> wrote:<br /> >> I was attemptingto set up a data set to test pg_rewind, when I<br /> >encountered<br /> >> an error. I created a primaryand standby, then:<br /> >><br /> >> [...]<br /> >><br /> >> # insert into utest (thing) values('moomoo');<br /> >> ERROR: index "utest_pkey" contains unexpected zero page at block 0<br /> >> HINT: Please REINDEX it.<br /> >><br /> >> This is built on commit e5f455f59fed0632371cddacddd79895b148dc07.<br/> ><br /> >Unlogged tables are not in WAL, and cannot be accessed whilein<br /> >recovery, so having an empty index relation is expected on a promoted<br /> >standby IMO. Now perhapswe could have a more friendly error message<br /> >in _bt_checkpage(), _hash_checkpage() and gistcheckpage() withan<br /> >additional HINT to mention unlogged tables, but I am not sure that<br /> >this is much worth it. Mentioningthis behavior in the docs would be<br /> >good instead.<br /><br /></div></div>I think Thom's point is thathe promoted the node...<br /><br /> Thom, are you sure this want transient?<br /></blockquote></div><br /></div><divclass="gmail_extra">The index is unlogged until reindexing...<br /><br /># select oid, relname, relpersistencefrom pg_class where relname in ('test','test_pkey','utest','utest_pkey');<br /> oid | relname | relpersistence<br />-------+------------+----------------<br /> 16387 | test | p<br /> 16394 | test_pkey | p<br /> 16398| utest | u<br /> 16405 | utest_pkey | u<br />(4 rows)<br /><br /># reindex index utest_pkey;<br />REINDEX<br/><br /># select oid, relname, relpersistence from pg_class where relname in ('test','test_pkey','utest','utest_pkey');<br/> oid | relname | relpersistence <br />-------+------------+----------------<br/> 16387 | test | p<br /> 16394 | test_pkey | p<br /> 16398 | utest | u<br /> 16405 | utest_pkey | p<br />(4 rows)<br /><br /></div><div class="gmail_extra">Which is think also raisesthe question, why are unlogged indexes made persistent by a reindex?<br clear="all" /></div><div class="gmail_extra"><br/>-- <br /><div class="gmail_signature">Thom</div></div></div>
On 24 March 2015 at 11:46, Thom Brown <thom@linux.com> wrote:
On 24 March 2015 at 11:37, Andres Freund <andres@anarazel.de> wrote:I think Thom's point is that he promoted the node...On March 24, 2015 12:35:28 PM GMT+01:00, Michael Paquier <michael.paquier@gmail.com> wrote:
>On Tue, Mar 24, 2015 at 5:53 PM, Thom Brown <thom@linux.com> wrote:
>> I was attempting to set up a data set to test pg_rewind, when I
>encountered
>> an error. I created a primary and standby, then:
>>
>> [...]
>>
>> # insert into utest (thing) values ('moomoo');
>> ERROR: index "utest_pkey" contains unexpected zero page at block 0
>> HINT: Please REINDEX it.
>>
>> This is built on commit e5f455f59fed0632371cddacddd79895b148dc07.
>
>Unlogged tables are not in WAL, and cannot be accessed while in
>recovery, so having an empty index relation is expected on a promoted
>standby IMO. Now perhaps we could have a more friendly error message
>in _bt_checkpage(), _hash_checkpage() and gistcheckpage() with an
>additional HINT to mention unlogged tables, but I am not sure that
>this is much worth it. Mentioning this behavior in the docs would be
>good instead.
Thom, are you sure this want transient?The index is unlogged until reindexing...
# select oid, relname, relpersistence from pg_class where relname in ('test','test_pkey','utest','utest_pkey');
oid | relname | relpersistence
-------+------------+----------------
16387 | test | p
16394 | test_pkey | p
16398 | utest | u
16405 | utest_pkey | u
(4 rows)
# reindex index utest_pkey;
REINDEX
# select oid, relname, relpersistence from pg_class where relname in ('test','test_pkey','utest','utest_pkey');
oid | relname | relpersistence
-------+------------+----------------
16387 | test | p
16394 | test_pkey | p
16398 | utest | u
16405 | utest_pkey | p
(4 rows)Which is think also raises the question, why are unlogged indexes made persistent by a reindex?
I should also mention that it becomes unlogged again when running VACUUM FULL or CLUSTER on the table.
--
Thom
On Tue, Mar 24, 2015 at 8:37 PM, Andres Freund <andres@anarazel.de> wrote: > On March 24, 2015 12:35:28 PM GMT+01:00, Michael Paquier wrote: > I think Thom's point is that he promoted the node... > > Thom, are you sure this want transient? Well, I got his point :) I was just thinking that this error message is legit, ResetUnloggedRelationsInDbspaceDir reinitializing unlogged relation and its indexes at the end of recovery. -- Michael
On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote: > The index is unlogged until reindexing... > > [...] > Which is think also raises the question, why are unlogged indexes made > persistent by a reindex? That's a bug of HEAD, ~9.4 keeping the index as unlogged even after REINDEX INDEX. What happens is that ReindexIndex relies on relpersistence provided by makeRangeVar at parse time, which is just incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch attached fixes that... -- Michael
Attachment
On Wednesday, March 25, 2015, Michael Paquier <michael.paquier@gmail.com> wrote: > > On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote: > > The index is unlogged until reindexing... > > > > [...] > > Which is think also raises the question, why are unlogged indexes made > > persistent by a reindex? > > That's a bug of HEAD, ~9.4 keeping the index as unlogged even after > REINDEX INDEX. What happens is that ReindexIndex relies on > relpersistence provided by makeRangeVar at parse time, which is just > incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch > attached fixes that... > How about VACUUM FULL and CLUSTER as the problem seems to have been reported to be there too? Amit
On 25 March 2015 at 12:22, Amit Langote <amitlangote09@gmail.com> wrote:
How about VACUUM FULL and CLUSTER as the problem seems to have beenOn Wednesday, March 25, 2015, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
> > The index is unlogged until reindexing...
> >
> > [...]
> > Which is think also raises the question, why are unlogged indexes made
> > persistent by a reindex?
>
> That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
> REINDEX INDEX. What happens is that ReindexIndex relies on
> relpersistence provided by makeRangeVar at parse time, which is just
> incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
> attached fixes that...
>
reported to be there too?
No, those are okay. They actually revert the index back to the same persistence level as the table they're attached to.
--
Thom
On Wednesday, March 25, 2015, Thom Brown <thom@linux.com> wrote:
On 25 March 2015 at 12:22, Amit Langote <amitlangote09@gmail.com> wrote:How about VACUUM FULL and CLUSTER as the problem seems to have beenOn Wednesday, March 25, 2015, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
> > The index is unlogged until reindexing...
> >
> > [...]
> > Which is think also raises the question, why are unlogged indexes made
> > persistent by a reindex?
>
> That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
> REINDEX INDEX. What happens is that ReindexIndex relies on
> relpersistence provided by makeRangeVar at parse time, which is just
> incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
> attached fixes that...
>
reported to be there too?No, those are okay. They actually revert the index back to the same persistence level as the table they're attached to.
Ah, I misread then; sorry about the noise.
Amit
Hi, On 2015-03-25 11:38:30 +0900, Michael Paquier wrote: > On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote: > > The index is unlogged until reindexing... > > > > [...] > > Which is think also raises the question, why are unlogged indexes made > > persistent by a reindex? > > That's a bug of HEAD, ~9.4 keeping the index as unlogged even after > REINDEX INDEX. What happens is that ReindexIndex relies on > relpersistence provided by makeRangeVar at parse time, which is just > incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch > attached fixes that... What the hell? That's somewhat nasty. Nice that it got caught before 9.5 was released. Did you check whether a similar bug was made in other places of 85b506bb? Could you additionally add a regression test to this end? Seems like something worth testing. Greetings, Andres Freund
<div dir="ltr"><div class="gmail_extra"><br />On Wed, Mar 25, 2015 at 10:53 AM, Andres Freund <<a href="mailto:andres@2ndquadrant.com">andres@2ndquadrant.com</a>>wrote:<br />><br />> Hi,<br />><br />> On2015-03-25 11:38:30 +0900, Michael Paquier wrote:<br />> > On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:<br/>> > > The index is unlogged until reindexing...<br />> > ><br />> > > [...]<br />>> > Which is think also raises the question, why are unlogged indexes made<br />> > > persistent bya reindex?<br />> ><br />> > That's a bug of HEAD, ~9.4 keeping the index as unlogged even after<br />>> REINDEX INDEX. What happens is that ReindexIndex relies on<br />> > relpersistence provided by makeRangeVarat parse time, which is just<br />> > incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch<br/>> > attached fixes that...<br />><br />> What the hell? That's somewhat nasty. Nice that it got caughtbefore 9.5<br />> was released.<br />><br /><br /></div><div class="gmail_extra">Unfortunately this is very nasty.Sorry!<br /><br /></div><div class="gmail_extra"><br />> Did you check whether a similar bug was made in other placesof<br />> 85b506bb? Could you additionally add a regression test to this end?<br />> Seems like something worthtesting.<br />><br /><br /></div><div class="gmail_extra">I'm checking it and adding some regression tests.<br /></div><divclass="gmail_extra"><br /></div><div class="gmail_extra">Regards,<br /></div><div class="gmail_extra"><br />--<br/>Fabrízio de Royes Mello<br />Consultoria/Coaching PostgreSQL<br />>> Timbira: <a href="http://www.timbira.com.br">http://www.timbira.com.br</a><br/>>> Blog: <a href="http://fabriziomello.github.io">http://fabriziomello.github.io</a><br/>>> Linkedin: <a href="http://br.linkedin.com/in/fabriziomello">http://br.linkedin.com/in/fabriziomello</a><br/>>> Twitter: <a href="http://twitter.com/fabriziomello">http://twitter.com/fabriziomello</a><br/>>> Github: <a href="http://github.com/fabriziomello">http://github.com/fabriziomello</a></div></div>
On Wed, Mar 25, 2015 at 12:46 PM, Fabrízio de Royes Mello <fabriziomello@gmail.com> wrote:
>
>
> On Wed, Mar 25, 2015 at 10:53 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >
>
> > Did you check whether a similar bug was made in other places of
> > 85b506bb? Could you additionally add a regression test to this end?
> > Seems like something worth testing.
> >
>
> I'm checking it and adding some regression tests.
>
>
>
> On Wed, Mar 25, 2015 at 10:53 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >
>
> > Did you check whether a similar bug was made in other places of
> > 85b506bb? Could you additionally add a regression test to this end?
> > Seems like something worth testing.
> >
>
> I'm checking it and adding some regression tests.
>
I didn't found any other similar bug introduced by 85b506bb.
Attached the original patch provided by Michael with some regression tests.
Regards,
--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello
Attachment
On Wed, Mar 25, 2015 at 10:53 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2015-03-25 11:38:30 +0900, Michael Paquier wrote: >> On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote: >> > The index is unlogged until reindexing... >> > >> > [...] >> > Which is think also raises the question, why are unlogged indexes made >> > persistent by a reindex? >> >> That's a bug of HEAD, ~9.4 keeping the index as unlogged even after >> REINDEX INDEX. What happens is that ReindexIndex relies on >> relpersistence provided by makeRangeVar at parse time, which is just >> incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch >> attached fixes that... > > What the hell? That's somewhat nasty. Nice that it got caught before 9.5 > was released. > > Did you check whether a similar bug was made in other places of > 85b506bb? Yeah I got a look at the other code paths, particularly cluster and matviews, and the relpersistence used is taken directly from a Relation. > Could you additionally add a regression test to this end? > Seems like something worth testing. Definitely. And I guess that Fabrizio already did that... -- Michael
On Thu, Mar 26, 2015 at 1:02 AM, Fabrízio de Royes Mello <fabriziomello@gmail.com> wrote: > On Wed, Mar 25, 2015 at 12:46 PM, Fabrízio de Royes Mello > <fabriziomello@gmail.com> wrote: >> >> >> On Wed, Mar 25, 2015 at 10:53 AM, Andres Freund <andres@2ndquadrant.com> >> wrote: >> > >> >> > Did you check whether a similar bug was made in other places of >> > 85b506bb? Could you additionally add a regression test to this end? >> > Seems like something worth testing. >> > >> >> I'm checking it and adding some regression tests. >> > > I didn't found any other similar bug introduced by 85b506bb. > > Attached the original patch provided by Michael with some regression tests. Thanks for adding a test, this looks fine to me (did some sanity checks and tutti-quanti for people wondering). On temporary tables this was failing with an error in md.c... -- Michael
On 26 March 2015 at 00:55, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Mar 26, 2015 at 1:02 AM, Fabrízio de Royes Mello
<fabriziomello@gmail.com> wrote:
> On Wed, Mar 25, 2015 at 12:46 PM, Fabrízio de Royes Mello
> <fabriziomello@gmail.com> wrote:
>>
>>
>> On Wed, Mar 25, 2015 at 10:53 AM, Andres Freund <andres@2ndquadrant.com>
>> wrote:
>> >
>>
>> > Did you check whether a similar bug was made in other places of
>> > 85b506bb? Could you additionally add a regression test to this end?
>> > Seems like something worth testing.
>> >
>>
>> I'm checking it and adding some regression tests.
>>
>
> I didn't found any other similar bug introduced by 85b506bb.
>
> Attached the original patch provided by Michael with some regression tests.
Thanks for adding a test, this looks fine to me (did some sanity
checks and tutti-quanti for people wondering). On temporary tables
this was failing with an error in md.c...
Thanks to both of you for fixing this.
I still, however, have a problem with the separate and original issue of:
# insert into utest (thing) values ('moomoo');
ERROR: index "utest_pkey" contains unexpected zero page at block 0
HINT: Please REINDEX it.
# insert into utest (thing) values ('moomoo');
ERROR: index "utest_pkey" contains unexpected zero page at block 0
HINT: Please REINDEX it.
I don't see why the user should need to go re-indexing all unlogged tables each time a standby is promoted. The index should just be empty and ready to use.
--
Thom
On 2015-03-26 13:55:22 +0000, Thom Brown wrote: > I still, however, have a problem with the separate and original issue of: > > # insert into utest (thing) values ('moomoo'); > ERROR: index "utest_pkey" contains unexpected zero page at block 0 > HINT: Please REINDEX it. > > I don't see why the user should need to go re-indexing all unlogged tables > each time a standby is promoted. The index should just be empty and ready > to use. There's definitely something rather broken here. Investigating. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2015-03-26 15:13:41 +0100, Andres Freund wrote: > On 2015-03-26 13:55:22 +0000, Thom Brown wrote: > > I still, however, have a problem with the separate and original issue of: > > > > # insert into utest (thing) values ('moomoo'); > > ERROR: index "utest_pkey" contains unexpected zero page at block 0 > > HINT: Please REINDEX it. > > > > I don't see why the user should need to go re-indexing all unlogged tables > > each time a standby is promoted. The index should just be empty and ready > > to use. > > There's definitely something rather broken here. Investigating. As far as I can see this has been broken at least since the introduction of fast promotion. WAL replay will update the init fork in shared memory, but it'll not be guaranteed to be flushed to disk when the reset happens. d3586fc8a et al. then also made it possible to hit the issue without fast promotion. To hit the issue there may not be a restartpoint (requiring a checkpoint on the primary) since the creation of the unlogged table. I think the problem here is that the *primary* makes no such assumptions. Init forks are logged via stuff likesmgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE, (char *)metapage, true);if (XLogIsNeeded()) log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, BTREE_METAPAGE,metapage, false); /* * An immediate sync is required even if we xlog'd the page, because the * write did not go through shared_buffers andtherefore a concurrent * checkpoint may have moved the redo pointer past our xlog record. */smgrimmedsync(index->rd_smgr,INIT_FORKNUM); i.e. the data is written out directly to disk, circumventing shared_buffers. It's pretty bad that we don't do the same on the standby. For master I think we should just add a bit to the XLOG_FPI record saying the data should be forced out to disk. I'm less sure what's to be done in the back branches. Flushing every HEAP_NEWPAGE record isn't really an option. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hello, At Thu, 26 Mar 2015 18:50:24 +0100, Andres Freund <andres@2ndquadrant.com> wrote in <20150326175024.GJ451@alap3.anarazel.de> > I think the problem here is that the *primary* makes no such > assumptions. Init forks are logged via stuff like > smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE, .. > i.e. the data is written out directly to disk, circumventing > shared_buffers. It's pretty bad that we don't do the same on the > standby. For master I think we should just add a bit to the XLOG_FPI > record saying the data should be forced out to disk. I'm less sure > what's to be done in the back branches. Flushing every HEAP_NEWPAGE > record isn't really an option. The problem exists only for INIT_FORKNUM. So I suppose it is enough to check forknum to decide whether to sync immediately. Specifically for this instance, syncing buffers of INIT_FORKNUM at the end of XLOG_FPI block in xlog_redo fixed the problem. The another (ugly!) solution sould be syncing only buffers for INIT_FORKNUM and is BM_DIRTY in ResetUnlogggedRelations(op = UNLOGGED_RELATION_INIT). This is catching-all-at-once solution though it is a kind of reversion of fast promotion. But buffers to be synced here should be pretty few. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Michael Paquier wrote: > On Thu, Mar 26, 2015 at 1:02 AM, Fabrízio de Royes Mello > <fabriziomello@gmail.com> wrote: > > I didn't found any other similar bug introduced by 85b506bb. > > > > Attached the original patch provided by Michael with some regression tests. > > Thanks for adding a test, this looks fine to me (did some sanity > checks and tutti-quanti for people wondering). On temporary tables > this was failing with an error in md.c... Yeah, I extended the test a bit to use a temp table too, and pushed. Thanks everybody. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 27 March 2015 at 04:54, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Thu, 26 Mar 2015 18:50:24 +0100, Andres Freund <andres@2ndquadrant.com> wrote in <20150326175024.GJ451@alap3.anarazel.de> >> I think the problem here is that the *primary* makes no such >> assumptions. Init forks are logged via stuff like >> smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE, > .. >> i.e. the data is written out directly to disk, circumventing >> shared_buffers. It's pretty bad that we don't do the same on the >> standby. For master I think we should just add a bit to the XLOG_FPI >> record saying the data should be forced out to disk. I'm less sure >> what's to be done in the back branches. Flushing every HEAP_NEWPAGE >> record isn't really an option. > > The problem exists only for INIT_FORKNUM. So I suppose it is > enough to check forknum to decide whether to sync immediately. > > Specifically for this instance, syncing buffers of INIT_FORKNUM > at the end of XLOG_FPI block in xlog_redo fixed the problem. > > The another (ugly!) solution sould be syncing only buffers for > INIT_FORKNUM and is BM_DIRTY in ResetUnlogggedRelations(op = > UNLOGGED_RELATION_INIT). This is catching-all-at-once solution > though it is a kind of reversion of fast promotion. But buffers > to be synced here should be pretty few. This bug still exists. Thom
On Thu, Nov 19, 2015 at 7:19 AM, Thom Brown <thom@linux.com> wrote: > On 27 March 2015 at 04:54, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> Hello, >> >> At Thu, 26 Mar 2015 18:50:24 +0100, Andres Freund <andres@2ndquadrant.com> wrote in <20150326175024.GJ451@alap3.anarazel.de> >>> I think the problem here is that the *primary* makes no such >>> assumptions. Init forks are logged via stuff like >>> smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE, >> .. >>> i.e. the data is written out directly to disk, circumventing >>> shared_buffers. It's pretty bad that we don't do the same on the >>> standby. For master I think we should just add a bit to the XLOG_FPI >>> record saying the data should be forced out to disk. I'm less sure >>> what's to be done in the back branches. Flushing every HEAP_NEWPAGE >>> record isn't really an option. >> >> The problem exists only for INIT_FORKNUM. So I suppose it is >> enough to check forknum to decide whether to sync immediately. >> >> Specifically for this instance, syncing buffers of INIT_FORKNUM >> at the end of XLOG_FPI block in xlog_redo fixed the problem. >> >> The another (ugly!) solution sould be syncing only buffers for >> INIT_FORKNUM and is BM_DIRTY in ResetUnlogggedRelations(op = >> UNLOGGED_RELATION_INIT). This is catching-all-at-once solution >> though it is a kind of reversion of fast promotion. But buffers >> to be synced here should be pretty few. > > This bug still exists. Hmm. This probably should have been on the open items list. I didn't pay too much attention this to this before because it seemed like Andres and Michael were all over it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Nov 20, 2015 at 7:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Nov 19, 2015 at 7:19 AM, Thom Brown <thom@linux.com> wrote: >> This bug still exists. > > Hmm. This probably should have been on the open items list. I have added an open item for 9.5. That's not a cool bug to release the next GA even if that's not limited to 9.5, it is easily reproducible in older stable branches as well. > I didn't > pay too much attention this to this before because it seemed like > Andres and Michael were all over it. This completely fell off my radar. Sorry about that. For back-branches, I tend to agree with what Horiguchi-san mentioned upthread: we had better issue an unconditional fsync on the relation when INIT_FORKNUM is used for it when replaying XLOG_FPI record. That would rather localize the impact. An example of patch is attached that applies on top of REL9_4_STABLE. That's rather ugly but it shows the idea and fixes the issue. For master and perhaps 9.5, I would think that a dedicated WAL record type enforcing the fsync is the way to go. This special treatment would go only for btree and spgist when they use INIT_FORKNUM and we would not impact other relation types and code paths with this behavior. Thoughts? -- Michael
Attachment
On Fri, Nov 20, 2015 at 4:11 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > For master and perhaps 9.5, I would think that a dedicated WAL record > type enforcing the fsync is the way to go. This special treatment > would go only for btree and spgist when they use INIT_FORKNUM and we > would not impact other relation types and code paths with this > behavior. So, I have been looking at that, and finished with the attached to implement this idea. This way, it is possible to control at replay if a relation should be synced to disk just after replaying a FPI or a set of FPIs. This makes btree and spgist init forks more consistent at replay with what is done on master by syncing them immediately, which is not a bad thing to me. Now, and actually my last email has been misleading as well, this patch as well as the previous patch I sent for ~9.4 do not actually fix the initialization of indexes for unlogged tables after promoting a standby. Hence I guess that we are still missing a trick when reinitializing those relations at the end of recovery. It is a bit late here so I am attaching a patch reflecting the progress I have done. Comments are welcome. -- Michael
Attachment
On Sat, Nov 21, 2015 at 11:30 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Nov 20, 2015 at 4:11 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> For master and perhaps 9.5, I would think that a dedicated WAL record >> type enforcing the fsync is the way to go. This special treatment >> would go only for btree and spgist when they use INIT_FORKNUM and we >> would not impact other relation types and code paths with this >> behavior. > > So, I have been looking at that, and finished with the attached to > implement this idea. This way, it is possible to control at replay if > a relation should be synced to disk just after replaying a FPI or a > set of FPIs. This makes btree and spgist init forks more consistent at > replay with what is done on master by syncing them immediately, which > is not a bad thing to me. Depending on the type of index used on an unlogged table, the failure happening is quite various. With gist and btree, a promoted standby will complain about an empty page. With brin, the standby will complain with a floating-point error: ERROR: 22P01: floating-point exception DETAIL: An invalid floating-point operation was signaled. This probably means an out-of-range result or an invalid operation, such as division by zero. LOCATION: FloatExceptionHandler, postgres.c:2702 Now with gin, the system becomes actually unresponsive, being stuck on PGSemaphoreLock and unable to answer to signals: frame #1: 0x000000010a29f7cf postgres`PGSemaphoreLock(sema=0x00000001138df118) + 63 at pg_sema.c:387 frame #2: 0x000000010a348e81 postgres`LWLockAcquire(lock=0x000000010acc5dc0, mode=LW_EXCLUSIVE) + 369 at lwlock.c:1026 frame #3: 0x000000010a3190b1 postgres`LockBuffer(buffer=208, mode=2) + 289 at bufmgr.c:3240 frame #4: 0x0000000109f341e1 postgres`ginHeapTupleFastInsert(ginstate=0x00007fff55d06a08, collector=0x00007fff55d069d8) + 849 at ginfast.c:309 frame #5: 0x0000000109f1383f postgres`gininsert(fcinfo=0x00007fff55d09010) + 431 at gininsert.c:531 I am still investigating for a correct fix, looking at reinit.c the code in charge of copying the init fork as the main fork for a relation at the end of recovery looks to be doing its job correctly... Regards, -- Michael
On Fri, Nov 27, 2015 at 3:42 PM, Michael Paquier wrote: > I am still investigating for a correct fix, looking at reinit.c the > code in charge of copying the init fork as the main fork for a > relation at the end of recovery looks to be doing its job correctly... Attached is a patch that fixes the issue for me in master and 9.5. Actually in the last patch I forgot a call to smgrwrite to ensure that the INIT_FORKNUM is correctly synced to disk when those pages are replayed at recovery, letting the reset routines for unlogged relations do their job correctly. I have noticed as well that we need to do the same for gin and brin relations. In this case I think that we could limit the flush to unlogged relations, my patch does it unconditionally though to generalize the logic. Thoughts? -- Michael
Attachment
Hello, I studied your lastest patch. At Fri, 27 Nov 2015 16:59:20 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqRoaCMhr4hjEgq4rCZ4GaCB-6=cH8b2U7K7T5-kBGC5bA@mail.gmail.com> > On Fri, Nov 27, 2015 at 3:42 PM, Michael Paquier wrote: > > I am still investigating for a correct fix, looking at reinit.c the > > code in charge of copying the init fork as the main fork for a > > relation at the end of recovery looks to be doing its job correctly... > > Attached is a patch that fixes the issue for me in master and 9.5. > Actually in the last patch I forgot a call to smgrwrite to ensure that > the INIT_FORKNUM is correctly synced to disk when those pages are > replayed at recovery, letting the reset routines for unlogged > relations do their job correctly. I have noticed as well that we need > to do the same for gin and brin relations. In this case I think that > we could limit the flush to unlogged relations, my patch does it > unconditionally though to generalize the logic. Thoughts? I feel quite uncomfortable that it solves the problem from a kind of nature of unlogged object by arbitrary flagging which is not fully corresponds to the nature. If we can deduce the necessity of fsync from some nature, it would be preferable. In the current patch, is_sync for log_newpage is generally true for and only for INIT_FORKNUM pages. Exceptions as far as I can see are, copy_relation_data: called with arbitrary forknum but it doesn't set is_fsync even for coying INIT_FORKNUM. (Is this nota problem?) spgbuildempty, ginbuildempty: these emits two or three newpage logs at once so only the last one is set is_fsync for performancereason. And other anormallies are, ginbuildempty, gistbuildempty: These funciton doesn't seem to immediately fsync but is_fsync is set to INIT_FORKNUM. Of courseit wouldn't be a problem. In short, it seems to me that the reason to choose using XLOG_FPI_FOR_SYNC here is only performance of processing successive FPIs for INIT_FORKNUM. INIT_FORKNUM is generated only for unlogged tables and their belongings. I suppose such successive fsyncs doesn't cause observable performance drop assuming that the number of unlogged tables and belongings is not so high, especially with smarter storages. All we should do is that just fsync only for INIT_FORKNUM's FPIs for the case. If the performance does matter even so, we still can fsync the last md-file when any wal record other than FPI for INIT_FORK comes. (But this would be a bit complex..) By the way, I suppose that fsyncing only the last page in successive new pages still theoretically can cause this problem when the last pages is not in the same file with other pages. That cannot occur for INIT_FORKNUM files though in reality:) Thoughts? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Dec 1, 2015 at 11:11 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, I studied your latest patch. Thanks! > I feel quite uncomfortable that it solves the problem from a kind > of nature of unlogged object by arbitrary flagging which is not > fully corresponds to the nature. If we can deduce the necessity > of fsync from some nature, it would be preferable. INIT_FORKNUM is not something only related to unlogged relations, indexes use them as well. And that's actually If you look at for example BRIN indexes that do not sync immediately their INIT_FORKNUM when index is created, I think that we still are going to need a new flag to control the sync at WAL replay because startup process cannot know a relation's persistence, thing that we can know when the XLOG_FPI record is created. For BRIN indexes, we want particularly to not sync the INIT_FORKNUM when the relation is not an unlogged one. > In the current patch, is_sync for log_newpage is generally true > for and only for INIT_FORKNUM pages. Exceptions as far as I can > see are, Yep, that's not always the case. > copy_relation_data: called with arbitrary forknum but it doesn't > set is_fsync even for coying INIT_FORKNUM. (Is this not a > problem?) Oh. And actually, it seems that use_wal is broken there as well. I think that we would still want to issue a XLOG_FPI record for an unlogged relation should it be moved to a new tablespace to copy its INIT_FORKNUM correctly to its new home. After moving an unlogged relation to a new tablespace, and after promoting a standby the standby fails to start because of this error: FATAL: could not create file "pg_tblspc/16388/PG_9.6_201511071/16384/16389": File exists This could be fixed separately though. > spgbuildempty, ginbuildempty: these emits two or three newpage > logs at once so only the last one is set is_fsync for > performance reason. It doesn't matter to fsync just at the last one. Each one of them would be replayed either way, the last one triggering the sync, no? > In short, it seems to me that the reason to choose using > XLOG_FPI_FOR_SYNC here is only performance of processing > successive FPIs for INIT_FORKNUM. Yeah, there is a one-way link between this WAL record a INIT_FORKNUM. However please note that having a INIT_FORKNUM does not always imply that a sync is wanted. copy_relation_data is an example of that. > INIT_FORKNUM is generated only for unlogged tables and their > belongings. I suppose such successive fsyncs doesn't cause > observable performance drop assuming that the number of unlogged > tables and belongings is not so high, especially with smarter > storages. All we should do is that just fsync only for > INIT_FORKNUM's FPIs for the case. If the performance does matter > even so, we still can fsync the last md-file when any wal record > other than FPI for INIT_FORK comes. (But this would be a bit > complex..) Hm. If a system uses a bunch of temporary relations with brin index or other included I would not say so. For back branches we may have to do it unconditionally using INIT_FORKNUM, but having a control flag to have it only done for unlogged relations would leverage that. -- Michael
Hello, At Tue, 1 Dec 2015 11:53:35 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSMENEK7nqmwGsiyLTSbrZNJKx80tBX3qF6cQsS49sjag@mail.gmail.com> > On Tue, Dec 1, 2015 at 11:11 AM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Hello, I studied your latest patch. > > Thanks! > > > I feel quite uncomfortable that it solves the problem from a kind > > of nature of unlogged object by arbitrary flagging which is not > > fully corresponds to the nature. If we can deduce the necessity > > of fsync from some nature, it would be preferable. > > INIT_FORKNUM is not something only related to unlogged relations, > indexes use them as well. And that's actually > If you look at for example BRIN indexes that do not sync immediately > their INIT_FORKNUM when index is created, I think that we still are > going to need a new flag to control the sync at WAL replay because > startup process cannot know a relation's persistence, thing that we > can know when the XLOG_FPI record is created. For BRIN indexes, we > want particularly to not sync the INIT_FORKNUM when the relation is > not an unlogged one. (The comment added in brinbuildempty looks wrong since itactually doesn't fsync it immediately..) Hmm, I've already seen that, and having your explanation I wonder why brinbuidempty issues WAL for what is not necessary to be persistent at the mement. Isn't it breaking agreements about Write Ahead Log? INIT_FORKNUM and unconditionally fsync'ing would be equally tied excluding the anormally about WAL. (Except for succeeding newpages.) > > In the current patch, is_sync for log_newpage is generally true > > for and only for INIT_FORKNUM pages. Exceptions as far as I can > > see are, > > Yep, that's not always the case. > > > copy_relation_data: called with arbitrary forknum but it doesn't > > set is_fsync even for coying INIT_FORKNUM. (Is this not a > > problem?) > > Oh. And actually, it seems that use_wal is broken there as well. I > think that we would still want to issue a XLOG_FPI record for an > unlogged relation should it be moved to a new tablespace to copy its > INIT_FORKNUM correctly to its new home. Agreed. > After moving an unlogged relation to a new tablespace, and after > promoting a standby the standby fails to start because of this error: > FATAL: could not create file > "pg_tblspc/16388/PG_9.6_201511071/16384/16389": File exists > This could be fixed separately though. > > > spgbuildempty, ginbuildempty: these emits two or three newpage > > logs at once so only the last one is set is_fsync for > > performance reason. > > It doesn't matter to fsync just at the last one. Each one of them > would be replayed either way, the last one triggering the sync, no? Yes, it does (except for some unreal case below). It was just a confirmation and I didn't see this as a problem at all. Sorry for the ambiguous writing. > > In short, it seems to me that the reason to choose using > > XLOG_FPI_FOR_SYNC here is only performance of processing > > successive FPIs for INIT_FORKNUM. > > Yeah, there is a one-way link between this WAL record a INIT_FORKNUM. > However please note that having a INIT_FORKNUM does not always imply > that a sync is wanted. copy_relation_data is an example of that. As I wrote above, I suppose we should fix(?) the irregular relationship between WAL and init fork of brin and so. > > INIT_FORKNUM is generated only for unlogged tables and their > > belongings. I suppose such successive fsyncs doesn't cause > > observable performance drop assuming that the number of unlogged > > tables and belongings is not so high, especially with smarter > > storages. All we should do is that just fsync only for > > INIT_FORKNUM's FPIs for the case. If the performance does matter > > even so, we still can fsync the last md-file when any wal record > > other than FPI for INIT_FORK comes. (But this would be a bit > > complex..) > > Hm. If a system uses a bunch of temporary relations with brin index or > other included I would not say so. For back branches we may have to do > it unconditionally using INIT_FORKNUM, but having a control flag to > have it only done for unlogged relations would leverage that. It could, and should do so. And if we take such systems with bunch of temp relations as significant (I agree with this), XLogRegisterBlock() looks to be able to register multiple blocks into single wal record and we could eliminate arbitrary flagging on individual FPI records using it. Is it possible? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Dec 1, 2015 at 3:06 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Tue, 1 Dec 2015 11:53:35 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSMENEK7nqmwGsiyLTSbrZNJKx80tBX3qF6cQsS49sjag@mail.gmail.com> >> On Tue, Dec 1, 2015 at 11:11 AM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> > Hello, I studied your latest patch. >> >> Thanks! >> >> > I feel quite uncomfortable that it solves the problem from a kind >> > of nature of unlogged object by arbitrary flagging which is not >> > fully corresponds to the nature. If we can deduce the necessity >> > of fsync from some nature, it would be preferable. >> >> INIT_FORKNUM is not something only related to unlogged relations, >> indexes use them as well. And that's actually >> If you look at for example BRIN indexes that do not sync immediately >> their INIT_FORKNUM when index is created, I think that we still are >> going to need a new flag to control the sync at WAL replay because >> startup process cannot know a relation's persistence, thing that we >> can know when the XLOG_FPI record is created. For BRIN indexes, we >> want particularly to not sync the INIT_FORKNUM when the relation is >> not an unlogged one. > > (The comment added in brinbuildempty looks wrong since it > actually doesn't fsync it immediately..) > > Hmm, I've already seen that, and having your explanation I wonder > why brinbuidempty issues WAL for what is not necessary to be > persistent at the mement. Isn't it breaking agreements about > Write Ahead Log? INIT_FORKNUM and unconditionally fsync'ing would > be equally tied excluding the anormally about WAL. (Except for > succeeding newpages.) Alvaro, your thoughts regarding those lines? When building an empty INIT_FORKNUM for a brin index its data is saved into a shared buffer and not immediately synced into disk. Shouldn't that be necessary for at least unlogged relations? >> > In short, it seems to me that the reason to choose using >> > XLOG_FPI_FOR_SYNC here is only performance of processing >> > successive FPIs for INIT_FORKNUM. >> >> Yeah, there is a one-way link between this WAL record a INIT_FORKNUM. >> However please note that having a INIT_FORKNUM does not always imply >> that a sync is wanted. copy_relation_data is an example of that. > > As I wrote above, I suppose we should fix(?) the irregular > relationship between WAL and init fork of brin and so. Yep. >> > INIT_FORKNUM is generated only for unlogged tables and their >> > belongings. I suppose such successive fsyncs doesn't cause >> > observable performance drop assuming that the number of unlogged >> > tables and belongings is not so high, especially with smarter >> > storages. All we should do is that just fsync only for >> > INIT_FORKNUM's FPIs for the case. If the performance does matter >> > even so, we still can fsync the last md-file when any wal record >> > other than FPI for INIT_FORK comes. (But this would be a bit >> > complex..) >> >> Hm. If a system uses a bunch of temporary relations with brin index or >> other included I would not say so. For back branches we may have to do >> it unconditionally using INIT_FORKNUM, but having a control flag to >> have it only done for unlogged relations would leverage that. > > It could, and should do so. And if we take such systems with > bunch of temp relations as significant (I agree with this), > XLogRegisterBlock() looks to be able to register multiple blocks > into single wal record and we could eliminate arbitrary flagging > on individual FPI records using it. Is it possible? I thought about using a BKPBLOCK flag but all of them are already taken if that's what you meant. it seems cheaper to do that a record level... -- Michael
Hi, On 2015-11-20 16:11:15 +0900, Michael Paquier wrote: > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c > index cc845d2..4883697 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -9503,6 +9503,14 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record) > data += sizeof(BkpBlock); > > RestoreBackupBlockContents(lsn, bkpb, data, false, false); > + > + if (bkpb.fork == INIT_FORKNUM) > + { > + SMgrRelation srel; > + srel = smgropen(bkpb.node, InvalidBackendId); > + smgrimmedsync(srel, INIT_FORKNUM); > + smgrclose(srel); > + } > } > else if (info == XLOG_BACKUP_END) > { A smgrwrite() instead of a smgrimmedsync() should be sufficient here. - Andres
On 2015-11-27 16:59:20 +0900, Michael Paquier wrote: > Attached is a patch that fixes the issue for me in master and 9.5. > Actually in the last patch I forgot a call to smgrwrite to ensure that > the INIT_FORKNUM is correctly synced to disk when those pages are > replayed at recovery, letting the reset routines for unlogged > relations do their job correctly. I have noticed as well that we need > to do the same for gin and brin relations. In this case I think that > we could limit the flush to unlogged relations, my patch does it > unconditionally though to generalize the logic. Thoughts? I think it's a good idea to limit this to unlogged relations. For a dataset that fits into shared_buffers and creates many relations, the additional disk writes could be noticeable. > diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c > index 99337b0..d7964ac 100644 > --- a/src/backend/access/brin/brin.c > +++ b/src/backend/access/brin/brin.c > @@ -682,7 +682,12 @@ brinbuildempty(PG_FUNCTION_ARGS) > brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index), > BRIN_CURRENT_VERSION); > MarkBufferDirty(metabuf); > - log_newpage_buffer(metabuf, false); > + /* > + * When this full page image is replayed, there is no guarantee that > + * this page will be present to disk when replayed particularly for > + * unlogged relations, hence enforce it to be flushed to disk. > + */ The grammar seems a bit off here. > + /* > + * Initialize and xlog metabuffer and root buffer. When those full > + * pages are replayed, it is not guaranteed that those relation > + * init forks will be flushed to disk after replaying them, hence > + * enforce those pages to be flushed to disk at replay, only the > + * last record will enforce a flush for performance reasons and > + * because it is actually unnecessary to do it multiple times. > + */ That comment needs some love too. I think it really only makes sense if you know the current state. There's also some language polishing needed. > @@ -450,14 +450,21 @@ ginbuildempty(PG_FUNCTION_ARGS) > START_CRIT_SECTION(); > GinInitMetabuffer(MetaBuffer); > MarkBufferDirty(MetaBuffer); > - log_newpage_buffer(MetaBuffer, false); > + log_newpage_buffer(MetaBuffer, false, false); > GinInitBuffer(RootBuffer, GIN_LEAF); > MarkBufferDirty(RootBuffer); > - log_newpage_buffer(RootBuffer, false); > + log_newpage_buffer(RootBuffer, false, true); > END_CRIT_SECTION(); > Why isn't the metapage flushed to disk? > --- a/src/backend/access/spgist/spginsert.c > +++ b/src/backend/access/spgist/spginsert.c > @@ -173,7 +173,7 @@ spgbuildempty(PG_FUNCTION_ARGS) > (char *) page, true); > if (XLogIsNeeded()) > log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, > - SPGIST_METAPAGE_BLKNO, page, false); > + SPGIST_METAPAGE_BLKNO, page, false, false); > > /* Likewise for the root page. */ > SpGistInitPage(page, SPGIST_LEAF); > @@ -183,7 +183,7 @@ spgbuildempty(PG_FUNCTION_ARGS) > (char *) page, true); > if (XLogIsNeeded()) > log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, > - SPGIST_ROOT_BLKNO, page, true); > + SPGIST_ROOT_BLKNO, page, true, false); > I don't think it's correct to not flush in these cases. Yes, the primary does an smgrimmedsync() - but that's not done on the standby. > @@ -9392,9 +9396,28 @@ xlog_redo(XLogReaderState *record) > * when checksums are enabled. There is no difference in handling > * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info > * code just to distinguish them for statistics purposes. > + * > + * XLOG_FPI_FOR_SYNC records are generated when a relation needs to > + * be sync'ed just after replaying a full page. This is important > + * to match the master behavior in the case where a page is written > + * directly to disk without going through shared buffers, particularly > + * when writing init forks for index relations. > */ How about FPI_FOR_SYNC records indicate that the page immediately needs to be written to disk, not just to shared buffers. This is important if the on-disk state is to be the authoritative, not the state in shared buffers. E.g. because on-disk files may later be copied directly. > diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c > index 0ddde72..eb22a51 100644 > --- a/src/backend/commands/tablecmds.c > +++ b/src/backend/commands/tablecmds.c > @@ -9920,7 +9920,8 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst, > * space. > */ > if (use_wal) > - log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false); > + log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, > + false, false); Shouldn't this flush data if it's an init fork? Currently that's an academic distinction, given there'll never be a page, but it still seems prudent. > extern void InitXLogInsert(void); > diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h > index ad1eb4b..91445f1 100644 > --- a/src/include/catalog/pg_control.h > +++ b/src/include/catalog/pg_control.h > @@ -73,6 +73,7 @@ typedef struct CheckPoint > #define XLOG_END_OF_RECOVERY 0x90 > #define XLOG_FPI_FOR_HINT 0xA0 > #define XLOG_FPI 0xB0 > +#define XLOG_FPI_FOR_SYNC 0xC0 I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or instead adding actual record data and a 'flags' field in there? I slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are different, XLOG_FPI_FOR_SYNC not so much. Greetings, Andres Freund
Andres Freud wrote: > On 2015-11-20 16:11:15 +0900, Michael Paquier wrote: >> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c >> index cc845d2..4883697 100644 >> --- a/src/backend/access/transam/xlog.c >> +++ b/src/backend/access/transam/xlog.c >> @@ -9503,6 +9503,14 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record) >> data += sizeof(BkpBlock); >> >> RestoreBackupBlockContents(lsn, bkpb, data, false, false); >> + >> + if (bkpb.fork == INIT_FORKNUM) >> + { >> + SMgrRelation srel; >> + srel = smgropen(bkpb.node, InvalidBackendId); >> + smgrimmedsync(srel, INIT_FORKNUM); >> + smgrclose(srel); >> + } >> } >> else if (info == XLOG_BACKUP_END) >> { > > A smgrwrite() instead of a smgrimmedsync() should be sufficient here. Check. On Fri, Dec 4, 2015 at 6:35 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-11-27 16:59:20 +0900, Michael Paquier wrote: >> Attached is a patch that fixes the issue for me in master and 9.5. >> Actually in the last patch I forgot a call to smgrwrite to ensure that >> the INIT_FORKNUM is correctly synced to disk when those pages are >> replayed at recovery, letting the reset routines for unlogged >> relations do their job correctly. I have noticed as well that we need >> to do the same for gin and brin relations. In this case I think that >> we could limit the flush to unlogged relations, my patch does it >> unconditionally though to generalize the logic. Thoughts? > > I think it's a good idea to limit this to unlogged relations. For a > dataset that fits into shared_buffers and creates many relations, the > additional disk writes could be noticeable. OK, then I have switched the patch this way to limit the flush to unlogged relations where needed. >> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c >> index 99337b0..d7964ac 100644 >> --- a/src/backend/access/brin/brin.c >> +++ b/src/backend/access/brin/brin.c >> @@ -682,7 +682,12 @@ brinbuildempty(PG_FUNCTION_ARGS) >> brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index), >> BRIN_CURRENT_VERSION); >> MarkBufferDirty(metabuf); >> - log_newpage_buffer(metabuf, false); >> + /* >> + * When this full page image is replayed, there is no guarantee that >> + * this page will be present to disk when replayed particularly for >> + * unlogged relations, hence enforce it to be flushed to disk. >> + */ > > The grammar seems a bit off here. Check. >> + /* >> + * Initialize and xlog metabuffer and root buffer. When those full >> + * pages are replayed, it is not guaranteed that those relation >> + * init forks will be flushed to disk after replaying them, hence >> + * enforce those pages to be flushed to disk at replay, only the >> + * last record will enforce a flush for performance reasons and >> + * because it is actually unnecessary to do it multiple times. >> + */ > > That comment needs some love too. I think it really only makes sense if > you know the current state. There's also some language polishing needed. Check. >> @@ -450,14 +450,21 @@ ginbuildempty(PG_FUNCTION_ARGS) >> START_CRIT_SECTION(); >> GinInitMetabuffer(MetaBuffer); >> MarkBufferDirty(MetaBuffer); >> - log_newpage_buffer(MetaBuffer, false); >> + log_newpage_buffer(MetaBuffer, false, false); >> GinInitBuffer(RootBuffer, GIN_LEAF); >> MarkBufferDirty(RootBuffer); >> - log_newpage_buffer(RootBuffer, false); >> + log_newpage_buffer(RootBuffer, false, true); >> END_CRIT_SECTION(); >> > Why isn't the metapage flushed to disk? I was not sure if we should only flush only at the last page, as pages just before would be already replayed. Switched. >> --- a/src/backend/access/spgist/spginsert.c >> +++ b/src/backend/access/spgist/spginsert.c >> @@ -173,7 +173,7 @@ spgbuildempty(PG_FUNCTION_ARGS) >> (char *) page, true); >> if (XLogIsNeeded()) >> log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, >> - SPGIST_METAPAGE_BLKNO, page, false); >> + SPGIST_METAPAGE_BLKNO, page, false, false); >> >> /* Likewise for the root page. */ >> SpGistInitPage(page, SPGIST_LEAF); >> @@ -183,7 +183,7 @@ spgbuildempty(PG_FUNCTION_ARGS) >> (char *) page, true); >> if (XLogIsNeeded()) >> log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, >> - SPGIST_ROOT_BLKNO, page, true); >> + SPGIST_ROOT_BLKNO, page, true, false); >> > > I don't think it's correct to not flush in these cases. Yes, the primary > does an smgrimmedsync() - but that's not done on the standby. OK. Switched. >> @@ -9392,9 +9396,28 @@ xlog_redo(XLogReaderState *record) >> * when checksums are enabled. There is no difference in handling >> * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info >> * code just to distinguish them for statistics purposes. >> + * >> + * XLOG_FPI_FOR_SYNC records are generated when a relation needs to >> + * be sync'ed just after replaying a full page. This is important >> + * to match the master behavior in the case where a page is written >> + * directly to disk without going through shared buffers, particularly >> + * when writing init forks for index relations. >> */ > > How about > > FPI_FOR_SYNC records indicate that the page immediately needs to be > written to disk, not just to shared buffers. This is important if the > on-disk state is to be the authoritative, not the state in shared > buffers. E.g. because on-disk files may later be copied directly. OK. >> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c >> index 0ddde72..eb22a51 100644 >> --- a/src/backend/commands/tablecmds.c >> +++ b/src/backend/commands/tablecmds.c >> @@ -9920,7 +9920,8 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst, >> * space. >> */ >> if (use_wal) >> - log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false); >> + log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, >> + false, false); > > Shouldn't this flush data if it's an init fork? Currently that's an > academic distinction, given there'll never be a page, but it still seems > prudent. Yeah it should do that for all the INIT_FORKNUM because we don't know the page type here. Note that use_wal is broken as well. We had better log and flush the INIT_FORKNUM as well for unlogged relations. >> extern void InitXLogInsert(void); >> diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h >> index ad1eb4b..91445f1 100644 >> --- a/src/include/catalog/pg_control.h >> +++ b/src/include/catalog/pg_control.h >> @@ -73,6 +73,7 @@ typedef struct CheckPoint >> #define XLOG_END_OF_RECOVERY 0x90 >> #define XLOG_FPI_FOR_HINT 0xA0 >> #define XLOG_FPI 0xB0 >> +#define XLOG_FPI_FOR_SYNC 0xC0 > > > I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too > ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or > instead adding actual record data and a 'flags' field in there? I > slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are > different, XLOG_FPI_FOR_SYNC not so much. Let's go for XLOG_FPI_FLUSH. Attached is an updated patch. For back branches, I am still not sure what would be a good idea though. I don't see any other initiative than flushing unconditionally INIT_FORKNUM when replaying an FPI... But that would impact some systems severely. Regards, -- Michael
Attachment
On 2015-12-04 17:00:13 +0900, Michael Paquier wrote: > Andres Freud wrote: > >> @@ -450,14 +450,21 @@ ginbuildempty(PG_FUNCTION_ARGS) > >> START_CRIT_SECTION(); > >> GinInitMetabuffer(MetaBuffer); > >> MarkBufferDirty(MetaBuffer); > >> - log_newpage_buffer(MetaBuffer, false); > >> + log_newpage_buffer(MetaBuffer, false, false); > >> GinInitBuffer(RootBuffer, GIN_LEAF); > >> MarkBufferDirty(RootBuffer); > >> - log_newpage_buffer(RootBuffer, false); > >> + log_newpage_buffer(RootBuffer, false, true); > >> END_CRIT_SECTION(); > >> > > Why isn't the metapage flushed to disk? > > I was not sure if we should only flush only at the last page, as pages > just before would be already replayed. Switched. Uh, that's not how it works! The earlier pages would just be in shared buffers. Neither smgrwrite, nor smgrimmedsync does anything about that! > >> extern void InitXLogInsert(void); > >> diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h > >> index ad1eb4b..91445f1 100644 > >> --- a/src/include/catalog/pg_control.h > >> +++ b/src/include/catalog/pg_control.h > >> @@ -73,6 +73,7 @@ typedef struct CheckPoint > >> #define XLOG_END_OF_RECOVERY 0x90 > >> #define XLOG_FPI_FOR_HINT 0xA0 > >> #define XLOG_FPI 0xB0 > >> +#define XLOG_FPI_FOR_SYNC 0xC0 > > > > > > I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too > > ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or > > instead adding actual record data and a 'flags' field in there? I > > slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are > > different, XLOG_FPI_FOR_SYNC not so much. > > Let's go for XLOG_FPI_FLUSH. I think the other way is a bit better, because we can add new flags without changing the WAL format. > diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c > index 99337b0..b646101 100644 > --- a/src/backend/access/brin/brin.c > +++ b/src/backend/access/brin/brin.c > @@ -682,7 +682,15 @@ brinbuildempty(PG_FUNCTION_ARGS) > brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index), > BRIN_CURRENT_VERSION); > MarkBufferDirty(metabuf); > - log_newpage_buffer(metabuf, false); > + > + /* > + * For unlogged relations, this page should be immediately flushed > + * to disk after being replayed. This is necessary to ensure that the > + * initial on-disk state of unlogged relations is preserved when > + * they get reset at the end of recovery. > + */ > + log_newpage_buffer(metabuf, false, > + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED); > END_CRIT_SECTION(); Maybe write the last sentence as '... as the on disk files are copied directly at the end of recovery.'? > @@ -336,7 +336,8 @@ end_heap_rewrite(RewriteState state) > MAIN_FORKNUM, > state->rs_blockno, > state->rs_buffer, > - true); > + true, > + false); > RelationOpenSmgr(state->rs_new_rel); > > PageSetChecksumInplace(state->rs_buffer, state->rs_blockno); > @@ -685,7 +686,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup) > MAIN_FORKNUM, > state->rs_blockno, > page, > - true); > + true, > + false); Did you verify that that's ok when a unlogged table is clustered/vacuum full'ed? > @@ -181,6 +183,9 @@ xlog_identify(uint8 info) > case XLOG_FPI_FOR_HINT: > id = "FPI_FOR_HINT"; > break; > + case XLOG_FPI_FLUSH: > + id = "FPI_FOR_SYNC"; > + break; > } Old string. > --- a/src/backend/access/transam/xloginsert.c > +++ b/src/backend/access/transam/xloginsert.c > @@ -932,10 +932,13 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std) > * If the page follows the standard page layout, with a PageHeader and unused > * space between pd_lower and pd_upper, set 'page_std' to TRUE. That allows > * the unused space to be left out from the WAL record, making it smaller. > + * > + * If 'is_flush' is set to TRUE, relation will be requested to flush > + * immediately its data at replay after replaying this full page image. > */ s/is_flush/flush_immed/? And maybe say that it 'will be flushed to the OS immediately after replaying the record'? Greetings, Andres Freund
On Fri, Dec 4, 2015 at 5:10 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-12-04 17:00:13 +0900, Michael Paquier wrote: >> Andres Freud wrote: >> >> extern void InitXLogInsert(void); >> >> diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h >> >> index ad1eb4b..91445f1 100644 >> >> --- a/src/include/catalog/pg_control.h >> >> +++ b/src/include/catalog/pg_control.h >> >> @@ -73,6 +73,7 @@ typedef struct CheckPoint >> >> #define XLOG_END_OF_RECOVERY 0x90 >> >> #define XLOG_FPI_FOR_HINT 0xA0 >> >> #define XLOG_FPI 0xB0 >> >> +#define XLOG_FPI_FOR_SYNC 0xC0 >> > >> > >> > I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too >> > ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or >> > instead adding actual record data and a 'flags' field in there? I >> > slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are >> > different, XLOG_FPI_FOR_SYNC not so much. >> >> Let's go for XLOG_FPI_FLUSH. > > I think the other way is a bit better, because we can add new flags > without changing the WAL format. Hm. On the contrary, I think that it would make more sense to have a flag as well for FOR_HINT honestly, those are really the same operations, and FOR_HINT is just here for statistic purposes. >> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c >> index 99337b0..b646101 100644 >> --- a/src/backend/access/brin/brin.c >> +++ b/src/backend/access/brin/brin.c >> @@ -682,7 +682,15 @@ brinbuildempty(PG_FUNCTION_ARGS) >> brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index), >> BRIN_CURRENT_VERSION); >> MarkBufferDirty(metabuf); >> - log_newpage_buffer(metabuf, false); >> + >> + /* >> + * For unlogged relations, this page should be immediately flushed >> + * to disk after being replayed. This is necessary to ensure that the >> + * initial on-disk state of unlogged relations is preserved when >> + * they get reset at the end of recovery. >> + */ >> + log_newpage_buffer(metabuf, false, >> + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED); >> END_CRIT_SECTION(); > > Maybe write the last sentence as '... as the on disk files are copied > directly at the end of recovery.'? Check. >> @@ -336,7 +336,8 @@ end_heap_rewrite(RewriteState state) >> MAIN_FORKNUM, >> state->rs_blockno, >> state->rs_buffer, >> - true); >> + true, >> + false); >> RelationOpenSmgr(state->rs_new_rel); >> >> PageSetChecksumInplace(state->rs_buffer, state->rs_blockno); >> @@ -685,7 +686,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup) >> MAIN_FORKNUM, >> state->rs_blockno, >> page, >> - true); >> + true, >> + false); > > Did you verify that that's ok when a unlogged table is clustered/vacuum > full'ed? Yep. >> @@ -181,6 +183,9 @@ xlog_identify(uint8 info) >> case XLOG_FPI_FOR_HINT: >> id = "FPI_FOR_HINT"; >> break; >> + case XLOG_FPI_FLUSH: >> + id = "FPI_FOR_SYNC"; >> + break; >> } > > Old string. Yeah, that's now completely removed. >> --- a/src/backend/access/transam/xloginsert.c >> +++ b/src/backend/access/transam/xloginsert.c >> @@ -932,10 +932,13 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std) >> * If the page follows the standard page layout, with a PageHeader and unused >> * space between pd_lower and pd_upper, set 'page_std' to TRUE. That allows >> * the unused space to be left out from the WAL record, making it smaller. >> + * >> + * If 'is_flush' is set to TRUE, relation will be requested to flush >> + * immediately its data at replay after replaying this full page image. >> */ > > s/is_flush/flush_immed/? And maybe say that it 'will be flushed to the > OS immediately after replaying the record'? s/OS/stable storage? -- Michael
Attachment
On 2015-12-03 22:09:43 +0100, Andres Freund wrote: > On 2015-11-20 16:11:15 +0900, Michael Paquier wrote: > > + if (bkpb.fork == INIT_FORKNUM) > > + { > > + SMgrRelation srel; > > + srel = smgropen(bkpb.node, InvalidBackendId); > > + smgrimmedsync(srel, INIT_FORKNUM); > > + smgrclose(srel); > > + } > > A smgrwrite() instead of a smgrimmedsync() should be sufficient here. More importantly, the smgrimmedsync() won't actually achieve anything - RestoreBackupBlockContents() will just have written the data into shared buffers. smgrimmedsync() doesn't flush that. And further, just flushing the buffer directly to disk using smgrwrite(), without reflecting that in the in-memory state, doesn't seem prudent. At the very least we'll write all those blocks twice. It also likely misses dealing with checksums and such. What I think we really ought to do instead is to use "proper" buffer functionality, e.g. flush the buffer via FlushBuffer(). It seems slightly better not to directly expose FlushBuffer() and instead add a tiny wrapper. Couldn't come up with a grand name tho. For me the attached, preliminary, patch, fixes the problem in master; previous branches ought to look mostly similar, except the flush moved to RestoreBackupBlockContents/RestoreBackupBlock. Does anybody have a better idea? Suitable for the back-branches? I'm kinda wondering if it wouldn't have been better to go through shared buffers in ResetUnloggedRelationsInDbspaceDir() instead of using copy_file(). Regareds, Andres
Attachment
Hi, On 2015-12-04 21:57:54 +0900, Michael Paquier wrote: > On Fri, Dec 4, 2015 at 5:10 PM, Andres Freund <andres@anarazel.de> wrote: > >> Let's go for XLOG_FPI_FLUSH. > > > > I think the other way is a bit better, because we can add new flags > > without changing the WAL format. > > Hm. On the contrary, I think that it would make more sense to have a > flag as well for FOR_HINT honestly, those are really the same > operations, and FOR_HINT is just here for statistic purposes. I don't think it's all that much the same operation. And WRT statistics purpose: Being able to easily differentiate FOR_HINT is important for pg_xlogdump --stats, but not at all for XLOG_FPI_FLUSH. > >> --- a/src/backend/access/transam/xloginsert.c > >> +++ b/src/backend/access/transam/xloginsert.c > >> @@ -932,10 +932,13 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std) > >> * If the page follows the standard page layout, with a PageHeader and unused > >> * space between pd_lower and pd_upper, set 'page_std' to TRUE. That allows > >> * the unused space to be left out from the WAL record, making it smaller. > >> + * > >> + * If 'is_flush' is set to TRUE, relation will be requested to flush > >> + * immediately its data at replay after replaying this full page image. > >> */ > > > > s/is_flush/flush_immed/? And maybe say that it 'will be flushed to the > > OS immediately after replaying the record'? > > s/OS/stable storage? It's not flushed to stable storage here. Just to the OS. > diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c > index 99337b0..fff48ab 100644 > --- a/src/backend/access/brin/brin.c > +++ b/src/backend/access/brin/brin.c > @@ -682,7 +682,15 @@ brinbuildempty(PG_FUNCTION_ARGS) > brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index), > BRIN_CURRENT_VERSION); > MarkBufferDirty(metabuf); > - log_newpage_buffer(metabuf, false); > + > + /* > + * For unlogged relations, this page should be immediately flushed > + * to disk after being replayed. This is necessary to ensure that the > + * initial on-disk state of unlogged relations is preserved as the > + * on-disk files are copied directly at the end of recovery. > + */ > + log_newpage_buffer(metabuf, false, > + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED); > END_CRIT_SECTION(); > > UnlockReleaseBuffer(metabuf); > diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c > index f876f62..572fe20 100644 > --- a/src/backend/access/brin/brin_pageops.c > +++ b/src/backend/access/brin/brin_pageops.c > @@ -865,7 +865,7 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer) > page = BufferGetPage(buffer); > brin_page_init(page, BRIN_PAGETYPE_REGULAR); > MarkBufferDirty(buffer); > - log_newpage_buffer(buffer, true); > + log_newpage_buffer(buffer, true, false); > END_CRIT_SECTION(); > > /* > diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c > index 49e9185..17c168a 100644 > --- a/src/backend/access/gin/gininsert.c > +++ b/src/backend/access/gin/gininsert.c > @@ -450,14 +450,22 @@ ginbuildempty(PG_FUNCTION_ARGS) > ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL); > LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE); > > - /* Initialize and xlog metabuffer and root buffer. */ > + /* > + * Initialize and xlog metabuffer and root buffer. For unlogged > + * relations, those pages need to be immediately flushed to disk > + * after being replayed. This is necessary to ensure that the > + * initial on-disk state of unlogged relations is preserved when > + * they get reset at the end of recovery. > + */ > START_CRIT_SECTION(); > GinInitMetabuffer(MetaBuffer); > MarkBufferDirty(MetaBuffer); > - log_newpage_buffer(MetaBuffer, false); > + log_newpage_buffer(MetaBuffer, false, > + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED); > GinInitBuffer(RootBuffer, GIN_LEAF); > MarkBufferDirty(RootBuffer); > - log_newpage_buffer(RootBuffer, false); > + log_newpage_buffer(RootBuffer, false, > + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED); > END_CRIT_SECTION(); > > /* Unlock and release the buffers. */ > diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c > index 53bccf6..6a20031 100644 > --- a/src/backend/access/gist/gist.c > +++ b/src/backend/access/gist/gist.c > @@ -84,7 +84,8 @@ gistbuildempty(PG_FUNCTION_ARGS) > START_CRIT_SECTION(); > GISTInitBuffer(buffer, F_LEAF); > MarkBufferDirty(buffer); > - log_newpage_buffer(buffer, true); > + log_newpage_buffer(buffer, true, > + index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED); > END_CRIT_SECTION(); > I might be missing something here - but isn't it pretty much guaranteed that all these are unlogged relations? Otherwise *buildempty() wouldn't have been called, right? > else if (info == XLOG_BACKUP_END) > { > @@ -178,9 +183,6 @@ xlog_identify(uint8 info) > case XLOG_FPI: > id = "FPI"; > break; > - case XLOG_FPI_FOR_HINT: > - id = "FPI_FOR_HINT"; > - break; > } Really don't want to do that. > @@ -9391,14 +9394,34 @@ xlog_redo(XLogReaderState *record) > * resource manager needs to generate conflicts, it has to define a > * separate WAL record type and redo routine. > * > - * XLOG_FPI_FOR_HINT records are generated when a page needs to be > - * WAL- logged because of a hint bit update. They are only generated > - * when checksums are enabled. There is no difference in handling > - * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info > - * code just to distinguish them for statistics purposes. > + * Records flagged with 'for_hint_bits' are generated when a page needs > + * to be WAL- logged because of a hint bit update. They are only > + * generated when checksums are enabled. There is no difference in > + * handling records when this flag is set, it is used for statistics > + * purposes. > + * > + * Records flagged with 'is_flush' indicate that the page immediately > + * needs to be written to disk, not just to shared buffers. This is > + * important if the on-disk state is to be the authoritative, not the > + * state in shared buffers. E.g. because on-disk files may later be > + * copied directly. > */ > if (XLogReadBufferForRedo(record, 0, &buffer) != BLK_RESTORED) > elog(ERROR, "unexpected XLogReadBufferForRedo result when restoring backup block"); > + > + if (xlrec.is_flush) > + { > + RelFileNode rnode; > + ForkNumber forknum; > + BlockNumber blkno; > + SMgrRelation srel; > + > + (void) XLogRecGetBlockTag(record, 0, &rnode, &forknum, &blkno); > + srel = smgropen(rnode, InvalidBackendId); > + smgrwrite(srel, forknum, blkno, BufferGetPage(buffer), false); > + smgrclose(srel); > + } That'd leave the dirty bit set on the buffer... Greetings, Andres Freund
On Tue, Dec 8, 2015 at 10:09 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-12-04 21:57:54 +0900, Michael Paquier wrote: >> On Fri, Dec 4, 2015 at 5:10 PM, Andres Freund <andres@anarazel.de> wrote: >> >> Let's go for XLOG_FPI_FLUSH. >> > >> > I think the other way is a bit better, because we can add new flags >> > without changing the WAL format. >> >> Hm. On the contrary, I think that it would make more sense to have a >> flag as well for FOR_HINT honestly, those are really the same >> operations, and FOR_HINT is just here for statistic purposes. > > I don't think it's all that much the same operation. And WRT statistics > purpose: Being able to easily differentiate FOR_HINT is important for > pg_xlogdump --stats, but not at all for XLOG_FPI_FLUSH. OK. Switched back to using XLOG_FPI_FLUSH. >> [stuff about unlogged relations in code] > > I might be missing something here - but isn't it pretty much guaranteed > that all these are unlogged relations? Otherwise *buildempty() wouldn't > have been called, right? Yep, that's ensured in index.c when calling ambuildempty. Let's flush it unconditionally then in those code paths. >> else if (info == XLOG_BACKUP_END) >> { >> @@ -178,9 +183,6 @@ xlog_identify(uint8 info) >> case XLOG_FPI: >> id = "FPI"; >> break; >> - case XLOG_FPI_FOR_HINT: >> - id = "FPI_FOR_HINT"; >> - break; >> } > > Really don't want to do that. OK, added back. >> @@ -9391,14 +9394,34 @@ xlog_redo(XLogReaderState *record) >> * resource manager needs to generate conflicts, it has to define a >> * separate WAL record type and redo routine. >> * >> - * XLOG_FPI_FOR_HINT records are generated when a page needs to be >> - * WAL- logged because of a hint bit update. They are only generated >> - * when checksums are enabled. There is no difference in handling >> - * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info >> - * code just to distinguish them for statistics purposes. >> + * Records flagged with 'for_hint_bits' are generated when a page needs >> + * to be WAL- logged because of a hint bit update. They are only >> + * generated when checksums are enabled. There is no difference in >> + * handling records when this flag is set, it is used for statistics >> + * purposes. >> + * >> + * Records flagged with 'is_flush' indicate that the page immediately >> + * needs to be written to disk, not just to shared buffers. This is >> + * important if the on-disk state is to be the authoritative, not the >> + * state in shared buffers. E.g. because on-disk files may later be >> + * copied directly. >> */ >> if (XLogReadBufferForRedo(record, 0, &buffer) != BLK_RESTORED) >> elog(ERROR, "unexpected XLogReadBufferForRedo result when restoring backup block"); >> + >> + if (xlrec.is_flush) >> + { >> + RelFileNode rnode; >> + ForkNumber forknum; >> + BlockNumber blkno; >> + SMgrRelation srel; >> + >> + (void) XLogRecGetBlockTag(record, 0, &rnode, &forknum, &blkno); >> + srel = smgropen(rnode, InvalidBackendId); >> + smgrwrite(srel, forknum, blkno, BufferGetPage(buffer), false); >> + smgrclose(srel); >> + } > > That'd leave the dirty bit set on the buffer... OK, I have used an equivalent of FlushBuffer, FlushSingleBuffer being a copycat of your previous patch... I am attaching as well a patch for ~9.4, which uses XLOG_HEAP_NEWPAGE instead. -- Michael
Attachment
On Tue, Dec 8, 2015 at 9:57 PM, Andres Freund <andres@anarazel.de> wrote: > For me the attached, preliminary, patch, fixes the problem in master; > previous branches ought to look mostly similar, except the flush moved > to RestoreBackupBlockContents/RestoreBackupBlock. > Does anybody have a better idea? Suitable for the back-branches? Not really I am afraid.. > I'm kinda wondering if it wouldn't have been better to go through shared > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using > copy_file(). For deployment with large shared_buffers settings, wouldn't that be actually more costly than the current way of doing? We would need to go through all the buffers at least once and look for the INIT_FORKNUM present to flush them. -- Michael
On 2015-12-09 16:30:47 +0900, Michael Paquier wrote: > > I'm kinda wondering if it wouldn't have been better to go through shared > > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using > > copy_file(). > > For deployment with large shared_buffers settings, wouldn't that be > actually more costly than the current way of doing? We would need to > go through all the buffers at least once and look for the INIT_FORKNUM > present to flush them. We could just check the file sizes on disk, and the check for the contents of all the pages for each file.
On Wed, Dec 9, 2015 at 4:41 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-12-09 16:30:47 +0900, Michael Paquier wrote: >> > I'm kinda wondering if it wouldn't have been better to go through shared >> > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using >> > copy_file(). >> >> For deployment with large shared_buffers settings, wouldn't that be >> actually more costly than the current way of doing? We would need to >> go through all the buffers at least once and look for the INIT_FORKNUM >> present to flush them. > > We could just check the file sizes on disk, and the check for the > contents of all the pages for each file. By doing it at replay, the flushes are spread across time. And by doing it at the end of recovery, all the flushes would be grouped. Do you think that's fine? -- Michael
On 2015-12-09 19:36:11 +0900, Michael Paquier wrote: > On Wed, Dec 9, 2015 at 4:41 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2015-12-09 16:30:47 +0900, Michael Paquier wrote: > >> > I'm kinda wondering if it wouldn't have been better to go through shared > >> > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using > >> > copy_file(). > >> > >> For deployment with large shared_buffers settings, wouldn't that be > >> actually more costly than the current way of doing? We would need to > >> go through all the buffers at least once and look for the INIT_FORKNUM > >> present to flush them. > > > > We could just check the file sizes on disk, and the check for the > > contents of all the pages for each file. > > By doing it at replay, the flushes are spread across time. And by > doing it at the end of recovery, all the flushes would be grouped. Do > you think that's fine? The point is that we'd no flushes, because the data would come directly from shared buffers...
On Wed, Dec 9, 2015 at 8:04 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-12-09 19:36:11 +0900, Michael Paquier wrote: >> On Wed, Dec 9, 2015 at 4:41 PM, Andres Freund <andres@anarazel.de> wrote: >> > On 2015-12-09 16:30:47 +0900, Michael Paquier wrote: >> >> > I'm kinda wondering if it wouldn't have been better to go through shared >> >> > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using >> >> > copy_file(). >> >> >> >> For deployment with large shared_buffers settings, wouldn't that be >> >> actually more costly than the current way of doing? We would need to >> >> go through all the buffers at least once and look for the INIT_FORKNUM >> >> present to flush them. >> > >> > We could just check the file sizes on disk, and the check for the >> > contents of all the pages for each file. >> >> By doing it at replay, the flushes are spread across time. And by >> doing it at the end of recovery, all the flushes would be grouped. Do >> you think that's fine? > > The point is that we'd no flushes, because the data would come directly > from shared buffers... Oh, OK. I didn't read though your lines correctly. So you basically mean that we would look at the init files that are on disk, and check if they are empty. If they are, we simply use XLogReadBufferExtended to fetch the INIT_FORKNUM content and fill in another buffer for the MAIN_FORKNUM. More or less right? -- Michael
On 2015-12-09 21:03:47 +0900, Michael Paquier wrote: > Oh, OK. I didn't read though your lines correctly. So you basically > mean that we would look at the init files that are on disk, and check > if they are empty. If they are, we simply use XLogReadBufferExtended > to fetch the INIT_FORKNUM content and fill in another buffer for the > MAIN_FORKNUM. More or less right? We'd not just do so if they're empty, we'd just generally copy the file via shared buffers, instead of copy_file(). But we'd get the file size from the filesystem (which is fine, we make sure it is correct during replay).
On Wed, Dec 9, 2015 at 9:07 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-12-09 21:03:47 +0900, Michael Paquier wrote: >> Oh, OK. I didn't read though your lines correctly. So you basically >> mean that we would look at the init files that are on disk, and check >> if they are empty. If they are, we simply use XLogReadBufferExtended >> to fetch the INIT_FORKNUM content and fill in another buffer for the >> MAIN_FORKNUM. More or less right? > > We would not just do so if they're empty, we would just generally copy the file > via shared buffers, instead of copy_file(). But we'd get the file size > from the filesystem (which is fine, we make sure it is correct during > replay). So, this suggestion is basically implementing the reverse operation of GetRelationPath() to be able to rebuild a RelFileNode from scratch and then look at the shared buffers needed. Isn't it a bit brittle for back-branches? RelFileNode stuff is available easily through records when replaying individually FPIs, but not really in this code path. -- Michael
On December 10, 2015 5:02:27 AM GMT+01:00, Michael Paquier <michael.paquier@gmail.com> wrote: >On Wed, Dec 9, 2015 at 9:07 PM, Andres Freund <andres@anarazel.de> >wrote: >> On 2015-12-09 21:03:47 +0900, Michael Paquier wrote: >>> Oh, OK. I didn't read though your lines correctly. So you basically >>> mean that we would look at the init files that are on disk, and >check >>> if they are empty. If they are, we simply use XLogReadBufferExtended >>> to fetch the INIT_FORKNUM content and fill in another buffer for the >>> MAIN_FORKNUM. More or less right? >> >> We would not just do so if they're empty, we would just generally >copy the file >> via shared buffers, instead of copy_file(). But we'd get the file >size >> from the filesystem (which is fine, we make sure it is correct during >> replay). > >So, this suggestion is basically implementing the reverse operation of >GetRelationPath() to be able to rebuild a RelFileNode from scratch and >then look at the shared buffers needed. Isn't it a bit brittle for >back-branches? RelFileNode stuff is available easily through records >when replaying individually FPIs, but not really in this code path. Oh, sorry. In definitely not thinking about doing this for the back branches. That was more musing about a way to optimizethis. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
On Thu, Dec 10, 2015 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote: > On December 10, 2015 5:02:27 AM GMT+01:00, Michael Paquier <michael.paquier@gmail.com> wrote: >>On Wed, Dec 9, 2015 at 9:07 PM, Andres Freund <andres@anarazel.de> >>wrote: >>> On 2015-12-09 21:03:47 +0900, Michael Paquier wrote: >>>> Oh, OK. I didn't read though your lines correctly. So you basically >>>> mean that we would look at the init files that are on disk, and >>check >>>> if they are empty. If they are, we simply use XLogReadBufferExtended >>>> to fetch the INIT_FORKNUM content and fill in another buffer for the >>>> MAIN_FORKNUM. More or less right? >>> >>> We would not just do so if they're empty, we would just generally >>copy the file >>> via shared buffers, instead of copy_file(). But we'd get the file >>size >>> from the filesystem (which is fine, we make sure it is correct during >>> replay). >> >>So, this suggestion is basically implementing the reverse operation of >>GetRelationPath() to be able to rebuild a RelFileNode from scratch and >>then look at the shared buffers needed. Isn't it a bit brittle for >>back-branches? RelFileNode stuff is available easily through records >>when replaying individually FPIs, but not really in this code path. > > Oh, sorry. In definitely not thinking about doing this for the back branches. That was more musing about a way to optimizethis. OK sure. Yeah that may be something worth investigating. The reverse engineering of GetRelationPath would be interesting perhaps for some utilities. So, do we go for something like the patch you attached in 20151208125716.GS4934@alap3.anarazel.de for master and 9.5, and for ~9.4 we use the one I wrote in CAB7nPqSxErpZJ+BZ-mfopzFZP5pAbiE9jWBUcJy6qaYertt4uw@mail.gmail.com? Note that in both cases the patches are not complete, we need to fix as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages are logged all the time. I am also thinking that the patch changing WAL format of XLOG_FPI is overdoing it as we know that all the INIT_FORKNUM will be generated for unlogged relations, so that's fine to flush unconditionally INIT_FORKNUM buffers at replay. -- Michael
> So, do we go for something like the patch you attached in > 20151208125716.GS4934@alap3.anarazel.de for master and 9.5, and for > ~9.4 we use the one I wrote in > CAB7nPqSxErpZJ+BZ-mfopzFZP5pAbiE9jWBUcJy6qaYertt4uw@mail.gmail.com? I'm more thinking of using something like my patch for all branches. Why would we want to go for the more complicated approach in the more distant branches? > Note that in both cases the patches are not complete, we need to fix > as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages > are logged all the time. Aggreed. It's probably better treated as an entirely different - pretty ugly - bug though. I mean it's not some issue of a race during replay, it's entirely missing WAL logging for SET TABLESPACE of unlogged relations. Andres
On Thu, Dec 10, 2015 at 8:56 PM, Andres Freund <andres@anarazel.de> wrote: >> So, do we go for something like the patch you attached in >> 20151208125716.GS4934@alap3.anarazel.de for master and 9.5, and for >> ~9.4 we use the one I wrote in >> CAB7nPqSxErpZJ+BZ-mfopzFZP5pAbiE9jWBUcJy6qaYertt4uw@mail.gmail.com? > > I'm more thinking of using something like my patch for all branches. Why > would we want to go for the more complicated approach in the more > distant branches? That's not what I think it meant: I don't wish to do the complicated approach either anymore. I sent two patches on the mail mentioned above: one for master/9.5 that used the approach of changing WAL, and a second aimed for 9.4 and old versions that is close to what you sent. It looks that you did not look at the second patch, named 20151209_replay_unlogged_94.patch that does some stuff with XLOG_HEAP_NEWPAGE to fix the issue. >> Note that in both cases the patches are not complete, we need to fix >> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages >> are logged all the time. > > Agreed. It's probably better treated as an entirely different - pretty > ugly - bug though. I mean it's not some issue of a race during replay, > it's entirely missing WAL logging for SET TABLESPACE of unlogged > relations. Okidoki. -- Michael
On Thu, Dec 10, 2015 at 9:07 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Dec 10, 2015 at 8:56 PM, Andres Freund <andres@anarazel.de> wrote: >>> So, do we go for something like the patch you attached in >>> 20151208125716.GS4934@alap3.anarazel.de for master and 9.5, and for >>> ~9.4 we use the one I wrote in >>> CAB7nPqSxErpZJ+BZ-mfopzFZP5pAbiE9jWBUcJy6qaYertt4uw@mail.gmail.com? >> >> I'm more thinking of using something like my patch for all branches. Why >> would we want to go for the more complicated approach in the more >> distant branches? > > That's not what I think it meant: I don't wish to do the complicated > approach either anymore. I sent two patches on the mail mentioned > above: one for master/9.5 that used the approach of changing WAL, and > a second aimed for 9.4 and old versions that is close to what you > sent. It looks that you did not look at the second patch, named > 20151209_replay_unlogged_94.patch that does some stuff with > XLOG_HEAP_NEWPAGE to fix the issue. > >>> Note that in both cases the patches are not complete, we need to fix >>> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages >>> are logged all the time. >> >> Agreed. It's probably better treated as an entirely different - pretty >> ugly - bug though. I mean it's not some issue of a race during replay, >> it's entirely missing WAL logging for SET TABLESPACE of unlogged >> relations. > > Okidoki. In short: should I send patches for all those things or are you on it? It seems that we are on the same page: using the simple approach, with XLOG_FPI that enforces the flushes for 9.5/master and XLOG_HEAP_NEWPAGE that does the same for ~9.4. For the second issue, I would just need to extract the fix from one of the patches upthread. -- Michael
On 2015-12-10 21:10:57 +0900, Michael Paquier wrote: > In short: should I send patches for all those things or are you on it? I'm on it. I don't think the new name you gave the function, and the new comment, are really an improvement. We already have 'SyncOneBuffer' (unusable for our purpose unfortunately), so going for Single here doesn't seem like a good idea. Andres
On Thu, Dec 10, 2015 at 9:13 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-12-10 21:10:57 +0900, Michael Paquier wrote: >> In short: should I send patches for all those things or are you on it? > > I'm on it. I don't think the new name you gave the function, and the new > comment, are really an improvement. We already have 'SyncOneBuffer' > (unusable for our purpose unfortunately), so going for Single here > doesn't seem like a good idea. OK, no problem. Thanks. -- Michael
Hi Michael, Robert, On 2015-12-10 21:10:57 +0900, Michael Paquier wrote: > On Thu, Dec 10, 2015 at 9:07 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: > >>> Note that in both cases the patches are not complete, we need to fix > >>> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages > >>> are logged all the time. > >> > >> Agreed. It's probably better treated as an entirely different - pretty > >> ugly - bug though. I mean it's not some issue of a race during replay, > >> it's entirely missing WAL logging for SET TABLESPACE of unlogged > >> relations. > > For the second issue, I would just need to extract the fix from one of > the patches upthread. I've, pushed the fix for the promotion related issue. I'm afraid that the ALTER TABLE <unlogged-table> SET TABLESPACE issue is a bit bigger than it though. Robert, to catch you up: The isssue, discovered by Michael somewhere in this thread, is that copy_relation_data(), used by SET TABLESPACE, does not deal with unlogged tables. Michael, what your earlier patch did was basically @@ -9892,9 +9892,14 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst, /* * We need to log the copied data inWAL iff WAL archiving/streaming is - * enabled AND it's a permanent relation. - */ - use_wal = XLogIsNeeded() && relpersistence == RELPERSISTENCE_PERMANENT; + * enabled AND it's a permanent relation. Unlogged relations need to have + * their init fork logged as well, to ensure a consistent on-disk state + * when reset at the end of recovery. + */ + use_wal = XLogIsNeeded() && + (relpersistence == RELPERSISTENCE_PERMANENT || + (relpersistence == RELPERSISTENCE_UNLOGGED && + forkNum == INIT_FORKNUM)); ... nblocks = smgrnblocks(src, forkNum); for (blkno = 0; blkno < nblocks; blkno++){ ... if (use_wal) log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page); But that's not sufficient for a bunch of reasons: First, and that's the big one, that approach doesn't guarantee that the file will be created in the new tablespace if the file does not have any blocks. Like e.g. the heap relation itself. This will lead to errors like: ERROR: 58P01: could not open file "pg_tblspc/16384/PG_9.6_201511071/12384/16438": No such file or LOCATION: mdopen, md.c:602 The real problem there imo isn't that the copy_relation_data() doesn't deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a unlogged table specific codepath like heap_create_with_catalog() has. A second problem is that the smgrimmedsync() in copy_relation_data() isn't called for the init fork of unlogged relations, even if it needs to. As ATExecSetTableSpace() is also used for indices, the problem exists there as well. Yuck. Greetings, Andres Freund
On Mon, Nov 30, 2015 at 9:53 PM, Michael Paquier <michael.paquier@gmail.com> wrote: >> I feel quite uncomfortable that it solves the problem from a kind >> of nature of unlogged object by arbitrary flagging which is not >> fully corresponds to the nature. If we can deduce the necessity >> of fsync from some nature, it would be preferable. > > INIT_FORKNUM is not something only related to unlogged relations, > indexes use them as well. Eh, what? Indexes use them if they are indexes on unlogged tables, but they'd better not use them in any other situation. Otherwise bad things are going to happen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 10, 2015 at 11:32 AM, Andres Freund <andres@anarazel.de> wrote: > I've, pushed the fix for the promotion related issue. I'm afraid that > the ALTER TABLE <unlogged-table> SET TABLESPACE issue is a bit bigger > than it though. I think that I would have preferred to fix this by flushing unlogged buffers in bulk at the end of recovery, rather than by flushing them as they are generated. This approach seems needlessly inefficient. Maybe it doesn't matter, but you made the same sort of decision with the find_multixact_start() fix: flushing sooner instead of working harder to make the delayed flush safe. I think that's a bad direction in general, and that it will bite us. Nonetheless, I've been unresponsive on this thread, and it's certainly better to have fixed the bug in a way that isn't what I would prefer than to have not fixed it. > Robert, to catch you up: The isssue, discovered by Michael somewhere in > this thread, is that copy_relation_data(), used by SET TABLESPACE, does > not deal with unlogged tables. Hmm. [ ... ] > But that's not sufficient for a bunch of reasons: > > First, and that's the big one, that approach doesn't guarantee that the > file will be created in the new tablespace if the file does not have any > blocks. Like e.g. the heap relation itself. This will lead to errors > like: > ERROR: 58P01: could not open file "pg_tblspc/16384/PG_9.6_201511071/12384/16438": No such file or > LOCATION: mdopen, md.c:602 > > The real problem there imo isn't that the copy_relation_data() doesn't > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a > unlogged table specific codepath like heap_create_with_catalog() has. It looks to me like somewhere we need to do log_smgrcreate(..., INIT_FORKNUM) in the unlogged table case. RelationCreateStorage() skips this for the main forknum of an unlogged table, which seems OK, but there's nothing that even attempts it for the init fork, which does not seem OK. I guess that logic should happen in ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false). > A second problem is that the smgrimmedsync() in copy_relation_data() > isn't called for the init fork of unlogged relations, even if it needs > to. That seems easy enough to fix. Can't we just sync all forks at that location for permanent relations, and additionally sync the init fork only for unlogged relations? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-12-10 12:19:12 -0500, Robert Haas wrote: > On Thu, Dec 10, 2015 at 11:32 AM, Andres Freund <andres@anarazel.de> wrote: > > I've, pushed the fix for the promotion related issue. I'm afraid that > > the ALTER TABLE <unlogged-table> SET TABLESPACE issue is a bit bigger > > than it though. > > I think that I would have preferred to fix this by flushing unlogged > buffers in bulk at the end of recovery, rather than by flushing them > as they are generated. This approach seems needlessly inefficient. We touched on that somewhere in the thread - having to scan all of shared buffers isn't free either, and it'd mean that promotion would take longer because it'd do all the writes at once. As we don't fsync them during the flush itself, and as init forks don't ever get rewritten, I don't think it makes any actual difference? The total number of writes to the OS is the same, no? > Maybe it doesn't matter, but you made the same sort of decision with > the find_multixact_start() fix: flushing sooner instead of working > harder to make the delayed flush safe. I think that's a bad direction > in general, and that it will bite us. Working harder often means added complexity, and complexity for things executed relatively infrequently has the tendency to bite. I don't think that's a simple tradeoff. E.g. in the find_multixact_start() case the flushing isn't just about "itself", it's also about interactions with SlruScanDirectory. Requiring that to also check in-memory buffers wouldn't be entirely trivial... And all that for an option that's essentially executed infrequently. > > The real problem there imo isn't that the copy_relation_data() doesn't > > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a > > unlogged table specific codepath like heap_create_with_catalog() has. > > It looks to me like somewhere we need to do log_smgrcreate(..., > INIT_FORKNUM) in the unlogged table case. Yes. > RelationCreateStorage() > skips this for the main forknum of an unlogged table, which seems OK, > but there's nothing that even attempts it for the init fork, which > does not seem OK. We unfortunately can't trivially delegate that work to RelationCreateStorage(). E.g. heap_create() documents that only the main fork is created :( > I guess that logic should happen in > ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false). Looks like it's the easiest place. It sounds worthwhile to check that other locations rewriting tables, e.g. cluster/vacuum full/reindex are safe. > > A second problem is that the smgrimmedsync() in copy_relation_data() > > isn't called for the init fork of unlogged relations, even if it needs > > to. > > That seems easy enough to fix. Can't we just sync all forks at that > location for permanent relations, and additionally sync the init fork > only for unlogged relations? Yes, it's not hard, I just wanted to mention it so it doesn't get forgotten. - Andres
On 2015-12-10 18:36:32 +0100, Andres Freund wrote: > On 2015-12-10 12:19:12 -0500, Robert Haas wrote: > > On Thu, Dec 10, 2015 at 11:32 AM, Andres Freund <andres@anarazel.de> wrote: > > > I've, pushed the fix for the promotion related issue. I'm afraid that > > > the ALTER TABLE <unlogged-table> SET TABLESPACE issue is a bit bigger > > > than it though. > > > > I think that I would have preferred to fix this by flushing unlogged > > buffers in bulk at the end of recovery, rather than by flushing them > > as they are generated. This approach seems needlessly inefficient. > > We touched on that somewhere in the thread - having to scan all of > shared buffers isn't free either, and it'd mean that promotion would > take longer because it'd do all the writes at once. As we don't fsync > them during the flush itself, and as init forks don't ever get > rewritten, I don't think it makes any actual difference? The total > number of writes to the OS is the same, no? Oh, and the primary, in most places, immediately does an smgrimmedsync() after creating the init fork anyway. So in comparison to that a plain smrwrite() during replay is nearly free.
Hi, On 2015-12-10 18:36:32 +0100, Andres Freund wrote: > On 2015-12-10 12:19:12 -0500, Robert Haas wrote: > > > The real problem there imo isn't that the copy_relation_data() doesn't > > > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a > > > unlogged table specific codepath like heap_create_with_catalog() has. > > > > It looks to me like somewhere we need to do log_smgrcreate(..., > > INIT_FORKNUM) in the unlogged table case. > > Yes. > > > RelationCreateStorage() > > skips this for the main forknum of an unlogged table, which seems OK, > > but there's nothing that even attempts it for the init fork, which > > does not seem OK. > > We unfortunately can't trivially delegate that work to > RelationCreateStorage(). E.g. heap_create() documents that only the main > fork is created :( > > > I guess that logic should happen in > > ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false). > > Looks like it's the easiest place. > > > A second problem is that the smgrimmedsync() in copy_relation_data() > > > isn't called for the init fork of unlogged relations, even if it needs > > > to. Here's a patch doing that. It's not yet fully polished, but I wanted to get it out, because I noticed one thing: In ATExecSetTableSpace(), for !main forks, we currently call smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT relations. That seems a bit odd to me. It currently seems to be without further consequence because, if there's actual data in the fork, we'll just create the relation in _mdfd_getseg(); or we can cope with the relation not being there. But to me that feels wrong. It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT, not just INIT_FORKNUM. What do you guys think? > It sounds worthwhile to check that other locations rewriting tables, > e.g. cluster/vacuum full/reindex are safe. Seems to be ok, on a first glance. Andres
Attachment
On Fri, Dec 11, 2015 at 1:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Nov 30, 2015 at 9:53 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >>> I feel quite uncomfortable that it solves the problem from a kind >>> of nature of unlogged object by arbitrary flagging which is not >>> fully corresponds to the nature. If we can deduce the necessity >>> of fsync from some nature, it would be preferable. >> >> INIT_FORKNUM is not something only related to unlogged relations, >> indexes use them as well. > > Eh, what? > > Indexes use them if they are indexes on unlogged tables, but they'd > better not use them in any other situation. Otherwise bad things are > going to happen. Yes, this was badly formulated, and caused by my lack of knowledge of unlogged tables, I think I got it now :) Why don't we actually put some asserts in those code paths to say that INIT_FORKNUM specific code can just be used for unlogged relations? Just a thought... -- Michael
Hello, At Thu, 10 Dec 2015 20:27:01 +0100, Andres Freund <andres@anarazel.de> wrote in <20151210192701.GC11331@alap3.anarazel.de> > > > > A second problem is that the smgrimmedsync() in copy_relation_data() > > > > isn't called for the init fork of unlogged relations, even if it needs > > > > to. > > Here's a patch doing that. It's not yet fully polished, but I wanted to > get it out, because I noticed one thing: > > In ATExecSetTableSpace(), for !main forks, we currently call > smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT > relations. That seems a bit odd to me. It currently seems to be without > further consequence because, if there's actual data in the fork, we'll > just create the relation in _mdfd_getseg(); or we can cope with the > relation not being there. But to me that feels wrong. FWIW I agree that. > It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT, > not just INIT_FORKNUM. What do you guys think? What it is doing seems to me reasonable but copying_initfork doesn't seems to be necessary. Kicking both of log_newpage() and smgrimmedsync() by use_wal, which has the value considering INIT_FORKNUM would be more descriptive. (more readable, in other words) > > It sounds worthwhile to check that other locations rewriting tables, > > e.g. cluster/vacuum full/reindex are safe. > > Seems to be ok, on a first glance. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Dec 11, 2015 at 1:32 AM, Andres Freund <andres@anarazel.de> wrote: > I've, pushed the fix for the promotion related issue. Thanks! It is great to see this issue addressed. -- Michael
On Fri, Dec 11, 2015 at 4:27 AM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2015-12-10 18:36:32 +0100, Andres Freund wrote: >> On 2015-12-10 12:19:12 -0500, Robert Haas wrote: >> > > The real problem there imo isn't that the copy_relation_data() doesn't >> > > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a >> > > unlogged table specific codepath like heap_create_with_catalog() has. >> > >> > It looks to me like somewhere we need to do log_smgrcreate(..., >> > INIT_FORKNUM) in the unlogged table case. >> >> Yes. >> >> > RelationCreateStorage() >> > skips this for the main forknum of an unlogged table, which seems OK, >> > but there's nothing that even attempts it for the init fork, which >> > does not seem OK. >> >> We unfortunately can't trivially delegate that work to >> RelationCreateStorage(). E.g. heap_create() documents that only the main >> fork is created :( >> >> > I guess that logic should happen in >> > ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false). >> >> Looks like it's the easiest place. > >> > > A second problem is that the smgrimmedsync() in copy_relation_data() >> > > isn't called for the init fork of unlogged relations, even if it needs >> > > to. > > Here's a patch doing that. It's not yet fully polished, but I wanted to > get it out, because I noticed one thing: > > In ATExecSetTableSpace(), for !main forks, we currently call > smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT > relations. That seems a bit odd to me. It currently seems to be without > further consequence because, if there's actual data in the fork, we'll > just create the relation in _mdfd_getseg(); or we can cope with the > relation not being there. But to me that feels wrong. > > It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT, > not just INIT_FORKNUM. What do you guys think? This fixes the problem in my environment. + if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT || + (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED && + forkNum == INIT_FORKNUM)) + log_smgrcreate(&newrnode, forkNum); There should be a XLogIsNeeded() check as well. Removing the check on RELPERSISTENCE_UNLOGGED is fine as well... Not mandatory though :) + * The init fork for an unlogged relation in many respects has to be + * treated the same as normal relation, changes need to be WAL logged and + * it needs to be synced to disk. + */ + copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED && + forkNum == INIT_FORKNUM; Here as well just a check on INIT_FORKNUM would be fine. >> It sounds worthwhile to check that other locations rewriting tables, >> e.g. cluster/vacuum full/reindex are safe. > > Seems to be ok, on a first glance. Yeah. REINDEX relies on index_build to recreate what it should... The others are looking fine as well. I have tested it in case and the files produced are consistent on standby and its master. -- Michael
On 2015-12-11 15:43:24 +0900, Kyotaro HORIGUCHI wrote: > What it is doing seems to me reasonable but copying_initfork > doesn't seems to be necessary. Kicking both of log_newpage() and > smgrimmedsync() by use_wal, which has the value considering > INIT_FORKNUM would be more descriptive. (more readable, in other > words) The smgrimmedsync() has a different condition, it doesn't, and may not, check for XLogIsNeeded(). Andres
On Fri, Dec 11, 2015 at 4:54 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Dec 11, 2015 at 4:27 AM, Andres Freund <andres@anarazel.de> wrote: >> On 2015-12-10 18:36:32 +0100, Andres Freund wrote: >>> On 2015-12-10 12:19:12 -0500, Robert Haas wrote: >>> > > The real problem there imo isn't that the copy_relation_data() doesn't >>> > > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a >>> > > unlogged table specific codepath like heap_create_with_catalog() has. >>> > >>> > It looks to me like somewhere we need to do log_smgrcreate(..., >>> > INIT_FORKNUM) in the unlogged table case. >>> >>> Yes. >>> >>> > RelationCreateStorage() >>> > skips this for the main forknum of an unlogged table, which seems OK, >>> > but there's nothing that even attempts it for the init fork, which >>> > does not seem OK. >>> >>> We unfortunately can't trivially delegate that work to >>> RelationCreateStorage(). E.g. heap_create() documents that only the main >>> fork is created :( >>> >>> > I guess that logic should happen in >>> > ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false). >>> >>> Looks like it's the easiest place. >> >>> > > A second problem is that the smgrimmedsync() in copy_relation_data() >>> > > isn't called for the init fork of unlogged relations, even if it needs >>> > > to. >> >> Here's a patch doing that. It's not yet fully polished, but I wanted to >> get it out, because I noticed one thing: >> >> In ATExecSetTableSpace(), for !main forks, we currently call >> smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT >> relations. That seems a bit odd to me. It currently seems to be without >> further consequence because, if there's actual data in the fork, we'll >> just create the relation in _mdfd_getseg(); or we can cope with the >> relation not being there. But to me that feels wrong. >> >> It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT, >> not just INIT_FORKNUM. What do you guys think? > > This fixes the problem in my environment. > > + if (rel->rd_rel->relpersistence == > RELPERSISTENCE_PERMANENT || > + (rel->rd_rel->relpersistence == > RELPERSISTENCE_UNLOGGED && > + forkNum == INIT_FORKNUM)) > + log_smgrcreate(&newrnode, forkNum); > There should be a XLogIsNeeded() check as well. Removing the check on > RELPERSISTENCE_UNLOGGED is fine as well... Not mandatory though :) > > + * The init fork for an unlogged relation in many respects has to be > + * treated the same as normal relation, changes need to be WAL > logged and > + * it needs to be synced to disk. > + */ > + copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED && > + forkNum == INIT_FORKNUM; > Here as well just a check on INIT_FORKNUM would be fine. Should we consider this bug a 9.5 blocker? I feel uneasy with the fact of releasing a new major version knowing that we know some bugs on it, and this one is not cool so I have added it in the list of open items. We know the problem and there is a patch, so this is definitely solvable. -- Michael
On 2015-12-12 20:49:52 +0900, Michael Paquier wrote: > Should we consider this bug a 9.5 blocker? I don't think so. But either way, I'm right now working on getting it fixed anyway.
On 2015-12-11 16:54:45 +0900, Michael Paquier wrote: > + if (rel->rd_rel->relpersistence == > RELPERSISTENCE_PERMANENT || > + (rel->rd_rel->relpersistence == > RELPERSISTENCE_UNLOGGED && > + forkNum == INIT_FORKNUM)) > + log_smgrcreate(&newrnode, forkNum); > There should be a XLogIsNeeded() check as well. RelationCreateStorage(), which creates the main fork, has no such check. Seems better to stay symmetric, even if it's not strictly necessary. > Removing the check on RELPERSISTENCE_UNLOGGED is fine as well... Not > mandatory though :) I've gone back and forth on this, I can't really convince myself either way. We might end up reusing init forks for 'global temporary tables' or somesuch, so being a bit stricter seems slight better. Anyway, it's of no actual consequence for now. Andres
On 2015-12-12 12:52:21 +0100, Andres Freund wrote: > On 2015-12-12 20:49:52 +0900, Michael Paquier wrote: > > Should we consider this bug a 9.5 blocker? > > I don't think so. But either way, I'm right now working on getting it > fixed anyway. And done. Took a bit longer than predicted - I had to adjust my test scripts to cope with < 9.4 not having tablespace mapping. Rather annoying. It'd be really convenient if tablespaces relative to the main data directory were supported...
Michael Paquier <michael.paquier@gmail.com> writes: > Should we consider this bug a 9.5 blocker? I feel uneasy with the fact > of releasing a new major version knowing that we know some bugs on it, > and this one is not cool so I have added it in the list of open items. > We know the problem and there is a patch, so this is definitely > solvable. FWIW, general project policy has been that pre-existing bugs are not release blockers as such. However, if we have a fix in hand that we would like to get some field testing on, it's certainly desirable to push it out first in a beta or RC release, rather than having it first hit the field in stable-branch updates. So we might delay a beta/RC to make sure such a fix is available for testing. But that's not quite the same thing as a release blocker. To my mind, a release blocker is something that would make it undesirable for users to upgrade to the new release compared to the previous branch, so bugs the branches have in common are never that. regards, tom lane
On Sat, Dec 12, 2015 at 10:31 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-12-12 12:52:21 +0100, Andres Freund wrote: >> On 2015-12-12 20:49:52 +0900, Michael Paquier wrote: >> > Should we consider this bug a 9.5 blocker? >> >> I don't think so. But either way, I'm right now working on getting it >> fixed anyway. > > And done. Took a bit longer than predicted - I had to adjust my test > scripts to cope with < 9.4 not having tablespace mapping. Rather > annoying. Thanks! Cool to see this thread finally addressed. > It'd be really convenient if tablespaces relative to the main data > directory were supported... I got bitten at some point as well by the fact that tablespaces created after a base backup is taken have their location map to the same point for a master and a standby :) -- Michael