Home > mailing lists

Thread: Error with index on unlogged table

Error with index on unlogged table

From

Thom Brown

Date:

24 March 2015, 08:53:49

Hi,

I was attempting to set up a data set to test pg_rewind, when I encountered an error. I created a primary and standby, then:

# create table test (id serial primary key, thing text);
CREATE TABLE

# create unlogged table utest (id serial primary key, thing text);
CREATE TABLE

# insert into test (thing) values ('jifj');
INSERT 0 1

# insert into utest (thing) values ('jifj');
INSERT 0 1

$ pg_ctl -D standby1 promote

$ psql -p 5531 postgres # standby

# insert into test (thing) values ('moomoo');
INSERT 0 1

# insert into utest (thing) values ('moomoo');
ERROR: index "utest_pkey" contains unexpected zero page at block 0
HINT: Please REINDEX it.

This is built on commit e5f455f59fed0632371cddacddd79895b148dc07.

--

Thom

Re: Error with index on unlogged table

From

Michael Paquier

Date:

24 March 2015, 11:35:36

On Tue, Mar 24, 2015 at 5:53 PM, Thom Brown <thom@linux.com> wrote:
> I was attempting to set up a data set to test pg_rewind, when I encountered
> an error.  I created a primary and standby, then:
>
> [...]
>
> # insert into utest (thing) values ('moomoo');
> ERROR:  index "utest_pkey" contains unexpected zero page at block 0
> HINT:  Please REINDEX it.
>
> This is built on commit e5f455f59fed0632371cddacddd79895b148dc07.

Unlogged tables are not in WAL, and cannot be accessed while in
recovery, so having an empty index relation is expected on a promoted
standby IMO. Now perhaps we could have a more friendly error message
in _bt_checkpage(), _hash_checkpage() and gistcheckpage() with an
additional HINT to mention unlogged tables, but I am not sure that
this is much worth it. Mentioning this behavior in the docs would be
good instead.
-- 
Michael

Re: Error with index on unlogged table

From

Andres Freund

Date:

24 March 2015, 11:37:39

On March 24, 2015 12:35:28 PM GMT+01:00, Michael Paquier <michael.paquier@gmail.com> wrote:
>On Tue, Mar 24, 2015 at 5:53 PM, Thom Brown <thom@linux.com> wrote:
>> I was attempting to set up a data set to test pg_rewind, when I
>encountered
>> an error.  I created a primary and standby, then:
>>
>> [...]
>>
>> # insert into utest (thing) values ('moomoo');
>> ERROR:  index "utest_pkey" contains unexpected zero page at block 0
>> HINT:  Please REINDEX it.
>>
>> This is built on commit e5f455f59fed0632371cddacddd79895b148dc07.
>
>Unlogged tables are not in WAL, and cannot be accessed while in
>recovery, so having an empty index relation is expected on a promoted
>standby IMO. Now perhaps we could have a more friendly error message
>in _bt_checkpage(), _hash_checkpage() and gistcheckpage() with an
>additional HINT to mention unlogged tables, but I am not sure that
>this is much worth it. Mentioning this behavior in the docs would be
>good instead.

I think Thom's point is that he promoted the node...

Thom, are you sure this want transient?

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Re: Error with index on unlogged table

From

Thom Brown

Date:

24 March 2015, 11:47:18

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 24 March 2015 at 11:37, Andres Freund <span
dir="ltr"><<ahref="mailto:andres@anarazel.de" target="_blank">andres@anarazel.de</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"><divclass=""><div class="h5">On March 24, 2015 12:35:28 PM GMT+01:00, Michael
Paquier<<a href="mailto:michael.paquier@gmail.com">michael.paquier@gmail.com</a>> wrote:<br /> >On Tue, Mar
24,2015 at 5:53 PM, Thom Brown <<a href="mailto:thom@linux.com">thom@linux.com</a>> wrote:<br /> >> I was
attemptingto set up a data set to test pg_rewind, when I<br /> >encountered<br /> >> an error.  I created a
primaryand standby, then:<br /> >><br /> >> [...]<br /> >><br /> >> # insert into utest (thing)
values('moomoo');<br /> >> ERROR:  index "utest_pkey" contains unexpected zero page at block 0<br /> >>
HINT: Please REINDEX it.<br /> >><br /> >> This is built on commit
e5f455f59fed0632371cddacddd79895b148dc07.<br/> ><br /> >Unlogged tables are not in WAL, and cannot be accessed
whilein<br /> >recovery, so having an empty index relation is expected on a promoted<br /> >standby IMO. Now
perhapswe could have a more friendly error message<br /> >in _bt_checkpage(), _hash_checkpage() and gistcheckpage()
withan<br /> >additional HINT to mention unlogged tables, but I am not sure that<br /> >this is much worth it.
Mentioningthis behavior in the docs would be<br /> >good instead.<br /><br /></div></div>I think Thom's point is
thathe promoted the node...<br /><br /> Thom, are you sure this want transient?<br /></blockquote></div><br
/></div><divclass="gmail_extra">The index is unlogged until reindexing...<br /><br /># select oid, relname,
relpersistencefrom pg_class where relname in ('test','test_pkey','utest','utest_pkey');<br />  oid  |  relname   |
relpersistence<br />-------+------------+----------------<br /> 16387 | test       | p<br /> 16394 | test_pkey  | p<br
/> 16398| utest      | u<br /> 16405 | utest_pkey | u<br />(4 rows)<br /><br /># reindex index utest_pkey;<br
/>REINDEX<br/><br /># select oid, relname, relpersistence from pg_class where relname in
('test','test_pkey','utest','utest_pkey');<br/>  oid  |  relname   | relpersistence <br
/>-------+------------+----------------<br/> 16387 | test       | p<br /> 16394 | test_pkey  | p<br /> 16398 |
utest     | u<br /> 16405 | utest_pkey | p<br />(4 rows)<br /><br /></div><div class="gmail_extra">Which is think also
raisesthe question, why are unlogged indexes made persistent by a reindex?<br clear="all" /></div><div
class="gmail_extra"><br/>-- <br /><div class="gmail_signature">Thom</div></div></div>

Re: Error with index on unlogged table

From

Thom Brown

Date:

24 March 2015, 12:04:09

On 24 March 2015 at 11:46, Thom Brown <thom@linux.com> wrote:

On 24 March 2015 at 11:37, Andres Freund <andres@anarazel.de> wrote:
On March 24, 2015 12:35:28 PM GMT+01:00, Michael Paquier <michael.paquier@gmail.com> wrote:
>On Tue, Mar 24, 2015 at 5:53 PM, Thom Brown <thom@linux.com> wrote:
>> I was attempting to set up a data set to test pg_rewind, when I
>encountered
>> an error. I created a primary and standby, then:
>>
>> [...]
>>
>> # insert into utest (thing) values ('moomoo');
>> ERROR: index "utest_pkey" contains unexpected zero page at block 0
>> HINT: Please REINDEX it.
>>
>> This is built on commit e5f455f59fed0632371cddacddd79895b148dc07.
>
>Unlogged tables are not in WAL, and cannot be accessed while in
>recovery, so having an empty index relation is expected on a promoted
>standby IMO. Now perhaps we could have a more friendly error message
>in _bt_checkpage(), _hash_checkpage() and gistcheckpage() with an
>additional HINT to mention unlogged tables, but I am not sure that
>this is much worth it. Mentioning this behavior in the docs would be
>good instead.

I think Thom's point is that he promoted the node...

Thom, are you sure this want transient?

The index is unlogged until reindexing...

# select oid, relname, relpersistence from pg_class where relname in ('test','test_pkey','utest','utest_pkey');
oid | relname   | relpersistence
-------+------------+----------------
16387 | test       | p
16394 | test_pkey | p
16398 | utest      | u
16405 | utest_pkey | u
(4 rows)

# reindex index utest_pkey;
REINDEX

# select oid, relname, relpersistence from pg_class where relname in ('test','test_pkey','utest','utest_pkey');
oid | relname   | relpersistence
-------+------------+----------------
16387 | test       | p
16394 | test_pkey | p
16398 | utest      | u
16405 | utest_pkey | p
(4 rows)

Which is think also raises the question, why are unlogged indexes made persistent by a reindex?

I should also mention that it becomes unlogged again when running VACUUM FULL or CLUSTER on the table.

--

Thom

Re: Error with index on unlogged table

From

Michael Paquier

Date:

25 March 2015, 02:06:06

On Tue, Mar 24, 2015 at 8:37 PM, Andres Freund <andres@anarazel.de> wrote:
> On March 24, 2015 12:35:28 PM GMT+01:00, Michael Paquier wrote:
> I think Thom's point is that he promoted the node...
>
> Thom, are you sure this want transient?

Well, I got his point :)
I was just thinking that this error message is legit,
ResetUnloggedRelationsInDbspaceDir reinitializing unlogged relation
and its indexes at the end of recovery.
-- 
Michael

Re: Error with index on unlogged table

From

Michael Paquier

Date:

25 March 2015, 02:38:36

On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
> The index is unlogged until reindexing...
>
> [...]
> Which is think also raises the question, why are unlogged indexes made
> persistent by a reindex?

That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
REINDEX INDEX. What happens is that ReindexIndex relies on
relpersistence provided by makeRangeVar at parse time, which is just
incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
attached fixes that...
--
Michael

Attachment

20150325_reindex_index_fix.patch

Re: Error with index on unlogged table

From

Amit Langote

Date:

25 March 2015, 12:23:14

On Wednesday, March 25, 2015, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
> > The index is unlogged until reindexing...
> >
> > [...]
> > Which is think also raises the question, why are unlogged indexes made
> > persistent by a reindex?
>
> That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
> REINDEX INDEX. What happens is that ReindexIndex relies on
> relpersistence provided by makeRangeVar at parse time, which is just
> incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
> attached fixes that...
>

How about VACUUM FULL and CLUSTER as the problem seems to have been
reported to be there too?

Amit

Re: Error with index on unlogged table

From

Thom Brown

Date:

25 March 2015, 12:46:42

On 25 March 2015 at 12:22, Amit Langote <amitlangote09@gmail.com> wrote:

On Wednesday, March 25, 2015, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
> > The index is unlogged until reindexing...
> >
> > [...]
> > Which is think also raises the question, why are unlogged indexes made
> > persistent by a reindex?
>
> That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
> REINDEX INDEX. What happens is that ReindexIndex relies on
> relpersistence provided by makeRangeVar at parse time, which is just
> incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
> attached fixes that...
>

How about VACUUM FULL and CLUSTER as the problem seems to have been
reported to be there too?

No, those are okay. They actually revert the index back to the same persistence level as the table they're attached to.

--

Thom

Re: Error with index on unlogged table

From

Amit Langote

Date:

25 March 2015, 13:46:04

On Wednesday, March 25, 2015, Thom Brown <thom@linux.com> wrote:

On 25 March 2015 at 12:22, Amit Langote <amitlangote09@gmail.com> wrote:
On Wednesday, March 25, 2015, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
> > The index is unlogged until reindexing...
> >
> > [...]
> > Which is think also raises the question, why are unlogged indexes made
> > persistent by a reindex?
>
> That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
> REINDEX INDEX. What happens is that ReindexIndex relies on
> relpersistence provided by makeRangeVar at parse time, which is just
> incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
> attached fixes that...
>

How about VACUUM FULL and CLUSTER as the problem seems to have been
reported to be there too?

No, those are okay. They actually revert the index back to the same persistence level as the table they're attached to.

Ah, I misread then; sorry about the noise.

Amit

Re: Error with index on unlogged table

From

Andres Freund

Date:

25 March 2015, 13:53:51

Hi,

On 2015-03-25 11:38:30 +0900, Michael Paquier wrote:
> On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
> > The index is unlogged until reindexing...
> >
> > [...]
> > Which is think also raises the question, why are unlogged indexes made
> > persistent by a reindex?
> 
> That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
> REINDEX INDEX. What happens is that ReindexIndex relies on
> relpersistence provided by makeRangeVar at parse time, which is just
> incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
> attached fixes that...

What the hell? That's somewhat nasty. Nice that it got caught before 9.5
was released.

Did you check whether a similar bug was made in other places of
85b506bb? Could you additionally add a regression test to this end?
Seems like something worth testing.

Greetings,

Andres Freund

Re: Error with index on unlogged table

From

Fabrízio de Royes Mello

Date:

25 March 2015, 15:46:37

<div dir="ltr"><div class="gmail_extra"><br />On Wed, Mar 25, 2015 at 10:53 AM, Andres Freund <<a
href="mailto:andres@2ndquadrant.com">andres@2ndquadrant.com</a>>wrote:<br />><br />> Hi,<br />><br />>
On2015-03-25 11:38:30 +0900, Michael Paquier wrote:<br />> > On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown
wrote:<br/>> > > The index is unlogged until reindexing...<br />> > ><br />> > > [...]<br
/>>> > Which is think also raises the question, why are unlogged indexes made<br />> > > persistent
bya reindex?<br />> ><br />> > That's a bug of HEAD, ~9.4 keeping the index as unlogged even after<br
/>>> REINDEX INDEX. What happens is that ReindexIndex relies on<br />> > relpersistence provided by
makeRangeVarat parse time, which is just<br />> > incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The
patch<br/>> > attached fixes that...<br />><br />> What the hell? That's somewhat nasty. Nice that it got
caughtbefore 9.5<br />> was released.<br />><br /><br /></div><div class="gmail_extra">Unfortunately this is very
nasty.Sorry!<br /><br /></div><div class="gmail_extra"><br />> Did you check whether a similar bug was made in other
placesof<br />> 85b506bb? Could you additionally add a regression test to this end?<br />> Seems like something
worthtesting.<br />><br /><br /></div><div class="gmail_extra">I'm checking it and adding some regression tests.<br
/></div><divclass="gmail_extra"><br /></div><div class="gmail_extra">Regards,<br /></div><div class="gmail_extra"><br
/>--<br/>Fabrízio de Royes Mello<br />Consultoria/Coaching PostgreSQL<br />>> Timbira: <a
href="http://www.timbira.com.br">http://www.timbira.com.br</a><br/>>> Blog: <a
href="http://fabriziomello.github.io">http://fabriziomello.github.io</a><br/>>> Linkedin: <a
href="http://br.linkedin.com/in/fabriziomello">http://br.linkedin.com/in/fabriziomello</a><br/>>> Twitter: <a
href="http://twitter.com/fabriziomello">http://twitter.com/fabriziomello</a><br/>>> Github: <a
href="http://github.com/fabriziomello">http://github.com/fabriziomello</a></div></div>

Re: Error with index on unlogged table

From

Fabrízio de Royes Mello

Date:

25 March 2015, 16:02:57

On Wed, Mar 25, 2015 at 12:46 PM, Fabrízio de Royes Mello <fabriziomello@gmail.com> wrote:
>
>
> On Wed, Mar 25, 2015 at 10:53 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >
>
> > Did you check whether a similar bug was made in other places of
> > 85b506bb? Could you additionally add a regression test to this end?
> > Seems like something worth testing.
> >
>
> I'm checking it and adding some regression tests.
>

I didn't found any other similar bug introduced by 85b506bb.

Attached the original patch provided by Michael with some regression tests.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello

Attachment

20150325_reindex_index_fix_v2.patch

Re: Error with index on unlogged table

From

Michael Paquier

Date:

26 March 2015, 00:44:04

On Wed, Mar 25, 2015 at 10:53 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi,
>
> On 2015-03-25 11:38:30 +0900, Michael Paquier wrote:
>> On Tue, Mar 24, 2015 at 8:46 PM, Thom Brown wrote:
>> > The index is unlogged until reindexing...
>> >
>> > [...]
>> > Which is think also raises the question, why are unlogged indexes made
>> > persistent by a reindex?
>>
>> That's a bug of HEAD, ~9.4 keeping the index as unlogged even after
>> REINDEX INDEX. What happens is that ReindexIndex relies on
>> relpersistence provided by makeRangeVar at parse time, which is just
>> incorrect as it uses RELPERSISTENCE_PERMANENT all the time. The patch
>> attached fixes that...
>
> What the hell? That's somewhat nasty. Nice that it got caught before 9.5
> was released.
>
> Did you check whether a similar bug was made in other places of
> 85b506bb?

Yeah I got a look at the other code paths, particularly cluster and
matviews, and the relpersistence used is taken directly from a
Relation.

> Could you additionally add a regression test to this end?
> Seems like something worth testing.

Definitely. And I guess that Fabrizio already did that...
-- 
Michael

Re: Error with index on unlogged table

From

Michael Paquier

Date:

26 March 2015, 00:55:52

On Thu, Mar 26, 2015 at 1:02 AM, Fabrízio de Royes Mello
<fabriziomello@gmail.com> wrote:
> On Wed, Mar 25, 2015 at 12:46 PM, Fabrízio de Royes Mello
> <fabriziomello@gmail.com> wrote:
>>
>>
>> On Wed, Mar 25, 2015 at 10:53 AM, Andres Freund <andres@2ndquadrant.com>
>> wrote:
>> >
>>
>> > Did you check whether a similar bug was made in other places of
>> > 85b506bb? Could you additionally add a regression test to this end?
>> > Seems like something worth testing.
>> >
>>
>> I'm checking it and adding some regression tests.
>>
>
> I didn't found any other similar bug introduced by 85b506bb.
>
> Attached the original patch provided by Michael with some regression tests.

Thanks for adding a test, this looks fine to me (did some sanity
checks and tutti-quanti for people wondering). On temporary tables
this was failing with an error in md.c...
--
Michael

Re: Error with index on unlogged table

From

Thom Brown

Date:

26 March 2015, 13:56:05

On 26 March 2015 at 00:55, Michael Paquier <michael.paquier@gmail.com> wrote:

On Thu, Mar 26, 2015 at 1:02 AM, Fabrízio de Royes Mello
<fabriziomello@gmail.com> wrote:
> On Wed, Mar 25, 2015 at 12:46 PM, Fabrízio de Royes Mello
> <fabriziomello@gmail.com> wrote:
>>
>>
>> On Wed, Mar 25, 2015 at 10:53 AM, Andres Freund <andres@2ndquadrant.com>
>> wrote:
>> >
>>
>> > Did you check whether a similar bug was made in other places of
>> > 85b506bb? Could you additionally add a regression test to this end?
>> > Seems like something worth testing.
>> >
>>
>> I'm checking it and adding some regression tests.
>>
>
> I didn't found any other similar bug introduced by 85b506bb.
>
> Attached the original patch provided by Michael with some regression tests.

Thanks for adding a test, this looks fine to me (did some sanity
checks and tutti-quanti for people wondering). On temporary tables
this was failing with an error in md.c...

Thanks to both of you for fixing this.

I still, however, have a problem with the separate and original issue of:

# insert into utest (thing) values ('moomoo');
ERROR: index "utest_pkey" contains unexpected zero page at block 0
HINT: Please REINDEX it.

I don't see why the user should need to go re-indexing all unlogged tables each time a standby is promoted. The index should just be empty and ready to use.

--

Thom

Re: Error with index on unlogged table

From

Andres Freund

Date:

26 March 2015, 14:13:53

On 2015-03-26 13:55:22 +0000, Thom Brown wrote:
> I still, however, have a problem with the separate and original issue of:
> 
> # insert into utest (thing) values ('moomoo');
> ERROR:  index "utest_pkey" contains unexpected zero page at block 0
> HINT:  Please REINDEX it.
> 
> I don't see why the user should need to go re-indexing all unlogged tables
> each time a standby is promoted.  The index should just be empty and ready
> to use.

There's definitely something rather broken here. Investigating.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Error with index on unlogged table

From

Andres Freund

Date:

26 March 2015, 17:50:33

On 2015-03-26 15:13:41 +0100, Andres Freund wrote:
> On 2015-03-26 13:55:22 +0000, Thom Brown wrote:
> > I still, however, have a problem with the separate and original issue of:
> > 
> > # insert into utest (thing) values ('moomoo');
> > ERROR:  index "utest_pkey" contains unexpected zero page at block 0
> > HINT:  Please REINDEX it.
> > 
> > I don't see why the user should need to go re-indexing all unlogged tables
> > each time a standby is promoted.  The index should just be empty and ready
> > to use.
> 
> There's definitely something rather broken here. Investigating.

As far as I can see this has been broken at least since the introduction
of fast promotion. WAL replay will update the init fork in shared
memory, but it'll not be guaranteed to be flushed to disk when the reset
happens. d3586fc8a et al. then also made it possible to hit the issue
without fast promotion.

To hit the issue there may not be a restartpoint (requiring a checkpoint
on the primary) since the creation of the unlogged table.

I think the problem here is that the *primary* makes no such
assumptions. Init forks are logged via stuff likesmgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,          (char
*)metapage, true);if (XLogIsNeeded())    log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
BTREE_METAPAGE,metapage, false);

/* * An immediate sync is required even if we xlog'd the page, because the * write did not go through shared_buffers
andtherefore a concurrent * checkpoint may have moved the redo pointer past our xlog record.
*/smgrimmedsync(index->rd_smgr,INIT_FORKNUM);

i.e. the data is written out directly to disk, circumventing
shared_buffers. It's pretty bad that we don't do the same on the
standby. For master I think we should just add a bit to the XLOG_FPI
record saying the data should be forced out to disk. I'm less sure
what's to be done in the back branches. Flushing every HEAP_NEWPAGE
record isn't really an option.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Error with index on unlogged table

From

Kyotaro HORIGUCHI

Date:

27 March 2015, 04:54:38

Hello,

At Thu, 26 Mar 2015 18:50:24 +0100, Andres Freund <andres@2ndquadrant.com> wrote in
<20150326175024.GJ451@alap3.anarazel.de>
> I think the problem here is that the *primary* makes no such
> assumptions. Init forks are logged via stuff like
>     smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
..
> i.e. the data is written out directly to disk, circumventing
> shared_buffers. It's pretty bad that we don't do the same on the
> standby. For master I think we should just add a bit to the XLOG_FPI
> record saying the data should be forced out to disk. I'm less sure
> what's to be done in the back branches. Flushing every HEAP_NEWPAGE
> record isn't really an option.

The problem exists only for INIT_FORKNUM. So I suppose it is
enough to check forknum to decide whether to sync immediately.

Specifically for this instance, syncing buffers of INIT_FORKNUM
at the end of XLOG_FPI block in xlog_redo fixed the problem.

The another (ugly!) solution sould be syncing only buffers for
INIT_FORKNUM and is BM_DIRTY in ResetUnlogggedRelations(op =
UNLOGGED_RELATION_INIT). This is catching-all-at-once solution
though it is a kind of reversion of fast promotion. But buffers
to be synced here should be pretty few.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Error with index on unlogged table

From

Alvaro Herrera

Date:

30 March 2015, 19:05:16

Michael Paquier wrote:
> On Thu, Mar 26, 2015 at 1:02 AM, Fabrízio de Royes Mello
> <fabriziomello@gmail.com> wrote:

> > I didn't found any other similar bug introduced by 85b506bb.
> >
> > Attached the original patch provided by Michael with some regression tests.
> 
> Thanks for adding a test, this looks fine to me (did some sanity
> checks and tutti-quanti for people wondering). On temporary tables
> this was failing with an error in md.c...

Yeah, I extended the test a bit to use a temp table too, and pushed.
Thanks everybody.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Error with index on unlogged table

From

Thom Brown

Date:

19 November 2015, 12:19:58

On 27 March 2015 at 04:54, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Thu, 26 Mar 2015 18:50:24 +0100, Andres Freund <andres@2ndquadrant.com> wrote in
<20150326175024.GJ451@alap3.anarazel.de>
>> I think the problem here is that the *primary* makes no such
>> assumptions. Init forks are logged via stuff like
>>       smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
> ..
>> i.e. the data is written out directly to disk, circumventing
>> shared_buffers. It's pretty bad that we don't do the same on the
>> standby. For master I think we should just add a bit to the XLOG_FPI
>> record saying the data should be forced out to disk. I'm less sure
>> what's to be done in the back branches. Flushing every HEAP_NEWPAGE
>> record isn't really an option.
>
> The problem exists only for INIT_FORKNUM. So I suppose it is
> enough to check forknum to decide whether to sync immediately.
>
> Specifically for this instance, syncing buffers of INIT_FORKNUM
> at the end of XLOG_FPI block in xlog_redo fixed the problem.
>
> The another (ugly!) solution sould be syncing only buffers for
> INIT_FORKNUM and is BM_DIRTY in ResetUnlogggedRelations(op =
> UNLOGGED_RELATION_INIT). This is catching-all-at-once solution
> though it is a kind of reversion of fast promotion. But buffers
> to be synced here should be pretty few.

This bug still exists.

Thom

Re: Error with index on unlogged table

From

Robert Haas

Date:

19 November 2015, 22:03:14

On Thu, Nov 19, 2015 at 7:19 AM, Thom Brown <thom@linux.com> wrote:
> On 27 March 2015 at 04:54, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> Hello,
>>
>> At Thu, 26 Mar 2015 18:50:24 +0100, Andres Freund <andres@2ndquadrant.com> wrote in
<20150326175024.GJ451@alap3.anarazel.de>
>>> I think the problem here is that the *primary* makes no such
>>> assumptions. Init forks are logged via stuff like
>>>       smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
>> ..
>>> i.e. the data is written out directly to disk, circumventing
>>> shared_buffers. It's pretty bad that we don't do the same on the
>>> standby. For master I think we should just add a bit to the XLOG_FPI
>>> record saying the data should be forced out to disk. I'm less sure
>>> what's to be done in the back branches. Flushing every HEAP_NEWPAGE
>>> record isn't really an option.
>>
>> The problem exists only for INIT_FORKNUM. So I suppose it is
>> enough to check forknum to decide whether to sync immediately.
>>
>> Specifically for this instance, syncing buffers of INIT_FORKNUM
>> at the end of XLOG_FPI block in xlog_redo fixed the problem.
>>
>> The another (ugly!) solution sould be syncing only buffers for
>> INIT_FORKNUM and is BM_DIRTY in ResetUnlogggedRelations(op =
>> UNLOGGED_RELATION_INIT). This is catching-all-at-once solution
>> though it is a kind of reversion of fast promotion. But buffers
>> to be synced here should be pretty few.
>
> This bug still exists.

Hmm.  This probably should have been on the open items list.  I didn't
pay too much attention this to this before because it seemed like
Andres and Michael were all over it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Error with index on unlogged table

From

Michael Paquier

Date:

20 November 2015, 07:11:20

On Fri, Nov 20, 2015 at 7:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Nov 19, 2015 at 7:19 AM, Thom Brown <thom@linux.com> wrote:
>> This bug still exists.
>
> Hmm.  This probably should have been on the open items list.

I have added an open item for 9.5. That's not a cool bug to release
the next GA even if that's not limited to 9.5, it is easily
reproducible in older stable branches as well.

> I didn't
> pay too much attention this to this before because it seemed like
> Andres and Michael were all over it.

This completely fell off my radar. Sorry about that.

For back-branches, I tend to agree with what Horiguchi-san mentioned
upthread: we had better issue an unconditional fsync on the relation
when INIT_FORKNUM is used for it when replaying XLOG_FPI record. That
would rather localize the impact. An example of patch is attached that
applies on top of REL9_4_STABLE. That's rather ugly but it shows the
idea and fixes the issue.

For master and perhaps 9.5, I would think that a dedicated WAL record
type enforcing the fsync is the way to go. This special treatment
would go only for btree and spgist when they use INIT_FORKNUM and we
would not impact other relation types and code paths with this
behavior.

Thoughts?
--
Michael

Attachment

20151120_xlog_fpi_replay.patch

Re: Error with index on unlogged table

From

Michael Paquier

Date:

21 November 2015, 14:30:22

On Fri, Nov 20, 2015 at 4:11 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> For master and perhaps 9.5, I would think that a dedicated WAL record
> type enforcing the fsync is the way to go. This special treatment
> would go only for btree and spgist when they use INIT_FORKNUM and we
> would not impact other relation types and code paths with this
> behavior.

So, I have been looking at that, and finished with the attached to
implement this idea. This way, it is possible to control at replay if
a relation should be synced to disk just after replaying a FPI or a
set of FPIs. This makes btree and spgist init forks more consistent at
replay with what is done on master by syncing them immediately, which
is not a bad thing to me.

Now, and actually my last email has been misleading as well, this
patch as well as the previous patch I sent for ~9.4 do not actually
fix the initialization of indexes for unlogged tables after promoting
a standby. Hence I guess that we are still missing a trick when
reinitializing those relations at the end of recovery. It is a bit
late here so I am attaching a patch reflecting the progress I have
done. Comments are welcome.
--
Michael

Attachment

20151121_xlog_fpi_replay_master_95.patch

Re: Error with index on unlogged table

From

Michael Paquier

Date:

27 November 2015, 06:42:35

On Sat, Nov 21, 2015 at 11:30 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Nov 20, 2015 at 4:11 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> For master and perhaps 9.5, I would think that a dedicated WAL record
>> type enforcing the fsync is the way to go. This special treatment
>> would go only for btree and spgist when they use INIT_FORKNUM and we
>> would not impact other relation types and code paths with this
>> behavior.
>
> So, I have been looking at that, and finished with the attached to
> implement this idea. This way, it is possible to control at replay if
> a relation should be synced to disk just after replaying a FPI or a
> set of FPIs. This makes btree and spgist init forks more consistent at
> replay with what is done on master by syncing them immediately, which
> is not a bad thing to me.

Depending on the type of index used on an unlogged table, the failure
happening is quite various. With gist and btree, a promoted standby
will complain about an empty page. With brin, the standby will
complain with a floating-point error:
ERROR:  22P01: floating-point exception
DETAIL:  An invalid floating-point operation was signaled. This
probably means an out-of-range result or an invalid operation, such as
division by zero.
LOCATION:  FloatExceptionHandler, postgres.c:2702

Now with gin, the system becomes actually unresponsive, being stuck on
PGSemaphoreLock and unable to answer to signals:   frame #1: 0x000000010a29f7cf
postgres`PGSemaphoreLock(sema=0x00000001138df118) + 63 at
pg_sema.c:387   frame #2: 0x000000010a348e81
postgres`LWLockAcquire(lock=0x000000010acc5dc0, mode=LW_EXCLUSIVE) +
369 at lwlock.c:1026   frame #3: 0x000000010a3190b1 postgres`LockBuffer(buffer=208,
mode=2) + 289 at bufmgr.c:3240   frame #4: 0x0000000109f341e1
postgres`ginHeapTupleFastInsert(ginstate=0x00007fff55d06a08,
collector=0x00007fff55d069d8) + 849 at ginfast.c:309   frame #5: 0x0000000109f1383f
postgres`gininsert(fcinfo=0x00007fff55d09010) + 431 at gininsert.c:531

I am still investigating for a correct fix, looking at reinit.c the
code in charge of copying the init fork as the main fork for a
relation at the end of recovery looks to be doing its job correctly...
Regards,
-- 
Michael

Re: Error with index on unlogged table

From

Michael Paquier

Date:

27 November 2015, 07:59:31

On Fri, Nov 27, 2015 at 3:42 PM, Michael Paquier wrote:
> I am still investigating for a correct fix, looking at reinit.c the
> code in charge of copying the init fork as the main fork for a
> relation at the end of recovery looks to be doing its job correctly...

Attached is a patch that fixes the issue for me in master and 9.5.
Actually in the last patch I forgot a call to smgrwrite to ensure that
the INIT_FORKNUM is correctly synced to disk when those pages are
replayed at recovery, letting the reset routines for unlogged
relations do their job correctly. I have noticed as well that we need
to do the same for gin and brin relations. In this case I think that
we could limit the flush to unlogged relations, my patch does it
unconditionally though to generalize the logic. Thoughts?
--
Michael

Attachment

20151127_xlog_fpi_replay_master_95_v2.patch

Re: Error with index on unlogged table

From

Kyotaro HORIGUCHI

Date:

01 December 2015, 02:11:36

Hello, I studied your lastest patch.

At Fri, 27 Nov 2015 16:59:20 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqRoaCMhr4hjEgq4rCZ4GaCB-6=cH8b2U7K7T5-kBGC5bA@mail.gmail.com>
> On Fri, Nov 27, 2015 at 3:42 PM, Michael Paquier wrote:
> > I am still investigating for a correct fix, looking at reinit.c the
> > code in charge of copying the init fork as the main fork for a
> > relation at the end of recovery looks to be doing its job correctly...
> 
> Attached is a patch that fixes the issue for me in master and 9.5.
> Actually in the last patch I forgot a call to smgrwrite to ensure that
> the INIT_FORKNUM is correctly synced to disk when those pages are
> replayed at recovery, letting the reset routines for unlogged
> relations do their job correctly. I have noticed as well that we need
> to do the same for gin and brin relations. In this case I think that
> we could limit the flush to unlogged relations, my patch does it
> unconditionally though to generalize the logic. Thoughts?

I feel quite uncomfortable that it solves the problem from a kind
of nature of unlogged object by arbitrary flagging which is not
fully corresponds to the nature. If we can deduce the necessity
of fsync from some nature, it would be preferable.

In the current patch, is_sync for log_newpage is generally true
for and only for INIT_FORKNUM pages. Exceptions as far as I can
see are,

copy_relation_data: called with arbitrary forknum but it doesn't  set is_fsync even for coying INIT_FORKNUM. (Is this
nota  problem?)

spgbuildempty, ginbuildempty: these emits two or three newpage logs at once so only the last one is set is_fsync for
performancereason.

And other anormallies are,

ginbuildempty, gistbuildempty: These funciton doesn't seem to immediately fsync but is_fsync is set to INIT_FORKNUM. Of
courseit wouldn't be a problem.

In short, it seems to me that the reason to choose using
XLOG_FPI_FOR_SYNC here is only performance of processing
successive FPIs for INIT_FORKNUM.

INIT_FORKNUM is generated only for unlogged tables and their
belongings. I suppose such successive fsyncs doesn't cause
observable performance drop assuming that the number of unlogged
tables and belongings is not so high, especially with smarter
storages. All we should do is that just fsync only for
INIT_FORKNUM's FPIs for the case. If the performance does matter
even so, we still can fsync the last md-file when any wal record
other than FPI for INIT_FORK comes. (But this would be a bit
complex..)

By the way, I suppose that fsyncing only the last page in
successive new pages still theoretically can cause this problem
when the last pages is not in the same file with other
pages. That cannot occur for INIT_FORKNUM files though in
reality:)

Thoughts?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Error with index on unlogged table

From

Michael Paquier

Date:

01 December 2015, 02:53:46

On Tue, Dec 1, 2015 at 11:11 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello, I studied your latest patch.

Thanks!

> I feel quite uncomfortable that it solves the problem from a kind
> of nature of unlogged object by arbitrary flagging which is not
> fully corresponds to the nature. If we can deduce the necessity
> of fsync from some nature, it would be preferable.

INIT_FORKNUM is not something only related to unlogged relations,
indexes use them as well. And that's actually
If you look at for example BRIN indexes that do not sync immediately
their INIT_FORKNUM when index is created, I think that we still are
going to need a new flag to control the sync at WAL replay because
startup process cannot know a relation's persistence, thing that we
can know when the XLOG_FPI record is created. For BRIN indexes, we
want particularly to not sync the INIT_FORKNUM when the relation is
not an unlogged one.

> In the current patch, is_sync for log_newpage is generally true
> for and only for INIT_FORKNUM pages. Exceptions as far as I can
> see are,

Yep, that's not always the case.

> copy_relation_data: called with arbitrary forknum but it doesn't
>    set is_fsync even for coying INIT_FORKNUM. (Is this not a
>    problem?)

Oh. And actually, it seems that use_wal is broken there as well. I
think that we would still want to issue a XLOG_FPI record for an
unlogged relation should it be moved to a new tablespace to copy its
INIT_FORKNUM correctly to its new home.
After moving an unlogged relation to a new tablespace, and after
promoting a standby the standby fails to start because of this error:
FATAL:  could not create file
"pg_tblspc/16388/PG_9.6_201511071/16384/16389": File exists
This could be fixed separately though.

> spgbuildempty, ginbuildempty: these emits two or three newpage
>   logs at once so only the last one is set is_fsync for
>   performance reason.

It doesn't matter to fsync just at the last one. Each one of them
would be replayed either way, the last one triggering the sync, no?

> In short, it seems to me that the reason to choose using
> XLOG_FPI_FOR_SYNC here is only performance of processing
> successive FPIs for INIT_FORKNUM.

Yeah, there is a one-way link between this WAL record a INIT_FORKNUM.
However please note that having a INIT_FORKNUM does not always imply
that a sync is wanted. copy_relation_data is an example of that.

> INIT_FORKNUM is generated only for unlogged tables and their
> belongings. I suppose such successive fsyncs doesn't cause
> observable performance drop assuming that the number of unlogged
> tables and belongings is not so high, especially with smarter
> storages. All we should do is that just fsync only for
> INIT_FORKNUM's FPIs for the case. If the performance does matter
> even so, we still can fsync the last md-file when any wal record
> other than FPI for INIT_FORK comes. (But this would be a bit
> complex..)

Hm. If a system uses a bunch of temporary relations with brin index or
other included I would not say so. For back branches we may have to do
it unconditionally using INIT_FORKNUM, but having a control flag to
have it only done for unlogged relations would leverage that.
-- 
Michael

Re: Error with index on unlogged table

From

Kyotaro HORIGUCHI

Date:

01 December 2015, 06:07:22

Hello, 

At Tue, 1 Dec 2015 11:53:35 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSMENEK7nqmwGsiyLTSbrZNJKx80tBX3qF6cQsS49sjag@mail.gmail.com>
> On Tue, Dec 1, 2015 at 11:11 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hello, I studied your latest patch.
> 
> Thanks!
> 
> > I feel quite uncomfortable that it solves the problem from a kind
> > of nature of unlogged object by arbitrary flagging which is not
> > fully corresponds to the nature. If we can deduce the necessity
> > of fsync from some nature, it would be preferable.
> 
> INIT_FORKNUM is not something only related to unlogged relations,
> indexes use them as well. And that's actually
> If you look at for example BRIN indexes that do not sync immediately
> their INIT_FORKNUM when index is created, I think that we still are
> going to need a new flag to control the sync at WAL replay because
> startup process cannot know a relation's persistence, thing that we
> can know when the XLOG_FPI record is created. For BRIN indexes, we
> want particularly to not sync the INIT_FORKNUM when the relation is
> not an unlogged one.

(The comment added in brinbuildempty looks wrong since itactually doesn't fsync it immediately..)

Hmm, I've already seen that, and having your explanation I wonder
why brinbuidempty issues WAL for what is not necessary to be
persistent at the mement. Isn't it breaking agreements about
Write Ahead Log? INIT_FORKNUM and unconditionally fsync'ing would
be equally tied excluding the anormally about WAL. (Except for
succeeding newpages.)

> > In the current patch, is_sync for log_newpage is generally true
> > for and only for INIT_FORKNUM pages. Exceptions as far as I can
> > see are,
> 
> Yep, that's not always the case.
> 
> > copy_relation_data: called with arbitrary forknum but it doesn't
> >    set is_fsync even for coying INIT_FORKNUM. (Is this not a
> >    problem?)
> 
> Oh. And actually, it seems that use_wal is broken there as well. I
> think that we would still want to issue a XLOG_FPI record for an
> unlogged relation should it be moved to a new tablespace to copy its
> INIT_FORKNUM correctly to its new home.

Agreed.

> After moving an unlogged relation to a new tablespace, and after
> promoting a standby the standby fails to start because of this error:
> FATAL:  could not create file
> "pg_tblspc/16388/PG_9.6_201511071/16384/16389": File exists
> This could be fixed separately though.
> 
> > spgbuildempty, ginbuildempty: these emits two or three newpage
> >   logs at once so only the last one is set is_fsync for
> >   performance reason.
> 
> It doesn't matter to fsync just at the last one. Each one of them
> would be replayed either way, the last one triggering the sync, no?

Yes, it does (except for some unreal case below). It was just a
confirmation and I didn't see this as a problem at all. Sorry for the
ambiguous writing.

> > In short, it seems to me that the reason to choose using
> > XLOG_FPI_FOR_SYNC here is only performance of processing
> > successive FPIs for INIT_FORKNUM.
> 
> Yeah, there is a one-way link between this WAL record a INIT_FORKNUM.
> However please note that having a INIT_FORKNUM does not always imply
> that a sync is wanted. copy_relation_data is an example of that.

As I wrote above, I suppose we should fix(?) the irregular
relationship between WAL and init fork of brin and so.

> > INIT_FORKNUM is generated only for unlogged tables and their
> > belongings. I suppose such successive fsyncs doesn't cause
> > observable performance drop assuming that the number of unlogged
> > tables and belongings is not so high, especially with smarter
> > storages. All we should do is that just fsync only for
> > INIT_FORKNUM's FPIs for the case. If the performance does matter
> > even so, we still can fsync the last md-file when any wal record
> > other than FPI for INIT_FORK comes. (But this would be a bit
> > complex..)
> 
> Hm. If a system uses a bunch of temporary relations with brin index or
> other included I would not say so. For back branches we may have to do
> it unconditionally using INIT_FORKNUM, but having a control flag to
> have it only done for unlogged relations would leverage that.

It could, and should do so. And if we take such systems with
bunch of temp relations as significant (I agree with this),
XLogRegisterBlock() looks to be able to register multiple blocks
into single wal record and we could eliminate arbitrary flagging
on individual FPI records using it. Is it possible?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Error with index on unlogged table

From

Michael Paquier

Date:

01 December 2015, 13:27:13

On Tue, Dec 1, 2015 at 3:06 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Tue, 1 Dec 2015 11:53:35 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSMENEK7nqmwGsiyLTSbrZNJKx80tBX3qF6cQsS49sjag@mail.gmail.com>
>> On Tue, Dec 1, 2015 at 11:11 AM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > Hello, I studied your latest patch.
>>
>> Thanks!
>>
>> > I feel quite uncomfortable that it solves the problem from a kind
>> > of nature of unlogged object by arbitrary flagging which is not
>> > fully corresponds to the nature. If we can deduce the necessity
>> > of fsync from some nature, it would be preferable.
>>
>> INIT_FORKNUM is not something only related to unlogged relations,
>> indexes use them as well. And that's actually
>> If you look at for example BRIN indexes that do not sync immediately
>> their INIT_FORKNUM when index is created, I think that we still are
>> going to need a new flag to control the sync at WAL replay because
>> startup process cannot know a relation's persistence, thing that we
>> can know when the XLOG_FPI record is created. For BRIN indexes, we
>> want particularly to not sync the INIT_FORKNUM when the relation is
>> not an unlogged one.
>
> (The comment added in brinbuildempty looks wrong since it
>  actually doesn't fsync it immediately..)
>
> Hmm, I've already seen that, and having your explanation I wonder
> why brinbuidempty issues WAL for what is not necessary to be
> persistent at the mement. Isn't it breaking agreements about
> Write Ahead Log? INIT_FORKNUM and unconditionally fsync'ing would
> be equally tied excluding the anormally about WAL. (Except for
> succeeding newpages.)

Alvaro, your thoughts regarding those lines? When building an empty
INIT_FORKNUM for a brin index its data is saved into a shared buffer
and not immediately synced into disk. Shouldn't that be necessary for
at least unlogged relations?

>> > In short, it seems to me that the reason to choose using
>> > XLOG_FPI_FOR_SYNC here is only performance of processing
>> > successive FPIs for INIT_FORKNUM.
>>
>> Yeah, there is a one-way link between this WAL record a INIT_FORKNUM.
>> However please note that having a INIT_FORKNUM does not always imply
>> that a sync is wanted. copy_relation_data is an example of that.
>
> As I wrote above, I suppose we should fix(?) the irregular
> relationship between WAL and init fork of brin and so.

Yep.

>> > INIT_FORKNUM is generated only for unlogged tables and their
>> > belongings. I suppose such successive fsyncs doesn't cause
>> > observable performance drop assuming that the number of unlogged
>> > tables and belongings is not so high, especially with smarter
>> > storages. All we should do is that just fsync only for
>> > INIT_FORKNUM's FPIs for the case. If the performance does matter
>> > even so, we still can fsync the last md-file when any wal record
>> > other than FPI for INIT_FORK comes. (But this would be a bit
>> > complex..)
>>
>> Hm. If a system uses a bunch of temporary relations with brin index or
>> other included I would not say so. For back branches we may have to do
>> it unconditionally using INIT_FORKNUM, but having a control flag to
>> have it only done for unlogged relations would leverage that.
>
> It could, and should do so. And if we take such systems with
> bunch of temp relations as significant (I agree with this),
> XLogRegisterBlock() looks to be able to register multiple blocks
> into single wal record and we could eliminate arbitrary flagging
> on individual FPI records using it. Is it possible?

I thought about using a BKPBLOCK flag but all of them are already
taken if that's what you meant. it seems cheaper to do that a record
level...
-- 
Michael

Re: Error with index on unlogged table

From

Andres Freund

Date:

03 December 2015, 21:09:52

Hi,

On 2015-11-20 16:11:15 +0900, Michael Paquier wrote:
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index cc845d2..4883697 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -9503,6 +9503,14 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
>          data += sizeof(BkpBlock);
>  
>          RestoreBackupBlockContents(lsn, bkpb, data, false, false);
> +
> +        if (bkpb.fork == INIT_FORKNUM)
> +        {
> +            SMgrRelation srel;
> +            srel = smgropen(bkpb.node, InvalidBackendId);
> +            smgrimmedsync(srel, INIT_FORKNUM);
> +            smgrclose(srel);
> +        }
>      }
>      else if (info == XLOG_BACKUP_END)
>      {

A smgrwrite() instead of a smgrimmedsync() should be sufficient here.

- Andres

Re: Error with index on unlogged table

From

Andres Freund

Date:

03 December 2015, 21:35:18

On 2015-11-27 16:59:20 +0900, Michael Paquier wrote:
> Attached is a patch that fixes the issue for me in master and 9.5.
> Actually in the last patch I forgot a call to smgrwrite to ensure that
> the INIT_FORKNUM is correctly synced to disk when those pages are
> replayed at recovery, letting the reset routines for unlogged
> relations do their job correctly. I have noticed as well that we need
> to do the same for gin and brin relations. In this case I think that
> we could limit the flush to unlogged relations, my patch does it
> unconditionally though to generalize the logic. Thoughts?

I think it's a good idea to limit this to unlogged relations. For a
dataset that fits into shared_buffers and creates many relations, the
additional disk writes could be noticeable.

> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
> index 99337b0..d7964ac 100644
> --- a/src/backend/access/brin/brin.c
> +++ b/src/backend/access/brin/brin.c
> @@ -682,7 +682,12 @@ brinbuildempty(PG_FUNCTION_ARGS)
>      brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
>                         BRIN_CURRENT_VERSION);
>      MarkBufferDirty(metabuf);
> -    log_newpage_buffer(metabuf, false);
> +    /*
> +     * When this full page image is replayed, there is no guarantee that
> +     * this page will be present to disk when replayed particularly for
> +     * unlogged relations, hence enforce it to be flushed to disk.
> +     */

The grammar seems a bit off here.

> +    /*
> +     * Initialize and xlog metabuffer and root buffer. When those full
> +     * pages are replayed, it is not guaranteed that those relation
> +     * init forks will be flushed to disk after replaying them, hence
> +     * enforce those pages to be flushed to disk at replay, only the
> +     * last record will enforce a flush for performance reasons and
> +     * because it is actually unnecessary to do it multiple times.
> +     */

That comment needs some love too. I think it really only makes sense if
you know the current state. There's also some language polishing needed.

> @@ -450,14 +450,21 @@ ginbuildempty(PG_FUNCTION_ARGS)
>      START_CRIT_SECTION();
>      GinInitMetabuffer(MetaBuffer);
>      MarkBufferDirty(MetaBuffer);
> -    log_newpage_buffer(MetaBuffer, false);
> +    log_newpage_buffer(MetaBuffer, false, false);
>      GinInitBuffer(RootBuffer, GIN_LEAF);
>      MarkBufferDirty(RootBuffer);
> -    log_newpage_buffer(RootBuffer, false);
> +    log_newpage_buffer(RootBuffer, false, true);
>      END_CRIT_SECTION();
>
Why isn't the metapage flushed to disk?

> --- a/src/backend/access/spgist/spginsert.c
> +++ b/src/backend/access/spgist/spginsert.c
> @@ -173,7 +173,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
>                (char *) page, true);
>      if (XLogIsNeeded())
>          log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
> -                    SPGIST_METAPAGE_BLKNO, page, false);
> +                    SPGIST_METAPAGE_BLKNO, page, false, false);
>  
>      /* Likewise for the root page. */
>      SpGistInitPage(page, SPGIST_LEAF);
> @@ -183,7 +183,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
>                (char *) page, true);
>      if (XLogIsNeeded())
>          log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
> -                    SPGIST_ROOT_BLKNO, page, true);
> +                    SPGIST_ROOT_BLKNO, page, true, false);
>

I don't think it's correct to not flush in these cases. Yes, the primary
does an smgrimmedsync() - but that's not done on the standby.

> @@ -9392,9 +9396,28 @@ xlog_redo(XLogReaderState *record)
>           * when checksums are enabled. There is no difference in handling
>           * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
>           * code just to distinguish them for statistics purposes.
> +         *
> +         * XLOG_FPI_FOR_SYNC records are generated when a relation needs to
> +         * be sync'ed just after replaying a full page. This is important
> +         * to match the master behavior in the case where a page is written
> +         * directly to disk without going through shared buffers, particularly
> +         * when writing init forks for index relations.
>           */

How about

FPI_FOR_SYNC records indicate that the page immediately needs to be
written to disk, not just to shared buffers. This is important if the
on-disk state is to be the authoritative, not the state in shared
buffers. E.g. because on-disk files may later be copied directly.

> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
> index 0ddde72..eb22a51 100644
> --- a/src/backend/commands/tablecmds.c
> +++ b/src/backend/commands/tablecmds.c
> @@ -9920,7 +9920,8 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
>           * space.
>           */
>          if (use_wal)
> -            log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false);
> +            log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page,
> +                        false, false);

Shouldn't this flush data if it's an init fork? Currently that's an
academic distinction, given there'll never be a page, but it still seems
prudent.

>  extern void InitXLogInsert(void);
> diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
> index ad1eb4b..91445f1 100644
> --- a/src/include/catalog/pg_control.h
> +++ b/src/include/catalog/pg_control.h
> @@ -73,6 +73,7 @@ typedef struct CheckPoint
>  #define XLOG_END_OF_RECOVERY            0x90
>  #define XLOG_FPI_FOR_HINT                0xA0
>  #define XLOG_FPI                        0xB0
> +#define XLOG_FPI_FOR_SYNC                0xC0


I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too
ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or
instead adding actual record data and a 'flags' field in there? I
slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are
different, XLOG_FPI_FOR_SYNC not so much.

Greetings,

Andres Freund

Re: Error with index on unlogged table

From

Michael Paquier

Date:

04 December 2015, 08:00:21

Andres Freud wrote:
> On 2015-11-20 16:11:15 +0900, Michael Paquier wrote:
>> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
>> index cc845d2..4883697 100644
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -9503,6 +9503,14 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
>>               data += sizeof(BkpBlock);
>>
>>               RestoreBackupBlockContents(lsn, bkpb, data, false, false);
>> +
>> +             if (bkpb.fork == INIT_FORKNUM)
>> +             {
>> +                     SMgrRelation srel;
>> +                     srel = smgropen(bkpb.node, InvalidBackendId);
>> +                     smgrimmedsync(srel, INIT_FORKNUM);
>> +                     smgrclose(srel);
>> +             }
>>       }
>>       else if (info == XLOG_BACKUP_END)
>>       {
>
> A smgrwrite() instead of a smgrimmedsync() should be sufficient here.

Check.

On Fri, Dec 4, 2015 at 6:35 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-11-27 16:59:20 +0900, Michael Paquier wrote:
>> Attached is a patch that fixes the issue for me in master and 9.5.
>> Actually in the last patch I forgot a call to smgrwrite to ensure that
>> the INIT_FORKNUM is correctly synced to disk when those pages are
>> replayed at recovery, letting the reset routines for unlogged
>> relations do their job correctly. I have noticed as well that we need
>> to do the same for gin and brin relations. In this case I think that
>> we could limit the flush to unlogged relations, my patch does it
>> unconditionally though to generalize the logic. Thoughts?
>
> I think it's a good idea to limit this to unlogged relations. For a
> dataset that fits into shared_buffers and creates many relations, the
> additional disk writes could be noticeable.

OK, then I have switched the patch this way to limit the flush to
unlogged relations where needed.

>> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
>> index 99337b0..d7964ac 100644
>> --- a/src/backend/access/brin/brin.c
>> +++ b/src/backend/access/brin/brin.c
>> @@ -682,7 +682,12 @@ brinbuildempty(PG_FUNCTION_ARGS)
>>       brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
>>                                          BRIN_CURRENT_VERSION);
>>       MarkBufferDirty(metabuf);
>> -     log_newpage_buffer(metabuf, false);
>> +     /*
>> +      * When this full page image is replayed, there is no guarantee that
>> +      * this page will be present to disk when replayed particularly for
>> +      * unlogged relations, hence enforce it to be flushed to disk.
>> +      */
>
> The grammar seems a bit off here.

Check.

>> +     /*
>> +      * Initialize and xlog metabuffer and root buffer. When those full
>> +      * pages are replayed, it is not guaranteed that those relation
>> +      * init forks will be flushed to disk after replaying them, hence
>> +      * enforce those pages to be flushed to disk at replay, only the
>> +      * last record will enforce a flush for performance reasons and
>> +      * because it is actually unnecessary to do it multiple times.
>> +      */
>
> That comment needs some love too. I think it really only makes sense if
> you know the current state. There's also some language polishing needed.

Check.

>> @@ -450,14 +450,21 @@ ginbuildempty(PG_FUNCTION_ARGS)
>>       START_CRIT_SECTION();
>>       GinInitMetabuffer(MetaBuffer);
>>       MarkBufferDirty(MetaBuffer);
>> -     log_newpage_buffer(MetaBuffer, false);
>> +     log_newpage_buffer(MetaBuffer, false, false);
>>       GinInitBuffer(RootBuffer, GIN_LEAF);
>>       MarkBufferDirty(RootBuffer);
>> -     log_newpage_buffer(RootBuffer, false);
>> +     log_newpage_buffer(RootBuffer, false, true);
>>       END_CRIT_SECTION();
>>
> Why isn't the metapage flushed to disk?

I was not sure if we should only flush only at the last page, as pages
just before would be already replayed. Switched.

>> --- a/src/backend/access/spgist/spginsert.c
>> +++ b/src/backend/access/spgist/spginsert.c
>> @@ -173,7 +173,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
>>                         (char *) page, true);
>>       if (XLogIsNeeded())
>>               log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
>> -                                     SPGIST_METAPAGE_BLKNO, page, false);
>> +                                     SPGIST_METAPAGE_BLKNO, page, false, false);
>>
>>       /* Likewise for the root page. */
>>       SpGistInitPage(page, SPGIST_LEAF);
>> @@ -183,7 +183,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
>>                         (char *) page, true);
>>       if (XLogIsNeeded())
>>               log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
>> -                                     SPGIST_ROOT_BLKNO, page, true);
>> +                                     SPGIST_ROOT_BLKNO, page, true, false);
>>
>
> I don't think it's correct to not flush in these cases. Yes, the primary
> does an smgrimmedsync() - but that's not done on the standby.

OK. Switched.

>> @@ -9392,9 +9396,28 @@ xlog_redo(XLogReaderState *record)
>>                * when checksums are enabled. There is no difference in handling
>>                * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
>>                * code just to distinguish them for statistics purposes.
>> +              *
>> +              * XLOG_FPI_FOR_SYNC records are generated when a relation needs to
>> +              * be sync'ed just after replaying a full page. This is important
>> +              * to match the master behavior in the case where a page is written
>> +              * directly to disk without going through shared buffers, particularly
>> +              * when writing init forks for index relations.
>>                */
>
> How about
>
> FPI_FOR_SYNC records indicate that the page immediately needs to be
> written to disk, not just to shared buffers. This is important if the
> on-disk state is to be the authoritative, not the state in shared
> buffers. E.g. because on-disk files may later be copied directly.

OK.

>> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
>> index 0ddde72..eb22a51 100644
>> --- a/src/backend/commands/tablecmds.c
>> +++ b/src/backend/commands/tablecmds.c
>> @@ -9920,7 +9920,8 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
>>                * space.
>>                */
>>               if (use_wal)
>> -                     log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false);
>> +                     log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page,
>> +                                             false, false);
>
> Shouldn't this flush data if it's an init fork? Currently that's an
> academic distinction, given there'll never be a page, but it still seems
> prudent.

Yeah it should do that for all the INIT_FORKNUM because we don't know
the page type here. Note that use_wal is broken as well. We had better
log and flush the INIT_FORKNUM as well for unlogged relations.

>>  extern void InitXLogInsert(void);
>> diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
>> index ad1eb4b..91445f1 100644
>> --- a/src/include/catalog/pg_control.h
>> +++ b/src/include/catalog/pg_control.h
>> @@ -73,6 +73,7 @@ typedef struct CheckPoint
>>  #define XLOG_END_OF_RECOVERY                 0x90
>>  #define XLOG_FPI_FOR_HINT                            0xA0
>>  #define XLOG_FPI                                             0xB0
>> +#define XLOG_FPI_FOR_SYNC                            0xC0
>
>
> I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too
> ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or
> instead adding actual record data and a 'flags' field in there? I
> slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are
> different, XLOG_FPI_FOR_SYNC not so much.

Let's go for XLOG_FPI_FLUSH.

Attached is an updated patch. For back branches, I am still not sure
what would be a good idea though. I don't see any other initiative
than flushing unconditionally INIT_FORKNUM when replaying an FPI...
But that would impact some systems severely.
Regards,
--
Michael

Attachment

0001-Ensure-consistent-on-disk-state-of-UNLOGGED-indexes-.patch

Re: Error with index on unlogged table

From

Andres Freund

Date:

04 December 2015, 08:10:28

On 2015-12-04 17:00:13 +0900, Michael Paquier wrote:
> Andres Freud wrote:
> >> @@ -450,14 +450,21 @@ ginbuildempty(PG_FUNCTION_ARGS)
> >>       START_CRIT_SECTION();
> >>       GinInitMetabuffer(MetaBuffer);
> >>       MarkBufferDirty(MetaBuffer);
> >> -     log_newpage_buffer(MetaBuffer, false);
> >> +     log_newpage_buffer(MetaBuffer, false, false);
> >>       GinInitBuffer(RootBuffer, GIN_LEAF);
> >>       MarkBufferDirty(RootBuffer);
> >> -     log_newpage_buffer(RootBuffer, false);
> >> +     log_newpage_buffer(RootBuffer, false, true);
> >>       END_CRIT_SECTION();
> >>
> > Why isn't the metapage flushed to disk?
> 
> I was not sure if we should only flush only at the last page, as pages
> just before would be already replayed. Switched.

Uh, that's not how it works! The earlier pages would just be in shared
buffers. Neither smgrwrite, nor smgrimmedsync does anything about that!

> >>  extern void InitXLogInsert(void);
> >> diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
> >> index ad1eb4b..91445f1 100644
> >> --- a/src/include/catalog/pg_control.h
> >> +++ b/src/include/catalog/pg_control.h
> >> @@ -73,6 +73,7 @@ typedef struct CheckPoint
> >>  #define XLOG_END_OF_RECOVERY                 0x90
> >>  #define XLOG_FPI_FOR_HINT                            0xA0
> >>  #define XLOG_FPI                                             0xB0
> >> +#define XLOG_FPI_FOR_SYNC                            0xC0
> >
> >
> > I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too
> > ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or
> > instead adding actual record data and a 'flags' field in there? I
> > slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are
> > different, XLOG_FPI_FOR_SYNC not so much.
> 
> Let's go for XLOG_FPI_FLUSH.

I think the other way is a bit better, because we can add new flags
without changing the WAL format.

> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
> index 99337b0..b646101 100644
> --- a/src/backend/access/brin/brin.c
> +++ b/src/backend/access/brin/brin.c
> @@ -682,7 +682,15 @@ brinbuildempty(PG_FUNCTION_ARGS)
>      brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
>                         BRIN_CURRENT_VERSION);
>      MarkBufferDirty(metabuf);
> -    log_newpage_buffer(metabuf, false);
> +
> +    /*
> +     * For unlogged relations, this page should be immediately flushed
> +     * to disk after being replayed. This is necessary to ensure that the
> +     * initial on-disk state of unlogged relations is preserved when
> +     * they get reset at the end of recovery.
> +     */
> +    log_newpage_buffer(metabuf, false,
> +        index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>      END_CRIT_SECTION();

Maybe write the last sentence as '... as the on disk files are copied
directly at the end of recovery.'?

> @@ -336,7 +336,8 @@ end_heap_rewrite(RewriteState state)
>                          MAIN_FORKNUM,
>                          state->rs_blockno,
>                          state->rs_buffer,
> -                        true);
> +                        true,
> +                        false);
>          RelationOpenSmgr(state->rs_new_rel);
>  
>          PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
> @@ -685,7 +686,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
>                              MAIN_FORKNUM,
>                              state->rs_blockno,
>                              page,
> -                            true);
> +                            true,
> +                            false);

Did you verify that that's ok when a unlogged table is clustered/vacuum
full'ed?

> @@ -181,6 +183,9 @@ xlog_identify(uint8 info)
>          case XLOG_FPI_FOR_HINT:
>              id = "FPI_FOR_HINT";
>              break;
> +        case XLOG_FPI_FLUSH:
> +            id = "FPI_FOR_SYNC";
> +            break;
>      }

Old string.


> --- a/src/backend/access/transam/xloginsert.c
> +++ b/src/backend/access/transam/xloginsert.c
> @@ -932,10 +932,13 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
>   * If the page follows the standard page layout, with a PageHeader and unused
>   * space between pd_lower and pd_upper, set 'page_std' to TRUE. That allows
>   * the unused space to be left out from the WAL record, making it smaller.
> + *
> + * If 'is_flush' is set to TRUE, relation will be requested to flush
> + * immediately its data at replay after replaying this full page image.
>   */

s/is_flush/flush_immed/? And maybe say that it 'will be flushed to the
OS immediately after replaying the record'?


Greetings,

Andres Freund

Re: Error with index on unlogged table

From

Michael Paquier

Date:

04 December 2015, 12:58:01

On Fri, Dec 4, 2015 at 5:10 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-04 17:00:13 +0900, Michael Paquier wrote:
>> Andres Freud wrote:
>> >>  extern void InitXLogInsert(void);
>> >> diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
>> >> index ad1eb4b..91445f1 100644
>> >> --- a/src/include/catalog/pg_control.h
>> >> +++ b/src/include/catalog/pg_control.h
>> >> @@ -73,6 +73,7 @@ typedef struct CheckPoint
>> >>  #define XLOG_END_OF_RECOVERY                 0x90
>> >>  #define XLOG_FPI_FOR_HINT                            0xA0
>> >>  #define XLOG_FPI                                             0xB0
>> >> +#define XLOG_FPI_FOR_SYNC                            0xC0
>> >
>> >
>> > I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too
>> > ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or
>> > instead adding actual record data and a 'flags' field in there? I
>> > slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are
>> > different, XLOG_FPI_FOR_SYNC not so much.
>>
>> Let's go for XLOG_FPI_FLUSH.
>
> I think the other way is a bit better, because we can add new flags
> without changing the WAL format.

Hm. On the contrary, I think that it would make more sense to have a
flag as well for FOR_HINT honestly, those are really the same
operations, and FOR_HINT is just here for statistic purposes.

>> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
>> index 99337b0..b646101 100644
>> --- a/src/backend/access/brin/brin.c
>> +++ b/src/backend/access/brin/brin.c
>> @@ -682,7 +682,15 @@ brinbuildempty(PG_FUNCTION_ARGS)
>>       brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
>>                                          BRIN_CURRENT_VERSION);
>>       MarkBufferDirty(metabuf);
>> -     log_newpage_buffer(metabuf, false);
>> +
>> +     /*
>> +      * For unlogged relations, this page should be immediately flushed
>> +      * to disk after being replayed. This is necessary to ensure that the
>> +      * initial on-disk state of unlogged relations is preserved when
>> +      * they get reset at the end of recovery.
>> +      */
>> +     log_newpage_buffer(metabuf, false,
>> +             index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>>       END_CRIT_SECTION();
>
> Maybe write the last sentence as '... as the on disk files are copied
> directly at the end of recovery.'?

Check.

>> @@ -336,7 +336,8 @@ end_heap_rewrite(RewriteState state)
>>                                               MAIN_FORKNUM,
>>                                               state->rs_blockno,
>>                                               state->rs_buffer,
>> -                                             true);
>> +                                             true,
>> +                                             false);
>>               RelationOpenSmgr(state->rs_new_rel);
>>
>>               PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
>> @@ -685,7 +686,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
>>                                                       MAIN_FORKNUM,
>>                                                       state->rs_blockno,
>>                                                       page,
>> -                                                     true);
>> +                                                     true,
>> +                                                     false);
>
> Did you verify that that's ok when a unlogged table is clustered/vacuum
> full'ed?

Yep.

>> @@ -181,6 +183,9 @@ xlog_identify(uint8 info)
>>               case XLOG_FPI_FOR_HINT:
>>                       id = "FPI_FOR_HINT";
>>                       break;
>> +             case XLOG_FPI_FLUSH:
>> +                     id = "FPI_FOR_SYNC";
>> +                     break;
>>       }
>
> Old string.

Yeah, that's now completely removed.

>> --- a/src/backend/access/transam/xloginsert.c
>> +++ b/src/backend/access/transam/xloginsert.c
>> @@ -932,10 +932,13 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
>>   * If the page follows the standard page layout, with a PageHeader and unused
>>   * space between pd_lower and pd_upper, set 'page_std' to TRUE. That allows
>>   * the unused space to be left out from the WAL record, making it smaller.
>> + *
>> + * If 'is_flush' is set to TRUE, relation will be requested to flush
>> + * immediately its data at replay after replaying this full page image.
>>   */
>
> s/is_flush/flush_immed/? And maybe say that it 'will be flushed to the
> OS immediately after replaying the record'?

s/OS/stable storage?
--
Michael

Attachment

0001-Ensure-consistent-on-disk-state-of-UNLOGGED-indexes-.patch

Re: Error with index on unlogged table

From

Andres Freund

Date:

08 December 2015, 12:57:25

On 2015-12-03 22:09:43 +0100, Andres Freund wrote:
> On 2015-11-20 16:11:15 +0900, Michael Paquier wrote:
> > +        if (bkpb.fork == INIT_FORKNUM)
> > +        {
> > +            SMgrRelation srel;
> > +            srel = smgropen(bkpb.node, InvalidBackendId);
> > +            smgrimmedsync(srel, INIT_FORKNUM);
> > +            smgrclose(srel);
> > +        }
>
> A smgrwrite() instead of a smgrimmedsync() should be sufficient here.

More importantly, the smgrimmedsync() won't actually achieve anything -
RestoreBackupBlockContents() will just have written the data into shared
buffers. smgrimmedsync() doesn't flush that.

And further, just flushing the buffer directly to disk using
smgrwrite(), without reflecting that in the in-memory state, doesn't
seem prudent. At the very least we'll write all those blocks twice. It
also likely misses dealing with checksums and such.

What I think we really ought to do instead is to use "proper" buffer
functionality, e.g. flush the buffer via FlushBuffer().

It seems slightly better not to directly expose FlushBuffer() and
instead add a tiny wrapper. Couldn't come up with a grand name tho.

For me the attached, preliminary, patch, fixes the problem in master;
previous branches ought to look mostly similar, except the flush moved
to RestoreBackupBlockContents/RestoreBackupBlock.

Does anybody have a better idea? Suitable for the back-branches?

I'm kinda wondering if it wouldn't have been better to go through shared
buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
copy_file().

Regareds,

Andres

Attachment

0001-Preliminary-unlogged-rel-recovery-fix.patch

Re: Error with index on unlogged table

From

Andres Freund

Date:

08 December 2015, 13:10:00

Hi,

On 2015-12-04 21:57:54 +0900, Michael Paquier wrote:
> On Fri, Dec 4, 2015 at 5:10 PM, Andres Freund <andres@anarazel.de> wrote:
> >> Let's go for XLOG_FPI_FLUSH.
> >
> > I think the other way is a bit better, because we can add new flags
> > without changing the WAL format.
> 
> Hm. On the contrary, I think that it would make more sense to have a
> flag as well for FOR_HINT honestly, those are really the same
> operations, and FOR_HINT is just here for statistic purposes.

I don't think it's all that much the same operation. And WRT statistics
purpose: Being able to easily differentiate FOR_HINT is important for
pg_xlogdump --stats, but not at all for XLOG_FPI_FLUSH.

> >> --- a/src/backend/access/transam/xloginsert.c
> >> +++ b/src/backend/access/transam/xloginsert.c
> >> @@ -932,10 +932,13 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
> >>   * If the page follows the standard page layout, with a PageHeader and unused
> >>   * space between pd_lower and pd_upper, set 'page_std' to TRUE. That allows
> >>   * the unused space to be left out from the WAL record, making it smaller.
> >> + *
> >> + * If 'is_flush' is set to TRUE, relation will be requested to flush
> >> + * immediately its data at replay after replaying this full page image.
> >>   */
> >
> > s/is_flush/flush_immed/? And maybe say that it 'will be flushed to the
> > OS immediately after replaying the record'?
> 
> s/OS/stable storage?

It's not flushed to stable storage here. Just to the OS.

> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
> index 99337b0..fff48ab 100644
> --- a/src/backend/access/brin/brin.c
> +++ b/src/backend/access/brin/brin.c
> @@ -682,7 +682,15 @@ brinbuildempty(PG_FUNCTION_ARGS)
>      brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
>                         BRIN_CURRENT_VERSION);
>      MarkBufferDirty(metabuf);
> -    log_newpage_buffer(metabuf, false);
> +
> +    /*
> +     * For unlogged relations, this page should be immediately flushed
> +     * to disk after being replayed. This is necessary to ensure that the
> +     * initial on-disk state of unlogged relations is preserved as the
> +     * on-disk files are copied directly at the end of recovery.
> +     */
> +    log_newpage_buffer(metabuf, false,
> +        index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>      END_CRIT_SECTION();
>  
>      UnlockReleaseBuffer(metabuf);
> diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
> index f876f62..572fe20 100644
> --- a/src/backend/access/brin/brin_pageops.c
> +++ b/src/backend/access/brin/brin_pageops.c
> @@ -865,7 +865,7 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
>      page = BufferGetPage(buffer);
>      brin_page_init(page, BRIN_PAGETYPE_REGULAR);
>      MarkBufferDirty(buffer);
> -    log_newpage_buffer(buffer, true);
> +    log_newpage_buffer(buffer, true, false);
>      END_CRIT_SECTION();
>  
>      /*
> diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
> index 49e9185..17c168a 100644
> --- a/src/backend/access/gin/gininsert.c
> +++ b/src/backend/access/gin/gininsert.c
> @@ -450,14 +450,22 @@ ginbuildempty(PG_FUNCTION_ARGS)
>          ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
>      LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
>  
> -    /* Initialize and xlog metabuffer and root buffer. */
> +    /*
> +     * Initialize and xlog metabuffer and root buffer. For unlogged
> +     * relations, those pages need to be immediately flushed to disk
> +     * after being replayed. This is necessary to ensure that the
> +     * initial on-disk state of unlogged relations is preserved when
> +     * they get reset at the end of recovery.
> +     */
>      START_CRIT_SECTION();
>      GinInitMetabuffer(MetaBuffer);
>      MarkBufferDirty(MetaBuffer);
> -    log_newpage_buffer(MetaBuffer, false);
> +    log_newpage_buffer(MetaBuffer, false,
> +        index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>      GinInitBuffer(RootBuffer, GIN_LEAF);
>      MarkBufferDirty(RootBuffer);
> -    log_newpage_buffer(RootBuffer, false);
> +    log_newpage_buffer(RootBuffer, false,
> +        index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>      END_CRIT_SECTION();
>  
>      /* Unlock and release the buffers. */
> diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
> index 53bccf6..6a20031 100644
> --- a/src/backend/access/gist/gist.c
> +++ b/src/backend/access/gist/gist.c
> @@ -84,7 +84,8 @@ gistbuildempty(PG_FUNCTION_ARGS)
>      START_CRIT_SECTION();
>      GISTInitBuffer(buffer, F_LEAF);
>      MarkBufferDirty(buffer);
> -    log_newpage_buffer(buffer, true);
> +    log_newpage_buffer(buffer, true,
> +        index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>      END_CRIT_SECTION();
>

I might be missing something here - but isn't it pretty much guaranteed
that all these are unlogged relations? Otherwise *buildempty() wouldn't
have been called, right?

>      else if (info == XLOG_BACKUP_END)
>      {
> @@ -178,9 +183,6 @@ xlog_identify(uint8 info)
>          case XLOG_FPI:
>              id = "FPI";
>              break;
> -        case XLOG_FPI_FOR_HINT:
> -            id = "FPI_FOR_HINT";
> -            break;
>      }

Really don't want to do that.

> @@ -9391,14 +9394,34 @@ xlog_redo(XLogReaderState *record)
>           * resource manager needs to generate conflicts, it has to define a
>           * separate WAL record type and redo routine.
>           *
> -         * XLOG_FPI_FOR_HINT records are generated when a page needs to be
> -         * WAL- logged because of a hint bit update. They are only generated
> -         * when checksums are enabled. There is no difference in handling
> -         * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
> -         * code just to distinguish them for statistics purposes.
> +         * Records flagged with 'for_hint_bits' are generated when a page needs
> +         * to be WAL- logged because of a hint bit update. They are only
> +         * generated when checksums are enabled. There is no difference in
> +         * handling records when this flag is set, it is used for statistics
> +         * purposes.
> +         *
> +         * Records flagged with 'is_flush' indicate that the page immediately
> +         * needs to be written to disk, not just to shared buffers. This is
> +         * important if the on-disk state is to be the authoritative, not the
> +         * state in shared buffers. E.g. because on-disk files may later be
> +         * copied directly.
>           */
>          if (XLogReadBufferForRedo(record, 0, &buffer) != BLK_RESTORED)
>              elog(ERROR, "unexpected XLogReadBufferForRedo result when restoring backup block");
> +
> +        if (xlrec.is_flush)
> +        {
> +            RelFileNode    rnode;
> +            ForkNumber    forknum;
> +            BlockNumber    blkno;
> +            SMgrRelation srel;
> +
> +            (void) XLogRecGetBlockTag(record, 0, &rnode, &forknum, &blkno);
> +            srel = smgropen(rnode, InvalidBackendId);
> +            smgrwrite(srel, forknum, blkno, BufferGetPage(buffer), false);
> +            smgrclose(srel);
> +        }

That'd leave the dirty bit set on the buffer...

Greetings,

Andres Freund

Re: Error with index on unlogged table

From

Michael Paquier

Date:

09 December 2015, 07:16:22

On Tue, Dec 8, 2015 at 10:09 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-04 21:57:54 +0900, Michael Paquier wrote:
>> On Fri, Dec 4, 2015 at 5:10 PM, Andres Freund <andres@anarazel.de> wrote:
>> >> Let's go for XLOG_FPI_FLUSH.
>> >
>> > I think the other way is a bit better, because we can add new flags
>> > without changing the WAL format.
>>
>> Hm. On the contrary, I think that it would make more sense to have a
>> flag as well for FOR_HINT honestly, those are really the same
>> operations, and FOR_HINT is just here for statistic purposes.
>
> I don't think it's all that much the same operation. And WRT statistics
> purpose: Being able to easily differentiate FOR_HINT is important for
> pg_xlogdump --stats, but not at all for XLOG_FPI_FLUSH.

OK. Switched back to using XLOG_FPI_FLUSH.

>> [stuff about unlogged relations in code]
>
> I might be missing something here - but isn't it pretty much guaranteed
> that all these are unlogged relations? Otherwise *buildempty() wouldn't
> have been called, right?

Yep, that's ensured in index.c when calling ambuildempty. Let's flush
it unconditionally then in those code paths.

>>       else if (info == XLOG_BACKUP_END)
>>       {
>> @@ -178,9 +183,6 @@ xlog_identify(uint8 info)
>>               case XLOG_FPI:
>>                       id = "FPI";
>>                       break;
>> -             case XLOG_FPI_FOR_HINT:
>> -                     id = "FPI_FOR_HINT";
>> -                     break;
>>       }
>
> Really don't want to do that.

OK, added back.

>> @@ -9391,14 +9394,34 @@ xlog_redo(XLogReaderState *record)
>>                * resource manager needs to generate conflicts, it has to define a
>>                * separate WAL record type and redo routine.
>>                *
>> -              * XLOG_FPI_FOR_HINT records are generated when a page needs to be
>> -              * WAL- logged because of a hint bit update. They are only generated
>> -              * when checksums are enabled. There is no difference in handling
>> -              * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
>> -              * code just to distinguish them for statistics purposes.
>> +              * Records flagged with 'for_hint_bits' are generated when a page needs
>> +              * to be WAL- logged because of a hint bit update. They are only
>> +              * generated when checksums are enabled. There is no difference in
>> +              * handling records when this flag is set, it is used for statistics
>> +              * purposes.
>> +              *
>> +              * Records flagged with 'is_flush' indicate that the page immediately
>> +              * needs to be written to disk, not just to shared buffers. This is
>> +              * important if the on-disk state is to be the authoritative, not the
>> +              * state in shared buffers. E.g. because on-disk files may later be
>> +              * copied directly.
>>                */
>>               if (XLogReadBufferForRedo(record, 0, &buffer) != BLK_RESTORED)
>>                       elog(ERROR, "unexpected XLogReadBufferForRedo result when restoring backup block");
>> +
>> +             if (xlrec.is_flush)
>> +             {
>> +                     RelFileNode     rnode;
>> +                     ForkNumber      forknum;
>> +                     BlockNumber     blkno;
>> +                     SMgrRelation srel;
>> +
>> +                     (void) XLogRecGetBlockTag(record, 0, &rnode, &forknum, &blkno);
>> +                     srel = smgropen(rnode, InvalidBackendId);
>> +                     smgrwrite(srel, forknum, blkno, BufferGetPage(buffer), false);
>> +                     smgrclose(srel);
>> +             }
>
> That'd leave the dirty bit set on the buffer...

OK, I have used an equivalent of FlushBuffer, FlushSingleBuffer being
a copycat of your previous patch... I am attaching as well a patch for
~9.4, which uses XLOG_HEAP_NEWPAGE instead.
--
Michael

Attachment

Re: Error with index on unlogged table

From

Michael Paquier

Date:

09 December 2015, 07:30:51

On Tue, Dec 8, 2015 at 9:57 PM, Andres Freund <andres@anarazel.de> wrote:
> For me the attached, preliminary, patch, fixes the problem in master;
> previous branches ought to look mostly similar, except the flush moved
> to RestoreBackupBlockContents/RestoreBackupBlock.
> Does anybody have a better idea? Suitable for the back-branches?

Not really I am afraid..

> I'm kinda wondering if it wouldn't have been better to go through shared
> buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
> copy_file().

For deployment with large shared_buffers settings, wouldn't that be
actually more costly than the current way of doing? We would need to
go through all the buffers at least once and look for the INIT_FORKNUM
present to flush them.
-- 
Michael

Re: Error with index on unlogged table

From

Andres Freund

Date:

09 December 2015, 07:41:55

On 2015-12-09 16:30:47 +0900, Michael Paquier wrote:
> > I'm kinda wondering if it wouldn't have been better to go through shared
> > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
> > copy_file().
> 
> For deployment with large shared_buffers settings, wouldn't that be
> actually more costly than the current way of doing? We would need to
> go through all the buffers at least once and look for the INIT_FORKNUM
> present to flush them.

We could just check the file sizes on disk, and the check for the
contents of all the pages for each file.

Re: Error with index on unlogged table

From

Michael Paquier

Date:

09 December 2015, 10:36:17

On Wed, Dec 9, 2015 at 4:41 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-09 16:30:47 +0900, Michael Paquier wrote:
>> > I'm kinda wondering if it wouldn't have been better to go through shared
>> > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
>> > copy_file().
>>
>> For deployment with large shared_buffers settings, wouldn't that be
>> actually more costly than the current way of doing? We would need to
>> go through all the buffers at least once and look for the INIT_FORKNUM
>> present to flush them.
>
> We could just check the file sizes on disk, and the check for the
> contents of all the pages for each file.

By doing it at replay, the flushes are spread across time. And by
doing it at the end of recovery, all the flushes would be grouped. Do
you think that's fine?
-- 
Michael

Re: Error with index on unlogged table

From

Andres Freund

Date:

09 December 2015, 11:04:48

On 2015-12-09 19:36:11 +0900, Michael Paquier wrote:
> On Wed, Dec 9, 2015 at 4:41 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2015-12-09 16:30:47 +0900, Michael Paquier wrote:
> >> > I'm kinda wondering if it wouldn't have been better to go through shared
> >> > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
> >> > copy_file().
> >>
> >> For deployment with large shared_buffers settings, wouldn't that be
> >> actually more costly than the current way of doing? We would need to
> >> go through all the buffers at least once and look for the INIT_FORKNUM
> >> present to flush them.
> >
> > We could just check the file sizes on disk, and the check for the
> > contents of all the pages for each file.
> 
> By doing it at replay, the flushes are spread across time. And by
> doing it at the end of recovery, all the flushes would be grouped. Do
> you think that's fine?

The point is that we'd no flushes, because the data would come directly
from shared buffers...

Re: Error with index on unlogged table

From

Michael Paquier

Date:

09 December 2015, 12:03:52

On Wed, Dec 9, 2015 at 8:04 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-09 19:36:11 +0900, Michael Paquier wrote:
>> On Wed, Dec 9, 2015 at 4:41 PM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2015-12-09 16:30:47 +0900, Michael Paquier wrote:
>> >> > I'm kinda wondering if it wouldn't have been better to go through shared
>> >> > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
>> >> > copy_file().
>> >>
>> >> For deployment with large shared_buffers settings, wouldn't that be
>> >> actually more costly than the current way of doing? We would need to
>> >> go through all the buffers at least once and look for the INIT_FORKNUM
>> >> present to flush them.
>> >
>> > We could just check the file sizes on disk, and the check for the
>> > contents of all the pages for each file.
>>
>> By doing it at replay, the flushes are spread across time. And by
>> doing it at the end of recovery, all the flushes would be grouped. Do
>> you think that's fine?
>
> The point is that we'd no flushes, because the data would come directly
> from shared buffers...

Oh, OK. I didn't read though your lines correctly. So you basically
mean that we would look at the init files that are on disk, and check
if they are empty. If they are, we simply use XLogReadBufferExtended
to fetch the INIT_FORKNUM content and fill in another buffer for the
MAIN_FORKNUM. More or less right?
-- 
Michael

Re: Error with index on unlogged table

From

Andres Freund

Date:

09 December 2015, 12:07:53

On 2015-12-09 21:03:47 +0900, Michael Paquier wrote:
> Oh, OK. I didn't read though your lines correctly. So you basically
> mean that we would look at the init files that are on disk, and check
> if they are empty. If they are, we simply use XLogReadBufferExtended
> to fetch the INIT_FORKNUM content and fill in another buffer for the
> MAIN_FORKNUM. More or less right?

We'd not just do so if they're empty, we'd just generally copy the file
via shared buffers, instead of copy_file(). But we'd get the file size
from the filesystem (which is fine, we make sure it is correct during
replay).

Re: Error with index on unlogged table

From

Michael Paquier

Date:

10 December 2015, 04:02:32

On Wed, Dec 9, 2015 at 9:07 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-09 21:03:47 +0900, Michael Paquier wrote:
>> Oh, OK. I didn't read though your lines correctly. So you basically
>> mean that we would look at the init files that are on disk, and check
>> if they are empty. If they are, we simply use XLogReadBufferExtended
>> to fetch the INIT_FORKNUM content and fill in another buffer for the
>> MAIN_FORKNUM. More or less right?
>
> We would not just do so if they're empty, we would just generally copy the file
> via shared buffers, instead of copy_file(). But we'd get the file size
> from the filesystem (which is fine, we make sure it is correct during
> replay).

So, this suggestion is basically implementing the reverse operation of
GetRelationPath() to be able to rebuild a RelFileNode from scratch and
then look at the shared buffers needed. Isn't it a bit brittle for
back-branches? RelFileNode stuff is available easily through records
when replaying individually FPIs, but not really in this code path.
-- 
Michael

Re: Error with index on unlogged table

From

Andres Freund

Date:

10 December 2015, 07:57:52

On December 10, 2015 5:02:27 AM GMT+01:00, Michael Paquier <michael.paquier@gmail.com> wrote:
>On Wed, Dec 9, 2015 at 9:07 PM, Andres Freund <andres@anarazel.de>
>wrote:
>> On 2015-12-09 21:03:47 +0900, Michael Paquier wrote:
>>> Oh, OK. I didn't read though your lines correctly. So you basically
>>> mean that we would look at the init files that are on disk, and
>check
>>> if they are empty. If they are, we simply use XLogReadBufferExtended
>>> to fetch the INIT_FORKNUM content and fill in another buffer for the
>>> MAIN_FORKNUM. More or less right?
>>
>> We would not just do so if they're empty, we would just generally
>copy the file
>> via shared buffers, instead of copy_file(). But we'd get the file
>size
>> from the filesystem (which is fine, we make sure it is correct during
>> replay).
>
>So, this suggestion is basically implementing the reverse operation of
>GetRelationPath() to be able to rebuild a RelFileNode from scratch and
>then look at the shared buffers needed. Isn't it a bit brittle for
>back-branches? RelFileNode stuff is available easily through records
>when replaying individually FPIs, but not really in this code path.

Oh, sorry. In definitely not thinking about doing this for the back branches. That was more musing about a way to
optimizethis.
 

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Re: Error with index on unlogged table

From

Michael Paquier

Date:

10 December 2015, 11:29:47

On Thu, Dec 10, 2015 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote:
> On December 10, 2015 5:02:27 AM GMT+01:00, Michael Paquier <michael.paquier@gmail.com> wrote:
>>On Wed, Dec 9, 2015 at 9:07 PM, Andres Freund <andres@anarazel.de>
>>wrote:
>>> On 2015-12-09 21:03:47 +0900, Michael Paquier wrote:
>>>> Oh, OK. I didn't read though your lines correctly. So you basically
>>>> mean that we would look at the init files that are on disk, and
>>check
>>>> if they are empty. If they are, we simply use XLogReadBufferExtended
>>>> to fetch the INIT_FORKNUM content and fill in another buffer for the
>>>> MAIN_FORKNUM. More or less right?
>>>
>>> We would not just do so if they're empty, we would just generally
>>copy the file
>>> via shared buffers, instead of copy_file(). But we'd get the file
>>size
>>> from the filesystem (which is fine, we make sure it is correct during
>>> replay).
>>
>>So, this suggestion is basically implementing the reverse operation of
>>GetRelationPath() to be able to rebuild a RelFileNode from scratch and
>>then look at the shared buffers needed. Isn't it a bit brittle for
>>back-branches? RelFileNode stuff is available easily through records
>>when replaying individually FPIs, but not really in this code path.
>
> Oh, sorry. In definitely not thinking about doing this for the back branches. That was more musing about a way to
optimizethis.
 

OK sure. Yeah that may be something worth investigating. The reverse
engineering of GetRelationPath would be interesting perhaps for some
utilities.

So, do we go for something like the patch you attached in
20151208125716.GS4934@alap3.anarazel.de for master and 9.5, and for
~9.4 we use the one I wrote in
CAB7nPqSxErpZJ+BZ-mfopzFZP5pAbiE9jWBUcJy6qaYertt4uw@mail.gmail.com?
Note that in both cases the patches are not complete, we need to fix
as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
are logged all the time. I am also thinking that the patch changing
WAL format of XLOG_FPI is overdoing it as we know that all the
INIT_FORKNUM will be generated for unlogged relations, so that's fine
to flush unconditionally INIT_FORKNUM buffers at replay.
-- 
Michael

Re: Error with index on unlogged table

From

Andres Freund

Date:

10 December 2015, 11:56:25

> So, do we go for something like the patch you attached in
> 20151208125716.GS4934@alap3.anarazel.de for master and 9.5, and for
> ~9.4 we use the one I wrote in
> CAB7nPqSxErpZJ+BZ-mfopzFZP5pAbiE9jWBUcJy6qaYertt4uw@mail.gmail.com?

I'm more thinking of using something like my patch for all branches. Why
would we want to go for the more complicated approach in the more
distant branches?

> Note that in both cases the patches are not complete, we need to fix
> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
> are logged all the time.

Aggreed. It's probably better treated as an entirely different - pretty
ugly - bug though. I mean it's not some issue of a race during replay,
it's entirely missing WAL logging for SET TABLESPACE of unlogged
relations.

Andres

Re: Error with index on unlogged table

From

Michael Paquier

Date:

10 December 2015, 12:07:12

On Thu, Dec 10, 2015 at 8:56 PM, Andres Freund <andres@anarazel.de> wrote:
>> So, do we go for something like the patch you attached in
>> 20151208125716.GS4934@alap3.anarazel.de for master and 9.5, and for
>> ~9.4 we use the one I wrote in
>> CAB7nPqSxErpZJ+BZ-mfopzFZP5pAbiE9jWBUcJy6qaYertt4uw@mail.gmail.com?
>
> I'm more thinking of using something like my patch for all branches. Why
> would we want to go for the more complicated approach in the more
> distant branches?

That's not what I think it meant: I don't wish to do the complicated
approach either anymore. I sent two patches on the mail mentioned
above: one for master/9.5 that used the approach of changing WAL, and
a second aimed for 9.4 and old versions that is close to what you
sent. It looks that you did not look at the second patch, named
20151209_replay_unlogged_94.patch that does some stuff with
XLOG_HEAP_NEWPAGE to fix the issue.

>> Note that in both cases the patches are not complete, we need to fix
>> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
>> are logged all the time.
>
> Agreed. It's probably better treated as an entirely different - pretty
> ugly - bug though. I mean it's not some issue of a race during replay,
> it's entirely missing WAL logging for SET TABLESPACE of unlogged
> relations.

Okidoki.
-- 
Michael

Re: Error with index on unlogged table

From

Michael Paquier

Date:

10 December 2015, 12:11:03

On Thu, Dec 10, 2015 at 9:07 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Dec 10, 2015 at 8:56 PM, Andres Freund <andres@anarazel.de> wrote:
>>> So, do we go for something like the patch you attached in
>>> 20151208125716.GS4934@alap3.anarazel.de for master and 9.5, and for
>>> ~9.4 we use the one I wrote in
>>> CAB7nPqSxErpZJ+BZ-mfopzFZP5pAbiE9jWBUcJy6qaYertt4uw@mail.gmail.com?
>>
>> I'm more thinking of using something like my patch for all branches. Why
>> would we want to go for the more complicated approach in the more
>> distant branches?
>
> That's not what I think it meant: I don't wish to do the complicated
> approach either anymore. I sent two patches on the mail mentioned
> above: one for master/9.5 that used the approach of changing WAL, and
> a second aimed for 9.4 and old versions that is close to what you
> sent. It looks that you did not look at the second patch, named
> 20151209_replay_unlogged_94.patch that does some stuff with
> XLOG_HEAP_NEWPAGE to fix the issue.
>
>>> Note that in both cases the patches are not complete, we need to fix
>>> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
>>> are logged all the time.
>>
>> Agreed. It's probably better treated as an entirely different - pretty
>> ugly - bug though. I mean it's not some issue of a race during replay,
>> it's entirely missing WAL logging for SET TABLESPACE of unlogged
>> relations.
>
> Okidoki.

In short: should I send patches for all those things or are you on it?
It seems that we are on the same page: using the simple approach, with
XLOG_FPI that enforces the flushes for 9.5/master and
XLOG_HEAP_NEWPAGE that does the same for ~9.4.

For the second issue, I would just need to extract the fix from one of
the patches upthread.
-- 
Michael

Re: Error with index on unlogged table

From

Andres Freund

Date:

10 December 2015, 12:13:25

On 2015-12-10 21:10:57 +0900, Michael Paquier wrote:
> In short: should I send patches for all those things or are you on it?

I'm on it. I don't think the new name you gave the function, and the new
comment, are really an improvement. We already have 'SyncOneBuffer'
(unusable for our purpose unfortunately), so going for Single here
doesn't seem like a good idea.

Andres

Re: Error with index on unlogged table

From

Michael Paquier

Date:

10 December 2015, 12:21:03

On Thu, Dec 10, 2015 at 9:13 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-10 21:10:57 +0900, Michael Paquier wrote:
>> In short: should I send patches for all those things or are you on it?
>
> I'm on it. I don't think the new name you gave the function, and the new
> comment, are really an improvement. We already have 'SyncOneBuffer'
> (unusable for our purpose unfortunately), so going for Single here
> doesn't seem like a good idea.

OK, no problem. Thanks.
-- 
Michael

Re: Error with index on unlogged table

From

Andres Freund

Date:

10 December 2015, 16:32:35

Hi Michael, Robert,

On 2015-12-10 21:10:57 +0900, Michael Paquier wrote:
> On Thu, Dec 10, 2015 at 9:07 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> >>> Note that in both cases the patches are not complete, we need to fix
> >>> as well copy_relation_data@tablecmds.c so as the INIT_FORKNUM pages
> >>> are logged all the time.
> >>
> >> Agreed. It's probably better treated as an entirely different - pretty
> >> ugly - bug though. I mean it's not some issue of a race during replay,
> >> it's entirely missing WAL logging for SET TABLESPACE of unlogged
> >> relations.
> 
> For the second issue, I would just need to extract the fix from one of
> the patches upthread.

I've, pushed the fix for the promotion related issue. I'm afraid that
the ALTER TABLE <unlogged-table> SET TABLESPACE issue is a bit bigger
than it though.

Robert, to catch you up: The isssue, discovered by Michael somewhere in
this thread, is that copy_relation_data(), used by SET TABLESPACE, does
not deal with unlogged tables.

Michael, what your earlier patch did was basically

@@ -9892,9 +9892,14 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,   /*    * We need to log the copied data
inWAL iff WAL archiving/streaming is

-    * enabled AND it's a permanent relation.
-    */
-   use_wal = XLogIsNeeded() && relpersistence == RELPERSISTENCE_PERMANENT;
+    * enabled AND it's a permanent relation. Unlogged relations need to have
+    * their init fork logged as well, to ensure a consistent on-disk state
+    * when reset at the end of recovery.
+    */
+   use_wal = XLogIsNeeded() &&
+       (relpersistence == RELPERSISTENCE_PERMANENT ||
+        (relpersistence == RELPERSISTENCE_UNLOGGED &&
+         forkNum == INIT_FORKNUM));

...
nblocks = smgrnblocks(src, forkNum);
for (blkno = 0; blkno < nblocks; blkno++){
...    if (use_wal)        log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page);

But that's not sufficient for a bunch of reasons:

First, and that's the big one, that approach doesn't guarantee that the
file will be created in the new tablespace if the file does not have any
blocks. Like e.g. the heap relation itself. This will lead to errors
like:
ERROR:  58P01: could not open file "pg_tblspc/16384/PG_9.6_201511071/12384/16438": No such file or
LOCATION:  mdopen, md.c:602

The real problem there imo isn't that the copy_relation_data() doesn't
deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
unlogged table specific codepath like heap_create_with_catalog() has.

A second problem is that the smgrimmedsync() in copy_relation_data()
isn't called for the init fork of unlogged relations, even if it needs
to.

As ATExecSetTableSpace() is also used for indices, the problem exists
there as well.

Yuck.

Greetings,

Andres Freund

Re: Error with index on unlogged table

From

Robert Haas

Date:

10 December 2015, 16:57:54

On Mon, Nov 30, 2015 at 9:53 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> I feel quite uncomfortable that it solves the problem from a kind
>> of nature of unlogged object by arbitrary flagging which is not
>> fully corresponds to the nature. If we can deduce the necessity
>> of fsync from some nature, it would be preferable.
>
> INIT_FORKNUM is not something only related to unlogged relations,
> indexes use them as well.

Eh, what?

Indexes use them if they are indexes on unlogged tables, but they'd
better not use them in any other situation.  Otherwise bad things are
going to happen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Error with index on unlogged table

From

Robert Haas

Date:

10 December 2015, 17:19:19

On Thu, Dec 10, 2015 at 11:32 AM, Andres Freund <andres@anarazel.de> wrote:
> I've, pushed the fix for the promotion related issue. I'm afraid that
> the ALTER TABLE <unlogged-table> SET TABLESPACE issue is a bit bigger
> than it though.

I think that I would have preferred to fix this by flushing unlogged
buffers in bulk at the end of recovery, rather than by flushing them
as they are generated. This approach seems needlessly inefficient.
Maybe it doesn't matter, but you made the same sort of decision with
the find_multixact_start() fix: flushing sooner instead of working
harder to make the delayed flush safe.  I think that's a bad direction
in general, and that it will bite us.  Nonetheless, I've been
unresponsive on this thread, and it's certainly better to have fixed
the bug in a way that isn't what I would prefer than to have not fixed
it.

> Robert, to catch you up: The isssue, discovered by Michael somewhere in
> this thread, is that copy_relation_data(), used by SET TABLESPACE, does
> not deal with unlogged tables.

Hmm.

[ ... ]

> But that's not sufficient for a bunch of reasons:
>
> First, and that's the big one, that approach doesn't guarantee that the
> file will be created in the new tablespace if the file does not have any
> blocks. Like e.g. the heap relation itself. This will lead to errors
> like:
> ERROR:  58P01: could not open file "pg_tblspc/16384/PG_9.6_201511071/12384/16438": No such file or
> LOCATION:  mdopen, md.c:602
>
> The real problem there imo isn't that the copy_relation_data() doesn't
> deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
> unlogged table specific codepath like heap_create_with_catalog() has.

It looks to me like somewhere we need to do log_smgrcreate(...,
INIT_FORKNUM) in the unlogged table case.  RelationCreateStorage()
skips this for the main forknum of an unlogged table, which seems OK,
but there's nothing that even attempts it for the init fork, which
does not seem OK.  I guess that logic should happen in
ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).

> A second problem is that the smgrimmedsync() in copy_relation_data()
> isn't called for the init fork of unlogged relations, even if it needs
> to.

That seems easy enough to fix.  Can't we just sync all forks at that
location for permanent relations, and additionally sync the init fork
only for unlogged relations?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Error with index on unlogged table

From

Andres Freund

Date:

10 December 2015, 17:36:39

On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
> On Thu, Dec 10, 2015 at 11:32 AM, Andres Freund <andres@anarazel.de> wrote:
> > I've, pushed the fix for the promotion related issue. I'm afraid that
> > the ALTER TABLE <unlogged-table> SET TABLESPACE issue is a bit bigger
> > than it though.
> 
> I think that I would have preferred to fix this by flushing unlogged
> buffers in bulk at the end of recovery, rather than by flushing them
> as they are generated. This approach seems needlessly inefficient.

We touched on that somewhere in the thread - having to scan all of
shared buffers isn't free either, and it'd mean that promotion would
take longer because it'd do all the writes at once. As we don't fsync
them during the flush itself, and as init forks don't ever get
rewritten, I don't think it makes any actual difference? The total
number of writes to the OS is the same, no?

> Maybe it doesn't matter, but you made the same sort of decision with
> the find_multixact_start() fix: flushing sooner instead of working
> harder to make the delayed flush safe. I think that's a bad direction
> in general, and that it will bite us.

Working harder often means added complexity, and complexity for things
executed relatively infrequently has the tendency to bite. I don't think
that's a simple tradeoff.

E.g. in the find_multixact_start() case the flushing isn't just about
"itself", it's also about interactions with SlruScanDirectory. Requiring
that to also check in-memory buffers wouldn't be entirely trivial... And
all that for an option that's essentially executed infrequently.

> > The real problem there imo isn't that the copy_relation_data() doesn't
> > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
> > unlogged table specific codepath like heap_create_with_catalog() has.
> 
> It looks to me like somewhere we need to do log_smgrcreate(...,
> INIT_FORKNUM) in the unlogged table case.

Yes.

> RelationCreateStorage()
> skips this for the main forknum of an unlogged table, which seems OK,
> but there's nothing that even attempts it for the init fork, which
> does not seem OK.

We unfortunately can't trivially delegate that work to
RelationCreateStorage(). E.g. heap_create() documents that only the main
fork is created :(

> I guess that logic should happen in
> ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).

Looks like it's the easiest place.

It sounds worthwhile to check that other locations rewriting tables,
e.g. cluster/vacuum full/reindex are safe.

> > A second problem is that the smgrimmedsync() in copy_relation_data()
> > isn't called for the init fork of unlogged relations, even if it needs
> > to.
> 
> That seems easy enough to fix.  Can't we just sync all forks at that
> location for permanent relations, and additionally sync the init fork
> only for unlogged relations?

Yes, it's not hard, I just wanted to mention it so it doesn't get
forgotten.

- Andres

Re: Error with index on unlogged table

From

Andres Freund

Date:

10 December 2015, 17:51:05

On 2015-12-10 18:36:32 +0100, Andres Freund wrote:
> On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
> > On Thu, Dec 10, 2015 at 11:32 AM, Andres Freund <andres@anarazel.de> wrote:
> > > I've, pushed the fix for the promotion related issue. I'm afraid that
> > > the ALTER TABLE <unlogged-table> SET TABLESPACE issue is a bit bigger
> > > than it though.
> > 
> > I think that I would have preferred to fix this by flushing unlogged
> > buffers in bulk at the end of recovery, rather than by flushing them
> > as they are generated. This approach seems needlessly inefficient.
> 
> We touched on that somewhere in the thread - having to scan all of
> shared buffers isn't free either, and it'd mean that promotion would
> take longer because it'd do all the writes at once. As we don't fsync
> them during the flush itself, and as init forks don't ever get
> rewritten, I don't think it makes any actual difference? The total
> number of writes to the OS is the same, no?

Oh, and the primary, in most places, immediately does an smgrimmedsync()
after creating the init fork anyway. So in comparison to that a plain
smrwrite() during replay is nearly free.

Re: Error with index on unlogged table

From

Andres Freund

Date:

10 December 2015, 19:27:08

Hi,

On 2015-12-10 18:36:32 +0100, Andres Freund wrote:
> On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
> > > The real problem there imo isn't that the copy_relation_data() doesn't
> > > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
> > > unlogged table specific codepath like heap_create_with_catalog() has.
> >
> > It looks to me like somewhere we need to do log_smgrcreate(...,
> > INIT_FORKNUM) in the unlogged table case.
>
> Yes.
>
> > RelationCreateStorage()
> > skips this for the main forknum of an unlogged table, which seems OK,
> > but there's nothing that even attempts it for the init fork, which
> > does not seem OK.
>
> We unfortunately can't trivially delegate that work to
> RelationCreateStorage(). E.g. heap_create() documents that only the main
> fork is created :(
>
> > I guess that logic should happen in
> > ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).
>
> Looks like it's the easiest place.

> > > A second problem is that the smgrimmedsync() in copy_relation_data()
> > > isn't called for the init fork of unlogged relations, even if it needs
> > > to.

Here's a patch doing that. It's not yet fully polished, but I wanted to
get it out, because I noticed one thing:

In ATExecSetTableSpace(), for !main forks, we currently call
smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT
relations. That seems a bit odd to me. It currently seems to be without
further consequence because, if there's actual data in the fork, we'll
just create the relation in _mdfd_getseg(); or we can cope with the
relation not being there.  But to me that feels wrong.

It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT,
not just INIT_FORKNUM. What do you guys think?

> It sounds worthwhile to check that other locations rewriting tables,
> e.g. cluster/vacuum full/reindex are safe.

Seems to be ok, on a first glance.

Andres

Attachment

0001-WIP-Fix-ALTER-TABLE-.-SET-TABLESPACE-for-unlogged-re.patch

Re: Error with index on unlogged table

From

Michael Paquier

Date:

11 December 2015, 00:31:34

On Fri, Dec 11, 2015 at 1:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Nov 30, 2015 at 9:53 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>> I feel quite uncomfortable that it solves the problem from a kind
>>> of nature of unlogged object by arbitrary flagging which is not
>>> fully corresponds to the nature. If we can deduce the necessity
>>> of fsync from some nature, it would be preferable.
>>
>> INIT_FORKNUM is not something only related to unlogged relations,
>> indexes use them as well.
>
> Eh, what?
>
> Indexes use them if they are indexes on unlogged tables, but they'd
> better not use them in any other situation.  Otherwise bad things are
> going to happen.

Yes, this was badly formulated, and caused by my lack of knowledge of
unlogged tables, I think I got it now :) Why don't we actually put
some asserts in those code paths to say that INIT_FORKNUM specific
code can just be used for unlogged relations? Just a thought...
-- 
Michael

Re: Error with index on unlogged table

From

Kyotaro HORIGUCHI

Date:

11 December 2015, 06:43:53

Hello,

At Thu, 10 Dec 2015 20:27:01 +0100, Andres Freund <andres@anarazel.de> wrote in
<20151210192701.GC11331@alap3.anarazel.de>
> > > > A second problem is that the smgrimmedsync() in copy_relation_data()
> > > > isn't called for the init fork of unlogged relations, even if it needs
> > > > to.
> 
> Here's a patch doing that. It's not yet fully polished, but I wanted to
> get it out, because I noticed one thing:
> 
> In ATExecSetTableSpace(), for !main forks, we currently call
> smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT
> relations. That seems a bit odd to me. It currently seems to be without
> further consequence because, if there's actual data in the fork, we'll
> just create the relation in _mdfd_getseg(); or we can cope with the
> relation not being there.  But to me that feels wrong.

FWIW I agree that.

> It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT,
> not just INIT_FORKNUM. What do you guys think?

What it is doing seems to me reasonable but copying_initfork
doesn't seems to be necessary. Kicking both of log_newpage() and
smgrimmedsync() by use_wal, which has the value considering
INIT_FORKNUM would be more descriptive. (more readable, in other
words)

> > It sounds worthwhile to check that other locations rewriting tables,
> > e.g. cluster/vacuum full/reindex are safe.
> 
> Seems to be ok, on a first glance.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Error with index on unlogged table

From

Michael Paquier

Date:

11 December 2015, 07:08:25

On Fri, Dec 11, 2015 at 1:32 AM, Andres Freund <andres@anarazel.de> wrote:
> I've, pushed the fix for the promotion related issue.

Thanks! It is great to see this issue addressed.
-- 
Michael

Re: Error with index on unlogged table

From

Michael Paquier

Date:

11 December 2015, 07:54:51

On Fri, Dec 11, 2015 at 4:27 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2015-12-10 18:36:32 +0100, Andres Freund wrote:
>> On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
>> > > The real problem there imo isn't that the copy_relation_data() doesn't
>> > > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
>> > > unlogged table specific codepath like heap_create_with_catalog() has.
>> >
>> > It looks to me like somewhere we need to do log_smgrcreate(...,
>> > INIT_FORKNUM) in the unlogged table case.
>>
>> Yes.
>>
>> > RelationCreateStorage()
>> > skips this for the main forknum of an unlogged table, which seems OK,
>> > but there's nothing that even attempts it for the init fork, which
>> > does not seem OK.
>>
>> We unfortunately can't trivially delegate that work to
>> RelationCreateStorage(). E.g. heap_create() documents that only the main
>> fork is created :(
>>
>> > I guess that logic should happen in
>> > ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).
>>
>> Looks like it's the easiest place.
>
>> > > A second problem is that the smgrimmedsync() in copy_relation_data()
>> > > isn't called for the init fork of unlogged relations, even if it needs
>> > > to.
>
> Here's a patch doing that. It's not yet fully polished, but I wanted to
> get it out, because I noticed one thing:
>
> In ATExecSetTableSpace(), for !main forks, we currently call
> smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT
> relations. That seems a bit odd to me. It currently seems to be without
> further consequence because, if there's actual data in the fork, we'll
> just create the relation in _mdfd_getseg(); or we can cope with the
> relation not being there.  But to me that feels wrong.
>
> It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT,
> not just INIT_FORKNUM. What do you guys think?

This fixes the problem in my environment.

+                       if (rel->rd_rel->relpersistence ==
RELPERSISTENCE_PERMANENT ||
+                               (rel->rd_rel->relpersistence ==
RELPERSISTENCE_UNLOGGED &&
+                                forkNum == INIT_FORKNUM))
+                               log_smgrcreate(&newrnode, forkNum);
There should be a XLogIsNeeded() check as well. Removing the check on
RELPERSISTENCE_UNLOGGED is fine as well... Not mandatory though :)

+        * The init fork for an unlogged relation in many respects has to be
+        * treated the same as normal relation, changes need to be WAL
logged and
+        * it needs to be synced to disk.
+        */
+       copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
+               forkNum == INIT_FORKNUM;
Here as well just a check on INIT_FORKNUM would be fine.

>> It sounds worthwhile to check that other locations rewriting tables,
>> e.g. cluster/vacuum full/reindex are safe.
>
> Seems to be ok, on a first glance.

Yeah. REINDEX relies on index_build to recreate what it should... The
others are looking fine as well. I have tested it in case and the
files produced are consistent on standby and its master.
-- 
Michael

Re: Error with index on unlogged table

From

Andres Freund

Date:

11 December 2015, 09:34:27

On 2015-12-11 15:43:24 +0900, Kyotaro HORIGUCHI wrote:
> What it is doing seems to me reasonable but copying_initfork
> doesn't seems to be necessary. Kicking both of log_newpage() and
> smgrimmedsync() by use_wal, which has the value considering
> INIT_FORKNUM would be more descriptive. (more readable, in other
> words)

The smgrimmedsync() has a different condition, it doesn't, and may not,
check for XLogIsNeeded().

Andres

Re: Error with index on unlogged table

From

Michael Paquier

Date:

12 December 2015, 11:50:02

On Fri, Dec 11, 2015 at 4:54 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Dec 11, 2015 at 4:27 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2015-12-10 18:36:32 +0100, Andres Freund wrote:
>>> On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
>>> > > The real problem there imo isn't that the copy_relation_data() doesn't
>>> > > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
>>> > > unlogged table specific codepath like heap_create_with_catalog() has.
>>> >
>>> > It looks to me like somewhere we need to do log_smgrcreate(...,
>>> > INIT_FORKNUM) in the unlogged table case.
>>>
>>> Yes.
>>>
>>> > RelationCreateStorage()
>>> > skips this for the main forknum of an unlogged table, which seems OK,
>>> > but there's nothing that even attempts it for the init fork, which
>>> > does not seem OK.
>>>
>>> We unfortunately can't trivially delegate that work to
>>> RelationCreateStorage(). E.g. heap_create() documents that only the main
>>> fork is created :(
>>>
>>> > I guess that logic should happen in
>>> > ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).
>>>
>>> Looks like it's the easiest place.
>>
>>> > > A second problem is that the smgrimmedsync() in copy_relation_data()
>>> > > isn't called for the init fork of unlogged relations, even if it needs
>>> > > to.
>>
>> Here's a patch doing that. It's not yet fully polished, but I wanted to
>> get it out, because I noticed one thing:
>>
>> In ATExecSetTableSpace(), for !main forks, we currently call
>> smgrcreate(), but not log_smgrcreate(). Even for PERSISTENT
>> relations. That seems a bit odd to me. It currently seems to be without
>> further consequence because, if there's actual data in the fork, we'll
>> just create the relation in _mdfd_getseg(); or we can cope with the
>> relation not being there.  But to me that feels wrong.
>>
>> It seems better to do the log_smgrcreate() for RELPERSISTENCE_PERMANENT,
>> not just INIT_FORKNUM. What do you guys think?
>
> This fixes the problem in my environment.
>
> +                       if (rel->rd_rel->relpersistence ==
> RELPERSISTENCE_PERMANENT ||
> +                               (rel->rd_rel->relpersistence ==
> RELPERSISTENCE_UNLOGGED &&
> +                                forkNum == INIT_FORKNUM))
> +                               log_smgrcreate(&newrnode, forkNum);
> There should be a XLogIsNeeded() check as well. Removing the check on
> RELPERSISTENCE_UNLOGGED is fine as well... Not mandatory though :)
>
> +        * The init fork for an unlogged relation in many respects has to be
> +        * treated the same as normal relation, changes need to be WAL
> logged and
> +        * it needs to be synced to disk.
> +        */
> +       copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
> +               forkNum == INIT_FORKNUM;
> Here as well just a check on INIT_FORKNUM would be fine.

Should we consider this bug a 9.5 blocker? I feel uneasy with the fact
of releasing a new major version knowing that we know some bugs on it,
and this one is not cool so I have added it in the list of open items.
We know the problem and there is a patch, so this is definitely
solvable.
-- 
Michael

Re: Error with index on unlogged table

From

Andres Freund

Date:

12 December 2015, 11:52:28

On 2015-12-12 20:49:52 +0900, Michael Paquier wrote:
> Should we consider this bug a 9.5 blocker?

I don't think so. But either way, I'm right now working on getting it
fixed anyway.

Re: Error with index on unlogged table

From

Andres Freund

Date:

12 December 2015, 12:30:55

On 2015-12-11 16:54:45 +0900, Michael Paquier wrote:
> +                       if (rel->rd_rel->relpersistence ==
> RELPERSISTENCE_PERMANENT ||
> +                               (rel->rd_rel->relpersistence ==
> RELPERSISTENCE_UNLOGGED &&
> +                                forkNum == INIT_FORKNUM))
> +                               log_smgrcreate(&newrnode, forkNum);
> There should be a XLogIsNeeded() check as well.

RelationCreateStorage(), which creates the main fork, has no such
check. Seems better to stay symmetric, even if it's not strictly
necessary.

> Removing the check on RELPERSISTENCE_UNLOGGED is fine as well... Not
> mandatory though :)

I've gone back and forth on this, I can't really convince myself either
way. We might end up reusing init forks for 'global temporary tables' or
somesuch, so being a bit stricter seems slight better. Anyway, it's of
no actual consequence for now.

Andres

Re: Error with index on unlogged table

From

Andres Freund

Date:

12 December 2015, 13:31:37

On 2015-12-12 12:52:21 +0100, Andres Freund wrote:
> On 2015-12-12 20:49:52 +0900, Michael Paquier wrote:
> > Should we consider this bug a 9.5 blocker?
> 
> I don't think so. But either way, I'm right now working on getting it
> fixed anyway.

And done. Took a bit longer than predicted - I had to adjust my test
scripts to cope with < 9.4 not having tablespace mapping. Rather
annoying.

It'd be really convenient if tablespaces relative to the main data
directory were supported...

Re: Error with index on unlogged table

From

Tom Lane

Date:

12 December 2015, 17:25:01

Michael Paquier <michael.paquier@gmail.com> writes:
> Should we consider this bug a 9.5 blocker? I feel uneasy with the fact
> of releasing a new major version knowing that we know some bugs on it,
> and this one is not cool so I have added it in the list of open items.
> We know the problem and there is a patch, so this is definitely
> solvable.

FWIW, general project policy has been that pre-existing bugs are not
release blockers as such.  However, if we have a fix in hand that we
would like to get some field testing on, it's certainly desirable to
push it out first in a beta or RC release, rather than having it first
hit the field in stable-branch updates.  So we might delay a beta/RC
to make sure such a fix is available for testing.  But that's not quite
the same thing as a release blocker.  To my mind, a release blocker
is something that would make it undesirable for users to upgrade to
the new release compared to the previous branch, so bugs the branches
have in common are never that.
        regards, tom lane

Re: Error with index on unlogged table

From

Michael Paquier

Date:

12 December 2015, 23:19:02

On Sat, Dec 12, 2015 at 10:31 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-12 12:52:21 +0100, Andres Freund wrote:
>> On 2015-12-12 20:49:52 +0900, Michael Paquier wrote:
>> > Should we consider this bug a 9.5 blocker?
>>
>> I don't think so. But either way, I'm right now working on getting it
>> fixed anyway.
>
> And done. Took a bit longer than predicted - I had to adjust my test
> scripts to cope with < 9.4 not having tablespace mapping. Rather
> annoying.

Thanks! Cool to see this thread finally addressed.

> It'd be really convenient if tablespaces relative to the main data
> directory were supported...

I got bitten at some point as well by the fact that tablespaces
created after a base backup is taken have their location map to the
same point for a master and a standby :)
-- 
Michael