Thread: [HACKERS] hash index on unlogged tables doesn't behave as expected

[HACKERS] hash index on unlogged tables doesn't behave as expected

From

Amit Kapila

Date:

03 July 2017, 06:42:01

While discussing the behavior of hash indexes with Bruce in the nearby
thread [1], it has been noticed that hash index on unlogged tables
doesn't behave as expected.  Prior to 10, it has the different set of
problems (mainly because hash indexes are not WAL-logged) which were
discussed on that thread [1], however when I checked, it doesn't work
even for 10.  Below are steps to reproduce the problem.

1. Setup master and standby
2. On the master, create unlogged table and hash index.
   2A.  Create unlogged table t1(c1 int);
   2B.  Create hash index idx_t1_hash on t1 using hash(c1);
3. On Standby, try selecting data,
   select * from t1;
   ERROR:  cannot access temporary or unlogged relations during recovery
---Till here everything works as expected
4. Promote standby to master (I have just stopped the standby and
master and removed recovery.conf file from the standby database
location) and try starting the new master, it
gives below error and doesn't get started.
FATAL:  could not create file "base/12700/16387": File exists

The basic issue was that the WAL logging for Create Index operation
was oblivion of the fact that for unlogged tables only INIT forks need
to be logged.  Another point which we need to consider is that while
replaying the WAL for the create index operation, we need to flush the
buffer if it is for init fork.  This needs to be done only for pages
that can be part of init fork file (like metapage, bitmappage).
Attached patch fixes the issue.

Another approach to fix the issue could be that always log complete
pages for the create index operation on unlogged tables (in
hashbuildempty).  However, the logic for initial hash index pages
needs us to perform certain other actions (like update metapage after
the creation of bitmappage) which make it difficult to follow that
approach.

I think this should be considered a PostgreSQL 10 open item.


[1] - https://www.postgresql.org/message-id/20170630005634.GA4448%40momjian.us

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

fix_unlogged_hash_index_issue_v1.patch

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Ashutosh Sharma

Date:

03 July 2017, 12:54:30

Hi,

On Mon, Jul 3, 2017 at 9:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> While discussing the behavior of hash indexes with Bruce in the nearby
> thread [1], it has been noticed that hash index on unlogged tables
> doesn't behave as expected. Prior to 10, it has the different set of
> problems (mainly because hash indexes are not WAL-logged) which were
> discussed on that thread [1], however when I checked, it doesn't work
> even for 10. Below are steps to reproduce the problem.
>
> 1. Setup master and standby
> 2. On the master, create unlogged table and hash index.
> 2A. Create unlogged table t1(c1 int);
> 2B. Create hash index idx_t1_hash on t1 using hash(c1);
> 3. On Standby, try selecting data,
> select * from t1;
> ERROR: cannot access temporary or unlogged relations during recovery
> ---Till here everything works as expected
> 4. Promote standby to master (I have just stopped the standby and
> master and removed recovery.conf file from the standby database
> location) and try starting the new master, it
> gives below error and doesn't get started.
> FATAL: could not create file "base/12700/16387": File exists
>
> The basic issue was that the WAL logging for Create Index operation
> was oblivion of the fact that for unlogged tables only INIT forks need
> to be logged. Another point which we need to consider is that while
> replaying the WAL for the create index operation, we need to flush the
> buffer if it is for init fork. This needs to be done only for pages
> that can be part of init fork file (like metapage, bitmappage).
> Attached patch fixes the issue.
>
> Another approach to fix the issue could be that always log complete
> pages for the create index operation on unlogged tables (in
> hashbuildempty). However, the logic for initial hash index pages
> needs us to perform certain other actions (like update metapage after
> the creation of bitmappage) which make it difficult to follow that
> approach.
>

Thanks for sharing the steps to reproduce the issue in detail. I could easily reproduce this issue. I also had a quick look into the patch and the fix looks fine to me. However, it would be good if we can add at least one test for unlogged table with hash index in the regression test suite.

Apart from that i have tested the patch and the patch seems to be working fine. Here are the steps that i have followed to verify the fix,

With Patch:
========
postgres=# SELECT pg_relation_filepath('t1');
pg_relation_filepath
----------------------
base/14915/16384
(1 row)

postgres=# SELECT pg_relation_filepath('idx_t1_hash');
pg_relation_filepath
----------------------
base/14915/16387
(1 row)

Master:
=====
[ashu@localhost bin]$ ls -l ../master/base/14915/1638*
-rw-------. 1 ashu ashu 18128896 Jul 3 14:29 ../master/base/14915/16384
-rw-------. 1 ashu ashu 24576 Jul 3 14:29 ../master/base/14915/16384_fsm
-rw-------. 1 ashu ashu 0 Jul 3 14:28 ../master/base/14915/16384_init
-rw-------. 1 ashu ashu 22339584 Jul 3 14:29 ../master/base/14915/16387
-rw-------. 1 ashu ashu 32768 Jul 3 14:28 ../master/base/14915/16387_init

Standby:
======
[ashu@localhost bin]$ ls -l ../standby/base/14915/1638*
-rw-------. 1 ashu ashu 0 Jul 3 14:28 ../standby/base/14915/16384_init
-rw-------. 1 ashu ashu 32768 Jul 3 14:28 ../standby/base/14915/16387_init

Without patch:
==========
postgres=# SELECT pg_relation_filepath('t1');
pg_relation_filepath
----------------------
base/14915/16384
(1 row)

postgres=# SELECT pg_relation_filepath('idx_t1_hash');
pg_relation_filepath
----------------------
base/14915/16387
(1 row)

Master:
=====
[ashu@localhost bin]$ ls -l ../master/base/14915/1638*
-rw-------. 1 ashu ashu 18128896 Jul 3 14:36 ../master/base/14915/16384
-rw-------. 1 ashu ashu 24576 Jul 3 14:36 ../master/base/14915/16384_fsm
-rw-------. 1 ashu ashu 0 Jul 3 14:35 ../master/base/14915/16384_init
-rw-------. 1 ashu ashu 22339584 Jul 3 14:36 ../master/base/14915/16387
-rw-------. 1 ashu ashu 32768 Jul 3 14:35 ../master/base/14915/16387_init

Standby:
======
[ashu@localhost bin]$ ls -l ../standby/base/14915/1638*
-rw-------. 1 ashu ashu 0 Jul 3 14:35 ../standby/base/14915/16384_init
-rw-------. 1 ashu ashu 73728 Jul 3 14:35 ../standby/base/14915/16387
-rw-------. 1 ashu ashu 24576 Jul 3 14:35 ../standby/base/14915/16387_init

As seen from the test results above, it is clear that without patch, both main fork and init fork were getting WAL logged as the create hash index WAL logging was being done without checking forkNum type.

_hash_init()
{
................
................
log_newpage(&rel->rd_node,
forkNum,
blkno,
BufferGetPage(buf),
true);
_hash_relbuf(rel, buf);
....................
....................
}

I have also ran the WAL consistency check tool to verify the things and didn't find any issues. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Amit Kapila

Date:

03 July 2017, 13:40:44

On Mon, Jul 3, 2017 at 3:24 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> On Mon, Jul 3, 2017 at 9:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> The basic issue was that the WAL logging for Create Index operation
>> was oblivion of the fact that for unlogged tables only INIT forks need
>> to be logged.  Another point which we need to consider is that while
>> replaying the WAL for the create index operation, we need to flush the
>> buffer if it is for init fork.  This needs to be done only for pages
>> that can be part of init fork file (like metapage, bitmappage).
>> Attached patch fixes the issue.
>>
>> Another approach to fix the issue could be that always log complete
>> pages for the create index operation on unlogged tables (in
>> hashbuildempty).  However, the logic for initial hash index pages
>> needs us to perform certain other actions (like update metapage after
>> the creation of bitmappage) which make it difficult to follow that
>> approach.
>>
>
> Thanks for sharing the steps to reproduce the issue in detail. I could
> easily reproduce this issue. I also had a quick look into the patch and the
> fix looks fine to me. However, it would be good if we can add at least one
> test for unlogged table with hash index in the regression test suite.
>

There is already such a test, see create_index.sql.
CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id
int4_ops);

Do you have something else in mind?

I think the problem mentioned in this thread can be caught only if we
promote the standby and restart it.  We might be able to write it
using TAP framework, but I think such a test should exist for other
index types as well or in general for unlogged tables.  I am not
opposed to writing such a test if you and or others feel so.

> Apart from that i have tested the patch and the patch seems to be working
> fine.

Thanks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Ashutosh Sharma

Date:

03 July 2017, 14:44:34

Hi,

On Mon, Jul 3, 2017 at 4:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 3, 2017 at 3:24 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > On Mon, Jul 3, 2017 at 9:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> The basic issue was that the WAL logging for Create Index operation
> >> was oblivion of the fact that for unlogged tables only INIT forks need
> >> to be logged.  Another point which we need to consider is that while
> >> replaying the WAL for the create index operation, we need to flush the
> >> buffer if it is for init fork.  This needs to be done only for pages
> >> that can be part of init fork file (like metapage, bitmappage).
> >> Attached patch fixes the issue.
> >>
> >> Another approach to fix the issue could be that always log complete
> >> pages for the create index operation on unlogged tables (in
> >> hashbuildempty).  However, the logic for initial hash index pages
> >> needs us to perform certain other actions (like update metapage after
> >> the creation of bitmappage) which make it difficult to follow that
> >> approach.
> >>
> >
> > Thanks for sharing the steps to reproduce the issue in detail. I could
> > easily reproduce this issue. I also had a quick look into the patch and the
> > fix looks fine to me. However, it would be good if we can add at least one
> > test for unlogged table with hash index in the regression test suite.
> >
>
> There is already such a test, see create_index.sql.
> CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
> CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id
> int4_ops);
>
> Do you have something else in mind?

Yes, that's what i had in my mind. But, good to know that we already
have a test-case with hash index created on an unlogged table.

>
>
> I think the problem mentioned in this thread can be caught only if we
> promote the standby and restart it.  We might be able to write it
> using TAP framework, but I think such a test should exist for other
> index types as well or in general for unlogged tables.  I am not
> opposed to writing such a test if you and or others feel so.

I think, it would be a good thing to add such test cases using TAP
framework but as you said, it would be good to take others opinion as
well on this. Thanks.

>
> > Apart from that i have tested the patch and the patch seems to be working
> > fine.
>
> Thanks.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Michael Paquier

Date:

04 July 2017, 10:53:51

On Mon, Jul 3, 2017 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> There is already such a test, see create_index.sql.
> CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
> CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id
> int4_ops);
>
> Do you have something else in mind?
>
> I think the problem mentioned in this thread can be caught only if we
> promote the standby and restart it.  We might be able to write it
> using TAP framework, but I think such a test should exist for other
> index types as well or in general for unlogged tables.  I am not
> opposed to writing such a test if you and or others feel so.

There have been many requests for something like that. This overlaps a
lot with the existing make check though, so a buildfarm module using
wal_consistency_checking may be a better idea than something in core.

>> Apart from that i have tested the patch and the patch seems to be working
>> fine.
>
> Thanks.

It seems to me that this qualifies as an open item for 10. WAL-logging
is new for hash tables. Amit, could you add one?

+   use_wal = RelationNeedsWAL(rel) ||
+             (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
+              forkNum == INIT_FORKNUM);
In access AMs, empty() routines are always only called for unlogged
relations, so you could relax that to check for INIT_FORKNUM only.

Looking at the patch, I think that you are right to do the logging
directly in _hash_init() and not separately in hashbuildempty().
Compared to other relations the init data is more dynamic.

+       /*
+        * Force the on-disk state of init forks to always be in sync with the
+        * state in shared buffers.  See XLogReadBufferForRedoExtended.
+        */
+       XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL);
+       if (forknum == INIT_FORKNUM)
+               FlushOneBuffer(metabuf);
I find those forced syncs a bit surprising. The buffer is already
marked as dirty, so even if there is a concurrent checkpoint they
would be flushed. If recovery comes again here, then they would just
be replayed. I am unclear why XLogReadBufferForRedoExtended is not
enough to replay a FPW of that.
-- 
Michael

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Amit Kapila

Date:

04 July 2017, 12:19:26

On Tue, Jul 4, 2017 at 1:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Jul 3, 2017 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> It seems to me that this qualifies as an open item for 10. WAL-logging
> is new for hash tables. Amit, could you add one?
>

Added.

> +   use_wal = RelationNeedsWAL(rel) ||
> +             (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
> +              forkNum == INIT_FORKNUM);
> In access AMs, empty() routines are always only called for unlogged
> relations, so you could relax that to check for INIT_FORKNUM only.
>

Yeah, but _hash_init() is an exposed API, so I am not sure it is a
good idea to make such an assumption at that level of code.  Now, as
there is no current user which wants to use it any other way, we can
make such an assumption, but does it serve any purpose?

> Looking at the patch, I think that you are right to do the logging
> directly in _hash_init() and not separately in hashbuildempty().
> Compared to other relations the init data is more dynamic.
>
> +       /*
> +        * Force the on-disk state of init forks to always be in sync with the
> +        * state in shared buffers.  See XLogReadBufferForRedoExtended.
> +        */
> +       XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL);
> +       if (forknum == INIT_FORKNUM)
> +               FlushOneBuffer(metabuf);
> I find those forced syncs a bit surprising. The buffer is already
> marked as dirty, so even if there is a concurrent checkpoint they
> would be flushed.
>

What if there is no concurrent checkpoint activity and the server is
shutdown?  Remember this replay happens on standby and we will just
try to perform Restartpoint and there is no guarantee that it will
flush the buffers.  If the buffers are not flushed at shutdown, then
the Init fork will be empty on restart and the hash index will be
corrupt.

> If recovery comes again here, then they would just
> be replayed. I am unclear why XLogReadBufferForRedoExtended is not
> enough to replay a FPW of that.
>

Where does FPW come into the picture for a replay of metapage or bitmappage?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Noah Misch

Date:

06 July 2017, 06:02:06

On Mon, Jul 03, 2017 at 09:12:01AM +0530, Amit Kapila wrote:
> While discussing the behavior of hash indexes with Bruce in the nearby
> thread [1], it has been noticed that hash index on unlogged tables
> doesn't behave as expected.  Prior to 10, it has the different set of
> problems (mainly because hash indexes are not WAL-logged) which were
> discussed on that thread [1], however when I checked, it doesn't work
> even for 10.  Below are steps to reproduce the problem.
> 
> 1. Setup master and standby
> 2. On the master, create unlogged table and hash index.
>    2A.  Create unlogged table t1(c1 int);
>    2B.  Create hash index idx_t1_hash on t1 using hash(c1);
> 3. On Standby, try selecting data,
>    select * from t1;
>    ERROR:  cannot access temporary or unlogged relations during recovery
> ---Till here everything works as expected
> 4. Promote standby to master (I have just stopped the standby and
> master and removed recovery.conf file from the standby database
> location) and try starting the new master, it
> gives below error and doesn't get started.
> FATAL:  could not create file "base/12700/16387": File exists
> 
> The basic issue was that the WAL logging for Create Index operation
> was oblivion of the fact that for unlogged tables only INIT forks need
> to be logged.  Another point which we need to consider is that while
> replaying the WAL for the create index operation, we need to flush the
> buffer if it is for init fork.  This needs to be done only for pages
> that can be part of init fork file (like metapage, bitmappage).
> Attached patch fixes the issue.
> 
> Another approach to fix the issue could be that always log complete
> pages for the create index operation on unlogged tables (in
> hashbuildempty).  However, the logic for initial hash index pages
> needs us to perform certain other actions (like update metapage after
> the creation of bitmappage) which make it difficult to follow that
> approach.
> 
> I think this should be considered a PostgreSQL 10 open item.
> 
> 
> [1] - https://www.postgresql.org/message-id/20170630005634.GA4448%40momjian.us

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Robert,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

Re: [HACKERS] hash index on unlogged tables doesn't behave asexpected

From

Kyotaro HORIGUCHI

Date:

06 July 2017, 12:54:47

Hello,

At Tue, 4 Jul 2017 14:49:26 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+SYqCmA7ioTBpJHcO-B-rf8A=N9Gr1-RP3RhwecB5E-A@mail.gmail.com>
> On Tue, Jul 4, 2017 at 1:23 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Mon, Jul 3, 2017 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > It seems to me that this qualifies as an open item for 10. WAL-logging
> > is new for hash tables. Amit, could you add one?
> >
> 
> Added.
> 
> > +   use_wal = RelationNeedsWAL(rel) ||
> > +             (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
> > +              forkNum == INIT_FORKNUM);
> > In access AMs, empty() routines are always only called for unlogged
> > relations, so you could relax that to check for INIT_FORKNUM only.
> >
> 
> Yeah, but _hash_init() is an exposed API, so I am not sure it is a
> good idea to make such an assumption at that level of code.  Now, as
> there is no current user which wants to use it any other way, we can
> make such an assumption, but does it serve any purpose?

Check for INIT_FORKNUM appears both accompanied and not
accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
is the convention here.

The difference of the two is an init fork of TEMP
relations. However I believe that init fork is by definition a
part of an unlogged relation, it seems to me that it ought not to
be wal-logged if we had it. From this viewpoint, the middle line
makes sense.

> > Looking at the patch, I think that you are right to do the logging
> > directly in _hash_init() and not separately in hashbuildempty().
> > Compared to other relations the init data is more dynamic.
> >
> > +       /*
> > +        * Force the on-disk state of init forks to always be in sync with the
> > +        * state in shared buffers.  See XLogReadBufferForRedoExtended.
> > +        */
> > +       XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL);
> > +       if (forknum == INIT_FORKNUM)
> > +               FlushOneBuffer(metabuf);
> > I find those forced syncs a bit surprising. The buffer is already
> > marked as dirty, so even if there is a concurrent checkpoint they
> > would be flushed.
> >
> 
> What if there is no concurrent checkpoint activity and the server is
> shutdown?  Remember this replay happens on standby and we will just
> try to perform Restartpoint and there is no guarantee that it will
> flush the buffers.  If the buffers are not flushed at shutdown, then
> the Init fork will be empty on restart and the hash index will be
> corrupt.

XLogReadBufferForRedoExtended() automatically force the flush for
a XLOG record on init fork that having FPW imaeges. And
_hash_init seems to have emitted such a XLOG record using
log_newpage.

> > If recovery comes again here, then they would just
> > be replayed. I am unclear why XLogReadBufferForRedoExtended is not
> > enough to replay a FPW of that.
> >
> 
> Where does FPW come into the picture for a replay of metapage or bitmappage?

Since required FPW seems to be emitted and the flush should be
done automatically, the issuer side (_hash_init) may attach wrong
FPW images if hash_xlog_init_meta_page requires additional
flushes. But I don't see such a flaw there. I think Michael wants
to say such a kind of thing.

By the way the comment of the function ReadBufferWithoutRelcache
has the following sentense.

| * NB: At present, this function may only be used on permanent relations, which
| * is OK, because we only use it during XLOG replay.  If in the future we
| * want to use it on temporary or unlogged relations, we could pass additional
| * parameters.

and does

| return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,                         mode, strategy,
&hit);

This surely works since BufferAlloc recognizes INIT_FORK. But
some adjustment may be needed around here.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] hash index on unlogged tables doesn't behave asexpected

From

Kyotaro HORIGUCHI

Date:

06 July 2017, 13:08:55

FWIW..

At Thu, 06 Jul 2017 18:54:47 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170706.185447.256482539.horiguchi.kyotaro@lab.ntt.co.jp>
> > > +       /*
> > > +        * Force the on-disk state of init forks to always be in sync with the
> > > +        * state in shared buffers.  See XLogReadBufferForRedoExtended.
> > > +        */
> > > +       XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL);
> > > +       if (forknum == INIT_FORKNUM)
> > > +               FlushOneBuffer(metabuf);
> > > I find those forced syncs a bit surprising. The buffer is already
> > > marked as dirty, so even if there is a concurrent checkpoint they
> > > would be flushed.
> > >
> > 
> > What if there is no concurrent checkpoint activity and the server is
> > shutdown?  Remember this replay happens on standby and we will just
> > try to perform Restartpoint and there is no guarantee that it will
> > flush the buffers.  If the buffers are not flushed at shutdown, then
> > the Init fork will be empty on restart and the hash index will be
> > corrupt.
> 
> XLogReadBufferForRedoExtended() automatically force the flush for
> a XLOG record on init fork that having FPW imaeges. And
> _hash_init seems to have emitted such a XLOG record using
> log_newpage.
> 
> > > If recovery comes again here, then they would just
> > > be replayed. I am unclear why XLogReadBufferForRedoExtended is not
> > > enough to replay a FPW of that.
> > >
> > 
> > Where does FPW come into the picture for a replay of metapage or bitmappage?
> 
> Since required FPW seems to be emitted and the flush should be
> done automatically, the issuer side (_hash_init) may attach wrong

"may attach" -> "may have attached"

> FPW images if hash_xlog_init_meta_page requires additional
> flushes. But I don't see such a flaw there. I think Michael wants
> to say such a kind of thing.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Amit Kapila

Date:

06 July 2017, 14:29:59

On Thu, Jul 6, 2017 at 3:24 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Tue, 4 Jul 2017 14:49:26 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+SYqCmA7ioTBpJHcO-B-rf8A=N9Gr1-RP3RhwecB5E-A@mail.gmail.com>
>> On Tue, Jul 4, 2017 at 1:23 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> > On Mon, Jul 3, 2017 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >
>> > It seems to me that this qualifies as an open item for 10. WAL-logging
>> > is new for hash tables. Amit, could you add one?
>> >
>>
>> Added.
>>
>> > +   use_wal = RelationNeedsWAL(rel) ||
>> > +             (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
>> > +              forkNum == INIT_FORKNUM);
>> > In access AMs, empty() routines are always only called for unlogged
>> > relations, so you could relax that to check for INIT_FORKNUM only.
>> >
>>
>> Yeah, but _hash_init() is an exposed API, so I am not sure it is a
>> good idea to make such an assumption at that level of code.  Now, as
>> there is no current user which wants to use it any other way, we can
>> make such an assumption, but does it serve any purpose?
>
> Check for INIT_FORKNUM appears both accompanied and not
> accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
> is the convention here.
>
> The difference of the two is an init fork of TEMP
> relations. However I believe that init fork is by definition a
> part of an unlogged relation, it seems to me that it ought not to
> be wal-logged if we had it. From this viewpoint, the middle line
> makes sense.
>

What is the middle line?  Are you okay with a current check or do you
see any problem with it or do you prefer to write it differently?

>
>
>> > Looking at the patch, I think that you are right to do the logging
>> > directly in _hash_init() and not separately in hashbuildempty().
>> > Compared to other relations the init data is more dynamic.
>> >
>> > +       /*
>> > +        * Force the on-disk state of init forks to always be in sync with the
>> > +        * state in shared buffers.  See XLogReadBufferForRedoExtended.
>> > +        */
>> > +       XLogRecGetBlockTag(record, 0, NULL, &forknum, NULL);
>> > +       if (forknum == INIT_FORKNUM)
>> > +               FlushOneBuffer(metabuf);
>> > I find those forced syncs a bit surprising. The buffer is already
>> > marked as dirty, so even if there is a concurrent checkpoint they
>> > would be flushed.
>> >
>>
>> What if there is no concurrent checkpoint activity and the server is
>> shutdown?  Remember this replay happens on standby and we will just
>> try to perform Restartpoint and there is no guarantee that it will
>> flush the buffers.  If the buffers are not flushed at shutdown, then
>> the Init fork will be empty on restart and the hash index will be
>> corrupt.
>
> XLogReadBufferForRedoExtended() automatically force the flush for
> a XLOG record on init fork that having FPW imaeges. And
> _hash_init seems to have emitted such a XLOG record using
> log_newpage.
>

Sure, but log_newpage is not used for metapage and bitmappage.

>> > If recovery comes again here, then they would just
>> > be replayed. I am unclear why XLogReadBufferForRedoExtended is not
>> > enough to replay a FPW of that.
>> >
>>
>> Where does FPW come into the picture for a replay of metapage or bitmappage?
>
> Since required FPW seems to be emitted and the flush should be
> done automatically, the issuer side (_hash_init) may attach wrong
> FPW images if hash_xlog_init_meta_page requires additional> flushes. But I don't see such a flaw there. I think
Michaelwants
 
> to say such a kind of thing.
>

I am not able to understand what you want to say, but I think you
don't see any problem with the patch as such.

>
>
> By the way the comment of the function ReadBufferWithoutRelcache
> has the following sentense.
>

I don't think we need anything with respect to this patch because
ReadBufferWithoutRelcache is used in case of FPW replay of INIT_FORK
pages as well.  Do you have something else in mind which I am not able
to see?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Robert Haas

Date:

08 July 2017, 06:11:06

On Wed, Jul 5, 2017 at 11:02 PM, Noah Misch <noah@leadboat.com> wrote:
> [Action required within three days.  This is a generic notification.]
>
> The above-described topic is currently a PostgreSQL 10 open item.  Robert,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.

I don't intend to rush this in before the beta2 wrap, although some
other committer is welcome to do so if they feel confident in the fix.
I will update again by July 17th.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Robert Haas

Date:

08 July 2017, 06:22:33

On Thu, Jul 6, 2017 at 5:54 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Check for INIT_FORKNUM appears both accompanied and not
> accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
> is the convention here.

Checking only for INIT_FORKNUM seems logical to me.  Also checking for
RELPERSISTENCE_UNLOGGED just makes the code longer to no benefit.  I
guess Amit copied the test from ATExecSetTablespace, which does it as
he did, but it seems unnecessarily long-winded.

> The difference of the two is an init fork of TEMP
> relations. However I believe that init fork is by definition a
> part of an unlogged relation, it seems to me that it ought not to
> be wal-logged if we had it. From this viewpoint, the middle line
> makes sense.

Actually, the init fork of an unlogged relation *must* be WAL-logged.
All other forks are removed on a crash (and the main fork is copied
anew from the init fork).  But the init fork itself must survive -
therefore it, and only it, must be WAL-logged and thus durable.

> By the way the comment of the function ReadBufferWithoutRelcache
> has the following sentense.
>
> | * NB: At present, this function may only be used on permanent relations, which
> | * is OK, because we only use it during XLOG replay.  If in the future we
> | * want to use it on temporary or unlogged relations, we could pass additional
> | * parameters.
>
> and does
>
> | return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
>                                                          mode, strategy, &hit);
>
> This surely works since BufferAlloc recognizes INIT_FORK. But
> some adjustment may be needed around here.

Yeah, it should probably mention that the init fork of an unlogged
relation is also OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Amit Kapila

Date:

08 July 2017, 06:36:13

On Sat, Jul 8, 2017 at 8:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jul 6, 2017 at 5:54 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> Check for INIT_FORKNUM appears both accompanied and not
>> accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
>> is the convention here.
>
> Checking only for INIT_FORKNUM seems logical to me.  Also checking for
> RELPERSISTENCE_UNLOGGED just makes the code longer to no benefit.  I
> guess Amit copied the test from ATExecSetTablespace, which does it as
> he did, but it seems unnecessarily long-winded.
>

Okay.  If you and Michael feel the check that way is better, I will
change and submit the patch.

>> The difference of the two is an init fork of TEMP
>> relations. However I believe that init fork is by definition a
>> part of an unlogged relation, it seems to me that it ought not to
>> be wal-logged if we had it. From this viewpoint, the middle line
>> makes sense.
>
> Actually, the init fork of an unlogged relation *must* be WAL-logged.
> All other forks are removed on a crash (and the main fork is copied
> anew from the init fork).  But the init fork itself must survive -
> therefore it, and only it, must be WAL-logged and thus durable.
>
>> By the way the comment of the function ReadBufferWithoutRelcache
>> has the following sentense.
>>
>> | * NB: At present, this function may only be used on permanent relations, which
>> | * is OK, because we only use it during XLOG replay.  If in the future we
>> | * want to use it on temporary or unlogged relations, we could pass additional
>> | * parameters.
>>
>> and does
>>
>> | return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
>>                                                          mode, strategy, &hit);
>>
>> This surely works since BufferAlloc recognizes INIT_FORK. But
>> some adjustment may be needed around here.
>
> Yeah, it should probably mention that the init fork of an unlogged
> relation is also OK.
>

I think we should do that as a separate patch (I can write the same as
well) because that is not new behavior introduced by this patch, but
let me know if you think that we should add such a comment in this
patch itself.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Robert Haas

Date:

08 July 2017, 06:38:20

On Fri, Jul 7, 2017 at 11:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think we should do that as a separate patch (I can write the same as
> well) because that is not new behavior introduced by this patch, ...

True, although maybe hash indexes are the only way that happens today?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Amit Kapila

Date:

08 July 2017, 06:53:02

On Sat, Jul 8, 2017 at 9:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jul 7, 2017 at 11:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think we should do that as a separate patch (I can write the same as
>> well) because that is not new behavior introduced by this patch, ...
>
> True, although maybe hash indexes are the only way that happens today?
>

No, I think it will happen for all other indexes as well.  Basically,
we log new page for init forks of other indexes and then while
restoring that full page image, it will fall through the same path.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Amit Kapila

Date:

08 July 2017, 14:11:27

On Sat, Jul 8, 2017 at 9:06 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Jul 8, 2017 at 8:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Jul 6, 2017 at 5:54 AM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> Check for INIT_FORKNUM appears both accompanied and not
>>> accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
>>> is the convention here.
>>
>> Checking only for INIT_FORKNUM seems logical to me.  Also checking for
>> RELPERSISTENCE_UNLOGGED just makes the code longer to no benefit.  I
>> guess Amit copied the test from ATExecSetTablespace, which does it as
>> he did, but it seems unnecessarily long-winded.
>>
>
> Okay.  If you and Michael feel the check that way is better, I will
> change and submit the patch.
>

Changed as per suggestion.

>>> By the way the comment of the function ReadBufferWithoutRelcache
>>> has the following sentense.
>>>
>>> | * NB: At present, this function may only be used on permanent relations, which
>>> | * is OK, because we only use it during XLOG replay.  If in the future we
>>> | * want to use it on temporary or unlogged relations, we could pass additional
>>> | * parameters.
>>>
>>> and does
>>>
>>> | return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
>>>                                                          mode, strategy, &hit);
>>>
>>> This surely works since BufferAlloc recognizes INIT_FORK. But
>>> some adjustment may be needed around here.
>>
>> Yeah, it should probably mention that the init fork of an unlogged
>> relation is also OK.
>>
>
> I think we should do that as a separate patch (I can write the same as
> well) because that is not new behavior introduced by this patch, but
> let me know if you think that we should add such a comment in this
> patch itself.
>

Attached a separate patch to adjust the comment atop ReadBufferWithoutRelcache.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] hash index on unlogged tables doesn't behave asexpected

From

Kyotaro HORIGUCHI

Date:

10 July 2017, 08:09:24

At Sat, 8 Jul 2017 16:41:27 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+-DUto+MyeNdLE9P9u8G3Fv6n+SOjPSqmPSw6ashhPjw@mail.gmail.com>
> On Sat, Jul 8, 2017 at 9:06 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sat, Jul 8, 2017 at 8:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Thu, Jul 6, 2017 at 5:54 AM, Kyotaro HORIGUCHI
> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >>> Check for INIT_FORKNUM appears both accompanied and not
> >>> accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
> >>> is the convention here.
> >>
> >> Checking only for INIT_FORKNUM seems logical to me.  Also checking for
> >> RELPERSISTENCE_UNLOGGED just makes the code longer to no benefit.  I
> >> guess Amit copied the test from ATExecSetTablespace, which does it as
> >> he did, but it seems unnecessarily long-winded.
> >>
> >
> > Okay.  If you and Michael feel the check that way is better, I will
> > change and submit the patch.
> >
> 
> Changed as per suggestion.


> On Sat, Jul 8, 2017 at 9:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Fri, Jul 7, 2017 at 11:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> I think we should do that as a separate patch (I can write the same as
> >> well) because that is not new behavior introduced by this patch, ...
> >
> > True, although maybe hash indexes are the only way that happens today?
> >
> 
> No, I think it will happen for all other indexes as well.  Basically,
> we log new page for init forks of other indexes and then while
> restoring that full page image, it will fall through the same path.

(Sorry, I didn't noticed that hash metapage is not using log_newpgae)

For example, (bt|gin|gist|spgist|brin)buildempty are using
log_newpage to log INIT_FORK metapages. I believe that the xlog
flow from log_newpage to XLogReadBufferForRedoExtended is
developed so that pages in init forks are surely flushed during
recovery, so we should fix it instead adding other flushes if the
path doesn't work. Am I looking wrong place? (I think it is
working.)

If I'm understanding correctly here, hashbild and hashbuildempty
should be refactored using the same structure with other *build
and *buildempty functions. Specifically metapages initialization
subroutines (_hash_init or _bt_initmetapage or SpGistInitMetapage
or...)  does only on-memory initialization and does not emit WAL,
then *build and *buildempty emits WAL in their required way.


> >>> By the way the comment of the function ReadBufferWithoutRelcache
> >>> has the following sentense.
> >>>
> >>> | * NB: At present, this function may only be used on permanent relations, which
> >>> | * is OK, because we only use it during XLOG replay.  If in the future we
> >>> | * want to use it on temporary or unlogged relations, we could pass additional
> >>> | * parameters.
> >>>
> >>> and does
> >>>
> >>> | return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
> >>>                                                          mode, strategy, &hit);
> >>>
> >>> This surely works since BufferAlloc recognizes INIT_FORK. But
> >>> some adjustment may be needed around here.
> >>
> >> Yeah, it should probably mention that the init fork of an unlogged
> >> relation is also OK.
> >>
> >
> > I think we should do that as a separate patch (I can write the same as
> > well) because that is not new behavior introduced by this patch, but
> > let me know if you think that we should add such a comment in this
> > patch itself.
> >
> 
> Attached a separate patch to adjust the comment atop ReadBufferWithoutRelcache.

+ * NB: At present, this function may only be used on permanent relations and
+ * init fork of an unlogged relation, which is OK, because we only use it
+ * during XLOG replay.  If in the future we want to use it on temporary or
+ * unlogged relations, we could pass additional parameters.

*I* think the target of the funcion is permanent relations and
init forks, not unlogged relations. And I'd like to have an
additional comment about RELPERSISTENCE_PERMANENT. Something like
the attached.

reagards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 15795b0..cc3980c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -673,10 +673,10 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum, *
ReadBufferWithoutRelcache-- like ReadBufferExtended, but doesn't require *        a relcache entry for the relation. *
 
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * NB: At present, this function may only be used on main fork pages of
+ * permanent relations and init fork pages, which is OK, because we only use
+ * it during XLOG replay.  If in the future we want to use it on temporary or
+ * unlogged relations, we could pass additional parameters. */BufferReadBufferWithoutRelcache(RelFileNode rnode,
ForkNumberforkNum,
 
@@ -689,6 +689,10 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,    Assert(InRecovery);
+    /*
+     * Pages of init fork are permanent so RELPERSISTENCE_PERMANENT is ok
+     * here.
+     */    return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
mode,strategy, &hit);}

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Amit Kapila

Date:

10 July 2017, 12:28:13

On Mon, Jul 10, 2017 at 10:39 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Sat, 8 Jul 2017 16:41:27 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+-DUto+MyeNdLE9P9u8G3Fv6n+SOjPSqmPSw6ashhPjw@mail.gmail.com>
>> On Sat, Jul 8, 2017 at 9:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > On Fri, Jul 7, 2017 at 11:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >> I think we should do that as a separate patch (I can write the same as
>> >> well) because that is not new behavior introduced by this patch, ...
>> >
>> > True, although maybe hash indexes are the only way that happens today?
>> >
>>
>> No, I think it will happen for all other indexes as well.  Basically,
>> we log new page for init forks of other indexes and then while
>> restoring that full page image, it will fall through the same path.
>
> (Sorry, I didn't noticed that hash metapage is not using log_newpgae)
>

The bitmappage is also not using log_newpage.

> For example, (bt|gin|gist|spgist|brin)buildempty are using
> log_newpage to log INIT_FORK metapages. I believe that the xlog
> flow from log_newpage to XLogReadBufferForRedoExtended is
> developed so that pages in init forks are surely flushed during
> recovery, so we should fix it instead adding other flushes if the
> path doesn't work. Am I looking wrong place? (I think it is
> working.)
>

You are looking at right place.

> If I'm understanding correctly here, hashbild and hashbuildempty
> should be refactored using the same structure with other *build
> and *buildempty functions. Specifically metapages initialization
> subroutines (_hash_init or _bt_initmetapage or SpGistInitMetapage
> or...)  does only on-memory initialization and does not emit WAL,
> then *build and *buildempty emits WAL in their required way.
>

I have considered this way of doing as well, read my first e-mail [1]
in this thread "Another approach to fix the issue ....".  It is not
that we can't change the code of hashbuildempty to make it log new
pages for all kind of pages (metapage, bitmappage and datapages), but
I am not sure if there is a value in going in that direction.  If we
have to go in that direction, we need to hard code some of the values
like hashm_nmaps and hashm_mapp in metapage rather than initializing
them after bitmappage creation.  Before going in that direction, I
think we should also take opinion from other people especially some
committer as we might need to maintain two different ways of doing
almost the same thing.

I am also not 100% comfortable with adding flush at two new places,
but that makes the code for fix simpler and fundamentally there is no
problem in doing so.

There is a small downside to always logging new page which is that it
will always log full page image in WAL which has the potential to
increase WAL volume if there are many unlogged tables and indexes.
Now considering the init fork generally has very less pages, it is not
a big concern, but still avoiding full page image has some value.

>
>> >>> By the way the comment of the function ReadBufferWithoutRelcache
>> >>> has the following sentense.
>> >>>
>> >>> | * NB: At present, this function may only be used on permanent relations, which
>> >>> | * is OK, because we only use it during XLOG replay.  If in the future we
>> >>> | * want to use it on temporary or unlogged relations, we could pass additional
>> >>> | * parameters.
>> >>>
>> >>> and does
>> >>>
>> >>> | return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
>> >>>                                                          mode, strategy, &hit);
>> >>>
>> >>> This surely works since BufferAlloc recognizes INIT_FORK. But
>> >>> some adjustment may be needed around here.
>> >>
>> >> Yeah, it should probably mention that the init fork of an unlogged
>> >> relation is also OK.
>> >>
>> >
>> > I think we should do that as a separate patch (I can write the same as
>> > well) because that is not new behavior introduced by this patch, but
>> > let me know if you think that we should add such a comment in this
>> > patch itself.
>> >
>>
>> Attached a separate patch to adjust the comment atop ReadBufferWithoutRelcache.
>
> + * NB: At present, this function may only be used on permanent relations and
> + * init fork of an unlogged relation, which is OK, because we only use it
> + * during XLOG replay.  If in the future we want to use it on temporary or
> + * unlogged relations, we could pass additional parameters.
>
> *I* think the target of the funcion is permanent relations and
> init forks, not unlogged relations. And I'd like to have an
> additional comment about RELPERSISTENCE_PERMANENT. Something like
> the attached.
>

Okay, let's leave this for committer to decide.

[1] - https://www.postgresql.org/message-id/CAA4eK1JpcMsEtOL_J7WODumeEfyrPi7FPYHeVdS7fyyrCrgp4w%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] hash index on unlogged tables doesn't behave asexpected

From

Kyotaro HORIGUCHI

Date:

10 July 2017, 12:57:40

Hi,

At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1JYyO5Hcxx4rP0jb=JmMC4qvY1YvG9UvkwNr5qrojsOPw@mail.gmail.com>
> On Mon, Jul 10, 2017 at 10:39 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Sat, 8 Jul 2017 16:41:27 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+-DUto+MyeNdLE9P9u8G3Fv6n+SOjPSqmPSw6ashhPjw@mail.gmail.com>
> >> On Sat, Jul 8, 2017 at 9:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> > On Fri, Jul 7, 2017 at 11:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> >> I think we should do that as a separate patch (I can write the same as
> >> >> well) because that is not new behavior introduced by this patch, ...
> >> >
> >> > True, although maybe hash indexes are the only way that happens today?
> >> >
> >>
> >> No, I think it will happen for all other indexes as well.  Basically,
> >> we log new page for init forks of other indexes and then while
> >> restoring that full page image, it will fall through the same path.
> >
> > (Sorry, I didn't noticed that hash metapage is not using log_newpgae)
> >
> 
> The bitmappage is also not using log_newpage.
> 
> > For example, (bt|gin|gist|spgist|brin)buildempty are using
> > log_newpage to log INIT_FORK metapages. I believe that the xlog
> > flow from log_newpage to XLogReadBufferForRedoExtended is
> > developed so that pages in init forks are surely flushed during
> > recovery, so we should fix it instead adding other flushes if the
> > path doesn't work. Am I looking wrong place? (I think it is
> > working.)
> >
> 
> You are looking at right place.
> 
> > If I'm understanding correctly here, hashbild and hashbuildempty
> > should be refactored using the same structure with other *build
> > and *buildempty functions. Specifically metapages initialization
> > subroutines (_hash_init or _bt_initmetapage or SpGistInitMetapage
> > or...)  does only on-memory initialization and does not emit WAL,
> > then *build and *buildempty emits WAL in their required way.
> >
> 
> I have considered this way of doing as well, read my first e-mail [1]
> in this thread "Another approach to fix the issue ....".  It is not
> that we can't change the code of hashbuildempty to make it log new
> pages for all kind of pages (metapage, bitmappage and datapages), but
> I am not sure if there is a value in going in that direction.  If we
> have to go in that direction, we need to hard code some of the values
> like hashm_nmaps and hashm_mapp in metapage rather than initializing
> them after bitmappage creation.  Before going in that direction, I
> think we should also take opinion from other people especially some
> committer as we might need to maintain two different ways of doing
> almost the same thing.

Thanks for the explanation and the pointer (to the start of this
thread.. sorry).

> I am also not 100% comfortable with adding flush at two new places,
> but that makes the code for fix simpler and fundamentally there is no
> problem in doing so.

I agree that the patch would be simpler. Ok, I am satisfied with
an additional comment for _hash_init and hash_xlog_init_meta_page
that describes the reason that _hash_init doesn't/can't use
log_newpage and thus requires explicit flushes. Something like
the description in [1] would be enough.

> There is a small downside to always logging new page which is that it
> will always log full page image in WAL which has the potential to
> increase WAL volume if there are many unlogged tables and indexes.
> Now considering the init fork generally has very less pages, it is not
> a big concern, but still avoiding full page image has some value.

Perhaps more effective mechanism will be developed at the time.

> >> >>> By the way the comment of the function ReadBufferWithoutRelcache
> >> >>> has the following sentense.
> >> >>>
> >> >>> | * NB: At present, this function may only be used on permanent relations, which
> >> >>> | * is OK, because we only use it during XLOG replay.  If in the future we
> >> >>> | * want to use it on temporary or unlogged relations, we could pass additional
> >> >>> | * parameters.
> >> >>>
> >> >>> and does
> >> >>>
> >> >>> | return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
> >> >>>                                                          mode, strategy, &hit);
> >> >>>
> >> >>> This surely works since BufferAlloc recognizes INIT_FORK. But
> >> >>> some adjustment may be needed around here.
> >> >>
> >> >> Yeah, it should probably mention that the init fork of an unlogged
> >> >> relation is also OK.
> >> >>
> >> >
> >> > I think we should do that as a separate patch (I can write the same as
> >> > well) because that is not new behavior introduced by this patch, but
> >> > let me know if you think that we should add such a comment in this
> >> > patch itself.
> >> >
> >>
> >> Attached a separate patch to adjust the comment atop ReadBufferWithoutRelcache.
> >
> > + * NB: At present, this function may only be used on permanent relations and
> > + * init fork of an unlogged relation, which is OK, because we only use it
> > + * during XLOG replay.  If in the future we want to use it on temporary or
> > + * unlogged relations, we could pass additional parameters.
> >
> > *I* think the target of the funcion is permanent relations and
> > init forks, not unlogged relations. And I'd like to have an
> > additional comment about RELPERSISTENCE_PERMANENT. Something like
> > the attached.
> >
> 
> Okay, let's leave this for committer to decide.

Agreed, thanks.

> [1] - https://www.postgresql.org/message-id/CAA4eK1JpcMsEtOL_J7WODumeEfyrPi7FPYHeVdS7fyyrCrgp4w%40mail.gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Amit Kapila

Date:

10 July 2017, 16:07:34

On Mon, Jul 10, 2017 at 3:27 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hi,
>
> At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1JYyO5Hcxx4rP0jb=JmMC4qvY1YvG9UvkwNr5qrojsOPw@mail.gmail.com>
>
>> I am also not 100% comfortable with adding flush at two new places,
>> but that makes the code for fix simpler and fundamentally there is no
>> problem in doing so.
>
> I agree that the patch would be simpler. Ok, I am satisfied with
> an additional comment for _hash_init and hash_xlog_init_meta_page
> that describes the reason that _hash_init doesn't/can't use
> log_newpage and thus requires explicit flushes. Something like
> the description in [1] would be enough.
>

I have modified the comment in hash_xlog_init_meta_page and a
corresponding function for bitmap page.  However, I think adding
anything about not using log_newpage in _hash_init doesn't sound good
to me.  I think you have suggested it so that we don't forget the
reason for not using log_newpage, but I think that is overkill.  Let
me know if you have any other concerns?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

fix_unlogged_hash_index_issue_v3.patch

Re: [HACKERS] hash index on unlogged tables doesn't behave asexpected

From

Kyotaro HORIGUCHI

Date:

11 July 2017, 04:32:08

At Mon, 10 Jul 2017 18:37:34 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1KGK9pvW=Hn5yd51weEqWxGiKFC8HG8Bs1Ls1Pvfo5kwQ@mail.gmail.com>
> On Mon, Jul 10, 2017 at 3:27 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hi,
> >
> > At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1JYyO5Hcxx4rP0jb=JmMC4qvY1YvG9UvkwNr5qrojsOPw@mail.gmail.com>
> >
> >> I am also not 100% comfortable with adding flush at two new places,
> >> but that makes the code for fix simpler and fundamentally there is no
> >> problem in doing so.
> >
> > I agree that the patch would be simpler. Ok, I am satisfied with
> > an additional comment for _hash_init and hash_xlog_init_meta_page
> > that describes the reason that _hash_init doesn't/can't use
> > log_newpage and thus requires explicit flushes. Something like
> > the description in [1] would be enough.
> >
> 
> I have modified the comment in hash_xlog_init_meta_page and a
> corresponding function for bitmap page.  However, I think adding
> anything about not using log_newpage in _hash_init doesn't sound good
> to me.  I think you have suggested it so that we don't forget the
> reason for not using log_newpage, but I think that is overkill.  Let
> me know if you have any other concerns?

The modified comment looks perfect. Thanks!

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Michael Paquier

Date:

14 July 2017, 15:32:04

(catching up finally with this thread)

On Mon, Jul 10, 2017 at 11:57 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1JYyO5Hcxx4rP0jb=JmMC4qvY1YvG9UvkwNr5qrojsOPw@mail.gmail.com>
>> I am also not 100% comfortable with adding flush at two new places,
>> but that makes the code for fix simpler and fundamentally there is no
>> problem in doing so.
>
> I agree that the patch would be simpler. Ok, I am satisfied with
> an additional comment for _hash_init and hash_xlog_init_meta_page
> that describes the reason that _hash_init doesn't/can't use
> log_newpage and thus requires explicit flushes. Something like
> the description in [1] would be enough.

It seems to me that we should really consider logging a full page of
the bitmap and meta pages for init forks as this is the common
practice used by all the other AMs. I would go as far as spreading a
method similar to ginbuildempty() to all the AMs as delayed
checkpoints guarantee that buffers are correctly flushed when marked
as dirty.
-- 
Michael

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Amit Kapila

Date:

15 July 2017, 05:50:32

On Fri, Jul 14, 2017 at 6:02 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> (catching up finally with this thread)
>
> On Mon, Jul 10, 2017 at 11:57 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1JYyO5Hcxx4rP0jb=JmMC4qvY1YvG9UvkwNr5qrojsOPw@mail.gmail.com>
>>> I am also not 100% comfortable with adding flush at two new places,
>>> but that makes the code for fix simpler and fundamentally there is no
>>> problem in doing so.
>>
>> I agree that the patch would be simpler. Ok, I am satisfied with
>> an additional comment for _hash_init and hash_xlog_init_meta_page
>> that describes the reason that _hash_init doesn't/can't use
>> log_newpage and thus requires explicit flushes. Something like
>> the description in [1] would be enough.
>
> It seems to me that we should really consider logging a full page of
> the bitmap and meta pages for init forks as this is the common
> practice used by all the other AMs.
>

I think if we really want to go in that direction then it is better to
write code separately for hashbuildempty, rather than adding special
purpose logic in _hash_init for init forks.   As to the comparison
with other indexes, the logic of hash index is slightly tricky (as I
have already explained in email up thread) as compared to other
indexes, so we should not force ourselves to do that way if it can be
integrated with logged index creation path.  I am not against doing
that way as it has some merit, but the advantage seems to be thin.
Let's not argue endlessly on this point because it is not that I have
not considered it doing the way you are saying (in fact I have
mentioned that in my first e-mail).  Let us wait for an opinion from
others and or from committers.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Ashutosh Sharma

Date:

15 July 2017, 07:27:56

On Sat, Jul 15, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jul 14, 2017 at 6:02 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> (catching up finally with this thread)
>>
>> On Mon, Jul 10, 2017 at 11:57 AM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> At Mon, 10 Jul 2017 14:58:13 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1JYyO5Hcxx4rP0jb=JmMC4qvY1YvG9UvkwNr5qrojsOPw@mail.gmail.com>
>>>> I am also not 100% comfortable with adding flush at two new places,
>>>> but that makes the code for fix simpler and fundamentally there is no
>>>> problem in doing so.
>>>
>>> I agree that the patch would be simpler. Ok, I am satisfied with
>>> an additional comment for _hash_init and hash_xlog_init_meta_page
>>> that describes the reason that _hash_init doesn't/can't use
>>> log_newpage and thus requires explicit flushes. Something like
>>> the description in [1] would be enough.
>>
>> It seems to me that we should really consider logging a full page of
>> the bitmap and meta pages for init forks as this is the common
>> practice used by all the other AMs.
>>
>
> I think if we really want to go in that direction then it is better to
> write code separately for hashbuildempty, rather than adding special
> purpose logic in _hash_init for init forks.   As to the comparison
> with other indexes, the logic of hash index is slightly tricky (as I
> have already explained in email up thread) as compared to other
> indexes, so we should not force ourselves to do that way if it can be
> integrated with logged index creation path.  I am not against doing
> that way as it has some merit, but the advantage seems to be thin.
> Let's not argue endlessly on this point because it is not that I have
> not considered it doing the way you are saying (in fact I have
> mentioned that in my first e-mail).  Let us wait for an opinion from
> others and or from committers.
>

I do agree with Amit. I think hash index is slightly trickier (in
terms of how it is organised) than other indexes and that could be the
reason for maintaining common code for hashbuild and hashbuildempty.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Michael Paquier

Date:

15 July 2017, 11:25:43

On Sat, Jul 15, 2017 at 6:27 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> I do agree with Amit. I think hash index is slightly trickier (in
> terms of how it is organised) than other indexes and that could be the
> reason for maintaining common code for hashbuild and hashbuildempty.

Well, you both and Robert worked more on this code for PG10 than I
did, so I am fine to rely on your judgement for the final result.
Still I find this special handling quite surprising. All other AMs
just always log FPWs for the init fork pages so I'd rather not break
this treaty, but that's one against the world as things stand
currently on this thread ;)
-- 
Michael

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Robert Haas

Date:

17 July 2017, 19:02:06

On Sat, Jul 15, 2017 at 4:25 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Jul 15, 2017 at 6:27 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> I do agree with Amit. I think hash index is slightly trickier (in
>> terms of how it is organised) than other indexes and that could be the
>> reason for maintaining common code for hashbuild and hashbuildempty.
>
> Well, you both and Robert worked more on this code for PG10 than I
> did, so I am fine to rely on your judgement for the final result.
> Still I find this special handling quite surprising. All other AMs
> just always log FPWs for the init fork pages so I'd rather not break
> this treaty, but that's one against the world as things stand
> currently on this thread ;)

It seems to me that this area might benefit from a broader rethink.
It's not very nice to impose a restriction that init forks can only be
constructed using log_newpage(); on the other hand, it is also not
very nice that Amit's patch needs to recapitulate the mysterious
incantation used by XLogReadBufferForRedoExtended in several more
places.  The basic problem here is that it would be really easy for
the next person who adds an index AM to make the exact same mistake
that Amit made here and that I failed to spot during code review.  It
would be nice to come up with some more generic solution to the
problem rather than relying on each AM to do insert this
FlushOneBuffer() incantation in each place where it's needed.  I think
there are probably several ways to do that; one idea might be to flush
all init-fork buffers in bulk at the end of recovery.

However, it doesn't really seem urgent to tinker with that; while I
think the fact that existing AMs use log_newpage() to populate the
init fork is mostly just luck, it might well be 10 or 20 years before
somebody adds another AM that happens to use any other technique.
Moreover, at the moment, we're trying to get a release out the door,
and to me that argues for keeping structural changes to a minimum.
Amit's patch seems like a pretty surgical fix to this problem with
minimal chance of breaking anything new, and that seems like the right
kind of fix for July.  So I plan to commit what Amit proposed (with a
few grammar corrections).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Michael Paquier

Date:

17 July 2017, 19:51:13

On Mon, Jul 17, 2017 at 6:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> It seems to me that this area might benefit from a broader rethink.
> It's not very nice to impose a restriction that init forks can only be
> constructed using log_newpage(); on the other hand, it is also not
> very nice that Amit's patch needs to recapitulate the mysterious
> incantation used by XLogReadBufferForRedoExtended in several more
> places.  The basic problem here is that it would be really easy for
> the next person who adds an index AM to make the exact same mistake
> that Amit made here and that I failed to spot during code review.  It
> would be nice to come up with some more generic solution to the
> problem rather than relying on each AM to do insert this
> FlushOneBuffer() incantation in each place where it's needed.  I think
> there are probably several ways to do that; one idea might be to flush
> all init-fork buffers in bulk at the end of recovery.

Things are centralized in XLogReadBufferForRedoExtended() for init
fork flushes, which is not something bad in itself as it is the unique
place doing this work normally for other AMs. And I have to admit that
the current coding for hash index WAL has the advantage to create less
WAL quantity as the bitmap and metadata pages get initialized using
the data of the records themselves. One idea I can think of would be
to create a README in src/backend/access for index AMs that summarizes
a basic set of recommendations for each routine used.

> However, it doesn't really seem urgent to tinker with that; while I
> think the fact that existing AMs use log_newpage() to populate the
> init fork is mostly just luck, it might well be 10 or 20 years before
> somebody adds another AM that happens to use any other technique.
> Moreover, at the moment, we're trying to get a release out the door,
> and to me that argues for keeping structural changes to a minimum.
> Amit's patch seems like a pretty surgical fix to this problem with
> minimal chance of breaking anything new, and that seems like the right
> kind of fix for July.  So I plan to commit what Amit proposed (with a
> few grammar corrections).

No objections to that.
-- 
Michael

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Amit Kapila

Date:

18 July 2017, 05:18:48

On Mon, Jul 17, 2017 at 10:21 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Jul 17, 2017 at 6:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> It seems to me that this area might benefit from a broader rethink.
>> It's not very nice to impose a restriction that init forks can only be
>> constructed using log_newpage(); on the other hand, it is also not
>> very nice that Amit's patch needs to recapitulate the mysterious
>> incantation used by XLogReadBufferForRedoExtended in several more
>> places.  The basic problem here is that it would be really easy for
>> the next person who adds an index AM to make the exact same mistake
>> that Amit made here and that I failed to spot during code review.  It
>> would be nice to come up with some more generic solution to the
>> problem rather than relying on each AM to do insert this
>> FlushOneBuffer() incantation in each place where it's needed.  I think
>> there are probably several ways to do that; one idea might be to flush
>> all init-fork buffers in bulk at the end of recovery.
>
> Things are centralized in XLogReadBufferForRedoExtended() for init
> fork flushes, which is not something bad in itself as it is the unique
> place doing this work normally for other AMs. And I have to admit that
> the current coding for hash index WAL has the advantage to create less
> WAL quantity as the bitmap and metadata pages get initialized using
> the data of the records themselves. One idea I can think of would be
> to create a README in src/backend/access for index AMs that summarizes
> a basic set of recommendations for each routine used.
>

We already have quite a decent amount of information about indexes in
our docs [1][2].  We might want to extend that as well.

>> However, it doesn't really seem urgent to tinker with that; while I
>> think the fact that existing AMs use log_newpage() to populate the
>> init fork is mostly just luck, it might well be 10 or 20 years before
>> somebody adds another AM that happens to use any other technique.
>> Moreover, at the moment, we're trying to get a release out the door,
>> and to me that argues for keeping structural changes to a minimum.
>> Amit's patch seems like a pretty surgical fix to this problem with
>> minimal chance of breaking anything new, and that seems like the right
>> kind of fix for July.  So I plan to commit what Amit proposed (with a
>> few grammar corrections).
>

Thanks.  Do you have any suggestion for back-branches?  As of now, it
fails badly with below kind of error:

test=> SELECT * FROM t_u_hash;
ERROR:  could not open file "base/16384/16392": No such file or directory

It is explained in another thread [3] where it has been found that the
reason for such an error is that hash indexes are not WAL logged prior
to 10.  Now, we can claim that we don't recommend hash indexes to be
used prior to 10 in production, so such an error is okay even if there
is no crash has happened in the system.

[1] - https://www.postgresql.org/docs/devel/static/indexam.html
[2] - https://www.postgresql.org/docs/devel/static/index-functions.html
[3] - https://www.postgresql.org/message-id/20170630005634.GA4448%40momjian.us

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Michael Paquier

Date:

18 July 2017, 09:33:07

On Tue, Jul 18, 2017 at 4:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Thanks.  Do you have any suggestion for back-branches?  As of now, it
> fails badly with below kind of error:
>
> test=> SELECT * FROM t_u_hash;
> ERROR:  could not open file "base/16384/16392": No such file or directory
>
> It is explained in another thread [3] where it has been found that the
> reason for such an error is that hash indexes are not WAL logged prior
> to 10.  Now, we can claim that we don't recommend hash indexes to be
> used prior to 10 in production, so such an error is okay even if there
> is no crash has happened in the system.

There are a couple of approaches:
1) Marking such indexes as invalid at recovery and log information
about the switch done.
2) Error at creation of hash indexes on unlogged tables.
3) Leave it as-is, because there is already a WARNING at creation.
I don't mind seeing 3) per the amount of work done lately to support
WAL on hash indexes.
-- 
Michael

Re: [HACKERS] hash index on unlogged tables doesn't behave asexpected

From

Kyotaro HORIGUCHI

Date:

21 September 2017, 13:44:25

Hello,

Following a bit older thread.

At Tue, 18 Jul 2017 08:33:07 +0200, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSQDmz+PKewNN9r_7jC4WKf9f31Gkf=DzVGA3q+GsgJEQ@mail.gmail.com>
> On Tue, Jul 18, 2017 at 4:18 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Thanks.  Do you have any suggestion for back-branches?  As of now, it
> > fails badly with below kind of error:
> >
> > test=> SELECT * FROM t_u_hash;
> > ERROR:  could not open file "base/16384/16392": No such file or directory
> >
> > It is explained in another thread [3] where it has been found that the
> > reason for such an error is that hash indexes are not WAL logged prior
> > to 10.  Now, we can claim that we don't recommend hash indexes to be
> > used prior to 10 in production, so such an error is okay even if there
> > is no crash has happened in the system.
> 
> There are a couple of approaches:
> 1) Marking such indexes as invalid at recovery and log information
> about the switch done.
> 2) Error at creation of hash indexes on unlogged tables.
> 3) Leave it as-is, because there is already a WARNING at creation.
> I don't mind seeing 3) per the amount of work done lately to support
> WAL on hash indexes.

I overlooked that but (3) is true as long as the table is
*logged* one.

postgres=# create table test (id int primary key, v text);
postgres=# create index on test using hash (id);
WARNING:  hash indexes are not WAL-logged and their use is discouraged

But not for for unlogged tables.

postgres=# create unlogged table test (id int primary key, v text);
postgres=# create index on test using hash (id);
postgres=# (no warning)

And fails on promotion in the same way.

postgres=# select * from test;
ERROR:  could not open file "base/13324/16446": No such file or directory

indexcmds.c@965:503
>   if (strcmp(accessMethodName, "hash") == 0 &&
>     RelationNeedsWAL(rel))
>     ereport(WARNING,
>         (errmsg("hash indexes are not WAL-logged and their use is discouraged")));

Using !RelationUsesLocalBuffers instead fixes that and the
attached patch is for 9.6. I'm a bit unconfident on the usage of
logical meaning of the macro but what it does fits there.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/src/backend/commands/indexcmds.c
--- b/src/backend/commands/indexcmds.c
***************
*** 501,507 **** DefineIndex(Oid relationId,     amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);      if
(strcmp(accessMethodName,"hash") == 0 &&
 
!         RelationNeedsWAL(rel))         ereport(WARNING,                 (errmsg("hash indexes are not WAL-logged and
theiruse is discouraged"))); 
 
--- 501,507 ----     amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);      if (strcmp(accessMethodName,
"hash")== 0 &&
 
!         !RelationUsesLocalBuffers(rel))         ereport(WARNING,                 (errmsg("hash indexes are not
WAL-loggedand their use is discouraged")));  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Amit Kapila

Date:

21 September 2017, 15:59:56

On Thu, Sep 21, 2017 at 4:14 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> postgres=# create table test (id int primary key, v text);
> postgres=# create index on test using hash (id);
> WARNING:  hash indexes are not WAL-logged and their use is discouraged
>
> But not for for unlogged tables.
>
> postgres=# create unlogged table test (id int primary key, v text);
> postgres=# create index on test using hash (id);
> postgres=# (no warning)
>
> And fails on promotion in the same way.
>
> postgres=# select * from test;
> ERROR:  could not open file "base/13324/16446": No such file or directory
>
> indexcmds.c@965:503
>>   if (strcmp(accessMethodName, "hash") == 0 &&
>>     RelationNeedsWAL(rel))
>>     ereport(WARNING,
>>         (errmsg("hash indexes are not WAL-logged and their use is discouraged")));
>
> Using !RelationUsesLocalBuffers instead fixes that and the
> attached patch is for 9.6. I'm a bit unconfident on the usage of
> logical meaning of the macro but what it does fits there.
>


I think giving an error message like "hash indexes are not WAL-logged
and .." for unlogged tables doesn't seem like a good behavior.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Robert Haas

Date:

21 September 2017, 23:19:27

On Sep 21, 2017, at 8:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:.
> I think giving an error message like "hash indexes are not WAL-logged
> and .." for unlogged tables doesn't seem like a good behavior.

+1. This seems like deliberate behavior, not a bug.

...Robert


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hash index on unlogged tables doesn't behave asexpected

From

Kyotaro HORIGUCHI

Date:

22 September 2017, 03:16:35

At Thu, 21 Sep 2017 16:19:27 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<694CB417-EF2C-4760-863B-AEC4530C21E3@gmail.com>
> On Sep 21, 2017, at 8:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:.
> > I think giving an error message like "hash indexes are not WAL-logged
> > and .." for unlogged tables doesn't seem like a good behavior.
> 
> +1. This seems like deliberate behavior, not a bug.

Though I don't see it's bug and agree that the message is not
proper, currently we can create hash indexes without no warning
on unlogged tables and it causes a problem with replication.

The point here is that if we leave the behavior (failure on the
standby) for the reason that we see a warning on index creation,
a similar messages ought to be for unlogged tables.

Otherwise, our decision will be another option.

(4) Though we won't not see a warning on hash index creation on unlogged tables, it seems to have been a problem and
won'tmind.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

From

Robert Haas

Date:

22 September 2017, 03:35:01

On Thu, Sep 21, 2017 at 8:16 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Though I don't see it's bug and agree that the message is not
> proper, currently we can create hash indexes without no warning
> on unlogged tables and it causes a problem with replication.

That's true, but I don't believe it's a sufficient reason to make a change.

Before 84aa8ba128a08e6fdebb2497c7a79ebf18093e12 (2014), we didn't
issue a warning about hash indexes in any case whatsoever; we relied
on people reading the documentation to find out about the limitations
of hash indexes.  They can still do that in any cases that the warning
doesn't adequately cover.  I really don't think it's worth kibitzing
the cases where this message is emitted in released branches, or the
text of the message, just as we didn't back-patch the message itself
into older releases that are still supported.  We need a compelling
reason to change things in stable branches, and the fact that a
warning message added in 2014 doesn't cover every limitation of a
pre-1996 hash index implementation is not an emergency.  Let's save
back-patching for actual bugs, or we'll forever be fiddling with
things in stable branches that would be better left alone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hash index on unlogged tables doesn't behave asexpected

From

Kyotaro HORIGUCHI

Date:

22 September 2017, 05:19:52

At Thu, 21 Sep 2017 20:35:01 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmobXYq1ht8R76RTvun0pY85-=Oov8EY2Fv8nhNnM7Gdzxg@mail.gmail.com>
> On Thu, Sep 21, 2017 at 8:16 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Though I don't see it's bug and agree that the message is not
> > proper, currently we can create hash indexes without no warning
> > on unlogged tables and it causes a problem with replication.
> 
> That's true, but I don't believe it's a sufficient reason to make a change.
> 
> Before 84aa8ba128a08e6fdebb2497c7a79ebf18093e12 (2014), we didn't
> issue a warning about hash indexes in any case whatsoever; we relied
> on people reading the documentation to find out about the limitations
> of hash indexes.  They can still do that in any cases that the warning
> doesn't adequately cover.  I really don't think it's worth kibitzing
> the cases where this message is emitted in released branches, or the
> text of the message, just as we didn't back-patch the message itself
> into older releases that are still supported.  We need a compelling
> reason to change things in stable branches, and the fact that a
> warning message added in 2014 doesn't cover every limitation of a
> pre-1996 hash index implementation is not an emergency.  Let's save
> back-patching for actual bugs, or we'll forever be fiddling with
> things in stable branches that would be better left alone.

Sorry for annoying you and thank you. I agree with that after
just knowing the reason is not precisely (3) (we already have
WARNING for the problematic ops).

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers