Home > mailing lists

Re: standby recovery fails (tablespace related) (tentative patch and discussion) - Mailing list pgsql-hackers

From	Paul Guo
Subject	Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Date	April 28, 2019 07:33:13
Msg-id	CAEET0ZFCQp5JMOS7Tx9v1-4F3pESbpFAYmWyFxhQ=6+YBEHiuA@mail.gmail.com Whole thread
In response to	Re: standby recovery fails (tablespace related) (tentative patchand discussion) (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses	Re: standby recovery fails (tablespace related) (tentative patch and discussion)
List	pgsql-hackers

Tree view

On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Mmm. I posted to wrong thread. Sorry.

At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190423.163949.36763221.horiguchi.kyotaro@lab.ntt.co.jp>
> At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>
> > Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
> > create redo error, but I suspect some other kind of redo, which depends on
> > the files under the directory (they are not copied since the directory is
> > not created) and also cannot be covered by the invalid page mechanism,
> > could fail. Thanks.
>
> If recovery starts from just after tablespace creation, that's
> simple. The Symlink to the removed tablespace is already removed
> in the case. Hence server innocently create files directly under
> pg_tblspc, not in the tablespace. Finally all files that were
> supposed to be created in the removed tablespace are removed
> later in recovery.
>
> If recovery starts from recalling page in a file that have been
> in the tablespace, XLogReadBufferExtended creates one (perhaps
> directly in pg_tblspc as described above) and the files are
> removed later in recoery the same way to above. This case doen't
> cause FATAL/PANIC during recovery even in master.
>
> XLogReadBufferExtended@xlogutils.c
> | * Create the target file if it doesn't already exist. This lets us cope
> | * if the replay sequence contains writes to a relation that is later
> | * deleted. (The original coding of this routine would instead suppress
> | * the writes, but that seems like it risks losing valuable data if the
> | * filesystem loses an inode during a crash. Better to write the data
> | * until we are actually told to delete the file.)
>
> So buffered access cannot be a problem for the reason above. The
> remaining possible issue is non-buffered access to files in
> removed tablespaces. This is what I mentioned upthread:
>
> me> but I haven't checked this covers all places where need the same
> me> treatment.

RM_DBASE_ID is fixed by the patch.

XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
- are not relevant.

HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
- Resources works on buffer is not affected.

SMGR:
- Both CREATE and TRUNCATE seems fine.

TBLSPC:
- We don't nest tablespace directories. No Problem.

I don't find a similar case.

I took some time in digging into the related code. It seems that ignoring if the dst directory cannot be created directly

should be fine since smgr redo code creates tablespace path finally by calling TablespaceCreateDbspace().

What's more, I found some more issues.

1) The below error message is actually misleading.

2019-04-17 14:52:14.951 CST [23030] FATAL: could not create directory "pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for Database/CREATE: copy dir 1663/1 to 65546/65547

That should be due to dbase_desc(). It could be simply fixed following the code logic in GetDatabasePath().

2) It seems that src directory could be missing then dbase_redo()->copydir() could error out. For example,

\!rm -rf /tmp/tbspace1
\!mkdir /tmp/tbspace1
\!rm -rf /tmp/tbspace2
\!mkdir /tmp/tbspace2
create tablespace tbs1 location '/tmp/tbspace1';
create tablespace tbs2 location '/tmp/tbspace2';
create database db1 tablespace tbs1;
alter database db1 set tablespace tbs2;
drop tablespace tbs1;

Let's say, the standby finishes all replay but redo lsn on pg_control is still the point at 'alter database', and then

kill postgres, then in theory when startup, dbase_redo()->copydir() will ERROR since 'drop tablespace tbs1'

has removed the directories (and symlink) of tbs1. Below simple code change could fix that.

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c

index 9707afabd9..7d755c759e 100644

--- a/src/backend/commands/dbcommands.c

+++ b/src/backend/commands/dbcommands.c

@@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record)

FlushDatabaseBuffers(xlrec->src_db_id);

+ /*

+ * It is possible that the source directory is missing if

+ * we are re-replaying the xlog while subsequent xlogs

+ * drop the tablespace in previous replaying. For this

+ * we just skip.

+ */

+ if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))

+ return;

* Copy this subdirectory to the new location

If we want to fix the issue by ignoring the dst path create failure, I do not think we should do

that in copydir() since copydir() seems to be a common function. I'm not sure whether it is

used by some extensions or not. If no maybe we should move the dst patch create logic

out of copydir().

Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace() to replace

the code block includes a lot of get_parent_directory(), MakePGDirectory(), etc even it

is not fixing a bug since pg_mkdir_p() code change seems to be more graceful and simpler.

Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to be error-prone

along with postgre evolving since they are hard to test and also we are not easy to think out

various potential bad cases. Is it possible that we should do real checkpoint (flush & update

redo lsn) when seeing checkpoint xlogs for these operations? This will slow down standby

but master also does this and also these operations are not usual, espeically it seems that it

does not slow down wal receiving usually?

pgsql-hackers by date:

From: Peter Geoghegan
Date: 28 April 2019, 01:36:22
Subject: Re: Improve search for missing parent downlinks in amcheck

From: Sascha Kuhl
Date: 28 April 2019, 13:59:15
Subject: Data streaming between different databases

Re: standby recovery fails (tablespace related) (tentative patch and discussion) - Mailing list pgsql-hackers

Previous

Next