Re: standby recovery fails (tablespace related) (tentative patch and discussion) - Mailing list pgsql-hackers
From | Paul Guo |
---|---|
Subject | Re: standby recovery fails (tablespace related) (tentative patch and discussion) |
Date | |
Msg-id | CAEET0ZFCQp5JMOS7Tx9v1-4F3pESbpFAYmWyFxhQ=6+YBEHiuA@mail.gmail.com Whole thread Raw |
In response to | Re: standby recovery fails (tablespace related) (tentative patchand discussion) (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
|
List | pgsql-hackers |
On Wed, Apr 24, 2019 at 4:14 PM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Mmm. I posted to wrong thread. Sorry.
At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190423.163949.36763221.horiguchi.kyotaro@lab.ntt.co.jp>
> At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo <pguo@pivotal.io> wrote in <CAEET0ZEcwz57z2yfWRds43b3TfQPPDSWmbjGmD43xRxLT41NDg@mail.gmail.com>
> > Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
> > create redo error, but I suspect some other kind of redo, which depends on
> > the files under the directory (they are not copied since the directory is
> > not created) and also cannot be covered by the invalid page mechanism,
> > could fail. Thanks.
>
> If recovery starts from just after tablespace creation, that's
> simple. The Symlink to the removed tablespace is already removed
> in the case. Hence server innocently create files directly under
> pg_tblspc, not in the tablespace. Finally all files that were
> supposed to be created in the removed tablespace are removed
> later in recovery.
>
> If recovery starts from recalling page in a file that have been
> in the tablespace, XLogReadBufferExtended creates one (perhaps
> directly in pg_tblspc as described above) and the files are
> removed later in recoery the same way to above. This case doen't
> cause FATAL/PANIC during recovery even in master.
>
> XLogReadBufferExtended@xlogutils.c
> | * Create the target file if it doesn't already exist. This lets us cope
> | * if the replay sequence contains writes to a relation that is later
> | * deleted. (The original coding of this routine would instead suppress
> | * the writes, but that seems like it risks losing valuable data if the
> | * filesystem loses an inode during a crash. Better to write the data
> | * until we are actually told to delete the file.)
>
> So buffered access cannot be a problem for the reason above. The
> remaining possible issue is non-buffered access to files in
> removed tablespaces. This is what I mentioned upthread:
>
> me> but I haven't checked this covers all places where need the same
> me> treatment.
RM_DBASE_ID is fixed by the patch.
XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
- are not relevant.
HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
- Resources works on buffer is not affected.
SMGR:
- Both CREATE and TRUNCATE seems fine.
TBLSPC:
- We don't nest tablespace directories. No Problem.
I don't find a similar case.
I took some time in digging into the related code. It seems that ignoring if the dst directory cannot be created directly
should be fine since smgr redo code creates tablespace path finally by calling TablespaceCreateDbspace().
What's more, I found some more issues.
1) The below error message is actually misleading.
2019-04-17 14:52:14.951 CST [23030] CONTEXT: WAL redo at 0/3011650 for Database/CREATE: copy dir 1663/1 to 65546/65547
That should be due to dbase_desc(). It could be simply fixed following the code logic in GetDatabasePath().
2) It seems that src directory could be missing then dbase_redo()->copydir() could error out. For example,
\!rm -rf /tmp/tbspace1
\!mkdir /tmp/tbspace1
\!rm -rf /tmp/tbspace2
\!mkdir /tmp/tbspace2
create tablespace tbs1 location '/tmp/tbspace1';
create tablespace tbs2 location '/tmp/tbspace2';
create database db1 tablespace tbs1;
alter database db1 set tablespace tbs2;
drop tablespace tbs1;
Let's say, the standby finishes all replay but redo lsn on pg_control is still the point at 'alter database', and then
kill postgres, then in theory when startup, dbase_redo()->copydir() will ERROR since 'drop tablespace tbs1'
has removed the directories (and symlink) of tbs1. Below simple code change could fix that.
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9707afabd9..7d755c759e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2114,6 +2114,15 @@ dbase_redo(XLogReaderState *record)
*/
FlushDatabaseBuffers(xlrec->src_db_id);
+ /*
+ * It is possible that the source directory is missing if
+ * we are re-replaying the xlog while subsequent xlogs
+ * drop the tablespace in previous replaying. For this
+ * we just skip.
+ */
+ if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+ return;
+
/*
* Copy this subdirectory to the new location
*
If we want to fix the issue by ignoring the dst path create failure, I do not think we should do
that in copydir() since copydir() seems to be a common function. I'm not sure whether it is
used by some extensions or not. If no maybe we should move the dst patch create logic
out of copydir().
Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace() to replace
the code block includes a lot of get_parent_directory(), MakePGDirectory(), etc even it
is not fixing a bug since pg_mkdir_p() code change seems to be more graceful and simpler.
Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to be error-prone
along with postgre evolving since they are hard to test and also we are not easy to think out
various potential bad cases. Is it possible that we should do real checkpoint (flush & update
redo lsn) when seeing checkpoint xlogs for these operations? This will slow down standby
but master also does this and also these operations are not usual, espeically it seems that it
does not slow down wal receiving usually?
pgsql-hackers by date: