Re: standby recovery fails (tablespace related) (tentative patchand discussion) - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: standby recovery fails (tablespace related) (tentative patchand discussion)
Date
Msg-id 20190507.154711.254360772.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: standby recovery fails (tablespace related) (tentative patch and discussion)  (Paul Guo <pguo@pivotal.io>)
Responses Re: standby recovery fails (tablespace related) (tentative patch and discussion)
List pgsql-hackers
Hi.

At Tue, 30 Apr 2019 14:33:47 +0800, Paul Guo <pguo@pivotal.io> wrote in
<CAEET0ZGhmDKrq7JJu2rLLqcJBR8pA4OYrKsirZ5Ft8-deG1e8A@mail.gmail.com>
> I updated the original patch to

It's reasonable not to touch copydir.

> 1) skip copydir() if either src path or dst parent path is missing in
> dbase_redo(). Both missing cases seem to be possible. For the src path
> missing case, mkdir_p() is meaningless. It seems that moving the directory
> existence check step to dbase_redo() has less impact on other code.

Nice catch.


+      if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+      {

This patch is allowing missing source and destination directory
even in consistent state. I don't think it is safe.



+        ereport(WARNING,
+            (errmsg("directory \"%s\" for copydir() does not exists."
+                "It is possibly expected. Skip copydir().",
+                parent_path)));

This message seems unfriendly to users, or it seems like an elog
message. How about something like this. The same can be said for
the source directory.

| WARNING:  skipped creating database directory: "%s"
| DETAIL:  The tabelspace %u may have been removed just before crash.

# I'm not confident in this at all:(

> 2) Fixed dbase_desc(). Now the xlog output looks correct.
> 
> rmgr: Database    len (rec/tot):     42/    42, tx:        486, lsn:
> 0/016386A8, prev 0/01638630, desc: CREATE copy dir base/1 to
> pg_tblspc/16384/PG_12_201904281/16386
> 
> rmgr: Database    len (rec/tot):     34/    34, tx:        487, lsn:
> 0/01638EB8, prev 0/01638E40, desc: DROP dir
> pg_tblspc/16384/PG_12_201904281/16386

WAL records don't convey such information. The previous
description seems right to me.

> I'm not familiar with the TAP test details previously. I learned a lot
> about how to test such case from Kyotaro's patch series.👍

Yeah, good to hear.

> On Sun, Apr 28, 2019 at 3:33 PM Paul Guo <pguo@pivotal.io> wrote:
> > If we want to fix the issue by ignoring the dst path create failure, I do
> > not think we should do
> > that in copydir() since copydir() seems to be a common function. I'm not
> > sure whether it is
> > used by some extensions or not. If no maybe we should move the dst patch
> > create logic
> > out of copydir().

Agreed to this.

> > Also I'd suggest we should use pg_mkdir_p() in TablespaceCreateDbspace()
> > to replace
> > the code block includes a lot of
> > get_parent_directory(), MakePGDirectory(), etc even it
> > is not fixing a bug since pg_mkdir_p() code change seems to be more
> > graceful and simpler.

But I don't agree to this. pg_mkdir_p goes above two-parents up,
which would be unwanted here.

> > Whatever ignore mkdir failure or mkdir_p, I found that these steps seem to
> > be error-prone
> > along with postgre evolving since they are hard to test and also we are
> > not easy to think out
> > various potential bad cases. Is it possible that we should do real
> > checkpoint (flush & update
> > redo lsn) when seeing checkpoint xlogs for these operations? This will
> > slow down standby
> > but master also does this and also these operations are not usual,
> > espeically it seems that it
> > does not slow down wal receiving usually?

That dramatically slows recovery (not replication) if databases
are created and deleted frequently. That wouldn't be acceptable.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Unhappy about API changes in the no-fsm-for-small-rels patch
Next
From: Amit Langote
Date:
Subject: Re: findTargetlistEntrySQL92() and COLLATE clause