Re: standby recovery fails (tablespace related) (tentative patch and discussion) - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Date
Msg-id 20220715.163059.337322153376727951.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: standby recovery fails (tablespace related) (tentative patch and discussion)  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
List pgsql-hackers
At Thu, 14 Jul 2022 23:47:40 +0200, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in 
> Here's a couple of fixups.  0001 is the same as before.  In 0002 I think

Thanks!

+         if (!S_ISLNK(st.st_mode))
+ #else
+         if (!pgwin32_is_junction(path))
+ #endif
+             elog(ignore_invalid_pages ? WARNING : PANIC,
+                  "real directory found in pg_tblspc directory: %s", de->d_name);

A regular file with an oid-name also causes this error. Doesn't
something like "unexpected non-(sym)link entry..." work?

> CheckTablespaceDirectory ends up easier to read if we split out the test
> for validity of the link.  Looking at that again, I think we don't need
> to piggyback on ignore_invalid_pages, which is already a stretch, so
> let's not -- instead we can use allow_in_place_tablespaces if users need
> a workaround.  So that's 0003 (this bit needs more than zero docs,
> however.)

The result of 0003 looks good.

0002:
+is_path_tslink(const char *path)

What the "ts" of tslink stands for? If it stands for tablespace, the
function is not specific for table spaces. We already have 

+                    errmsg("could not stat file \"%s\": %m", path));

I'm not sure we need such correctness, but what is failing there is
lstat.  I found similar codes in two places in backend and one place
in frontend. So couldn't it be moved to /common and have a more
generic name?

-    dir = AllocateDir(tblspc_path);
-    while ((de = ReadDir(dir, tblspc_path)) != NULL)
+    dir = AllocateDir("pg_tblspc");
+    while ((de = ReadDir(dir, "pg_tblspc")) != NULL)

xlog.c uses the macro XLOGDIR. Why don't we define TBLSPCDIR?

-        for (p = de->d_name; *p && isdigit(*p); p++);
-        if (*p)
+        if (strspn(de->d_name, "0123456789") != strlen(de->d_name))
             continue;

The pattern "strspn != strlen" looks kind of remote, or somewhat
pedantic..

+        char    path[MAXPGPATH + 10];
..
-        snprintf(path, MAXPGPATH, "%s/%s", tblspc_path, de->d_name);
+        snprintf(path, sizeof(path), "pg_tblspc/%s", de->d_name);

I don't think we need the extra 10 bytes. A bit paranoic, but we can
check the return value to confirm the d_name is fully stored in the
buffer.

> 0004 is straightforward: let's check for bad directories before logging
> about consistent state.

I was about to write a comment to do this when looking 0001.

> After all this, I'm not sure what to think of dbase_redo.  At line 3102,
> is the directory supposed to exist or not?  I'm confused as to what is
> the expected state at that point.  I rewrote this, but now I think my
> rewrite continues to be confusing, so I'll have to think more about it.

I'm not sure l3102 exactly points, but haven't we chosen to create
everything required to keep recovery going, whether it is supposed to
exist or not?

> Another aspect are the tests.  Robert described a scenario where the
> previously committed version of this patch created trouble.  Do we have
> a test case to cover that problematic case?  I think we should strive to
> cover it, if possible.

I counldn't recall that clearly and failed to dig out from the thread,
but doesn't the "creating everything needed" strategy naturally save
that case?  We could add that test, but it seems to me a little
cumbersome to confirm the test correctly detect that case..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: "shiy.fnst@fujitsu.com"
Date:
Subject: RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
Next
From: Fujii Masao
Date:
Subject: Re: Backup command and functions can cause assertion failure and segmentation fault