Thread: Re: pgsql: Modify pg_basebackup to use a new COPY subprotocol for base back

Re: pgsql: Modify pg_basebackup to use a new COPY subprotocol for base back

From
Robert Haas
Date:
On Tue, Jan 18, 2022 at 1:51 PM Robert Haas <rhaas@postgresql.org> wrote:
> Modify pg_basebackup to use a new COPY subprotocol for base backups.

Andres pointed out to me that longfin is sad:

2022-01-18 14:52:35.484 EST [82470:4] LOG:  server process (PID 82487)
was terminated by signal 4: Illegal instruction: 4
2022-01-18 14:52:35.484 EST [82470:5] DETAIL:  Failed process was
running: BASE_BACKUP ( LABEL 'pg_basebackup base backup',  PROGRESS,
CHECKPOINT 'fast',  MANIFEST 'yes',  TARGET 'client')

Unfortunately, I can't reproduce this locally, even with COPT=-Wall
-Werror -fno-omit-frame-pointer -fsanitize-trap=alignment
-Wno-deprecated-declarations -DWRITE_READ_PARSE_PLAN_TREES
-DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS.

Tom, any chance you can get a stack trace?

--
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> Andres pointed out to me that longfin is sad:

> 2022-01-18 14:52:35.484 EST [82470:4] LOG:  server process (PID 82487)
> was terminated by signal 4: Illegal instruction: 4

> Tom, any chance you can get a stack trace?

Hmm, I'd assumed that was just a cosmic ray or something.
I'll check if it reproduces, though.

            regards, tom lane



Re: pgsql: Modify pg_basebackup to use a new COPY subprotocol for base back

From
Robert Haas
Date:
On Tue, Jan 18, 2022 at 4:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Andres pointed out to me that longfin is sad:
>
> > 2022-01-18 14:52:35.484 EST [82470:4] LOG:  server process (PID 82487)
> > was terminated by signal 4: Illegal instruction: 4
>
> > Tom, any chance you can get a stack trace?
>
> Hmm, I'd assumed that was just a cosmic ray or something.
> I'll check if it reproduces, though.

Thomas pointed out to me that thorntail also failed, and that it
included a backtrace. Unfortunately it's not somewhat confusing. The
innermost frame is:

#0  0x00000100006319a4 in bbsink_archive_contents (len=<optimized
out>, sink=<optimized out>) at
/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/replication/basebackup.c:1672
1672 return true;

Line 1672 of basebackup.c  is indeed "return true" but we're inside of
sendFile(), not bbsink_archive_contents(). However,
bbsink_archive_contents() is an inline function so maybe the failure
is misattributed. I wonder whether the "sink" pointer in that function
is somehow not valid ... but I don't know how that would happen, or
why it would happen only on this machine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



I wrote:
>> Tom, any chance you can get a stack trace?

> Hmm, I'd assumed that was just a cosmic ray or something.

My mistake: it's failing because of -fsanitize=alignment.

Here's the stack trace:

  * frame #0: 0x000000010885dfd0 postgres`sendFile(sink=0x00007fdedf071cb0, readfilename="./global/4178",
tarfilename="global/4178",statbuf=0x00007ffee77dfaf8, missing_ok=true, dboid=0, manifest=0x00007ffee77e2780,
spcoid=0x0000000000000000)at basebackup.c:1552:10 
    frame #1: 0x000000010885cb7f postgres`sendDir(sink=0x00007fdedf071cb0, path="./global", basepathlen=1,
sizeonly=false,tablespaces=0x00007fdedf072718, sendtblspclinks=true, manifest=0x00007ffee77e2780,
spcoid=0x0000000000000000)at basebackup.c:1354:12 
    frame #2: 0x000000010885ca6b postgres`sendDir(sink=0x00007fdedf071cb0, path=".", basepathlen=1, sizeonly=false,
tablespaces=0x00007fdedf072718,sendtblspclinks=true, manifest=0x00007ffee77e2780, spcoid=0x0000000000000000) at
basebackup.c:1346:13
    frame #3: 0x00000001088595be postgres`perform_base_backup(opt=0x00007ffee77e2e68, sink=0x00007fdedf071cb0) at
basebackup.c:352:5
    frame #4: 0x0000000108856b0b postgres`SendBaseBackup(cmd=0x00007fdedf05b510) at basebackup.c:932:3
    frame #5: 0x00000001088711c8 postgres`exec_replication_command(cmd_string="BASE_BACKUP ( LABEL 'pg_basebackup base
backup', PROGRESS,  CHECKPOINT 'fast',  MANIFEST 'yes',  TARGET 'client')") at walsender.c:1734:4 [opt] 
    frame #6: 0x00000001088dd61e postgres`PostgresMain(dbname=<unavailable>, username=<unavailable>) at
postgres.c:4494:12[opt] 

It failed at

-> 1552                                 if (!PageIsNew(page) && PageGetLSN(page) < sink->bbs_state->startptr)

and the problem is evidently that the page pointer isn't nicely aligned:

(lldb) p page
(char *) $4 = 0x00007fdeded7e041 ""

(I checked the "sink" data structure too for luck, but it seems fine.)

I see that thorntail has now also fallen over, presumably for
the same reason.

            regards, tom lane



Robert Haas <robertmhaas@gmail.com> writes:
> Unfortunately, I can't reproduce this locally, even with COPT=-Wall
> -Werror -fno-omit-frame-pointer -fsanitize-trap=alignment
> -Wno-deprecated-declarations -DWRITE_READ_PARSE_PLAN_TREES
> -DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS.

Now that I re-read what you did, I believe you need both of

-fsanitize=alignment -fsanitize-trap=alignment

to enable those traps to happen.  That seems to be the case with
Apple's clang, anyway.

            regards, tom lane



Re: pgsql: Modify pg_basebackup to use a new COPY subprotocol for base back

From
Andres Freund
Date:
On 2022-01-18 17:12:00 -0500, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Unfortunately, I can't reproduce this locally, even with COPT=-Wall
> > -Werror -fno-omit-frame-pointer -fsanitize-trap=alignment
> > -Wno-deprecated-declarations -DWRITE_READ_PARSE_PLAN_TREES
> > -DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS.
>
> Now that I re-read what you did, I believe you need both of
>
> -fsanitize=alignment -fsanitize-trap=alignment
>
> to enable those traps to happen.  That seems to be the case with
> Apple's clang, anyway.

FWIW, I can reproduce it on linux, but only if I -fno-sanitize-recover instead
of -fsanitize-trap=alignment. That then also produces a nicer explanation of
the problem:

/home/andres/src/postgresql/src/backend/replication/basebackup.c:1552:10: runtime error: member access within
misalignedaddress 0x000002b9ce09 for type 'PageHeaderData' (aka 'struct PageHeaderData'), which requires 4 byte
alignment
0x000002b9ce09: note: pointer points here
 00 00 00  64 00 00 00 00 c8 ad 0c  01 c5 1b 00 00 48 00 f0  1f f0 1f 04 20 00 00 00  00 62 31 05 00
              ^
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
/home/andres/src/postgresql/src/backend/replication/basebackup.c:1552:10in
 
2022-01-18 17:36:17.746 PST [1448756] LOG:  server process (PID 1448774) exited with exit code 1
2022-01-18 17:36:17.746 PST [1448756] DETAIL:  Failed process was running: BASE_BACKUP ( LABEL 'pg_basebackup base
backup', PROGRESS,  CHECKPOINT 'fast',  MANIFEST 'yes',  TARGET 'client')
 

The problem originates in bbsink_copystream_begin_backup()...

Greetings,

Andres Freund



Re: pgsql: Modify pg_basebackup to use a new COPY subprotocol for base back

From
Robert Haas
Date:
On Tue, Jan 18, 2022 at 5:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Now that I re-read what you did, I believe you need both of
>
> -fsanitize=alignment -fsanitize-trap=alignment
>
> to enable those traps to happen.  That seems to be the case with
> Apple's clang, anyway.

Ah, I guess I copied and pasted the options wrong, or something.
Anyway, I have an idea how to fix this. I didn't realize that we were
going to read from the bbsink's buffer like this, and it's not
properly aligned for that. I'll jigger things around to fix that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: pgsql: Modify pg_basebackup to use a new COPY subprotocol for base back

From
Robert Haas
Date:
On Tue, Jan 18, 2022 at 8:55 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Ah, I guess I copied and pasted the options wrong, or something.
> Anyway, I have an idea how to fix this. I didn't realize that we were
> going to read from the bbsink's buffer like this, and it's not
> properly aligned for that. I'll jigger things around to fix that.

Here's a patch. I'm still not able to reproduce the problem either
with the flags you propose (which don't cause a failure) or the ones
which Andres suggests (which make clang bitterly unhappy) or the ones
clang says I should use instead of the ones Andres suggests (which
make initdb fall over, so we never even get to the point of attempting
anything related to the code this patch modified).

Here's a patch, based in part on some off-list discussion with Andres.
I believe Andres has already confirmed that this fix works, but it
wouldn't hurt if Tom wants to verify it also.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment
Robert Haas <robertmhaas@gmail.com> writes:
> Here's a patch, based in part on some off-list discussion with Andres.
> I believe Andres has already confirmed that this fix works, but it
> wouldn't hurt if Tom wants to verify it also.

WFM too --- at least, pg_basebackup's "make check" passes now.

            regards, tom lane



Re: pgsql: Modify pg_basebackup to use a new COPY subprotocol for base back

From
Robert Haas
Date:
On Tue, Jan 18, 2022 at 9:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Here's a patch, based in part on some off-list discussion with Andres.
> > I believe Andres has already confirmed that this fix works, but it
> > wouldn't hurt if Tom wants to verify it also.
>
> WFM too --- at least, pg_basebackup's "make check" passes now.

Thanks for checking. Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com