Thread: Re: pgsql: Modify pg_basebackup to use a new COPY subprotocol for base back
On Tue, Jan 18, 2022 at 1:51 PM Robert Haas <rhaas@postgresql.org> wrote: > Modify pg_basebackup to use a new COPY subprotocol for base backups. Andres pointed out to me that longfin is sad: 2022-01-18 14:52:35.484 EST [82470:4] LOG: server process (PID 82487) was terminated by signal 4: Illegal instruction: 4 2022-01-18 14:52:35.484 EST [82470:5] DETAIL: Failed process was running: BASE_BACKUP ( LABEL 'pg_basebackup base backup', PROGRESS, CHECKPOINT 'fast', MANIFEST 'yes', TARGET 'client') Unfortunately, I can't reproduce this locally, even with COPT=-Wall -Werror -fno-omit-frame-pointer -fsanitize-trap=alignment -Wno-deprecated-declarations -DWRITE_READ_PARSE_PLAN_TREES -DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS. Tom, any chance you can get a stack trace? -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > Andres pointed out to me that longfin is sad: > 2022-01-18 14:52:35.484 EST [82470:4] LOG: server process (PID 82487) > was terminated by signal 4: Illegal instruction: 4 > Tom, any chance you can get a stack trace? Hmm, I'd assumed that was just a cosmic ray or something. I'll check if it reproduces, though. regards, tom lane
On Tue, Jan 18, 2022 at 4:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > Andres pointed out to me that longfin is sad: > > > 2022-01-18 14:52:35.484 EST [82470:4] LOG: server process (PID 82487) > > was terminated by signal 4: Illegal instruction: 4 > > > Tom, any chance you can get a stack trace? > > Hmm, I'd assumed that was just a cosmic ray or something. > I'll check if it reproduces, though. Thomas pointed out to me that thorntail also failed, and that it included a backtrace. Unfortunately it's not somewhat confusing. The innermost frame is: #0 0x00000100006319a4 in bbsink_archive_contents (len=<optimized out>, sink=<optimized out>) at /home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/replication/basebackup.c:1672 1672 return true; Line 1672 of basebackup.c is indeed "return true" but we're inside of sendFile(), not bbsink_archive_contents(). However, bbsink_archive_contents() is an inline function so maybe the failure is misattributed. I wonder whether the "sink" pointer in that function is somehow not valid ... but I don't know how that would happen, or why it would happen only on this machine. -- Robert Haas EDB: http://www.enterprisedb.com
I wrote: >> Tom, any chance you can get a stack trace? > Hmm, I'd assumed that was just a cosmic ray or something. My mistake: it's failing because of -fsanitize=alignment. Here's the stack trace: * frame #0: 0x000000010885dfd0 postgres`sendFile(sink=0x00007fdedf071cb0, readfilename="./global/4178", tarfilename="global/4178",statbuf=0x00007ffee77dfaf8, missing_ok=true, dboid=0, manifest=0x00007ffee77e2780, spcoid=0x0000000000000000)at basebackup.c:1552:10 frame #1: 0x000000010885cb7f postgres`sendDir(sink=0x00007fdedf071cb0, path="./global", basepathlen=1, sizeonly=false,tablespaces=0x00007fdedf072718, sendtblspclinks=true, manifest=0x00007ffee77e2780, spcoid=0x0000000000000000)at basebackup.c:1354:12 frame #2: 0x000000010885ca6b postgres`sendDir(sink=0x00007fdedf071cb0, path=".", basepathlen=1, sizeonly=false, tablespaces=0x00007fdedf072718,sendtblspclinks=true, manifest=0x00007ffee77e2780, spcoid=0x0000000000000000) at basebackup.c:1346:13 frame #3: 0x00000001088595be postgres`perform_base_backup(opt=0x00007ffee77e2e68, sink=0x00007fdedf071cb0) at basebackup.c:352:5 frame #4: 0x0000000108856b0b postgres`SendBaseBackup(cmd=0x00007fdedf05b510) at basebackup.c:932:3 frame #5: 0x00000001088711c8 postgres`exec_replication_command(cmd_string="BASE_BACKUP ( LABEL 'pg_basebackup base backup', PROGRESS, CHECKPOINT 'fast', MANIFEST 'yes', TARGET 'client')") at walsender.c:1734:4 [opt] frame #6: 0x00000001088dd61e postgres`PostgresMain(dbname=<unavailable>, username=<unavailable>) at postgres.c:4494:12[opt] It failed at -> 1552 if (!PageIsNew(page) && PageGetLSN(page) < sink->bbs_state->startptr) and the problem is evidently that the page pointer isn't nicely aligned: (lldb) p page (char *) $4 = 0x00007fdeded7e041 "" (I checked the "sink" data structure too for luck, but it seems fine.) I see that thorntail has now also fallen over, presumably for the same reason. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > Unfortunately, I can't reproduce this locally, even with COPT=-Wall > -Werror -fno-omit-frame-pointer -fsanitize-trap=alignment > -Wno-deprecated-declarations -DWRITE_READ_PARSE_PLAN_TREES > -DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS. Now that I re-read what you did, I believe you need both of -fsanitize=alignment -fsanitize-trap=alignment to enable those traps to happen. That seems to be the case with Apple's clang, anyway. regards, tom lane
Re: pgsql: Modify pg_basebackup to use a new COPY subprotocol for base back
From
Andres Freund
Date:
On 2022-01-18 17:12:00 -0500, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > Unfortunately, I can't reproduce this locally, even with COPT=-Wall > > -Werror -fno-omit-frame-pointer -fsanitize-trap=alignment > > -Wno-deprecated-declarations -DWRITE_READ_PARSE_PLAN_TREES > > -DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS. > > Now that I re-read what you did, I believe you need both of > > -fsanitize=alignment -fsanitize-trap=alignment > > to enable those traps to happen. That seems to be the case with > Apple's clang, anyway. FWIW, I can reproduce it on linux, but only if I -fno-sanitize-recover instead of -fsanitize-trap=alignment. That then also produces a nicer explanation of the problem: /home/andres/src/postgresql/src/backend/replication/basebackup.c:1552:10: runtime error: member access within misalignedaddress 0x000002b9ce09 for type 'PageHeaderData' (aka 'struct PageHeaderData'), which requires 4 byte alignment 0x000002b9ce09: note: pointer points here 00 00 00 64 00 00 00 00 c8 ad 0c 01 c5 1b 00 00 48 00 f0 1f f0 1f 04 20 00 00 00 00 62 31 05 00 ^ SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /home/andres/src/postgresql/src/backend/replication/basebackup.c:1552:10in 2022-01-18 17:36:17.746 PST [1448756] LOG: server process (PID 1448774) exited with exit code 1 2022-01-18 17:36:17.746 PST [1448756] DETAIL: Failed process was running: BASE_BACKUP ( LABEL 'pg_basebackup base backup', PROGRESS, CHECKPOINT 'fast', MANIFEST 'yes', TARGET 'client') The problem originates in bbsink_copystream_begin_backup()... Greetings, Andres Freund
On Tue, Jan 18, 2022 at 5:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Now that I re-read what you did, I believe you need both of > > -fsanitize=alignment -fsanitize-trap=alignment > > to enable those traps to happen. That seems to be the case with > Apple's clang, anyway. Ah, I guess I copied and pasted the options wrong, or something. Anyway, I have an idea how to fix this. I didn't realize that we were going to read from the bbsink's buffer like this, and it's not properly aligned for that. I'll jigger things around to fix that. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jan 18, 2022 at 8:55 PM Robert Haas <robertmhaas@gmail.com> wrote: > Ah, I guess I copied and pasted the options wrong, or something. > Anyway, I have an idea how to fix this. I didn't realize that we were > going to read from the bbsink's buffer like this, and it's not > properly aligned for that. I'll jigger things around to fix that. Here's a patch. I'm still not able to reproduce the problem either with the flags you propose (which don't cause a failure) or the ones which Andres suggests (which make clang bitterly unhappy) or the ones clang says I should use instead of the ones Andres suggests (which make initdb fall over, so we never even get to the point of attempting anything related to the code this patch modified). Here's a patch, based in part on some off-list discussion with Andres. I believe Andres has already confirmed that this fix works, but it wouldn't hurt if Tom wants to verify it also. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > Here's a patch, based in part on some off-list discussion with Andres. > I believe Andres has already confirmed that this fix works, but it > wouldn't hurt if Tom wants to verify it also. WFM too --- at least, pg_basebackup's "make check" passes now. regards, tom lane
On Tue, Jan 18, 2022 at 9:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > Here's a patch, based in part on some off-list discussion with Andres. > > I believe Andres has already confirmed that this fix works, but it > > wouldn't hurt if Tom wants to verify it also. > > WFM too --- at least, pg_basebackup's "make check" passes now. Thanks for checking. Committed. -- Robert Haas EDB: http://www.enterprisedb.com