On Sat, Aug 24, 2013 at 1:46 PM, <harukat@sraoss.co.jp> wrote:
> The following bug has been logged on the website:
>
> Bug reference: 8397
> Logged by: TAKATSUKA Haruka
> Email address: harukat@sraoss.co.jp
> PostgreSQL version: 9.2.4
> Operating system: Linux (CentOS6)
> Description:
>
> Hi.
>
>
> I report a small bug.
> pg_basebackup -x from new standby server sometimes causes Segmentation
> fault.
>
>
> (1) create new standby server dir by pg_basebackup without -x
> (2) start new standby server
> (3) pg_basebackup from new standby server with -x
> (!) when new standby has no WAL files in pg_xlog,
> new standby's wal sender crash
>
>
> new standby server's core file:
>
>
> Core was generated by `postgres: wal sender process postgres ::1(55210)
> sending backup "pg_basebackup'.
> Program terminated with signal 11, Segmentation fault.
> #0 0x0000003b7368ac66 in __rawmemchr_sse2 () from /lib64/libc.so.6
> Missing separate debuginfos, use: debuginfo-install
> glibc-2.12-1.107.el6.x86_64 libxml2-2.7.6-4.el6.x86_64
> zlib-1.2.3-27.el6.x86_64
> (gdb) bt
> #0 0x0000003b7368ac66 in __rawmemchr_sse2 () from /lib64/libc.so.6
> #1 0x0000003b73675990 in _IO_str_init_static_internal () from
> /lib64/libc.so.6
> #2 0x0000003b73669935 in vsscanf () from /lib64/libc.so.6
> #3 0x0000003b736639a8 in sscanf () from /lib64/libc.so.6
> #4 0x0000000000622351 in perform_base_backup (opt=0x7fffc2e22300,
> tblspcdir=0xd424c0) at basebackup.c:304
> #5 0x0000000000622c50 in SendBaseBackup (cmd=<value optimized out>)
> at basebackup.c:558
> #6 0x000000000061f5b0 in HandleReplicationCommand () at walsender.c:482
> #7 WalSndHandshake () at walsender.c:257
> #8 WalSenderMain () at walsender.c:181
> #9 0x0000000000650b12 in PostgresMain (argc=1, argv=<value optimized out>,
> dbname=0xc82a90 "", username=0xc82a70 "postgres") at postgres.c:3715
> #10 0x000000000060c4f1 in BackendRun () at postmaster.c:3614
> #11 BackendStartup () at postmaster.c:3304
> #12 ServerLoop () at postmaster.c:1367
> #13 0x000000000060f031 in PostmasterMain (argc=<value optimized out>,
> argv=<value optimized out>) at postmaster.c:1127
> #14 0x00000000005ae140 in main (argc=5, argv=0xc80bb0) at main.c:199
>
>
>
>
> ./backend/replication/basebackup.c:304
> XLogFromFileName(walFiles[0], &tli, &logid, &logseg);
>
>
> In this case, nWalFiles = 0 and walFiles[] palloced zero size.
>
>
> Though pg_basebackup does not have to work in this rare case,
> we should insert something like "if (nWalFiles <= 0) ereport(...);".
Yes, we definitely need better error checking there - a crash is never
the right answer.
Does this happen only when you take a backup "really quickly" after
setting up the new standby, or is there some scenario further in it's
lifetime when it can happen? In the first case, throwing a hard error
seems quite reasonable, but if it's repeatable, perhaps there is
something better we can do?
Also, while we definitely need a sanity check at this point, might it
be worth it to put a second check earlier in the process as well -
since AFAICT this error gets thrown only after all the data has been
sent arlready.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/