Re: Failed to delete old ReorderBuffer spilled files - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: Failed to delete old ReorderBuffer spilled files
Date
Msg-id 20180105145338.geiwbicz2t6s67e7@alvherre.pgsql
Whole thread Raw
In response to Re: Failed to delete old ReorderBuffer spilled files  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
Thomas Munro wrote:
> On Wed, Nov 22, 2017 at 12:27 AM, atorikoshi
> <torikoshi_atsushi_z2@lab.ntt.co.jp> wrote:
> > [set_final_lsn_2.patch]
> 
> Hi Torikoshi-san,
> 
> FYI "make check" in contrib/test_decoding fails a couple of isolation
> tests, one with an assertion failure for my automatic patch tester[1].
> Same result on my laptop:
> 
> test ondisk_startup           ... FAILED (test process exited with exit code 1)
> test concurrent_ddl_dml       ... FAILED (test process exited with exit code 1)
> 
> TRAP: FailedAssertion("!(!dlist_is_empty(head))", File:
> "../../../../src/include/lib/ilist.h", Line: 458)

I observed a couple of crashes too a couple of times, while testing this
patch.  But I have seen several completely different crashes.  This
crash you show I have not been able to reproduce, though I've run this
in 94 and master many times.

For example, I got a backtrace that looks like this in 9.6:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f19ccb913fa in __GI_abort () at abort.c:89
#2  0x000055e7511f451b in errfinish (dummy=<optimized out>)
    at /pgsql/source/REL9_6_STABLE/src/backend/utils/error/elog.c:557
#3  0x000055e750ed732b in XLogFileInit (logsegno=1, 
    use_existent=use_existent@entry=0x7ffdbc34ab6f "\001\002", use_lock=use_lock@entry=1 '\001')
    at /pgsql/source/REL9_6_STABLE/src/backend/access/transam/xlog.c:3023
#4  0x000055e750edb227 in XLogWrite (WriteRqst=..., flexible=flexible@entry=0 '\000')
    at /pgsql/source/REL9_6_STABLE/src/backend/access/transam/xlog.c:2258
#5  0x000055e750ee162d in XLogBackgroundFlush ()
    at /pgsql/source/REL9_6_STABLE/src/backend/access/transam/xlog.c:2894

then in 9.4 I saw this one:

creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... FATAL:  could not open directory "pg_logical/snapshots": No such file or directory
STATEMENT:  CREATE DATABASE template0;
    
WARNING:  could not remove file or directory "base/12148": No such file or directory
WARNING:  some useless files may be left behind in old database directory "base/12148"
FATAL:  could not access status of transaction 0
DETAIL:  Could not open file "pg_clog/0000": No such file or directory.
child process exited with exit code 1

What this indicates to me is that perhaps the test harness is doing
stupid things such as running two servers concurrently in the same
datadir, so they overwrite one another.  If I take out the "-j2" from
make, this no longer reproduces.

Therefore, I'm going to push this patch shortly because clearly this
problem is not its fault.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: pgsql: Implement channel binding tls-server-end-point for SCRAM
Next
From: Peter Eisentraut
Date:
Subject: Re: pgsql: Implement channel binding tls-server-end-point for SCRAM