Thread: 9.2 recovery/startup problems

9.2 recovery/startup problems

From

Jeff Janes

Date:

26 November 2014, 05:30:52

Using both 9.2.9 and 9_2_STABLE 9b468bcec15f1ea7433d4, I get a fairly reproducible startup failure.

What I was doing is restore a database from a base backup and roll it forward with a recovery.conf until it completes and starts up. Then I truncate an unlogged table and start repopulating it with a large slow 'insert into ...select from' query.

While that is running, if I reboot the server, when it restarts it does not come back up initially.

This is what I get in the log from the attempted restart:

PST LOG: database system was interrupted; last known up at 2014-11-25 15:40:33 PST

PST LOG: database system was not properly shut down; automatic recovery in progress

PST LOG: redo starts at 84/EF000080

PST LOG: record with zero length at 84/EF09AE18

PST LOG: redo done at 84/EF09AD28

PST LOG: last completed transaction was at log time 2014-11-25 15:42:09.173599-08

PST LOG: checkpoint starting: end-of-recovery immediate

PST LOG: checkpoint complete: wrote 103 buffers (0.2%); 0 transaction log file(s) added, 246 removed, 7 recycled; write=0.002 s, sync=0.020 s, total=0.526 s; sync files=51, longest=0.003 s, average=0.000 s

PST FATAL: could not create file "base/16416/59288": File exists

PST LOG: startup process (PID 2472) exited with exit code 1

PST LOG: aborting startup due to startup process failure

oid2name doesn't show me any 59288, so I think it is the new copy of the unlogged table which is being created at the moment of the reboot.

I can't distribute a tarball of this particular database. How would I go about debugging this? Should I track down the source of the FATAL and convert it to a PANIC so I can get a core dump to look at?

A second attempt to start up the server completes successfully.

Cheers,

Jeff

Re: 9.2 recovery/startup problems

From

Alvaro Herrera

Date:

26 November 2014, 13:06:05

Jeff Janes wrote:

> This is what I get in the log from the attempted restart:
> 
> PST LOG:  database system was interrupted; last known up at 2014-11-25
> 15:40:33 PST
> PST LOG:  database system was not properly shut down; automatic recovery in
> progress
> PST LOG:  redo starts at 84/EF000080
> PST LOG:  record with zero length at 84/EF09AE18
> PST LOG:  redo done at 84/EF09AD28
> PST LOG:  last completed transaction was at log time 2014-11-25
> 15:42:09.173599-08
> PST LOG:  checkpoint starting: end-of-recovery immediate
> PST LOG:  checkpoint complete: wrote 103 buffers (0.2%); 0 transaction log
> file(s) added, 246 removed, 7 recycled; write=0.002 s, sync=0.020 s,
> total=0.526 s; sync files=51, longest=0.003 s, average=0.000 s
> PST FATAL:  could not create file "base/16416/59288": File exists

Maybe fire up pg_xlogdump to see what xlog record references that relfilenode.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: 9.2 recovery/startup problems

From

Jeff Janes

Date:

26 November 2014, 19:29:15

On Wed, Nov 26, 2014 at 5:06 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Jeff Janes wrote:

> This is what I get in the log from the attempted restart:
>
> PST LOG: database system was interrupted; last known up at 2014-11-25
> 15:40:33 PST
> PST LOG: database system was not properly shut down; automatic recovery in
> progress
> PST LOG: redo starts at 84/EF000080
> PST LOG: record with zero length at 84/EF09AE18
> PST LOG: redo done at 84/EF09AD28
> PST LOG: last completed transaction was at log time 2014-11-25
> 15:42:09.173599-08
> PST LOG: checkpoint starting: end-of-recovery immediate
> PST LOG: checkpoint complete: wrote 103 buffers (0.2%); 0 transaction log
> file(s) added, 246 removed, 7 recycled; write=0.002 s, sync=0.020 s,
> total=0.526 s; sync files=51, longest=0.003 s, average=0.000 s
> PST FATAL: could not create file "base/16416/59288": File exists

Maybe fire up pg_xlogdump to see what xlog record references that relfilenode.

pg_xlogdump doesn't exist yet in 9.2 (or can I use a newer one against the older files?).

Immediately after the reboot, base/16416/59288 exists but is empty.

After the failed start up attempt, base/16416/59288 and base/16416/59288_init both exist and are empty, with base/16416/59288 having its pre-startup-attempt timestamp.

So it seems like the code that copies the init fork over the main fork at the end of crash recovery is doing something wrong in this case. But I don't understand why it then succeeds at starting up the next time I try it.

If I use Snaga's xlogdump (which I'm not sure is entirely correct under 9.2), the end of the stream looks like this:

[cur:85/1008BD88, xid:4547590, rmid:10(Heap), len/tot_len:174/206, info:64, prev:85/1008BD58] hot_update: s/d/r:1663/16416/12636 block 16 off 41 to block 16 off 43

[cur:85/1008BE58, xid:4547590, rmid:2(Storage), len/tot_len:16/48, info:16, prev:85/1008BD88] create rel: s/d/r:1663/16416/59288

[cur:85/1008BE88, xid:4547590, rmid:8(Standby), len/tot_len:16/48, info:0, prev:85/1008BE58] standby

[page:70, xlp_info:5, xlp_tli:2, xlp_pageaddr:85/1008C000] XLP_FIRST_IS_CONTRECORD XLP_BKP_REMOVABLE

[cur:85/1008BEB8, xid:4547590, rmid:10(Heap), len/tot_len:28/7160, info:72, prev:85/1008BE88] hot_update: s/d/r:1663/16416/12636 block 30 off 32 to block 30 off 33

[cur:85/1008BEB8, xid:4547590, rmid:10(Heap), len/tot_len:28/7160, info:72, prev:85/1008BE88] bkpblock[1]: s/d/r:1663/16416/12636 blk:30 hole_off/len:156/1116

Bogus page magic number 0000 at offset 8E000

Thanks,

Jeff

Re: 9.2 recovery/startup problems

From

Andres Freund

Date:

26 November 2014, 21:03:37

Hi,

On 2014-11-26 11:29:09 -0800, Jeff Janes wrote:
> On Wed, Nov 26, 2014 at 5:06 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
> 
> > Jeff Janes wrote:
> >
> > > This is what I get in the log from the attempted restart:
> > >
> > > PST LOG:  database system was interrupted; last known up at 2014-11-25
> > > 15:40:33 PST
> > > PST LOG:  database system was not properly shut down; automatic recovery
> > in
> > > progress
> > > PST LOG:  redo starts at 84/EF000080
> > > PST LOG:  record with zero length at 84/EF09AE18
> > > PST LOG:  redo done at 84/EF09AD28
> > > PST LOG:  last completed transaction was at log time 2014-11-25
> > > 15:42:09.173599-08
> > > PST LOG:  checkpoint starting: end-of-recovery immediate
> > > PST LOG:  checkpoint complete: wrote 103 buffers (0.2%); 0 transaction
> > log
> > > file(s) added, 246 removed, 7 recycled; write=0.002 s, sync=0.020 s,
> > > total=0.526 s; sync files=51, longest=0.003 s, average=0.000 s
> > > PST FATAL:  could not create file "base/16416/59288": File exists

Any chance you can reproduce this in a reproducible fashion?

> > Maybe fire up pg_xlogdump to see what xlog record references that
> > relfilenode.
> >
> 
> pg_xlogdump doesn't exist yet in 9.2  (or can I use a newer one against the
> older files?).

Nope, you can't :(

> Immediately after the reboot, base/16416/59288 exists but is empty.
> 
> After the failed start up attempt, base/16416/59288 and
> base/16416/59288_init both exist and are empty, with base/16416/59288
> having its pre-startup-attempt timestamp.
> 
> So it seems like the code that copies the init fork over the main fork at
> the end of crash recovery is doing something wrong in this case.  But I
> don't understand why it then succeeds at starting up the next time I try it.
> 
> 
> If I use Snaga's xlogdump (which I'm not sure is entirely correct under
> 9.2), the end of the stream looks like this:
> 
> [cur:85/1008BD88, xid:4547590, rmid:10(Heap), len/tot_len:174/206, info:64,
> prev:85/1008BD58] hot_update: s/d/r:1663/16416/12636 block 16 off 41 to
> block 16 off 43
> [cur:85/1008BE58, xid:4547590, rmid:2(Storage), len/tot_len:16/48, info:16,
> prev:85/1008BD88] create rel: s/d/r:1663/16416/59288
> [cur:85/1008BE88, xid:4547590, rmid:8(Standby), len/tot_len:16/48, info:0,
> prev:85/1008BE58] standby
> [page:70, xlp_info:5, xlp_tli:2, xlp_pageaddr:85/1008C000]
> XLP_FIRST_IS_CONTRECORD XLP_BKP_REMOVABLE
> [cur:85/1008BEB8, xid:4547590, rmid:10(Heap), len/tot_len:28/7160, info:72,
> prev:85/1008BE88] hot_update: s/d/r:1663/16416/12636 block 30 off 32 to
> block 30 off 33
> [cur:85/1008BEB8, xid:4547590, rmid:10(Heap), len/tot_len:28/7160, info:72,
> prev:85/1008BE88] bkpblock[1]: s/d/r:1663/16416/12636 blk:30
> hole_off/len:156/1116

Hm. It looks like the problem here might be an ordering one. Notice that
there seems to be a record for the main relation, but not the init fork
in the excerpt.

It wonder if ExcecuteTruncate() does things in the wrong order.        /*         * Need the full transaction-safe
pushups.        *         * Create a new empty storage file for the relation, and assign it         * as the
relfilenodevalue. The old storage file is scheduled for         * deletion at commit.         */
RelationSetNewRelfilenode(rel,RecentXmin, minmulti);        if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED)
          heap_create_init_fork(rel);
 

Arguably that can cause problems if the node is promoted between the
RelationSetNewRelfilenode() and the heap_create_init_fork(). On the
other hand, the fork should essentially be 'invisible' in that case as
the transaction won't commit...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: 9.2 recovery/startup problems

From

Michael Paquier

Date:

26 November 2014, 23:48:12

On Thu, Nov 27, 2014 at 4:29 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> pg_xlogdump doesn't exist yet in 9.2  (or can I use a newer one against the
> older files?).
Not sure if pg_xlogdump would work (9.5 not for sure, 9.4 should
partially, 9.3 has better chances), but you could try this one as well
that is compatible with 9.2:
https://github.com/snaga/xlogdump
-- 
Michael

Re: 9.2 recovery/startup problems

From

Jeff Janes

Date:

27 November 2014, 00:13:54

On Tue, Nov 25, 2014 at 9:30 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Using both 9.2.9 and 9_2_STABLE 9b468bcec15f1ea7433d4, I get a fairly reproducible startup failure.

What I was doing is restore a database from a base backup and roll it forward with a recovery.conf until it completes and starts up. Then I truncate an unlogged table and start repopulating it with a large slow 'insert into ...select from' query.

While that is running, if I reboot the server, when it restarts it does not come back up initially.

This is what I get in the log from the attempted restart:

PST LOG: database system was interrupted; last known up at 2014-11-25 15:40:33 PST
PST LOG: database system was not properly shut down; automatic recovery in progress
PST LOG: redo starts at 84/EF000080
PST LOG: record with zero length at 84/EF09AE18
PST LOG: redo done at 84/EF09AD28
PST LOG: last completed transaction was at log time 2014-11-25 15:42:09.173599-08
PST LOG: checkpoint starting: end-of-recovery immediate
PST LOG: checkpoint complete: wrote 103 buffers (0.2%); 0 transaction log file(s) added, 246 removed, 7 recycled; write=0.002 s, sync=0.020 s, total=0.526 s; sync files=51, longest=0.003 s, average=0.000 s
PST FATAL: could not create file "base/16416/59288": File exists
PST LOG: startup process (PID 2472) exited with exit code 1
PST LOG: aborting startup due to startup process failure

oid2name doesn't show me any 59288, so I think it is the new copy of the unlogged table which is being created at the moment of the reboot.

59288 is not the new (uncommitted) unlogged table itself, but is the new (uncommitted) toast table associated with that unlogged table.

immediately before the 'sudo /sbin/reboot', while the 'truncate unlogged_table; insert into unlogged_table from select...' is running, the two files below exist and have zero size.

base/16416/59288_init

base/16416/59288

Immediately after the system has come back up, the file base/16416/59288_init no longer exists.

I can't reproduce this behavior without the reboot. It seems to depend on the sequencing and timing of the signals which the kernel delivers to the running processes during the reboot.

If I do a pg_ctl stop -mf, then both files go away. If I do a pg_ctl stop -mi, then neither goes away. It is only with the /sbin/reboot that I get the fatal combination of _init being gone but the other still present.

Then once the system is back and I restart postmaster, the first pass through ResetUnloggedRelationsInDbspaceDir doesn't remove base/16416/59288, because base/16416/59288_init doesn't exist to trigger the deletion. On the 2nd pass through, base/16416/59288_init exists because it was created by WAL replay, and the copy fails because it is expecting base/16416/59288 to have already been cleaned up by the first pass.

Should ResetUnloggedRelationsInDbspaceDir on the second pass attempt to unlink the target immediately before the copy (line 338, in 9.2) to accommodate cases like this?

Cheers,

Jeff

Re: 9.2 recovery/startup problems

From

Robert Haas

Date:

02 December 2014, 15:41:20

On Wed, Nov 26, 2014 at 7:13 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> If I do a pg_ctl stop -mf, then both files go away.  If I do a pg_ctl stop
> -mi, then neither goes away.  It is only with the /sbin/reboot that I get
> the fatal combination of _init being gone but the other still present.

Eh?  That sounds wonky.

I mean, reboot normally kills processes with SIGTERM or SIGKILL, in
which case I'd expect the outcome to match what you get with pg_ctl
stop -mf or pg_ctl stop -mi.  The only way I can see that you'd get a
different behavior is if you did a hard reboot (like echo b >
/proc/sysrq-trigger); if that changes things, then we might have a
missing-fsync bug.  How is that reboot managing to leave the main fork
behind while losing the init fork?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: 9.2 recovery/startup problems

From

Jeff Janes

Date:

02 December 2014, 16:54:18

On Tue, Dec 2, 2014 at 7:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 26, 2014 at 7:13 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> If I do a pg_ctl stop -mf, then both files go away. If I do a pg_ctl stop
> -mi, then neither goes away. It is only with the /sbin/reboot that I get
> the fatal combination of _init being gone but the other still present.

Eh? That sounds wonky.

I mean, reboot normally kills processes with SIGTERM or SIGKILL, in
which case I'd expect the outcome to match what you get with pg_ctl
stop -mf or pg_ctl stop -mi. The only way I can see that you'd get a
different behavior is if you did a hard reboot (like echo b >
/proc/sysrq-trigger); if that changes things, then we might have a
missing-fsync bug. How is that reboot managing to leave the main fork
behind while losing the init fork?

During abort processing after getting a SIGTERM, the back end truncates 59288 to zero size, and unlinks all the other files (including 59288_init). The actual removal of 59288 is left until the checkpoint. So if you SIGTERM the backend, then take down the server uncleanly before the next checkpoint completes, you are left with just 59288.

Here is the strace:

open("base/16416/59288", O_RDWR) = 8

ftruncate(8, 0) = 0

close(8) = 0

unlink("base/16416/59288.1") = -1 ENOENT (No such file or directory)

unlink("base/16416/59288_fsm") = -1 ENOENT (No such file or directory)

unlink("base/16416/59288_vm") = -1 ENOENT (No such file or directory)

unlink("base/16416/59288_init") = 0

unlink("base/16416/59288_init.1") = -1 ENOENT (No such file or directory)

Cheers,

Jeff

Re: 9.2 recovery/startup problems

From

Robert Haas

Date:

02 December 2014, 22:02:18

On Tue, Dec 2, 2014 at 11:54 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> During abort processing after getting a SIGTERM, the back end truncates
> 59288 to zero size, and unlinks all the other files (including 59288_init).
> The actual removal of 59288 is left until the checkpoint.  So if you SIGTERM
> the backend, then take down the server uncleanly before the next checkpoint
> completes, you are left with just 59288.
>
> Here is the strace:
>
> open("base/16416/59288", O_RDWR)        = 8
> ftruncate(8, 0)                         = 0
> close(8)                                = 0
> unlink("base/16416/59288.1")            = -1 ENOENT (No such file or
> directory)
> unlink("base/16416/59288_fsm")          = -1 ENOENT (No such file or
> directory)
> unlink("base/16416/59288_vm")           = -1 ENOENT (No such file or
> directory)
> unlink("base/16416/59288_init")         = 0
> unlink("base/16416/59288_init.1")       = -1 ENOENT (No such file or
> directory)

Hmm, that's not good.

I guess we can either adopt your suggestion of adjusting
ResetUnloggedRelationsInDbspaceDir() to cope with the possibility that
the situation has changed during recovery, or else figure out how to
be more stringent about the order in which forks get removed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company