12.2: Why do my Redo Logs disappear?? - Mailing list pgsql-general

From Peter
Subject 12.2: Why do my Redo Logs disappear??
Date
Msg-id 20200609000253.GA21670@gate.oper.dinoex.org
Whole thread Raw
Responses Re: 12.2: Why do my Redo Logs disappear??
Re: 12.2: Why do my Redo Logs disappear??
List pgsql-general
Hi all,
  this is a 12.2 Release on FreeBSD 11.3.

I am doing RedoLog Archiving according to Docs Chapter 25.1. 

During the last week I have lost 4 distinct Redo Logs; they are
not in the backup.

Loosing a RedoLog is very bad, because there is no redundancy,
loosing a single one of them makes the timeline disappear and it
will only reappear after another Base Backup. Very very bad.

So I did analyze the matter.

There are three ways to restart a Berkeley machine:

1. Cowboy it (aka pull-the-plug). This makes sure that everything is
   certainly dead immediately, and, given all hard- and software is
   well designed, nothing bad should happen.

2. Shut it down. This is the official means, and it takes very long,
   because each and every applications are signalled and given time to
   bring up whatever objections they may have.

   In this case "pg_ctl stop" will be invoked with whatever options the
   sysop has configured, and postgres will copy out a full log into
   archive before terminating.

3. Halt/Reboot it, like this:
   https://www.freebsd.org/cgi/man.cgi?query=reboot&sektion=8&manpath=FreeBSD+11.3-RELEASE
   This is considered more safe than pull-the-plug, and still fast.
   Applications are killed without much waiting, but all disk buffers
   are flushed to permanent storage and filesystems closed.
   
   In this case, it seems,  Postgres will delete the current log
   without archiving it. :(

   What precisely happens (according to the OS sources) during reboot
   is this: processes will be sent SIGTERM, and after some 2-10
   seconds followed by SIGKILL.

--------------------------------------

Lets have a closer look:

We did a regular shutdown at 17:09, and then we did a reboot at 19:24.

Here is the content of the staging area (where the logs are
copied to and accumulated until it is worth to run a backup job):

# dir arch/
total 240
drwx------  2 postgres  postgres         5 Jun  8 17:09 .
drwxr-xr-x  6 postgres  postgres         7 Jun  8 17:09 ..
-rw-------  1 postgres  postgres  16777216 Jun  8 09:38 0000000100000017000000FC.ok
-rw-------  1 postgres  postgres  16777216 Jun  8 10:48 0000000100000017000000FD.ok
-rw-------  1 postgres  postgres  16777216 Jun  8 17:09 0000000100000017000000FE.ok

And here is the pg_wal directory:

# dir data12/pg_wal/
total 89256
drwx------   3 postgres  postgres        10 Jun  8 19:28 .
drwx------  19 postgres  postgres        23 Jun  8 19:28 ..
-rw-------   1 postgres  postgres       335 Jun  7 07:36 0000000100000017000000EF.00000060.backup
-rw-------   1 postgres  postgres  16777216 Jun  8 19:38 000000010000001800000000
-rw-------   1 postgres  postgres  16777216 Jun  7 07:17 000000010000001800000001
-rw-------   1 postgres  postgres  16777216 Jun  7 07:17 000000010000001800000002
-rw-------   1 postgres  postgres  16777216 Jun  7 07:17 000000010000001800000003
-rw-------   1 postgres  postgres  16777216 Jun  7 07:17 000000010000001800000004
-rw-------   1 postgres  postgres  16777216 Jun  7 07:36 000000010000001800000005
drwx------   2 postgres  postgres         3 Jun  8 17:09 archive_status
# dir data12/pg_wal/archive_status
total 23
drwx------  2 postgres  postgres   3 Jun  8 17:09 .
drwx------  3 postgres  postgres  10 Jun  8 19:28 ..
-rw-------  1 postgres  postgres   0 Jun  7 07:36 0000000100000017000000EF.00000060.backup.done

Now where the hell is my "FF" log ???

Lets check syslog - this was the full shutdown at 17:09:

Jun  8 17:09:38 <local0.info> admn pg-bck[73534]: [10-1] :[] LOG:  00000: received fast shutdown request
Jun  8 17:09:38 <local0.info> admn pg-bck[73534]: [10-2] :[] LOCATION:  pmdie, postmaster.c:2780
Jun  8 17:09:38 <local0.info> admn pg-bck[73534]: [11-1] :[] LOG:  00000: aborting any active transactions
Jun  8 17:09:38 <local0.info> admn pg-bck[73534]: [11-2] :[] LOCATION:  pmdie, postmaster.c:2813
Jun  8 17:09:38 <local0.debug> admn pg-bck[73549]: [10-1] :[] DEBUG:  00000: logical replication launcher shutting
down
Jun  8 17:09:38 <local0.debug> admn pg-bck[73549]: [10-2] :[] LOCATION:  ProcessInterrupts, postgres.c:2981
Jun  8 17:09:38 <local0.info> admn pg-bck[73534]: [12-1] :[] LOG:  00000: background worker "logical replication
launcher"(PID 73549) exited with exit code 1
 
Jun  8 17:09:38 <local0.info> admn pg-bck[73534]: [12-2] :[] LOCATION:  LogChildExit, postmaster.c:3657
Jun  8 17:09:38 <local0.info> admn pg-bck[73544]: [13-1] :[] LOG:  00000: shutting down
Jun  8 17:09:38 <local0.info> admn pg-bck[73544]: [13-2] :[] LOCATION:  ShutdownXLOG, xlog.c:8321
Jun  8 17:09:45 <local0.info> admn pg-bck[82223]: RedoLog.bck invoked pg_wal/0000000100000017000000FE
0000000100000017000000FE
Jun  8 17:09:45 <local0.info> admn pg-bck[82223]: RedoLog.bck pg_wal/0000000100000017000000FE 0000000100000017000000FE
returns0
 
Jun  8 17:09:45 <local0.debug> admn pg-bck[73547]: [8-1] :[] DEBUG:  00000: archived write-ahead log file
"0000000100000017000000FE"
Jun  8 17:09:45 <local0.debug> admn pg-bck[73547]: [8-2] :[] LOCATION:  pgarch_archiveXlog, pgarch.c:675
Jun  8 17:09:55 <local0.info> admn pg-bck[73544]: [14-1] :[] LOG:  00000: checkpoint starting: shutdown immediate
Jun  8 17:09:55 <local0.info> admn pg-bck[73544]: [14-2] :[] LOCATION:  LogCheckpointStart, xlog.c:8362
Jun  8 17:09:55 <local0.debug> admn pg-bck[73544]: [15-1] :[] DEBUG:  00000: performing replication slot checkpoint
Jun  8 17:09:55 <local0.debug> admn pg-bck[73544]: [15-2] :[] LOCATION:  CheckPointReplicationSlots, slot.c:1078
Jun  8 17:09:55 <local0.info> admn pg-bck[73544]: [16-1] :[] LOG:  00000: checkpoint complete: wrote 0 buffers (0.0%);
0WAL file(s) added, 1 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=10.131 s; sync files=0, longest=0.000 s,
average=0.000s; distance=16383 kB, estimate=16383 kB
 
Jun  8 17:09:55 <local0.info> admn pg-bck[73544]: [16-2] :[] LOCATION:  LogCheckpointEnd, xlog.c:8435
Jun  8 17:09:55 <local0.info> admn pg-bck[73534]: [13-1] :[] LOG:  00000: database system is shut down
Jun  8 17:09:55 <local0.info> admn pg-bck[73534]: [13-2] :[] LOCATION:  UnlinkLockFiles, miscinit.c:859

So, "FE" was copied out okay, and we have it in the staging area.

And here is the relevant stuff from the following startup - here we
consequentially are within the "FF" log:

Jun  8 17:15:38 <local0.info> admn pg-bck[6366]: [8-1] :[] LOG:  00000: database system was shut down at 2020-06-08
17:09:55CEST
 
Jun  8 17:15:38 <local0.info> admn pg-bck[6366]: [8-2] :[] LOCATION:  StartupXLOG, xlog.c:6242
Jun  8 17:15:38 <local0.debug> admn pg-bck[6366]: [9-1] :[] DEBUG:  00000: checkpoint record is at 17/FF000024
Jun  8 17:15:38 <local0.debug> admn pg-bck[6366]: [9-2] :[] LOCATION:  StartupXLOG, xlog.c:6532
Jun  8 17:15:38 <local0.debug> admn pg-bck[6366]: [10-1] :[] DEBUG:  00000: redo record is at 17/FF000024; shutdown
true
----------------------------------------

From the fast reboot @ 19:24 noting at all is logged.

At the following startup, we see that we are still within the "FF"
log:

Jun  8 19:28:24 <local0.debug> admn pg-bck[6465]: [1-1] :[] DEBUG:  00000: registering background worker "logical
replicationlauncher"
 
Jun  8 19:28:24 <local0.debug> admn pg-bck[6465]: [1-2] :[] LOCATION:  RegisterBackgroundWorker, bgworker.c:854
Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [2-1] :[] LOG:  00000: starting PostgreSQL 12.2 on
i386-portbld-freebsd11.3,compiled by gcc9 (FreeBSD Ports Collection) 9.3.0, 32-bit
 
Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [2-2] :[] LOCATION:  PostmasterMain, postmaster.c:997
Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [3-1] :[] LOG:  00000: listening on IPv4 address "0.0.0.0", port 5433
Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [3-2] :[] LOCATION:  StreamServerPort, pqcomm.c:590
Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [4-1] :[] LOG:  00000: listening on Unix socket "/tmp/.s.PGSQL.5433"
Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [4-2] :[] LOCATION:  StreamServerPort, pqcomm.c:585
Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [5-1] :[] LOG:  00000: ending log output to stderr
Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [5-2] :[] HINT:  Future log output will go to log destination
"syslog".
Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [5-3] :[] LOCATION:  PostmasterMain, postmaster.c:1297
Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [6-1] :[] LOG:  XX000: could not send test message on socket for
statisticscollector: Permission denied
 
Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [6-2] :[] LOCATION:  pgstat_init, pgstat.c:486
Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [7-1] :[] LOG:  00000: trying another address for the statistics
collector
Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [7-2] :[] LOCATION:  pgstat_init, pgstat.c:418
Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [8-1] :[] LOG:  00000: database system was interrupted; last known up
at2020-06-08 17:25:38 CEST
 
Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [8-2] :[] LOCATION:  StartupXLOG, xlog.c:6267
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [9-1] :[] DEBUG:  00000: checkpoint record is at 17/FF01BFE8
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [9-2] :[] LOCATION:  StartupXLOG, xlog.c:6532
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [10-1] :[] DEBUG:  00000: redo record is at 17/FF01BFB4; shutdown
false
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [10-2] :[] LOCATION:  StartupXLOG, xlog.c:6609
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [11-1] :[] DEBUG:  00000: next transaction ID: 18955154; next OID:
145913
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [11-2] :[] LOCATION:  StartupXLOG, xlog.c:6613
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [12-1] :[] DEBUG:  00000: next MultiXactId: 1; next MultiXactOffset:
0
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [12-2] :[] LOCATION:  StartupXLOG, xlog.c:6617
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [13-1] :[] DEBUG:  00000: oldest unfrozen transaction ID: 479, in
database13777
 
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [13-2] :[] LOCATION:  StartupXLOG, xlog.c:6620
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [14-1] :[] DEBUG:  00000: oldest MultiXactId: 1, in database 1
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [14-2] :[] LOCATION:  StartupXLOG, xlog.c:6623
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [15-1] :[] DEBUG:  00000: commit timestamp Xid oldest/newest: 0/0
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [15-2] :[] LOCATION:  StartupXLOG, xlog.c:6626
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [16-1] :[] DEBUG:  00000: transaction ID wrap limit is 2147484126,
limitedby database with OID 13777
 
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [16-2] :[] LOCATION:  SetTransactionIdLimit, varsup.c:408
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [17-1] :[] DEBUG:  00000: MultiXactId wrap limit is 2147483648,
limitedby database with OID 1
 
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [17-2] :[] LOCATION:  SetMultiXactIdLimit, multixact.c:2267
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [18-1] :[] DEBUG:  00000: starting up replication slots
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [18-2] :[] LOCATION:  StartupReplicationSlots, slot.c:1114
Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [19-1] :[] LOG:  00000: database system was not properly shut down;
automaticrecovery in progress
 
Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [19-2] :[] LOCATION:  StartupXLOG, xlog.c:6764
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [20-1] :[] DEBUG:  00000: resetting unlogged relations: cleanup 1
init0
 
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [20-2] :[] LOCATION:  ResetUnloggedRelations, reinit.c:55
Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [21-1] :[] LOG:  00000: redo starts at 17/FF01BFB4
Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [21-2] :[] LOCATION:  StartupXLOG, xlog.c:7035
Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [22-1] :[] LOG:  00000: redo done at 17/FF01C098
Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [22-2] :[] LOCATION:  StartupXLOG, xlog.c:7297
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [23-1] :[] DEBUG:  00000: resetting unlogged relations: cleanup 0
init1
 
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [23-2] :[] LOCATION:  ResetUnloggedRelations, reinit.c:55
Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [24-1] :[] LOG:  00000: checkpoint starting: end-of-recovery
immediate
Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [24-2] :[] LOCATION:  LogCheckpointStart, xlog.c:8362
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [25-1] :[] DEBUG:  00000: performing replication slot checkpoint
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [25-2] :[] LOCATION:  CheckPointReplicationSlots, slot.c:1078
Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [26-1] :[] LOG:  00000: checkpoint complete: wrote 0 buffers (0.0%); 0
WALfile(s) added, 1 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.007 s; sync files=0, longest=0.000 s,
average=0.000s; distance=16272 kB, estimate=16272 kB
 
Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [26-2] :[] LOCATION:  LogCheckpointEnd, xlog.c:8435
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [27-1] :[] DEBUG:  00000: MultiXactId wrap limit is 2147483648,
limitedby database with OID 1
 
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [27-2] :[] LOCATION:  SetMultiXactIdLimit, multixact.c:2267
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [28-1] :[] DEBUG:  00000: MultiXact member stop limit is now
4294914944based on MultiXact 1
 
Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [28-2] :[] LOCATION:  SetOffsetVacuumLimit, multixact.c:2630

There is no further information about anything concerning the logs,
but at this point in time "FF" has disappeared. It was NOT copied out,
we can see the timestamp on the pg_wal/archive_status still being at
17:09.

Nothing except postgres is supposed to write/delete anything within
the data tree, and the RedoLog.bck script doesn't delete anything at all.

Another cute question would be: The "FE" log was successfully copied
out at 17:09:45. The checkpoint at 17:09:55 then says "1 removed" -
this should be the "FE", because the one before was written seven
hours earlier (see the "ls" above) and should by long be gone.

But then the checkpoint at 19:28:24 again says "1 removed". What was
removed there? It is unlikely to be the "FE", and the "FF" would be
currently in use - and the archive_status directory was not written
since 17:09. But the "FF" has disappeared. So what is going on here?

cheerio,
PMc



pgsql-general by date:

Previous
From: Alistair Johnson
Date:
Subject: Re: Is it possible to use keywords (date units) in a function definition?
Next
From: Martin Gainty
Date:
Subject: Is it possible to use keywords (date units) in a function definition?