Re: 12.2: Why do my Redo Logs disappear?? - Mailing list pgsql-general

From Adrian Klaver
Subject Re: 12.2: Why do my Redo Logs disappear??
Date
Msg-id d105ee7a-19d5-3207-6311-a1dde37f95f5@aklaver.com
Whole thread Raw
In response to 12.2: Why do my Redo Logs disappear??  (Peter <pmc@citylink.dinoex.sub.org>)
Responses Re: Something else about Redo Logs disappearing  (Peter <pmc@citylink.dinoex.sub.org>)
List pgsql-general
On 6/8/20 5:02 PM, Peter wrote:
> Hi all,
>    this is a 12.2 Release on FreeBSD 11.3.
> 
> I am doing RedoLog Archiving according to Docs Chapter 25.1.

There is no ReDo logging, there is WAL logging.

What docs, because section 25.1 in the Postgres docs is :

https://www.postgresql.org/docs/12/backup-dump.html

25.1. SQL Dump

> 
> During the last week I have lost 4 distinct Redo Logs; they are
> not in the backup.
> 
> Loosing a RedoLog is very bad, because there is no redundancy,
> loosing a single one of them makes the timeline disappear and it
> will only reappear after another Base Backup. Very very bad.
> 
> So I did analyze the matter.
> 
> There are three ways to restart a Berkeley machine:
> 
> 1. Cowboy it (aka pull-the-plug). This makes sure that everything is
>     certainly dead immediately, and, given all hard- and software is
>     well designed, nothing bad should happen.
> 
> 2. Shut it down. This is the official means, and it takes very long,
>     because each and every applications are signalled and given time to
>     bring up whatever objections they may have.
> 
>     In this case "pg_ctl stop" will be invoked with whatever options the
>     sysop has configured, and postgres will copy out a full log into
>     archive before terminating.
> 
> 3. Halt/Reboot it, like this:
>     https://www.freebsd.org/cgi/man.cgi?query=reboot&sektion=8&manpath=FreeBSD+11.3-RELEASE
>     This is considered more safe than pull-the-plug, and still fast.
>     Applications are killed without much waiting, but all disk buffers
>     are flushed to permanent storage and filesystems closed.
>     
>     In this case, it seems,  Postgres will delete the current log
>     without archiving it. :(
> 
>     What precisely happens (according to the OS sources) during reboot
>     is this: processes will be sent SIGTERM, and after some 2-10
>     seconds followed by SIGKILL.

https://www.postgresql.org/docs/12/server-shutdown.html
"
Important

It is best not to use SIGKILL to shut down the server. Doing so will 
prevent the server from releasing shared memory and semaphores. 
Furthermore, SIGKILL kills the postgres process without letting it relay 
the signal to its subprocesses, so it might be necessary to kill the 
individual subprocesses by hand as well.

To terminate an individual session while allowing other sessions to 
continue, use pg_terminate_backend() (see Table 9.83) or send a SIGTERM 
signal to the child process associated with the session."

What is RedoLog.bck?

> 
> --------------------------------------
> 
> Lets have a closer look:
> 
> We did a regular shutdown at 17:09, and then we did a reboot at 19:24.
> 
> Here is the content of the staging area (where the logs are
> copied to and accumulated until it is worth to run a backup job):
> 
> # dir arch/
> total 240
> drwx------  2 postgres  postgres         5 Jun  8 17:09 .
> drwxr-xr-x  6 postgres  postgres         7 Jun  8 17:09 ..
> -rw-------  1 postgres  postgres  16777216 Jun  8 09:38 0000000100000017000000FC.ok
> -rw-------  1 postgres  postgres  16777216 Jun  8 10:48 0000000100000017000000FD.ok
> -rw-------  1 postgres  postgres  16777216 Jun  8 17:09 0000000100000017000000FE.ok
> 
> And here is the pg_wal directory:
> 
> # dir data12/pg_wal/
> total 89256
> drwx------   3 postgres  postgres        10 Jun  8 19:28 .
> drwx------  19 postgres  postgres        23 Jun  8 19:28 ..
> -rw-------   1 postgres  postgres       335 Jun  7 07:36 0000000100000017000000EF.00000060.backup
> -rw-------   1 postgres  postgres  16777216 Jun  8 19:38 000000010000001800000000
> -rw-------   1 postgres  postgres  16777216 Jun  7 07:17 000000010000001800000001
> -rw-------   1 postgres  postgres  16777216 Jun  7 07:17 000000010000001800000002
> -rw-------   1 postgres  postgres  16777216 Jun  7 07:17 000000010000001800000003
> -rw-------   1 postgres  postgres  16777216 Jun  7 07:17 000000010000001800000004
> -rw-------   1 postgres  postgres  16777216 Jun  7 07:36 000000010000001800000005
> drwx------   2 postgres  postgres         3 Jun  8 17:09 archive_status
> # dir data12/pg_wal/archive_status
> total 23
> drwx------  2 postgres  postgres   3 Jun  8 17:09 .
> drwx------  3 postgres  postgres  10 Jun  8 19:28 ..
> -rw-------  1 postgres  postgres   0 Jun  7 07:36 0000000100000017000000EF.00000060.backup.done
> 
> Now where the hell is my "FF" log ???
> 
> Lets check syslog - this was the full shutdown at 17:09:
> 
> Jun  8 17:09:38 <local0.info> admn pg-bck[73534]: [10-1] :[] LOG:  00000: received fast shutdown request
> Jun  8 17:09:38 <local0.info> admn pg-bck[73534]: [10-2] :[] LOCATION:  pmdie, postmaster.c:2780
> Jun  8 17:09:38 <local0.info> admn pg-bck[73534]: [11-1] :[] LOG:  00000: aborting any active transactions
> Jun  8 17:09:38 <local0.info> admn pg-bck[73534]: [11-2] :[] LOCATION:  pmdie, postmaster.c:2813
> Jun  8 17:09:38 <local0.debug> admn pg-bck[73549]: [10-1] :[] DEBUG:  00000: logical replication launcher shutting
down
> Jun  8 17:09:38 <local0.debug> admn pg-bck[73549]: [10-2] :[] LOCATION:  ProcessInterrupts, postgres.c:2981
> Jun  8 17:09:38 <local0.info> admn pg-bck[73534]: [12-1] :[] LOG:  00000: background worker "logical replication
launcher"(PID 73549) exited with exit code 1
 
> Jun  8 17:09:38 <local0.info> admn pg-bck[73534]: [12-2] :[] LOCATION:  LogChildExit, postmaster.c:3657
> Jun  8 17:09:38 <local0.info> admn pg-bck[73544]: [13-1] :[] LOG:  00000: shutting down
> Jun  8 17:09:38 <local0.info> admn pg-bck[73544]: [13-2] :[] LOCATION:  ShutdownXLOG, xlog.c:8321
> Jun  8 17:09:45 <local0.info> admn pg-bck[82223]: RedoLog.bck invoked pg_wal/0000000100000017000000FE
0000000100000017000000FE
> Jun  8 17:09:45 <local0.info> admn pg-bck[82223]: RedoLog.bck pg_wal/0000000100000017000000FE
0000000100000017000000FEreturns 0
 
> Jun  8 17:09:45 <local0.debug> admn pg-bck[73547]: [8-1] :[] DEBUG:  00000: archived write-ahead log file
"0000000100000017000000FE"
> Jun  8 17:09:45 <local0.debug> admn pg-bck[73547]: [8-2] :[] LOCATION:  pgarch_archiveXlog, pgarch.c:675
> Jun  8 17:09:55 <local0.info> admn pg-bck[73544]: [14-1] :[] LOG:  00000: checkpoint starting: shutdown immediate
> Jun  8 17:09:55 <local0.info> admn pg-bck[73544]: [14-2] :[] LOCATION:  LogCheckpointStart, xlog.c:8362
> Jun  8 17:09:55 <local0.debug> admn pg-bck[73544]: [15-1] :[] DEBUG:  00000: performing replication slot checkpoint
> Jun  8 17:09:55 <local0.debug> admn pg-bck[73544]: [15-2] :[] LOCATION:  CheckPointReplicationSlots, slot.c:1078
> Jun  8 17:09:55 <local0.info> admn pg-bck[73544]: [16-1] :[] LOG:  00000: checkpoint complete: wrote 0 buffers
(0.0%);0 WAL file(s) added, 1 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=10.131 s; sync files=0,
longest=0.000s, average=0.000 s; distance=16383 kB, estimate=16383 kB
 
> Jun  8 17:09:55 <local0.info> admn pg-bck[73544]: [16-2] :[] LOCATION:  LogCheckpointEnd, xlog.c:8435
> Jun  8 17:09:55 <local0.info> admn pg-bck[73534]: [13-1] :[] LOG:  00000: database system is shut down
> Jun  8 17:09:55 <local0.info> admn pg-bck[73534]: [13-2] :[] LOCATION:  UnlinkLockFiles, miscinit.c:859
> 
> So, "FE" was copied out okay, and we have it in the staging area.
> 
> And here is the relevant stuff from the following startup - here we
> consequentially are within the "FF" log:
> 
> Jun  8 17:15:38 <local0.info> admn pg-bck[6366]: [8-1] :[] LOG:  00000: database system was shut down at 2020-06-08
17:09:55CEST
 
> Jun  8 17:15:38 <local0.info> admn pg-bck[6366]: [8-2] :[] LOCATION:  StartupXLOG, xlog.c:6242
> Jun  8 17:15:38 <local0.debug> admn pg-bck[6366]: [9-1] :[] DEBUG:  00000: checkpoint record is at 17/FF000024
> Jun  8 17:15:38 <local0.debug> admn pg-bck[6366]: [9-2] :[] LOCATION:  StartupXLOG, xlog.c:6532
> Jun  8 17:15:38 <local0.debug> admn pg-bck[6366]: [10-1] :[] DEBUG:  00000: redo record is at 17/FF000024; shutdown
true
> ----------------------------------------
> 
>>From the fast reboot @ 19:24 noting at all is logged.
> 
> At the following startup, we see that we are still within the "FF"
> log:
> 
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6465]: [1-1] :[] DEBUG:  00000: registering background worker "logical
replicationlauncher"
 
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6465]: [1-2] :[] LOCATION:  RegisterBackgroundWorker, bgworker.c:854
> Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [2-1] :[] LOG:  00000: starting PostgreSQL 12.2 on
i386-portbld-freebsd11.3,compiled by gcc9 (FreeBSD Ports Collection) 9.3.0, 32-bit
 
> Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [2-2] :[] LOCATION:  PostmasterMain, postmaster.c:997
> Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [3-1] :[] LOG:  00000: listening on IPv4 address "0.0.0.0", port
5433
> Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [3-2] :[] LOCATION:  StreamServerPort, pqcomm.c:590
> Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [4-1] :[] LOG:  00000: listening on Unix socket
"/tmp/.s.PGSQL.5433"
> Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [4-2] :[] LOCATION:  StreamServerPort, pqcomm.c:585
> Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [5-1] :[] LOG:  00000: ending log output to stderr
> Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [5-2] :[] HINT:  Future log output will go to log destination
"syslog".
> Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [5-3] :[] LOCATION:  PostmasterMain, postmaster.c:1297
> Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [6-1] :[] LOG:  XX000: could not send test message on socket for
statisticscollector: Permission denied
 
> Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [6-2] :[] LOCATION:  pgstat_init, pgstat.c:486
> Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [7-1] :[] LOG:  00000: trying another address for the statistics
collector
> Jun  8 19:28:24 <local0.info> admn pg-bck[6465]: [7-2] :[] LOCATION:  pgstat_init, pgstat.c:418
> Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [8-1] :[] LOG:  00000: database system was interrupted; last known
upat 2020-06-08 17:25:38 CEST
 
> Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [8-2] :[] LOCATION:  StartupXLOG, xlog.c:6267
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [9-1] :[] DEBUG:  00000: checkpoint record is at 17/FF01BFE8
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [9-2] :[] LOCATION:  StartupXLOG, xlog.c:6532
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [10-1] :[] DEBUG:  00000: redo record is at 17/FF01BFB4; shutdown
false
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [10-2] :[] LOCATION:  StartupXLOG, xlog.c:6609
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [11-1] :[] DEBUG:  00000: next transaction ID: 18955154; next OID:
145913
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [11-2] :[] LOCATION:  StartupXLOG, xlog.c:6613
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [12-1] :[] DEBUG:  00000: next MultiXactId: 1; next
MultiXactOffset:0
 
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [12-2] :[] LOCATION:  StartupXLOG, xlog.c:6617
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [13-1] :[] DEBUG:  00000: oldest unfrozen transaction ID: 479, in
database13777
 
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [13-2] :[] LOCATION:  StartupXLOG, xlog.c:6620
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [14-1] :[] DEBUG:  00000: oldest MultiXactId: 1, in database 1
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [14-2] :[] LOCATION:  StartupXLOG, xlog.c:6623
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [15-1] :[] DEBUG:  00000: commit timestamp Xid oldest/newest: 0/0
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [15-2] :[] LOCATION:  StartupXLOG, xlog.c:6626
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [16-1] :[] DEBUG:  00000: transaction ID wrap limit is 2147484126,
limitedby database with OID 13777
 
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [16-2] :[] LOCATION:  SetTransactionIdLimit, varsup.c:408
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [17-1] :[] DEBUG:  00000: MultiXactId wrap limit is 2147483648,
limitedby database with OID 1
 
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [17-2] :[] LOCATION:  SetMultiXactIdLimit, multixact.c:2267
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [18-1] :[] DEBUG:  00000: starting up replication slots
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [18-2] :[] LOCATION:  StartupReplicationSlots, slot.c:1114
> Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [19-1] :[] LOG:  00000: database system was not properly shut down;
automaticrecovery in progress
 
> Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [19-2] :[] LOCATION:  StartupXLOG, xlog.c:6764
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [20-1] :[] DEBUG:  00000: resetting unlogged relations: cleanup 1
init0
 
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [20-2] :[] LOCATION:  ResetUnloggedRelations, reinit.c:55
> Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [21-1] :[] LOG:  00000: redo starts at 17/FF01BFB4
> Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [21-2] :[] LOCATION:  StartupXLOG, xlog.c:7035
> Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [22-1] :[] LOG:  00000: redo done at 17/FF01C098
> Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [22-2] :[] LOCATION:  StartupXLOG, xlog.c:7297
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [23-1] :[] DEBUG:  00000: resetting unlogged relations: cleanup 0
init1
 
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [23-2] :[] LOCATION:  ResetUnloggedRelations, reinit.c:55
> Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [24-1] :[] LOG:  00000: checkpoint starting: end-of-recovery
immediate
> Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [24-2] :[] LOCATION:  LogCheckpointStart, xlog.c:8362
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [25-1] :[] DEBUG:  00000: performing replication slot checkpoint
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [25-2] :[] LOCATION:  CheckPointReplicationSlots, slot.c:1078
> Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [26-1] :[] LOG:  00000: checkpoint complete: wrote 0 buffers (0.0%);
0WAL file(s) added, 1 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.007 s; sync files=0, longest=0.000 s,
average=0.000s; distance=16272 kB, estimate=16272 kB
 
> Jun  8 19:28:24 <local0.info> admn pg-bck[6467]: [26-2] :[] LOCATION:  LogCheckpointEnd, xlog.c:8435
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [27-1] :[] DEBUG:  00000: MultiXactId wrap limit is 2147483648,
limitedby database with OID 1
 
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [27-2] :[] LOCATION:  SetMultiXactIdLimit, multixact.c:2267
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [28-1] :[] DEBUG:  00000: MultiXact member stop limit is now
4294914944based on MultiXact 1
 
> Jun  8 19:28:24 <local0.debug> admn pg-bck[6467]: [28-2] :[] LOCATION:  SetOffsetVacuumLimit, multixact.c:2630
> 
> There is no further information about anything concerning the logs,
> but at this point in time "FF" has disappeared. It was NOT copied out,
> we can see the timestamp on the pg_wal/archive_status still being at
> 17:09.
> 
> Nothing except postgres is supposed to write/delete anything within
> the data tree, and the RedoLog.bck script doesn't delete anything at all.
> 
> Another cute question would be: The "FE" log was successfully copied
> out at 17:09:45. The checkpoint at 17:09:55 then says "1 removed" -
> this should be the "FE", because the one before was written seven
> hours earlier (see the "ls" above) and should by long be gone.
> 
> But then the checkpoint at 19:28:24 again says "1 removed". What was
> removed there? It is unlikely to be the "FE", and the "FF" would be
> currently in use - and the archive_status directory was not written
> since 17:09. But the "FF" has disappeared. So what is going on here?
> 
> cheerio,
> PMc
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



pgsql-general by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Is it possible to use keywords (date units) in a function definition?
Next
From: Peter Geoghegan
Date:
Subject: Re: 12.2: Why do my Redo Logs disappear??