12.2: Why do my Redo Logs disappear?? - Mailing list pgsql-general
From | Peter |
---|---|
Subject | 12.2: Why do my Redo Logs disappear?? |
Date | |
Msg-id | 20200609000253.GA21670@gate.oper.dinoex.org Whole thread Raw |
Responses |
Re: 12.2: Why do my Redo Logs disappear??
Re: 12.2: Why do my Redo Logs disappear?? |
List | pgsql-general |
Hi all, this is a 12.2 Release on FreeBSD 11.3. I am doing RedoLog Archiving according to Docs Chapter 25.1. During the last week I have lost 4 distinct Redo Logs; they are not in the backup. Loosing a RedoLog is very bad, because there is no redundancy, loosing a single one of them makes the timeline disappear and it will only reappear after another Base Backup. Very very bad. So I did analyze the matter. There are three ways to restart a Berkeley machine: 1. Cowboy it (aka pull-the-plug). This makes sure that everything is certainly dead immediately, and, given all hard- and software is well designed, nothing bad should happen. 2. Shut it down. This is the official means, and it takes very long, because each and every applications are signalled and given time to bring up whatever objections they may have. In this case "pg_ctl stop" will be invoked with whatever options the sysop has configured, and postgres will copy out a full log into archive before terminating. 3. Halt/Reboot it, like this: https://www.freebsd.org/cgi/man.cgi?query=reboot&sektion=8&manpath=FreeBSD+11.3-RELEASE This is considered more safe than pull-the-plug, and still fast. Applications are killed without much waiting, but all disk buffers are flushed to permanent storage and filesystems closed. In this case, it seems, Postgres will delete the current log without archiving it. :( What precisely happens (according to the OS sources) during reboot is this: processes will be sent SIGTERM, and after some 2-10 seconds followed by SIGKILL. -------------------------------------- Lets have a closer look: We did a regular shutdown at 17:09, and then we did a reboot at 19:24. Here is the content of the staging area (where the logs are copied to and accumulated until it is worth to run a backup job): # dir arch/ total 240 drwx------ 2 postgres postgres 5 Jun 8 17:09 . drwxr-xr-x 6 postgres postgres 7 Jun 8 17:09 .. -rw------- 1 postgres postgres 16777216 Jun 8 09:38 0000000100000017000000FC.ok -rw------- 1 postgres postgres 16777216 Jun 8 10:48 0000000100000017000000FD.ok -rw------- 1 postgres postgres 16777216 Jun 8 17:09 0000000100000017000000FE.ok And here is the pg_wal directory: # dir data12/pg_wal/ total 89256 drwx------ 3 postgres postgres 10 Jun 8 19:28 . drwx------ 19 postgres postgres 23 Jun 8 19:28 .. -rw------- 1 postgres postgres 335 Jun 7 07:36 0000000100000017000000EF.00000060.backup -rw------- 1 postgres postgres 16777216 Jun 8 19:38 000000010000001800000000 -rw------- 1 postgres postgres 16777216 Jun 7 07:17 000000010000001800000001 -rw------- 1 postgres postgres 16777216 Jun 7 07:17 000000010000001800000002 -rw------- 1 postgres postgres 16777216 Jun 7 07:17 000000010000001800000003 -rw------- 1 postgres postgres 16777216 Jun 7 07:17 000000010000001800000004 -rw------- 1 postgres postgres 16777216 Jun 7 07:36 000000010000001800000005 drwx------ 2 postgres postgres 3 Jun 8 17:09 archive_status # dir data12/pg_wal/archive_status total 23 drwx------ 2 postgres postgres 3 Jun 8 17:09 . drwx------ 3 postgres postgres 10 Jun 8 19:28 .. -rw------- 1 postgres postgres 0 Jun 7 07:36 0000000100000017000000EF.00000060.backup.done Now where the hell is my "FF" log ??? Lets check syslog - this was the full shutdown at 17:09: Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [10-1] :[] LOG: 00000: received fast shutdown request Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [10-2] :[] LOCATION: pmdie, postmaster.c:2780 Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [11-1] :[] LOG: 00000: aborting any active transactions Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [11-2] :[] LOCATION: pmdie, postmaster.c:2813 Jun 8 17:09:38 <local0.debug> admn pg-bck[73549]: [10-1] :[] DEBUG: 00000: logical replication launcher shutting down Jun 8 17:09:38 <local0.debug> admn pg-bck[73549]: [10-2] :[] LOCATION: ProcessInterrupts, postgres.c:2981 Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [12-1] :[] LOG: 00000: background worker "logical replication launcher"(PID 73549) exited with exit code 1 Jun 8 17:09:38 <local0.info> admn pg-bck[73534]: [12-2] :[] LOCATION: LogChildExit, postmaster.c:3657 Jun 8 17:09:38 <local0.info> admn pg-bck[73544]: [13-1] :[] LOG: 00000: shutting down Jun 8 17:09:38 <local0.info> admn pg-bck[73544]: [13-2] :[] LOCATION: ShutdownXLOG, xlog.c:8321 Jun 8 17:09:45 <local0.info> admn pg-bck[82223]: RedoLog.bck invoked pg_wal/0000000100000017000000FE 0000000100000017000000FE Jun 8 17:09:45 <local0.info> admn pg-bck[82223]: RedoLog.bck pg_wal/0000000100000017000000FE 0000000100000017000000FE returns0 Jun 8 17:09:45 <local0.debug> admn pg-bck[73547]: [8-1] :[] DEBUG: 00000: archived write-ahead log file "0000000100000017000000FE" Jun 8 17:09:45 <local0.debug> admn pg-bck[73547]: [8-2] :[] LOCATION: pgarch_archiveXlog, pgarch.c:675 Jun 8 17:09:55 <local0.info> admn pg-bck[73544]: [14-1] :[] LOG: 00000: checkpoint starting: shutdown immediate Jun 8 17:09:55 <local0.info> admn pg-bck[73544]: [14-2] :[] LOCATION: LogCheckpointStart, xlog.c:8362 Jun 8 17:09:55 <local0.debug> admn pg-bck[73544]: [15-1] :[] DEBUG: 00000: performing replication slot checkpoint Jun 8 17:09:55 <local0.debug> admn pg-bck[73544]: [15-2] :[] LOCATION: CheckPointReplicationSlots, slot.c:1078 Jun 8 17:09:55 <local0.info> admn pg-bck[73544]: [16-1] :[] LOG: 00000: checkpoint complete: wrote 0 buffers (0.0%); 0WAL file(s) added, 1 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=10.131 s; sync files=0, longest=0.000 s, average=0.000s; distance=16383 kB, estimate=16383 kB Jun 8 17:09:55 <local0.info> admn pg-bck[73544]: [16-2] :[] LOCATION: LogCheckpointEnd, xlog.c:8435 Jun 8 17:09:55 <local0.info> admn pg-bck[73534]: [13-1] :[] LOG: 00000: database system is shut down Jun 8 17:09:55 <local0.info> admn pg-bck[73534]: [13-2] :[] LOCATION: UnlinkLockFiles, miscinit.c:859 So, "FE" was copied out okay, and we have it in the staging area. And here is the relevant stuff from the following startup - here we consequentially are within the "FF" log: Jun 8 17:15:38 <local0.info> admn pg-bck[6366]: [8-1] :[] LOG: 00000: database system was shut down at 2020-06-08 17:09:55CEST Jun 8 17:15:38 <local0.info> admn pg-bck[6366]: [8-2] :[] LOCATION: StartupXLOG, xlog.c:6242 Jun 8 17:15:38 <local0.debug> admn pg-bck[6366]: [9-1] :[] DEBUG: 00000: checkpoint record is at 17/FF000024 Jun 8 17:15:38 <local0.debug> admn pg-bck[6366]: [9-2] :[] LOCATION: StartupXLOG, xlog.c:6532 Jun 8 17:15:38 <local0.debug> admn pg-bck[6366]: [10-1] :[] DEBUG: 00000: redo record is at 17/FF000024; shutdown true ---------------------------------------- From the fast reboot @ 19:24 noting at all is logged. At the following startup, we see that we are still within the "FF" log: Jun 8 19:28:24 <local0.debug> admn pg-bck[6465]: [1-1] :[] DEBUG: 00000: registering background worker "logical replicationlauncher" Jun 8 19:28:24 <local0.debug> admn pg-bck[6465]: [1-2] :[] LOCATION: RegisterBackgroundWorker, bgworker.c:854 Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [2-1] :[] LOG: 00000: starting PostgreSQL 12.2 on i386-portbld-freebsd11.3,compiled by gcc9 (FreeBSD Ports Collection) 9.3.0, 32-bit Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [2-2] :[] LOCATION: PostmasterMain, postmaster.c:997 Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [3-1] :[] LOG: 00000: listening on IPv4 address "0.0.0.0", port 5433 Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [3-2] :[] LOCATION: StreamServerPort, pqcomm.c:590 Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [4-1] :[] LOG: 00000: listening on Unix socket "/tmp/.s.PGSQL.5433" Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [4-2] :[] LOCATION: StreamServerPort, pqcomm.c:585 Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [5-1] :[] LOG: 00000: ending log output to stderr Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [5-2] :[] HINT: Future log output will go to log destination "syslog". Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [5-3] :[] LOCATION: PostmasterMain, postmaster.c:1297 Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [6-1] :[] LOG: XX000: could not send test message on socket for statisticscollector: Permission denied Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [6-2] :[] LOCATION: pgstat_init, pgstat.c:486 Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [7-1] :[] LOG: 00000: trying another address for the statistics collector Jun 8 19:28:24 <local0.info> admn pg-bck[6465]: [7-2] :[] LOCATION: pgstat_init, pgstat.c:418 Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [8-1] :[] LOG: 00000: database system was interrupted; last known up at2020-06-08 17:25:38 CEST Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [8-2] :[] LOCATION: StartupXLOG, xlog.c:6267 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [9-1] :[] DEBUG: 00000: checkpoint record is at 17/FF01BFE8 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [9-2] :[] LOCATION: StartupXLOG, xlog.c:6532 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [10-1] :[] DEBUG: 00000: redo record is at 17/FF01BFB4; shutdown false Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [10-2] :[] LOCATION: StartupXLOG, xlog.c:6609 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [11-1] :[] DEBUG: 00000: next transaction ID: 18955154; next OID: 145913 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [11-2] :[] LOCATION: StartupXLOG, xlog.c:6613 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [12-1] :[] DEBUG: 00000: next MultiXactId: 1; next MultiXactOffset: 0 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [12-2] :[] LOCATION: StartupXLOG, xlog.c:6617 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [13-1] :[] DEBUG: 00000: oldest unfrozen transaction ID: 479, in database13777 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [13-2] :[] LOCATION: StartupXLOG, xlog.c:6620 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [14-1] :[] DEBUG: 00000: oldest MultiXactId: 1, in database 1 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [14-2] :[] LOCATION: StartupXLOG, xlog.c:6623 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [15-1] :[] DEBUG: 00000: commit timestamp Xid oldest/newest: 0/0 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [15-2] :[] LOCATION: StartupXLOG, xlog.c:6626 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [16-1] :[] DEBUG: 00000: transaction ID wrap limit is 2147484126, limitedby database with OID 13777 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [16-2] :[] LOCATION: SetTransactionIdLimit, varsup.c:408 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [17-1] :[] DEBUG: 00000: MultiXactId wrap limit is 2147483648, limitedby database with OID 1 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [17-2] :[] LOCATION: SetMultiXactIdLimit, multixact.c:2267 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [18-1] :[] DEBUG: 00000: starting up replication slots Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [18-2] :[] LOCATION: StartupReplicationSlots, slot.c:1114 Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [19-1] :[] LOG: 00000: database system was not properly shut down; automaticrecovery in progress Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [19-2] :[] LOCATION: StartupXLOG, xlog.c:6764 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [20-1] :[] DEBUG: 00000: resetting unlogged relations: cleanup 1 init0 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [20-2] :[] LOCATION: ResetUnloggedRelations, reinit.c:55 Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [21-1] :[] LOG: 00000: redo starts at 17/FF01BFB4 Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [21-2] :[] LOCATION: StartupXLOG, xlog.c:7035 Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [22-1] :[] LOG: 00000: redo done at 17/FF01C098 Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [22-2] :[] LOCATION: StartupXLOG, xlog.c:7297 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [23-1] :[] DEBUG: 00000: resetting unlogged relations: cleanup 0 init1 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [23-2] :[] LOCATION: ResetUnloggedRelations, reinit.c:55 Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [24-1] :[] LOG: 00000: checkpoint starting: end-of-recovery immediate Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [24-2] :[] LOCATION: LogCheckpointStart, xlog.c:8362 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [25-1] :[] DEBUG: 00000: performing replication slot checkpoint Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [25-2] :[] LOCATION: CheckPointReplicationSlots, slot.c:1078 Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [26-1] :[] LOG: 00000: checkpoint complete: wrote 0 buffers (0.0%); 0 WALfile(s) added, 1 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.007 s; sync files=0, longest=0.000 s, average=0.000s; distance=16272 kB, estimate=16272 kB Jun 8 19:28:24 <local0.info> admn pg-bck[6467]: [26-2] :[] LOCATION: LogCheckpointEnd, xlog.c:8435 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [27-1] :[] DEBUG: 00000: MultiXactId wrap limit is 2147483648, limitedby database with OID 1 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [27-2] :[] LOCATION: SetMultiXactIdLimit, multixact.c:2267 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [28-1] :[] DEBUG: 00000: MultiXact member stop limit is now 4294914944based on MultiXact 1 Jun 8 19:28:24 <local0.debug> admn pg-bck[6467]: [28-2] :[] LOCATION: SetOffsetVacuumLimit, multixact.c:2630 There is no further information about anything concerning the logs, but at this point in time "FF" has disappeared. It was NOT copied out, we can see the timestamp on the pg_wal/archive_status still being at 17:09. Nothing except postgres is supposed to write/delete anything within the data tree, and the RedoLog.bck script doesn't delete anything at all. Another cute question would be: The "FE" log was successfully copied out at 17:09:45. The checkpoint at 17:09:55 then says "1 removed" - this should be the "FE", because the one before was written seven hours earlier (see the "ls" above) and should by long be gone. But then the checkpoint at 19:28:24 again says "1 removed". What was removed there? It is unlikely to be the "FE", and the "FF" would be currently in use - and the archive_status directory was not written since 17:09. But the "FF" has disappeared. So what is going on here? cheerio, PMc
pgsql-general by date: