Thread: [BUGS] Standby corruption after master is restarted
One of our standby servers on a testing environment at InnoGames broke today. The master and the other standby server is working fine. I noticed that it happened right after the master had restarted. It looks to me like the WAL file is corrupted. I wanted to check, if it can be a bug. These were on the log file: > 2017-08-24 12:19:51 UTC [20351-2] LOG: replication terminated by primary server > 2017-08-24 12:19:51 UTC [20351-3] DETAIL: End of WAL reached on timeline 1 at 0/9B9F5FF0. > 2017-08-24 12:19:51 UTC [20351-4] FATAL: could not send end-of-streaming message to primary: no COPY in progress > 2017-08-24 12:19:51 UTC [26673-7] FATAL: invalid memory alloc request size 1298694144 > 2017-08-24 12:19:51 UTC [26672-3] LOG: startup process (PID 26673) exited with exit code 1 > 2017-08-24 12:19:51 UTC [26672-4] LOG: terminating any other active server processes I tried to start it again to get more logs: > 2017-08-24 14:07:46 UTC [28861-12698] LOCATION: KnownAssignedXidsRemovePreceding, procarray.c:3738 > 2017-08-24 14:07:46 UTC [28861-12699] DEBUG: 00000: transaction ID wrap limit is 2147484807, limited by database withOID 1 > 2017-08-24 14:07:46 UTC [28861-12700] CONTEXT: xlog redo at 0/9B9F5F80 for XLOG/CHECKPOINT_SHUTDOWN: redo 0/9B9F5F80;tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest xid 1160 in DB 1; oldest multi1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown > 2017-08-24 14:07:46 UTC [28861-12701] LOCATION: SetTransactionIdLimit, varsup.c:347 > 2017-08-24 14:07:46 UTC [28861-12702] DEBUG: 00000: prune KnownAssignedXids to 5869245 > 2017-08-24 14:07:46 UTC [28861-12703] CONTEXT: xlog redo at 0/9B9F5F80 for XLOG/CHECKPOINT_SHUTDOWN: redo 0/9B9F5F80;tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest xid 1160 in DB 1; oldest multi1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown > 2017-08-24 14:07:46 UTC [28861-12704] LOCATION: KnownAssignedXidsRemovePreceding, procarray.c:3738 > 2017-08-24 14:07:46 UTC [28861-12705] FATAL: XX000: invalid memory alloc request size 1298694144 > 2017-08-24 14:07:46 UTC [28861-12706] LOCATION: palloc_extended, mcxt.c:961 These from pg_controldata: > pg_control version number: 960 > Catalog version number: 201608131 > Database system identifier: 6423674571818176107 > Database cluster state: in archive recovery > pg_control last modified: Thu 24 Aug 2017 02:01:42 PM UTC > Latest checkpoint location: 0/9B9C1188 > Prior checkpoint location: 0/9B9C1188 > Latest checkpoint's REDO location: 0/9B9BEC60 > Latest checkpoint's REDO WAL file: 00000001000000000000009B > Latest checkpoint's TimeLineID: 1 > Latest checkpoint's PrevTimeLineID: 1 > Latest checkpoint's full_page_writes: on > Latest checkpoint's NextXID: 0:5868410 > Latest checkpoint's NextOID: 28526 > Latest checkpoint's NextMultiXactId: 1 > Latest checkpoint's NextMultiOffset: 0 > Latest checkpoint's oldestXID: 1160 > Latest checkpoint's oldestXID's DB: 1 > Latest checkpoint's oldestActiveXID: 5868410 > Latest checkpoint's oldestMultiXid: 1 > Latest checkpoint's oldestMulti's DB: 1 > Latest checkpoint's oldestCommitTsXid:0 > Latest checkpoint's newestCommitTsXid:0 > Time of latest checkpoint: Thu 24 Aug 2017 12:02:56 PM UTC > Fake LSN counter for unlogged rels: 0/1 > Minimum recovery ending location: 0/9B9E13C0 > Min recovery ending loc's timeline: 1 > Backup start location: 0/0 > Backup end location: 0/0 > End-of-backup record required: no > wal_level setting: replica > wal_log_hints setting: on > max_connections setting: 400 > max_worker_processes setting: 8 > max_prepared_xacts setting: 0 > max_locks_per_xact setting: 64 > track_commit_timestamp setting: off > Maximum data alignment: 8 > Database block size: 8192 > Blocks per segment of large relation: 131072 > WAL block size: 8192 > Bytes per WAL segment: 16777216 > Maximum length of identifiers: 64 > Maximum columns in an index: 32 > Maximum size of a TOAST chunk: 1996 > Size of a large-object chunk: 2048 > Date/time type storage: 64-bit integers > Float4 argument passing: by value > Float8 argument passing: by value > Data page checksum version: 1 I tried pg_xlogdump and it failed on the spot: > rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn: 0/9B9F5F80, prev 0/9B9F5F48, desc: CHECKPOINT_SHUTDOWNredo 0/9B9F5F80; tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest xid1160 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown > pg_xlogdump: FATAL: error in WAL record at 0/9B9F5F80: unexpected pageaddr 0/979F6000 in log segment 00000000000000000000009B,offset 10444800 This is from the master at the same place: > rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn: 0/9B9F5F80, prev 0/9B9F5F48, desc: CHECKPOINT_SHUTDOWNredo 0/9B9F5F80; tli 1; prev tli 1; fpw true; xid 0:5869245; oid 20756; multi 1; offset 0; oldest xid1160 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown > rmgr: XLOG len (rec/tot): 49/ 2805, tx: 0, lsn: 0/9B9F5FF0, prev 0/9B9F5F80, desc: FPI_FOR_HINT ,blkref #0: rel 1663/16396/16398 blk 0 FPW > rmgr: Heap len (rec/tot): 54/ 5046, tx: 5869245, lsn: 0/9B9F6B00, prev 0/9B9F5FF0, desc: INSERT off 54,blkref #0: rel 1663/16396/16427 blk 64203 FPW > rmgr: Btree len (rec/tot): 53/ 7961, tx: 5869245, lsn: 0/9B9F7EB8, prev 0/9B9F6B00, desc: INSERT_LEAF off281, blkref #0: rel 1663/16396/16446 blk 21754 FPW > rmgr: Transaction len (rec/tot): 34/ 34, tx: 5869245, lsn: 0/9B9F9DF0, prev 0/9B9F7EB8, desc: COMMIT 2017-08-2412:19:59.560138 UTC I still have the database available. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Hi, On 2017-08-24 16:50:59 +0200, Emre Hasegeli wrote: > One of our standby servers on a testing environment at InnoGames broke > today. The master and the other standby server is working fine. I > noticed that it happened right after the master had restarted. It > looks to me like the WAL file is corrupted. I wanted to check, if it > can be a bug. These were on the log file: To clarify: Is fsync=off used anywhere, and was this an OS level / hardware level restart? And just on the master, not a standby? Greetings, Andres Freund -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
> To clarify: Is fsync=off used anywhere No. > and was this an OS level / hardware level restart? It was only Postgres restart. > And just on the master, not a standby? Yes, only master. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
We experienced this issue again, this time on production. The primary instance was in a loop of being killed by Linux OOM-killer and being restarted in 1 minute intervals. The corruption only happened on one of the two standbys. The primary and the other standby have no problems. Only the was killed and restarted, the standbys were not. There weren't any unusual settings, "fsync" was not disabled. Here is the information I collected. The logs at the time standby broke: > 2018-03-28 14:00:30 UTC [3693-67] LOG: invalid resource manager ID 39 at 1DFB/D43BE688 > 2018-03-28 14:00:30 UTC [25347-1] LOG: started streaming WAL from primary at 1DFB/D4000000 on timeline 5 > 2018-03-28 14:00:59 UTC [3748-357177] LOG: restartpoint starting: time > 2018-03-28 14:01:23 UTC [25347-2] FATAL: could not receive data from WAL stream: SSL SYSCALL error: EOF detected > 2018-03-28 14:01:24 UTC [3693-68] FATAL: invalid memory alloc request size 1916035072 And from the next try: > 2018-03-28 14:02:15 UTC [26808-5] LOG: consistent recovery state reached at 1DFB/D6BDDFF8 > 2018-03-28 14:02:15 UTC [26808-6] FATAL: invalid memory alloc request size 191603507 The output of pg_controldata: > pg_control version number: 942 > Catalog version number: 201510051 > Database system identifier: 6223303827582122631 > Database cluster state: in archive recovery > pg_control last modified: 2018-03-28T15:25:51 UTC > Latest checkpoint location: 1DFB/D1E17FD8 > Prior checkpoint location: 1DFB/D1E17FD8 > Latest checkpoint's REDO location: 1DFB/D1E17FD8 > Latest checkpoint's REDO WAL file: 0000000500001DFB000000D1 > Latest checkpoint's TimeLineID: 5 > Latest checkpoint's PrevTimeLineID: 5 > Latest checkpoint's full_page_writes: on > Latest checkpoint's NextXID: 2/3522301360 > Latest checkpoint's NextOID: 141900 > Latest checkpoint's NextMultiXactId: 3434 > Latest checkpoint's NextMultiOffset: 7985 > Latest checkpoint's oldestXID: 3329390243 > Latest checkpoint's oldestXID's DB: 16385 > Latest checkpoint's oldestActiveXID: 0 > Latest checkpoint's oldestMultiXid: 1 > Latest checkpoint's oldestMulti's DB: 16385 > Latest checkpoint's oldestCommitTsXid:0 > Latest checkpoint's newestCommitTsXid:0 > Time of latest checkpoint: 2018-03-28T13:55:52 UTC > Fake LSN counter for unlogged rels: 0/1 > Minimum recovery ending location: 1DFB/D6BDDFF8 > Min recovery ending loc's timeline: 5 > Backup start location: 0/0 > Backup end location: 0/0 > End-of-backup record required: no > wal_level setting: hot_standby > wal_log_hints setting: on > max_connections setting: 500 > max_worker_processes setting: 8 > max_prepared_xacts setting: 0 > max_locks_per_xact setting: 64 > track_commit_timestamp setting: off > Maximum data alignment: 8 > Database block size: 8192 > Blocks per segment of large relation: 131072 > WAL block size: 8192 > Bytes per WAL segment: 16777216 > Maximum length of identifiers: 64 > Maximum columns in an index: 32 > Maximum size of a TOAST chunk: 1996 > Size of a large-object chunk: 2048 > Date/time type storage: 64-bit integers > Float4 argument passing: by value > Float8 argument passing: by value > Data page checksum version: 0 Hexdump of the WAL file at the position on standby: 0bddfc0 0002 0003 0004 0005 0006 0007 0008 0009 0bddfd0 000a 000b 000c 000d 000e 000f 0010 0011 0bddfe0 0012 0013 0014 0015 0016 0017 0000 0000 0bddff0 0000 0017 0000 0000 5f32 7234 4b55 7045 0bde000 d087 0005 0005 0000 e000 66bd 1dfb 0000 0bde010 1931 0000 0000 0000 5a43 7746 7166 6e34 0bde020 304e 764e 9c32 0158 5400 e709 0900 6f66 0bde030 0765 7375 6111 646e 6f72 6469 370d 312e Hexdump at the same position on the master: 0bddfc0 0002 0003 0004 0005 0006 0007 0008 0009 0bddfd0 000a 000b 000c 000d 000e 000f 0010 0011 0bddfe0 0012 0013 0014 0015 0016 0017 0000 0000 0bddff0 0000 0017 0000 0000 006a 0000 0000 0000 0bde000 d087 0005 0005 0000 e000 d6bd 1dfb 0000 0bde010 0062 0000 0000 0000 df90 d6bd 1dfb 0000 0bde020 0000 0000 63d3 aa61 50ff dff8 d6bd 1dfb 0bde030 0000 0005 0000 0005 0000 0001 0000 0002
Hi Emre, On 03/28/2018 07:50 PM, Emre Hasegeli wrote: > We experienced this issue again, this time on production. The primary > instance was in a loop of being killed by Linux OOM-killer and being > restarted in 1 minute intervals. The corruption only happened on one > of the two standbys. The primary and the other standby have no > problems. Only the was killed and restarted, the standbys were not. > There weren't any unusual settings, "fsync" was not disabled. Here is > the information I collected. > I've been trying to reproduce this by running a master with a couple of replicas, and randomly restarting the master (while pgbench is running). But so far no luck, so I guess something else is required to reproduce the behavior ... > The logs at the time standby broke: > >> 2018-03-28 14:00:30 UTC [3693-67] LOG: invalid resource manager ID 39 at 1DFB/D43BE688 >> 2018-03-28 14:00:30 UTC [25347-1] LOG: started streaming WAL from primary at 1DFB/D4000000 on timeline 5 >> 2018-03-28 14:00:59 UTC [3748-357177] LOG: restartpoint starting: time >> 2018-03-28 14:01:23 UTC [25347-2] FATAL: could not receive data from WAL stream: SSL SYSCALL error: EOF detected >> 2018-03-28 14:01:24 UTC [3693-68] FATAL: invalid memory alloc request size 1916035072 > > And from the next try: > >> 2018-03-28 14:02:15 UTC [26808-5] LOG: consistent recovery state reached at 1DFB/D6BDDFF8 >> 2018-03-28 14:02:15 UTC [26808-6] FATAL: invalid memory alloc request size 191603507 > In the initial report (from August 2018) you shared pg_xlogdump output, showing that the corrupted WAL record is an FPI_FOR_HINT right after CHECKPOINT_SHUTDOWN. Was it the same case this time? BTW which versions are we talking about? I see the initial report mentioned catversion 201608131, this one mentions 201510051, so I'm guessing 9.6 and 9.5. Which minor versions? Is the master under load (accepting writes) before shutdown? How was it restarted, actually? I see you're mentioning OOM killer, so I guess "kill -9". What about the first report - was it the same case, or was it restarted "nicely" using pg_ctl? Could the replica receive the WAL in some other way - say, from a WAL archive? What archive/restore commands you use? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/14/2018 07:38 PM, Tomas Vondra wrote: > > How was it restarted, actually? I see you're mentioning OOM killer, so I > guess "kill -9". What about the first report - was it the same case, or > was it restarted "nicely" using pg_ctl? > Also, which process gets killed by the OOM killer? A random backend, walsender or some other process? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> In the initial report (from August 2018) you shared pg_xlogdump output, > showing that the corrupted WAL record is an FPI_FOR_HINT right after > CHECKPOINT_SHUTDOWN. Was it the same case this time? No, this time the xlogdump on master looks like this: > rmgr: Heap2 len (rec/tot): 110/ 110, tx: 0, lsn: 1DFB/D6BDDF20, prev 1DFB/D6BDDEE0, desc: CLEAN remxid0, blkref #0: rel 1663/16385/141744 blk 331540 > rmgr: Heap2 len (rec/tot): 100/ 100, tx: 0, lsn: 1DFB/D6BDDF90, prev 1DFB/D6BDDF20, desc: CLEAN remxid0, blkref #0: rel 1663/16385/141744 blk 331541 > rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn: 1DFB/D6BDDFF8, prev 1DFB/D6BDDF90, desc: CHECKPOINT_SHUTDOWNredo 1DFB/D6BDDFF8; tli 5; prev tli 5; fpw true; xid 2/3522301709; oid 141900; multi 3434; offset 7985;oldest xid 3329390243 in DB 16385; oldest multi 1 in DB 16385; oldest/newest commit timestamp xid: 0/0; oldest runningxid 0; shutdown > rmgr: Heap len (rec/tot): 56/ 2511, tx: 3522301709, lsn: 1DFB/D6BDE080, prev 1DFB/D6BDDFF8, desc: INSERT off93, blkref #0: rel 1663/40796/40826 blk 0 FPW > rmgr: Transaction len (rec/tot): 34/ 34, tx: 3522301709, lsn: 1DFB/D6BDEA50, prev 1DFB/D6BDE080, desc: COMMIT 2018-03-2814:01:36.800811 UTC > rmgr: Heap len (rec/tot): 56/ 1257, tx: 3522301710, lsn: 1DFB/D6BDEA78, prev 1DFB/D6BDEA50, desc: INSERT off41, blkref #0: rel 1663/40796/40820 blk 345119 FPW > rmgr: Btree len (rec/tot): 55/ 1857, tx: 3522301710, lsn: 1DFB/D6BDEF68, prev 1DFB/D6BDEA78, desc: INSERT_LEAFoff 125, blkref #0: rel 1663/40796/40839 blk 107903 FPW > rmgr: Transaction len (rec/tot): 34/ 34, tx: 3522301710, lsn: 1DFB/D6BDF6B0, prev 1DFB/D6BDEF68, desc: COMMIT 2018-03-2814:01:37.361010 UTC > BTW which versions are we talking about? I see the initial report > mentioned catversion 201608131, this one mentions 201510051, so I'm > guessing 9.6 and 9.5. Which minor versions? The first incident was on 9.6.3, the second 9.5.7. > Is the master under load (accepting writes) before shutdown? Yes. Especially on the second incident it was being written heavily. The master gets up to 10k INSERT transactions per second before being killed. > How was it restarted, actually? I see you're mentioning OOM killer, so I > guess "kill -9". What about the first report - was it the same case, or > was it restarted "nicely" using pg_ctl? On the first incident the master was being restarted as part of upgrade from 9.6.3 to 9.6.4. It should have been nicely done by the Debian upgrade scripts. > Could the replica receive the WAL in some other way - say, from a WAL > archive? What archive/restore commands you use? We have repmgr + barman setup: > archive_command = 'test $(psql -Atc "select failed_count from pg_stat_archiver") -gt 10 || rsync --quiet --contimeout=2--timeout=2 %p BARMAN_HOST::barman/' > restore_command = 'barman-wal-restore --gzip BARMAN_HOST BARMAN_CLUSTER %f %p'
> Also, which process gets killed by the OOM killer? A random backend, > walsender or some other process? It was always a SELECT query that was killed. I assume it was because of the GIN index we have with huge pending list. Postgres was "terminating other connections because of crash of another server process".
On 04/16/2018 10:55 AM, Emre Hasegeli wrote: >> In the initial report (from August 2018) you shared pg_xlogdump output, >> showing that the corrupted WAL record is an FPI_FOR_HINT right after >> CHECKPOINT_SHUTDOWN. Was it the same case this time? > > No, this time the xlogdump on master looks like this: > >> rmgr: Heap2 len (rec/tot): 110/ 110, tx: 0, lsn: 1DFB/D6BDDF20, prev 1DFB/D6BDDEE0, desc: CLEAN remxid0, blkref #0: rel 1663/16385/141744 blk 331540 >> rmgr: Heap2 len (rec/tot): 100/ 100, tx: 0, lsn: 1DFB/D6BDDF90, prev 1DFB/D6BDDF20, desc: CLEAN remxid0, blkref #0: rel 1663/16385/141744 blk 331541 >> rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn: 1DFB/D6BDDFF8, prev 1DFB/D6BDDF90, desc: CHECKPOINT_SHUTDOWNredo 1DFB/D6BDDFF8; tli 5; prev tli 5; fpw true; xid 2/3522301709; oid 141900; multi 3434; offset 7985;oldest xid 3329390243 in DB 16385; oldest multi 1 in DB 16385; oldest/newest commit timestamp xid: 0/0; oldest runningxid 0; shutdown >> rmgr: Heap len (rec/tot): 56/ 2511, tx: 3522301709, lsn: 1DFB/D6BDE080, prev 1DFB/D6BDDFF8, desc: INSERT off93, blkref #0: rel 1663/40796/40826 blk 0 FPW >> rmgr: Transaction len (rec/tot): 34/ 34, tx: 3522301709, lsn: 1DFB/D6BDEA50, prev 1DFB/D6BDE080, desc: COMMIT 2018-03-2814:01:36.800811 UTC >> rmgr: Heap len (rec/tot): 56/ 1257, tx: 3522301710, lsn: 1DFB/D6BDEA78, prev 1DFB/D6BDEA50, desc: INSERT off41, blkref #0: rel 1663/40796/40820 blk 345119 FPW >> rmgr: Btree len (rec/tot): 55/ 1857, tx: 3522301710, lsn: 1DFB/D6BDEF68, prev 1DFB/D6BDEA78, desc: INSERT_LEAFoff 125, blkref #0: rel 1663/40796/40839 blk 107903 FPW >> rmgr: Transaction len (rec/tot): 34/ 34, tx: 3522301710, lsn: 1DFB/D6BDF6B0, prev 1DFB/D6BDEF68, desc: COMMIT 2018-03-2814:01:37.361010 UTC > OK, and on the standby? I suppose it breaks right after the CHECKPOINT_SHUTDOWN record, right? >> BTW which versions are we talking about? I see the initial report >> mentioned catversion 201608131, this one mentions 201510051, so I'm >> guessing 9.6 and 9.5. Which minor versions? > > The first incident was on 9.6.3, the second 9.5.7. > >> Is the master under load (accepting writes) before shutdown? > > Yes. Especially on the second incident it was being written heavily. > The master gets up to 10k INSERT transactions per second before being > killed. > >> How was it restarted, actually? I see you're mentioning OOM killer, so I >> guess "kill -9". What about the first report - was it the same case, or >> was it restarted "nicely" using pg_ctl? > > On the first incident the master was being restarted as part of > upgrade from 9.6.3 to 9.6.4. It should have been nicely done by the > Debian upgrade scripts. > Hmmm. I'm not sure what exactly debian upgrade script (and IIRC it depends on packages). I wonder if it does fast/smart/immediate. >> Could the replica receive the WAL in some other way - say, from a WAL >> archive? What archive/restore commands you use? > > We have repmgr + barman setup: > >> archive_command = 'test $(psql -Atc "select failed_count from pg_stat_archiver") -gt 10 || rsync --quiet --contimeout=2--timeout=2 %p BARMAN_HOST::barman/' >> restore_command = 'barman-wal-restore --gzip BARMAN_HOST BARMAN_CLUSTER %f %p' And can you determine how the broken WAL got to the standby? Was it by directly fetching from the primary, or from the barman archive? Is it broken in the archive too? I think one hypothesis is that the standby starts fetching a WAL segment from the primary, and while doing that it gets recycled and overwritten (so the standby gets bogus somewhere half-way through). That should not be possible, of course, but perhaps there's something broken right after a restart? A race condition or something like that? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> OK, and on the standby? I suppose it breaks right after the > CHECKPOINT_SHUTDOWN record, right? Actually pg_xlogdump fails right at the SHUTDOWN record: > rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 1DFB/D6BDDEA0, prev 1DFB/D6BDDE60, desc: CLEAN remxid0, blkref #0: rel 1663/16385/141744 blk 331538 > rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 1DFB/D6BDDEE0, prev 1DFB/D6BDDEA0, desc: CLEAN remxid0, blkref #0: rel 1663/16385/141744 blk 331539 > rmgr: Heap2 len (rec/tot): 110/ 110, tx: 0, lsn: 1DFB/D6BDDF20, prev 1DFB/D6BDDEE0, desc: CLEAN remxid0, blkref #0: rel 1663/16385/141744 blk 331540 > rmgr: Heap2 len (rec/tot): 100/ 100, tx: 0, lsn: 1DFB/D6BDDF90, prev 1DFB/D6BDDF20, desc: CLEAN remxid0, blkref #0: rel 1663/16385/141744 blk 331541 > pg_xlogdump: FATAL: error in WAL record at 1DFB/D6BDDF90: unexpected pageaddr 1DFB/66BDE000 in log segment 0000000000001DFB000000D6,offset 12443648 > Hmmm. I'm not sure what exactly debian upgrade script (and IIRC it > depends on packages). I wonder if it does fast/smart/immediate. I couldn't find it quickly. > And can you determine how the broken WAL got to the standby? Was it by > directly fetching from the primary, or from the barman archive? Is it > broken in the archive too? It must be streamed from the primary. The log was on on the archive during the incident. The good log file came there later. Such logs are repeated on the standby before the error: > 2018-03-28 14:00:30 UTC [3693-67] LOG: invalid resource manager ID 39 at 1DFB/D43BE688 > 2018-03-28 14:00:30 UTC [25347-1] LOG: started streaming WAL from primary at 1DFB/D4000000 on timeline 5 > 2018-03-28 14:00:59 UTC [3748-357177] LOG: restartpoint starting: time > WARNING: terminating connection because of crash of another server process > DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because anotherserver process exited abnormally and possibly corrupted shared m emory. > HINT: In a moment you should be able to reconnect to the database and repeat your command. > 2018-03-28 14:01:23 UTC [25347-2] FATAL: could not receive data from WAL stream: SSL SYSCALL error: EOF detected > I think one hypothesis is that the standby starts fetching a WAL segment > from the primary, and while doing that it gets recycled and overwritten > (so the standby gets bogus somewhere half-way through). That should not > be possible, of course, but perhaps there's something broken right after > a restart? A race condition or something like that? If I haven't missed anything on the hexdumps I posted, the logs of the master and the standby are the same until the position on pg_controldata. I cannot understand how it can fail for something later. Is it possible that this is triggered if the replication is interrupted on a certain point?
On 04/16/2018 04:55 PM, Emre Hasegeli wrote: >> OK, and on the standby? I suppose it breaks right after the >> CHECKPOINT_SHUTDOWN record, right? > > Actually pg_xlogdump fails right at the SHUTDOWN record: > >> rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 1DFB/D6BDDEA0, prev 1DFB/D6BDDE60, desc: CLEAN remxid0, blkref #0: rel 1663/16385/141744 blk 331538 >> rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 1DFB/D6BDDEE0, prev 1DFB/D6BDDEA0, desc: CLEAN remxid0, blkref #0: rel 1663/16385/141744 blk 331539 >> rmgr: Heap2 len (rec/tot): 110/ 110, tx: 0, lsn: 1DFB/D6BDDF20, prev 1DFB/D6BDDEE0, desc: CLEAN remxid0, blkref #0: rel 1663/16385/141744 blk 331540 >> rmgr: Heap2 len (rec/tot): 100/ 100, tx: 0, lsn: 1DFB/D6BDDF90, prev 1DFB/D6BDDF20, desc: CLEAN remxid0, blkref #0: rel 1663/16385/141744 blk 331541 >> pg_xlogdump: FATAL: error in WAL record at 1DFB/D6BDDF90: unexpected pageaddr 1DFB/66BDE000 in log segment 0000000000001DFB000000D6,offset 12443648 > >> Hmmm. I'm not sure what exactly debian upgrade script (and IIRC it >> depends on packages). I wonder if it does fast/smart/immediate. > > I couldn't find it quickly. > >> And can you determine how the broken WAL got to the standby? Was it by >> directly fetching from the primary, or from the barman archive? Is it >> broken in the archive too? > > It must be streamed from the primary. The log was on on the archive > during the incident. The good log file came there later. Such logs > are repeated on the standby before the error: > OK, so the standby connects to the primary, gets a (corrupted) WAL segment from it, which then gets archived? That would contradict the theory of race condition between recycling the WAL segment and copying it to the standby, because we never recycle segments that have not been archived yet (obviously). But I'm not sure that's the exact sequence of events, of course. Can you check if the "incorrect" part of the WAL segment matches some previous segment? Verifying that shouldn't be very difficult (just cut a bunch of bytes using hexdump, compare to the incorrect data). Assuming you still have the WAL archive, of course. That would tell us that the corrupted part comes from an old recycled segment. >> 2018-03-28 14:00:30 UTC [3693-67] LOG: invalid resource manager ID 39 at 1DFB/D43BE688 >> 2018-03-28 14:00:30 UTC [25347-1] LOG: started streaming WAL from primary at 1DFB/D4000000 on timeline 5 >> 2018-03-28 14:00:59 UTC [3748-357177] LOG: restartpoint starting: time >> WARNING: terminating connection because of crash of another server process >> DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because anotherserver process exited abnormally and possibly corrupted shared m > emory. >> HINT: In a moment you should be able to reconnect to the database and repeat your command. >> 2018-03-28 14:01:23 UTC [25347-2] FATAL: could not receive data from WAL stream: SSL SYSCALL error: EOF detected > Hmmm, I see you're using SSL. I don't think that could break affect anything, but maybe I should try mimicking this aspect too. >> I think one hypothesis is that the standby starts fetching a WAL segment >> from the primary, and while doing that it gets recycled and overwritten >> (so the standby gets bogus somewhere half-way through). That should not >> be possible, of course, but perhaps there's something broken right after >> a restart? A race condition or something like that? > > If I haven't missed anything on the hexdumps I posted, the logs of the > master and the standby are the same until the position on > pg_controldata. I cannot understand how it can fail for something > later. Is it possible that this is triggered if the replication is > interrupted on a certain point? > Which position? The controldata shows location of the last checkpoint, which says where the replay should start. Or am I missing something? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> Can you check if the "incorrect" part of the WAL segment matches some > previous segment? Verifying that shouldn't be very difficult (just cut a > bunch of bytes using hexdump, compare to the incorrect data). Assuming > you still have the WAL archive, of course. That would tell us that the > corrupted part comes from an old recycled segment. I had found and saved the recycled WAL file from the archive after the incident. Here is the hexdump of it at the same position: 0bddfc0 3253 4830 616f 5034 5243 4d79 664f 6164 0bddfd0 3967 592d 7963 7967 5541 4a59 3066 4f50 0bddfe0 2d55 346e 4254 3559 6a4e 726b 4e30 6f52 0bddff0 3876 4751 4a38 5956 5f32 7234 4b55 7045 0bde000 d087 0005 0005 0000 e000 66bd 1dfb 0000 0bde010 1931 0000 0000 0000 5a43 7746 7166 6e34 0bde020 304e 764e 9c32 0158 5400 e709 0900 6f66 0bde030 0765 7375 6111 646e 6f72 6469 370d 312e If you compare it with the other 2 I have posted, you would notice that the corrupted file on standby is combination of the two. The data on it starts with the data on the master, and continues with the data of the recycled file. The switch is at the position 0bddff8 which is the position printed as "Minimum recovery ending location" by pg_controldata. > Hmmm, I see you're using SSL. I don't think that could break affect > anything, but maybe I should try mimicking this aspect too. This is the connection information. Although the master shows SSL compression is disabled in despite of being explicitly asked for. > primary_conninfo = 'host=MASTER_NODE port=5432 dbname=repmgr user=repmgr connect_timeout=10 sslcompression=1'
On 04/17/2018 10:55 AM, Emre Hasegeli wrote: >> Can you check if the "incorrect" part of the WAL segment matches some >> previous segment? Verifying that shouldn't be very difficult (just cut a >> bunch of bytes using hexdump, compare to the incorrect data). Assuming >> you still have the WAL archive, of course. That would tell us that the >> corrupted part comes from an old recycled segment. > > I had found and saved the recycled WAL file from the archive after the > incident. Here is the hexdump of it at the same position: > > 0bddfc0 3253 4830 616f 5034 5243 4d79 664f 6164 > 0bddfd0 3967 592d 7963 7967 5541 4a59 3066 4f50 > 0bddfe0 2d55 346e 4254 3559 6a4e 726b 4e30 6f52 > 0bddff0 3876 4751 4a38 5956 5f32 7234 4b55 7045 > 0bde000 d087 0005 0005 0000 e000 66bd 1dfb 0000 > 0bde010 1931 0000 0000 0000 5a43 7746 7166 6e34 > 0bde020 304e 764e 9c32 0158 5400 e709 0900 6f66 > 0bde030 0765 7375 6111 646e 6f72 6469 370d 312e > > If you compare it with the other 2 I have posted, you would notice > that the corrupted file on standby is combination of the two. The > data on it starts with the data on the master, and continues with the > data of the recycled file. The switch is at the position 0bddff8 > which is the position printed as "Minimum recovery ending location" by > pg_controldata. > OK, this seems to confirm the theory that there's a race condition between segment recycling and replicating. It's likely limited to short period after a crash, otherwise we'd probably see many more reports. But it's still just hunch - someone needs to read through the code and check how it behaves in these situations. Not sure when I'll have time for that. >> Hmmm, I see you're using SSL. I don't think that could break affect >> anything, but maybe I should try mimicking this aspect too. > > This is the connection information. Although the master shows SSL > compression is disabled in despite of being explicitly asked for. > >> primary_conninfo = 'host=MASTER_NODE port=5432 dbname=repmgr user=repmgr connect_timeout=10 sslcompression=1' Hmmm, that seems like a separate issue. When you say 'master shows SSL compression is disabled' where do you see that? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> OK, this seems to confirm the theory that there's a race condition between > segment recycling and replicating. It's likely limited to short period after > a crash, otherwise we'd probably see many more reports. I am suspicious about this part. It cannot be happening 2 times to us and never to anybody else. Maybe people do not report it because it is easy to deal with the problem. You just delete the corrupted WAL file. Or maybe there is something special in our infrastructure. Thank you for your help so far. >> This is the connection information. Although the master shows SSL >> compression is disabled in despite of being explicitly asked for. > Hmmm, that seems like a separate issue. When you say 'master shows SSL > compression is disabled' where do you see that? I thought it might be related first, so I mentioned. Then I found out on pg_stat_ssl view on the server that it is actually disabled. Probably the configuration of OpenSSL is disabling it.
Hello. I found how this occurs. # added -hackers I've seen similar case with inadequite operation but this time I found that it can happen under normal operation. At Tue, 17 Apr 2018 17:14:05 +0200, Emre Hasegeli <emre@hasegeli.com> wrote in <CAE2gYzwB0+MM3L6-zdK0JKO_aXqHNCKU7VAiDJk_kKRRJ4B=yA@mail.gmail.com> > > OK, this seems to confirm the theory that there's a race condition between > > segment recycling and replicating. It's likely limited to short period after > > a crash, otherwise we'd probably see many more reports. > > I am suspicious about this part. It cannot be happening 2 times to us > and never to anybody else. Maybe people do not report it because it > is easy to deal with the problem. You just delete the corrupted WAL > file. Or maybe there is something special in our infrastructure. > > Thank you for your help so far. > > >> This is the connection information. Although the master shows SSL > >> compression is disabled in despite of being explicitly asked for. > > > Hmmm, that seems like a separate issue. When you say 'master shows SSL > > compression is disabled' where do you see that? > > I thought it might be related first, so I mentioned. Then I found out > on pg_stat_ssl view on the server that it is actually disabled. > Probably the configuration of OpenSSL is disabling it. A segment is not cleard on recycling. walreceiver writes WAL record by record so startup process can see arbitrary byte sequence after the last valid record when replication connection is lost or standby is restarted. The following scenario results in the similar situation. 1. create master and standby and run. It makes happen this easily if wal_keep_segments is set large (20 or so) on master and 0 on standby. 2. Write WAL to recycle happens on standby. Explicit checkpoints on standby make it faster. May be required to run several rounds before we see recycled segment on standby. maybe_loop { master: create table t (a int); insert into t (select a from generate_series(0, 150000) a); delete from t; checkpoint; standby: checkpoint; <check for recycled segments in pg_wal> } 3. stop master 4. standby starts to complain that master is missing. At this time, standby complains for several kinds of failure. I saw 'invalid record length' and 'incorrect prev-link' this time. I saw 'invalid resource manager ID' when mixing different size records. If XLogReadRecord saw a record with impossibly large tot_length there, it will causes the palloc failure and startup process dies. 5. If you see 'zero length record', it's nothing interesting. Repeat 3 and 4 to see another. This doesn't seem to happen on master since XLogWrite writes in units of page so redo always sees zero length record or wrong magic, or pageaddr at page boundaries after a crash. On the other hand walreceiver writes in unit of records so redo can see arbitrary byte sequence as xl_tot_len, xl_prev, xl_rmid while fetching the next record. Espcially the first among them can cause FATAL with "invalid memory alloc". I think this behavior is a bug. XLogReadRecord is considering the case but palloc_extended() breaks it. So in the attached, add a new flag MCXT_ALLOC_NO_PARAMERR to palloc_extended() and allocate_recordbuf calls it with the new flag. That alone fixes the problem. However, the patch frees read state buffer facing errorneous records since such records can leave a too-large buffer allocated. After applying this patch, the behavior for the situation changes as follows. > LOG: started streaming WAL from primary at 0/83000000 on timeline 1 > LOG: record length 1298694144 at 0/83C17B70 too long > FATAL: terminating walreceiver process due to administrator command > LOG: record length 1298694144 at 0/83C17B70 too long > LOG: record length 1298694144 at 0/83C17B70 too long > LOG: received promote request > LOG: redo done at 0/83C17B38 > LOG: last completed transaction was at log time 2018-04-26 19:10:12.360253+09 > LOG: selected new timeline ID: 2 > LOG: archive recovery complete > LOG: database system is ready to accept connections regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 3a86f3497e..02c26224fc 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -165,7 +165,8 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength) if (state->readRecordBuf) pfree(state->readRecordBuf); state->readRecordBuf = - (char *) palloc_extended(newSize, MCXT_ALLOC_NO_OOM); + (char *) palloc_extended(newSize, + MCXT_ALLOC_NO_OOM | MCXT_ALLOC_NO_PARAMERR); if (state->readRecordBuf == NULL) { state->readRecordBufSize = 0; @@ -601,10 +602,19 @@ err: /* * Invalidate the xlogreader's read state to force a re-read. + * + * read state might hold too large buffer on reading invalid record so release + * the buffer as well. */ void XLogReaderInvalReadState(XLogReaderState *state) { + if (state->readRecordBuf != NULL) + { + pfree(state->readRecordBuf); + state->readRecordBuf = NULL; + state->readRecordBufSize = 0; + } state->readSegNo = 0; state->readOff = 0; state->readLen = 0; diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c index ebe0342f18..6e7f95d11f 100644 --- a/src/backend/utils/mmgr/mcxt.c +++ b/src/backend/utils/mmgr/mcxt.c @@ -996,7 +996,12 @@ palloc_extended(Size size, int flags) if (((flags & MCXT_ALLOC_HUGE) != 0 && !AllocHugeSizeIsValid(size)) || ((flags & MCXT_ALLOC_HUGE) == 0 && !AllocSizeIsValid(size))) + { + if (flags & MCXT_ALLOC_NO_PARAMERR) + return NULL; + elog(ERROR, "invalid memory alloc request size %zu", size); + } context->isReset = false; diff --git a/src/include/common/fe_memutils.h b/src/include/common/fe_memutils.h index 458743dd40..5859ec087f 100644 --- a/src/include/common/fe_memutils.h +++ b/src/include/common/fe_memutils.h @@ -16,7 +16,8 @@ #define MCXT_ALLOC_HUGE 0x01 /* allow huge allocation (> 1 GB) not * actually used for frontends */ #define MCXT_ALLOC_NO_OOM 0x02 /* no failure if out-of-memory */ -#define MCXT_ALLOC_ZERO 0x04 /* zero allocated memory */ +#define MCXT_ALLOC_NO_PARAMERR 0x04 /* no failure on parameter error */ +#define MCXT_ALLOC_ZERO 0x08 /* zero allocated memory */ /* * "Safe" memory allocation functions --- these exit(1) on failure diff --git a/src/include/utils/palloc.h b/src/include/utils/palloc.h index 781e948f69..eb1ecc9ee6 100644 --- a/src/include/utils/palloc.h +++ b/src/include/utils/palloc.h @@ -63,7 +63,8 @@ extern PGDLLIMPORT MemoryContext CurrentMemoryContext; */ #define MCXT_ALLOC_HUGE 0x01 /* allow huge allocation (> 1 GB) */ #define MCXT_ALLOC_NO_OOM 0x02 /* no failure if out-of-memory */ -#define MCXT_ALLOC_ZERO 0x04 /* zero allocated memory */ +#define MCXT_ALLOC_NO_PARAMERR 0x04 /* no failure on parameter error */ +#define MCXT_ALLOC_ZERO 0x08 /* zero allocated memory */ /* * Fundamental memory-allocation operations (more are in utils/memutils.h)
Hello. I found how this occurs. # added -hackers I've seen similar case with inadequite operation but this time I found that it can happen under normal operation. At Tue, 17 Apr 2018 17:14:05 +0200, Emre Hasegeli <emre@hasegeli.com> wrote in <CAE2gYzwB0+MM3L6-zdK0JKO_aXqHNCKU7VAiDJk_kKRRJ4B=yA@mail.gmail.com> > > OK, this seems to confirm the theory that there's a race condition between > > segment recycling and replicating. It's likely limited to short period after > > a crash, otherwise we'd probably see many more reports. > > I am suspicious about this part. It cannot be happening 2 times to us > and never to anybody else. Maybe people do not report it because it > is easy to deal with the problem. You just delete the corrupted WAL > file. Or maybe there is something special in our infrastructure. > > Thank you for your help so far. > > >> This is the connection information. Although the master shows SSL > >> compression is disabled in despite of being explicitly asked for. > > > Hmmm, that seems like a separate issue. When you say 'master shows SSL > > compression is disabled' where do you see that? > > I thought it might be related first, so I mentioned. Then I found out > on pg_stat_ssl view on the server that it is actually disabled. > Probably the configuration of OpenSSL is disabling it. A segment is not cleard on recycling. walreceiver writes WAL record by record so startup process can see arbitrary byte sequence after the last valid record when replication connection is lost or standby is restarted. The following scenario results in the similar situation. 1. create master and standby and run. It makes happen this easily if wal_keep_segments is set large (20 or so) on master and 0 on standby. 2. Write WAL to recycle happens on standby. Explicit checkpoints on standby make it faster. May be required to run several rounds before we see recycled segment on standby. maybe_loop { master: create table t (a int); insert into t (select a from generate_series(0, 150000) a); delete from t; checkpoint; standby: checkpoint; <check for recycled segments in pg_wal> } 3. stop master 4. standby starts to complain that master is missing. At this time, standby complains for several kinds of failure. I saw 'invalid record length' and 'incorrect prev-link' this time. I saw 'invalid resource manager ID' when mixing different size records. If XLogReadRecord saw a record with impossibly large tot_length there, it will causes the palloc failure and startup process dies. 5. If you see 'zero length record', it's nothing interesting. Repeat 3 and 4 to see another. This doesn't seem to happen on master since XLogWrite writes in units of page so redo always sees zero length record or wrong magic, or pageaddr at page boundaries after a crash. On the other hand walreceiver writes in unit of records so redo can see arbitrary byte sequence as xl_tot_len, xl_prev, xl_rmid while fetching the next record. Espcially the first among them can cause FATAL with "invalid memory alloc". I think this behavior is a bug. XLogReadRecord is considering the case but palloc_extended() breaks it. So in the attached, add a new flag MCXT_ALLOC_NO_PARAMERR to palloc_extended() and allocate_recordbuf calls it with the new flag. That alone fixes the problem. However, the patch frees read state buffer facing errorneous records since such records can leave a too-large buffer allocated. After applying this patch, the behavior for the situation changes as follows. > LOG: started streaming WAL from primary at 0/83000000 on timeline 1 > LOG: record length 1298694144 at 0/83C17B70 too long > FATAL: terminating walreceiver process due to administrator command > LOG: record length 1298694144 at 0/83C17B70 too long > LOG: record length 1298694144 at 0/83C17B70 too long > LOG: received promote request > LOG: redo done at 0/83C17B38 > LOG: last completed transaction was at log time 2018-04-26 19:10:12.360253+09 > LOG: selected new timeline ID: 2 > LOG: archive recovery complete > LOG: database system is ready to accept connections regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 3a86f3497e..02c26224fc 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -165,7 +165,8 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength) if (state->readRecordBuf) pfree(state->readRecordBuf); state->readRecordBuf = - (char *) palloc_extended(newSize, MCXT_ALLOC_NO_OOM); + (char *) palloc_extended(newSize, + MCXT_ALLOC_NO_OOM | MCXT_ALLOC_NO_PARAMERR); if (state->readRecordBuf == NULL) { state->readRecordBufSize = 0; @@ -601,10 +602,19 @@ err: /* * Invalidate the xlogreader's read state to force a re-read. + * + * read state might hold too large buffer on reading invalid record so release + * the buffer as well. */ void XLogReaderInvalReadState(XLogReaderState *state) { + if (state->readRecordBuf != NULL) + { + pfree(state->readRecordBuf); + state->readRecordBuf = NULL; + state->readRecordBufSize = 0; + } state->readSegNo = 0; state->readOff = 0; state->readLen = 0; diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c index ebe0342f18..6e7f95d11f 100644 --- a/src/backend/utils/mmgr/mcxt.c +++ b/src/backend/utils/mmgr/mcxt.c @@ -996,7 +996,12 @@ palloc_extended(Size size, int flags) if (((flags & MCXT_ALLOC_HUGE) != 0 && !AllocHugeSizeIsValid(size)) || ((flags & MCXT_ALLOC_HUGE) == 0 && !AllocSizeIsValid(size))) + { + if (flags & MCXT_ALLOC_NO_PARAMERR) + return NULL; + elog(ERROR, "invalid memory alloc request size %zu", size); + } context->isReset = false; diff --git a/src/include/common/fe_memutils.h b/src/include/common/fe_memutils.h index 458743dd40..5859ec087f 100644 --- a/src/include/common/fe_memutils.h +++ b/src/include/common/fe_memutils.h @@ -16,7 +16,8 @@ #define MCXT_ALLOC_HUGE 0x01 /* allow huge allocation (> 1 GB) not * actually used for frontends */ #define MCXT_ALLOC_NO_OOM 0x02 /* no failure if out-of-memory */ -#define MCXT_ALLOC_ZERO 0x04 /* zero allocated memory */ +#define MCXT_ALLOC_NO_PARAMERR 0x04 /* no failure on parameter error */ +#define MCXT_ALLOC_ZERO 0x08 /* zero allocated memory */ /* * "Safe" memory allocation functions --- these exit(1) on failure diff --git a/src/include/utils/palloc.h b/src/include/utils/palloc.h index 781e948f69..eb1ecc9ee6 100644 --- a/src/include/utils/palloc.h +++ b/src/include/utils/palloc.h @@ -63,7 +63,8 @@ extern PGDLLIMPORT MemoryContext CurrentMemoryContext; */ #define MCXT_ALLOC_HUGE 0x01 /* allow huge allocation (> 1 GB) */ #define MCXT_ALLOC_NO_OOM 0x02 /* no failure if out-of-memory */ -#define MCXT_ALLOC_ZERO 0x04 /* zero allocated memory */ +#define MCXT_ALLOC_NO_PARAMERR 0x04 /* no failure on parameter error */ +#define MCXT_ALLOC_ZERO 0x08 /* zero allocated memory */ /* * Fundamental memory-allocation operations (more are in utils/memutils.h)
On Thu, Apr 26, 2018 at 07:53:04PM +0900, Kyotaro HORIGUCHI wrote: > I think this behavior is a bug. XLogReadRecord is considering the > case but palloc_extended() breaks it. So in the attached, add a > new flag MCXT_ALLOC_NO_PARAMERR to palloc_extended() and > allocate_recordbuf calls it with the new flag. That alone fixes > the problem. However, the patch frees read state buffer facing > errorneous records since such records can leave a too-large > buffer allocated. This problem is already discussed here: https://commitfest.postgresql.org/18/1516/ And here is the thread: https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05 Tsunakawa-san and I discussed a couple of approaches. Extending palloc_extended so as an incorrect length does not result in an error is also something that crossed by mind, but the length handling is different on the backend and the frontend, so I discarded the idea you are proposing here and instead relied on a check with AllocSizeIsValid, which gives a more simple patch: https://www.postgresql.org/message-id/20180314052753.GA16179%40paquier.xyz This got no interest from committers yet unfortunately, but I think that this is a real problem which should be back-patched :( -- Michael
Attachment
On Thu, Apr 26, 2018 at 07:53:04PM +0900, Kyotaro HORIGUCHI wrote: > I think this behavior is a bug. XLogReadRecord is considering the > case but palloc_extended() breaks it. So in the attached, add a > new flag MCXT_ALLOC_NO_PARAMERR to palloc_extended() and > allocate_recordbuf calls it with the new flag. That alone fixes > the problem. However, the patch frees read state buffer facing > errorneous records since such records can leave a too-large > buffer allocated. This problem is already discussed here: https://commitfest.postgresql.org/18/1516/ And here is the thread: https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05 Tsunakawa-san and I discussed a couple of approaches. Extending palloc_extended so as an incorrect length does not result in an error is also something that crossed by mind, but the length handling is different on the backend and the frontend, so I discarded the idea you are proposing here and instead relied on a check with AllocSizeIsValid, which gives a more simple patch: https://www.postgresql.org/message-id/20180314052753.GA16179%40paquier.xyz This got no interest from committers yet unfortunately, but I think that this is a real problem which should be back-patched :( -- Michael
Attachment
Thank you for noticing me of that. Is there any way to know how a bug report has been concluded? Or should I search -hackers for a corresponding thread? At Thu, 26 Apr 2018 21:13:48 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180426121348.GA2365@paquier.xyz> > On Thu, Apr 26, 2018 at 07:53:04PM +0900, Kyotaro HORIGUCHI wrote: > > I think this behavior is a bug. XLogReadRecord is considering the > > case but palloc_extended() breaks it. So in the attached, add a > > new flag MCXT_ALLOC_NO_PARAMERR to palloc_extended() and > > allocate_recordbuf calls it with the new flag. That alone fixes > > the problem. However, the patch frees read state buffer facing > > errorneous records since such records can leave a too-large > > buffer allocated. > > This problem is already discussed here: > https://commitfest.postgresql.org/18/1516/ > > And here is the thread: > https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05 > > Tsunakawa-san and I discussed a couple of approaches. Extending > palloc_extended so as an incorrect length does not result in an error is > also something that crossed by mind, but the length handling is > different on the backend and the frontend, so I discarded the idea you > are proposing here and instead relied on a check with AllocSizeIsValid, > which gives a more simple patch: > https://www.postgresql.org/message-id/20180314052753.GA16179%40paquier.xyz Yeah, perhaps all approaches in the thread came to my mind but choosed different one. I'm fine with the approach in the thread. > This got no interest from committers yet unfortunately, but I think that > this is a real problem which should be back-patched :( Several other WAL-related fixes are also waiting to be picked up.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Thank you for noticing me of that. Is there any way to know how a bug report has been concluded? Or should I search -hackers for a corresponding thread? At Thu, 26 Apr 2018 21:13:48 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180426121348.GA2365@paquier.xyz> > On Thu, Apr 26, 2018 at 07:53:04PM +0900, Kyotaro HORIGUCHI wrote: > > I think this behavior is a bug. XLogReadRecord is considering the > > case but palloc_extended() breaks it. So in the attached, add a > > new flag MCXT_ALLOC_NO_PARAMERR to palloc_extended() and > > allocate_recordbuf calls it with the new flag. That alone fixes > > the problem. However, the patch frees read state buffer facing > > errorneous records since such records can leave a too-large > > buffer allocated. > > This problem is already discussed here: > https://commitfest.postgresql.org/18/1516/ > > And here is the thread: > https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05 > > Tsunakawa-san and I discussed a couple of approaches. Extending > palloc_extended so as an incorrect length does not result in an error is > also something that crossed by mind, but the length handling is > different on the backend and the frontend, so I discarded the idea you > are proposing here and instead relied on a check with AllocSizeIsValid, > which gives a more simple patch: > https://www.postgresql.org/message-id/20180314052753.GA16179%40paquier.xyz Yeah, perhaps all approaches in the thread came to my mind but choosed different one. I'm fine with the approach in the thread. > This got no interest from committers yet unfortunately, but I think that > this is a real problem which should be back-patched :( Several other WAL-related fixes are also waiting to be picked up.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Apr 27, 2018 at 09:49:08AM +0900, Kyotaro HORIGUCHI wrote: > Thank you for noticing me of that. Is there any way to know how a > bug report has been concluded? Or should I search -hackers for > a corresponding thread? Keeping a look at the list of patches for bugs in the CF app, and looking at the list of open items is what I use now. Now for this particular issue my memory has just served me well as it is hard to know that both are the same issue by looking at the title. Good thing I looked at your patch as well. > At Thu, 26 Apr 2018 21:13:48 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180426121348.GA2365@paquier.xyz> >> On Thu, Apr 26, 2018 at 07:53:04PM +0900, Kyotaro HORIGUCHI wrote: >>> I think this behavior is a bug. XLogReadRecord is considering the >>> case but palloc_extended() breaks it. So in the attached, add a >>> new flag MCXT_ALLOC_NO_PARAMERR to palloc_extended() and >>> allocate_recordbuf calls it with the new flag. That alone fixes >>> the problem. However, the patch frees read state buffer facing >>> errorneous records since such records can leave a too-large >>> buffer allocated. >> >> This problem is already discussed here: >> https://commitfest.postgresql.org/18/1516/ >> >> And here is the thread: >> https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05 >> >> Tsunakawa-san and I discussed a couple of approaches. Extending >> palloc_extended so as an incorrect length does not result in an error is >> also something that crossed by mind, but the length handling is >> different on the backend and the frontend, so I discarded the idea you >> are proposing here and instead relied on a check with AllocSizeIsValid, >> which gives a more simple patch: >> https://www.postgresql.org/message-id/20180314052753.GA16179%40paquier.xyz > > Yeah, perhaps all approaches in the thread came to my mind but > choosed different one. I'm fine with the approach in the thread. Okay, cool. >> This got no interest from committers yet unfortunately, but I think that >> this is a real problem which should be back-patched :( > > Several other WAL-related fixes are also waiting to be picked up.. Yeah, simply ignoring corrupted 2PC files at redo is no fun, as well as is breaking the promise of replication slots. Let's just make sure that everything is properly tracked and listed, that's the least we can do. -- Michael
Attachment
On Fri, Apr 27, 2018 at 09:49:08AM +0900, Kyotaro HORIGUCHI wrote: > Thank you for noticing me of that. Is there any way to know how a > bug report has been concluded? Or should I search -hackers for > a corresponding thread? Keeping a look at the list of patches for bugs in the CF app, and looking at the list of open items is what I use now. Now for this particular issue my memory has just served me well as it is hard to know that both are the same issue by looking at the title. Good thing I looked at your patch as well. > At Thu, 26 Apr 2018 21:13:48 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180426121348.GA2365@paquier.xyz> >> On Thu, Apr 26, 2018 at 07:53:04PM +0900, Kyotaro HORIGUCHI wrote: >>> I think this behavior is a bug. XLogReadRecord is considering the >>> case but palloc_extended() breaks it. So in the attached, add a >>> new flag MCXT_ALLOC_NO_PARAMERR to palloc_extended() and >>> allocate_recordbuf calls it with the new flag. That alone fixes >>> the problem. However, the patch frees read state buffer facing >>> errorneous records since such records can leave a too-large >>> buffer allocated. >> >> This problem is already discussed here: >> https://commitfest.postgresql.org/18/1516/ >> >> And here is the thread: >> https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05 >> >> Tsunakawa-san and I discussed a couple of approaches. Extending >> palloc_extended so as an incorrect length does not result in an error is >> also something that crossed by mind, but the length handling is >> different on the backend and the frontend, so I discarded the idea you >> are proposing here and instead relied on a check with AllocSizeIsValid, >> which gives a more simple patch: >> https://www.postgresql.org/message-id/20180314052753.GA16179%40paquier.xyz > > Yeah, perhaps all approaches in the thread came to my mind but > choosed different one. I'm fine with the approach in the thread. Okay, cool. >> This got no interest from committers yet unfortunately, but I think that >> this is a real problem which should be back-patched :( > > Several other WAL-related fixes are also waiting to be picked up.. Yeah, simply ignoring corrupted 2PC files at redo is no fun, as well as is breaking the promise of replication slots. Let's just make sure that everything is properly tracked and listed, that's the least we can do. -- Michael