Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs
From | hubert depesz lubaczewski |
---|---|
Subject | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
Date | |
Msg-id | Y25hQCKLnolQaFN9@depesz.com Whole thread Raw |
In response to | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. (hubert depesz lubaczewski <depesz@depesz.com>) |
Responses |
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it. Re: WAL segments removed from primary despite the fact that logical replication slot needs it. Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
List | pgsql-bugs |
On Thu, Oct 20, 2022 at 01:55:16PM +0200, hubert depesz lubaczewski wrote: > Hi, > just an update. Due to procedures we will upgrade to 12.12 at the > beginning of November. Then, I'll try to replicate the problem again, > and we'll see how this will pan out. > For next test will do my best to use debug2 log level. So. I have an update. Out of 8 servers that we prepared upgrade for, one failed with the same/similar problem. Unfortunately I don't know if this is the same server as before. Sorry. Specifically, I see now, in logs on focal/pg14 side: #v+ 2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive datafrom WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logicalreplication worker",,0 #v- On source (bionic/pg12), I see: #v+ =# select * from pg_replication_slots ; slot_name │ plugin │ slot_type │ datoid │ database │ temporary │ active │ active_pid │ xmin │ catalog_xmin │ restart_lsn │ confirmed_flush_lsn ═══════════╪══════════╪═══════════╪════════╪══════════╪═══════════╪════════╪════════════╪════════╪══════════════╪════════════════╪═════════════════════ focal14 │ pgoutput │ logical │ 16607 │ canvas │ f │ f │ [null] │ [null] │ 3241443528 │ 1039D/83825958│ 1039D/96453F38 (1 row) #v- but pg, like previously, lost some wal files, this time 8 of them: #v+ =$ sudo ls -l ~postgres/12/main/pg_wal | head total 224403644 -rw------- 1 postgres postgres 349 Feb 11 2022 000000010000BDEF000000FF.00045F68.backup -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008B -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008C -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008D ... #v- This time we had debug2 enabled on both sides. Looks like we lost segments: 000000010001039D00000083, 000000010001039D00000084, 000000010001039D00000085, 000000010001039D00000086, 000000010001039D00000087, 000000010001039D00000088, 000000010001039D00000089, and 000000010001039D0000008A. It looks that first time replica logged the error was: #v+ 2022-11-10 21:06:16.427 UTC,,,572364,,636d67c8.8bbcc,2,,2022-11-10 21:06:16 UTC,6/0,0,ERROR,08P01,"could not receive datafrom WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logicalreplication worker",,0 #v- In Pg logs on source I see: #v+ 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000082""",,,,,,,,,"" 2022-11-10 21:06:07.523 UTC,,,28925,,636378d6.70fd,4594903,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000083""",,,,,,,,,"" 2022-11-10 21:06:07.524 UTC,,,28925,,636378d6.70fd,4594904,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000084""",,,,,,,,,"" 2022-11-10 21:06:07.525 UTC,,,28925,,636378d6.70fd,4594905,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000085""",,,,,,,,,"" 2022-11-10 21:06:07.526 UTC,,,28925,,636378d6.70fd,4594906,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000086""",,,,,,,,,"" 2022-11-10 21:06:07.527 UTC,,,28925,,636378d6.70fd,4594907,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000087""",,,,,,,,,"" 2022-11-10 21:06:07.528 UTC,,,28925,,636378d6.70fd,4594908,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000088""",,,,,,,,,"" 2022-11-10 21:06:07.529 UTC,,,28925,,636378d6.70fd,4594909,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000089""",,,,,,,,,"" 2022-11-10 21:06:07.530 UTC,,,28925,,636378d6.70fd,4594910,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D0000008A""",,,,,,,,,"" 2022-11-10 21:06:16.426 UTC,"upgrayedd","canvas",28608,"10.1.238.101:45328",636d67c8.6fc0,8,"idle",2022-11-10 21:06:16 UTC,147/0,0,ERROR,58P01,"requestedWAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14" 2022-11-10 21:06:21.456 UTC,"upgrayedd","canvas",28815,"10.1.238.101:52316",636d67cd.708f,8,"idle",2022-11-10 21:06:21 UTC,147/0,0,ERROR,58P01,"requestedWAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14" 2022-11-10 21:06:26.488 UTC,"upgrayedd","canvas",28962,"10.1.238.101:52320",636d67d2.7122,8,"idle",2022-11-10 21:06:26 UTC,148/0,0,ERROR,58P01,"requestedWAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14" #v- So, the whole set of missing files was recycled at the same time. One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1 minute.Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp. Lookslike this: #v+ timestamp pg_current_wal_lsn slot_name plugin slot_type datoid database temporary active active_pid xmin catalog_xmin restart_lsn confirmed_flush_lsn 2022-11-10 20:50:04 UTC 103A6/C49D36F8 focal14 pgoutput logical 16607 canvas f t 16794 \N 3241375739 1039D/65365F68 1039D/782908E8 2022-11-10 20:51:03 UTC 103A6/CC957148 focal14 pgoutput logical 16607 canvas f f \N \N 3241375739 1039D/65365F68 1039D/782908E8 2022-11-10 20:52:02 UTC 103A6/D7348A30 focal14 pgoutput logical 16607 canvas f t 21140 \N 3241422387 1039D/65365F68 1039D/78C94588 2022-11-10 20:53:01 UTC 103A6/EB24F408 focal14 pgoutput logical 16607 canvas f t 21140 \N 3241422387 1039D/65365F68 1039D/7C8E7D78 2022-11-10 20:54:01 UTC 103A7/27A4778 focal14 pgoutput logical 16607 canvas f f \N \N 3241422387 1039D/65365F68 1039D/7C8E7D78 2022-11-10 20:55:00 UTC 103A7/1A635CD0 focal14 pgoutput logical 16607 canvas f t 27607 \N 3241422387 1039D/78833568 1039D/7F8BCD50 2022-11-10 20:55:59 UTC 103A7/2D5E6EE8 focal14 pgoutput logical 16607 canvas f t 27607 \N 3241422387 1039D/78833568 1039D/818B6990 2022-11-10 20:56:58 UTC 103A7/44548000 focal14 pgoutput logical 16607 canvas f t 27607 \N 3241422387 1039D/7E4CDF90 1039D/89E93620 2022-11-10 20:57:57 UTC 103A7/579F8000 focal14 pgoutput logical 16607 canvas f f \N \N 3241422387 1039D/7E4CDF90 1039D/89E93620 2022-11-10 20:58:56 UTC 103A7/6B6C96D0 focal14 pgoutput logical 16607 canvas f f \N \N 3241422387 1039D/7E4CDF90 1039D/89E93620 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-10 21:09:46 UTC 103A8/233202E0 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-10 21:10:45 UTC 103A8/2FBC9720 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 #v- Now, I can keep this situation up for some time, if more debug will be helpful, but I don't know how long (disk space issues). Is there anything in here that helps? Any other info I can dig out? Best regards, depesz
pgsql-bugs by date:
Previous
From: PG Bug reporting formDate:
Subject: BUG #17685: pg_rewind: error: unexpected result while fetching remote files: ERROR: could not open file ".s.PGS
Next
From: PG Bug reporting formDate:
Subject: BUG #17686: SELECT pg_advisory_lock(...) with low lock_timeout sometimes times out but the lock is acquired