Thread: WAL segments removed from primary despite the fact that logical replication slot needs it.
WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
Hi, We are still in process of preparing upgrades from Ubuntu Bionic with Pg 12 to Focal with Pg 14. And today (well, yesterday) we hit something that looks like a bug. Jumping to problem - wal files needed by logical replication were removed from primary, despite slot being there. Now, a bit more details. Our replication is setup using following method: 1. We create all tables and primary keys on pg14 2. replication slot is created, with no tables 3. we add some tables to replication slot, and wait for all tables to switch state to "ready" 4. rinse and repeat until all tables are in replication set 5. once all is ready, we add rest of schema (indexes, constraints, functions, fkeys, and so on). This works. Worked? Yesterday we saw something that broke the replication. First data from bionic side: when | currentLSN | slot_name │ plugin │ slot_type │ datoid │ database │ temporary │ active │ active_pid│ xmin │ catalog_xmin │ restart_lsn │ confirmed_flush_lsn 2022-10-04 14:13:49 UTC | CA0E/273F2180 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743431068 | CA09/BE23BC70 | CA09/C82B2A58 2022-10-04 14:14:48 UTC | CA0E/3B5FF5C8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743451605 | CA09/C9DA89B0 | CA09/D3FA9898 2022-10-04 14:15:47 UTC | CA0E/4F7C06F8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743479469 | CA09/DD32CE50 | CA09/E65E7F80 2022-10-04 14:16:46 UTC | CA0E/6CA820A8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743516374 | CA09/EDD61CB8 | CA09/F3A856D8 2022-10-04 14:17:45 UTC | CA0E/8C7F3950 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743635890 | CA0A/2157B8F8 | CA0A/2A4C48F8 2022-10-04 14:18:44 UTC | CA0E/A5D874F8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743686305 | CA0A/38BE5AB0 | CA0A/3D2B0C20 2022-10-04 14:19:43 UTC | CA0E/C3F6C240 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743749741 | CA0A/49ED91C8 | CA0A/4F6D3048 2022-10-04 14:20:42 UTC | CA0E/D7DBA590 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743749741 | CA0A/49ED91C8 | CA0A/4F6D3048 2022-10-04 14:21:41 UTC | CA0E/F65C0F48 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 29207 | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 2022-10-04 14:22:40 UTC | CA0F/103DBE98 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 2022-10-04 14:23:39 UTC | CA0F/22D791A8 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 2022-10-04 14:24:38 UTC | CA0F/32E81380 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 2022-10-04 14:25:38 UTC | CA0F/41E4A380 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 2022-10-04 14:26:37 UTC | CA0F/58FD1B30 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 2022-10-04 14:27:36 UTC | CA0F/6F732698 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 2022-10-04 14:28:35 UTC | CA0F/855E2F50 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 2022-10-04 14:29:34 UTC | CA0F/94B36450 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 2022-10-04 14:30:33 UTC | CA0F/A7553BF8 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 2022-10-04 14:31:33 UTC | CA0F/BB10F020 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 2022-10-04 14:32:32 UTC | CA0F/D1B4DF10 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 2022-10-04 14:33:31 UTC | CA0F/EE332240 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 2022-10-04 14:34:30 UTC | CA10/3B78E10 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 As you can see around 2:20pm replication stopped fetching data. At this time, in pg14 logs don't show any problems what so ever. Pg_restore was doing it's work with create indexes, and some queries like: ALTER TABLE ONLY schema_name.table_name REPLICA IDENTITY USING INDEX index_table_name_replica_identity; At 2:20:48pm there was: 2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could not receivedata from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0 But these errors happened earlier too (minutes/hours earlier). But then, 3 minutes later I got: 2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive datafrom WAL stream: SSL SYSCALL error: EOF detected server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.",,,,,,,,,"","logical replication worker",,0 2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical replicationapply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0 2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replicationworker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0 2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive data fromWAL stream: ERROR: requested WAL segment 000000010000CA0A00000049 has already been removed",,,,,,,,,"","logical replicationworker",,0 2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical replicationworker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0 Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic. Unfortunately I don't know the content of pg_wal on bionic at the moment, but currently I see: $ ls -l pg_wal/| head total 297534552 -rw------- 1 postgres postgres 349 Feb 11 2022 000000010000A1410000001B.00C81918.backup -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004D -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004E -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004F -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000050 -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000051 -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000052 -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000053 -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000054 So, 4 files are missing. These were archived properly, and I tried to restore them from archive, and put them in pg_wal, but even then pg12 was rejecting tries to connect to focal14 slot with the same message about "requested WAL segment 000000010000CA0A00000049 has already been removed" I know that at this moment we're hosed. This replication was removed, everythbing cleared (after all, we accumulated already 300G of wal, and we can't extend the FS forever, so we needed to cleanup the situation). But, the way I look at it, Pg12 shouldn't be removing these 4 wal files, and the fact that it did means that either we did something wrong (what, though?), or there is some kind of bug in Pg. Anyone has any idea on how to investigate/fix that? Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > Hi, > We are still in process of preparing upgrades from Ubuntu Bionic with Pg > 12 to Focal with Pg 14. > > And today (well, yesterday) we hit something that looks like a bug. > > Jumping to problem - wal files needed by logical replication were > removed from primary, despite slot being there. > > Now, a bit more details. > > Our replication is setup using following method: > > 1. We create all tables and primary keys on pg14 > 2. replication slot is created, with no tables > 3. we add some tables to replication slot, and wait for all tables to > switch state to "ready" > 4. rinse and repeat until all tables are in replication set > 5. once all is ready, we add rest of schema (indexes, constraints, > functions, fkeys, and so on). > > This works. Worked? > > Yesterday we saw something that broke the replication. > > First data from bionic side: > when | currentLSN | slot_name │ plugin │ slot_type │ datoid │ database │ temporary │ active │ active_pid│ xmin │ catalog_xmin │ restart_lsn │ confirmed_flush_lsn > 2022-10-04 14:13:49 UTC | CA0E/273F2180 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743431068 | CA09/BE23BC70 | CA09/C82B2A58 > 2022-10-04 14:14:48 UTC | CA0E/3B5FF5C8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743451605 | CA09/C9DA89B0 | CA09/D3FA9898 > 2022-10-04 14:15:47 UTC | CA0E/4F7C06F8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743479469 | CA09/DD32CE50 | CA09/E65E7F80 > 2022-10-04 14:16:46 UTC | CA0E/6CA820A8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743516374 | CA09/EDD61CB8 | CA09/F3A856D8 > 2022-10-04 14:17:45 UTC | CA0E/8C7F3950 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743635890 | CA0A/2157B8F8 | CA0A/2A4C48F8 > 2022-10-04 14:18:44 UTC | CA0E/A5D874F8 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743686305 | CA0A/38BE5AB0 | CA0A/3D2B0C20 > 2022-10-04 14:19:43 UTC | CA0E/C3F6C240 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 6084 | \N | 3743749741 | CA0A/49ED91C8 | CA0A/4F6D3048 > 2022-10-04 14:20:42 UTC | CA0E/D7DBA590 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743749741 | CA0A/49ED91C8 | CA0A/4F6D3048 > 2022-10-04 14:21:41 UTC | CA0E/F65C0F48 | focal14 | pgoutput | logical | 16609 | dbname | f | t | 29207 | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 > 2022-10-04 14:22:40 UTC | CA0F/103DBE98 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 > 2022-10-04 14:23:39 UTC | CA0F/22D791A8 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 > 2022-10-04 14:24:38 UTC | CA0F/32E81380 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 > 2022-10-04 14:25:38 UTC | CA0F/41E4A380 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 > 2022-10-04 14:26:37 UTC | CA0F/58FD1B30 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 > 2022-10-04 14:27:36 UTC | CA0F/6F732698 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 > 2022-10-04 14:28:35 UTC | CA0F/855E2F50 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 > 2022-10-04 14:29:34 UTC | CA0F/94B36450 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 > 2022-10-04 14:30:33 UTC | CA0F/A7553BF8 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 > 2022-10-04 14:31:33 UTC | CA0F/BB10F020 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 > 2022-10-04 14:32:32 UTC | CA0F/D1B4DF10 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 > 2022-10-04 14:33:31 UTC | CA0F/EE332240 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 > 2022-10-04 14:34:30 UTC | CA10/3B78E10 | focal14 | pgoutput | logical | 16609 | dbname | f | f | \N | \N | 3743760874 | CA0A/49ED91C8 | CA0A/55987F30 > > As you can see around 2:20pm replication stopped fetching data. > Around that time, it seems the slot is also marked inactive probably due to walsender exit. > At this time, in pg14 logs don't show any problems what so ever. > Pg_restore was doing it's work with create indexes, and some queries > like: > ALTER TABLE ONLY schema_name.table_name REPLICA IDENTITY USING INDEX index_table_name_replica_identity; > > At 2:20:48pm there was: > 2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could notreceive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0 > > But these errors happened earlier too (minutes/hours earlier). > > But then, 3 minutes later I got: > > 2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive datafrom WAL stream: SSL SYSCALL error: EOF detected > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request.",,,,,,,,,"","logical replication worker",,0 > 2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical replicationapply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0 > 2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logicalreplication worker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0 > 2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive datafrom WAL stream: ERROR: requested WAL segment 000000010000CA0A00000049 has already been removed",,,,,,,,,"","logicalreplication worker",,0 > 2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logicalreplication worker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0 > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on pg12/bionic. > This is quite strange and I am not able to see the reason why this can happen. The only way to debug this problem that comes to mind is to add more LOGS around the code that removes the WAL files. For example, we can try to print the value of minimumslotLSN (keep) and logSegNo in KeepLogSeg(). > Unfortunately I don't know the content of pg_wal on bionic at the > moment, but currently I see: > > $ ls -l pg_wal/| head > total 297534552 > -rw------- 1 postgres postgres 349 Feb 11 2022 000000010000A1410000001B.00C81918.backup > -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004D > -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004E > -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A0000004F > -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000050 > -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000051 > -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000052 > -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000053 > -rw------- 1 postgres postgres 16777216 Oct 4 13:29 000000010000CA0A00000054 > > So, 4 files are missing. > > These were archived properly, and I tried to restore them from archive, and put > them in pg_wal, but even then pg12 was rejecting tries to connect to focal14 > slot with the same message about "requested WAL segment > 000000010000CA0A00000049 has already been removed" > I think you are seeing this behavior because we update the lastRemovedSegNo before removing files in RemoveOldXlogFiles() and then we use that to give the error you are seeing. > I know that at this moment we're hosed. This replication was removed, > everythbing cleared (after all, we accumulated already 300G of wal, and > we can't extend the FS forever, so we needed to cleanup the situation). > > But, the way I look at it, Pg12 shouldn't be removing these 4 wal files, > and the fact that it did means that either we did something wrong (what, > though?), or there is some kind of bug in Pg. > > Anyone has any idea on how to investigate/fix that? > As suggested above, I feel we need to add some LOGS to see what is going on unless you have a reproducible testcase. -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Kyotaro Horiguchi
Date:
At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski > <depesz@depesz.com> wrote: > > At 2:20:48pm there was: > > 2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could notreceive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0 > > > > But these errors happened earlier too (minutes/hours earlier). > > > > But then, 3 minutes later I got: > > > > 2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive datafrom WAL stream: SSL SYSCALL error: EOF detected > > server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request.",,,,,,,,,"","logical replication worker",,0 > > 2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical replicationapply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0 > > 2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logicalreplication worker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0 > > 2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive datafrom WAL stream: ERROR: requested WAL segment 000000010000CA0A00000049 has already been removed",,,,,,,,,"","logicalreplication worker",,0 > > 2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logicalreplication worker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0 > > > > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile onpg12/bionic. > > > > This is quite strange and I am not able to see the reason why this can > happen. The only way to debug this problem that comes to mind is to > add more LOGS around the code that removes the WAL files. For example, > we can try to print the value of minimumslotLSN (keep) and logSegNo in > KeepLogSeg(). What I can see from the information at hand is that walsender is killed (not died) twice for some reasons then some "reserved" WAL files are lost. I don't see the reason for the second walsender (pid 29207) being killed but if the "has already been removed" log line above is the first one, the file removal is probably happend AT 14:23:10. (log is needed to know that.) Anyway, I didn't come up with a possibility other than of someone else having removed those files.. For example, virus scan softwares. If any such software kills processes that is grabbing the files that the software identified as malicious and is going to remove, that could be happen? > > So, 4 files are missing. > > > > These were archived properly, and I tried to restore them from archive, and put > > them in pg_wal, but even then pg12 was rejecting tries to connect to focal14 > > slot with the same message about "requested WAL segment > > 000000010000CA0A00000049 has already been removed" > > > > I think you are seeing this behavior because we update the > lastRemovedSegNo before removing files in RemoveOldXlogFiles() and > then we use that to give the error you are seeing. lastRemovedSegNo is updated once per one segment of removal. Four files are lost in this case. > > I know that at this moment we're hosed. This replication was removed, > > everythbing cleared (after all, we accumulated already 300G of wal, and > > we can't extend the FS forever, so we needed to cleanup the situation). > > > > But, the way I look at it, Pg12 shouldn't be removing these 4 wal files, > > and the fact that it did means that either we did something wrong (what, > > though?), or there is some kind of bug in Pg. > > > > Anyone has any idea on how to investigate/fix that? > > > > As suggested above, I feel we need to add some LOGS to see what is > going on unless you have a reproducible testcase. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote: > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile onpg12/bionic. > This is quite strange and I am not able to see the reason why this can > happen. The only way to debug this problem that comes to mind is to > add more LOGS around the code that removes the WAL files. For example, > we can try to print the value of minimumslotLSN (keep) and logSegNo in > KeepLogSeg(). That would require changing pg sources, I think, recompiling, and retrying? And on the pg12 side? I don't think we can do it, but I'll ask boss. > > Anyone has any idea on how to investigate/fix that? > As suggested above, I feel we need to add some LOGS to see what is > going on unless you have a reproducible testcase. The problem is that we were trying to use this approach to upgrade pg12 production db to pg14. Modifying it (pg12) will be tricky, as it would require restart. Unless there is a way to add extra logging without restart... depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Mon, Oct 17, 2022 at 03:53:07PM +0900, Kyotaro Horiguchi wrote: > Anyway, I didn't come up with a possibility other than of someone else > having removed those files.. For example, virus scan softwares. If > any such software kills processes that is grabbing the files that the > software identified as malicious and is going to remove, that could be > happen? We don't use virus scanning there. The pg12 server is on ubuntu 18.04, used for normal production workload, and pg14 is on ubuntu focal, with no load, aside from 2 physical streaming replication to pg14 replicas. Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Mon, Oct 17, 2022 at 12:23 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski > > > So, 4 files are missing. > > > > > > These were archived properly, and I tried to restore them from archive, and put > > > them in pg_wal, but even then pg12 was rejecting tries to connect to focal14 > > > slot with the same message about "requested WAL segment > > > 000000010000CA0A00000049 has already been removed" > > > > > > > I think you are seeing this behavior because we update the > > lastRemovedSegNo before removing files in RemoveOldXlogFiles() and > > then we use that to give the error you are seeing. > > lastRemovedSegNo is updated once per one segment of removal. Four > files are lost in this case. > I didn't understand your response. I was saying the one possible reason why even after restoring files from the archive the error appears is because of the lastRemovedSegNo related check in function CheckXLogRemoved() and we update its value while removing old xlog files. From this behavior, it appears that somehow the server has only removed those files even though the reason is not clear yet. -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote: > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibileon pg12/bionic. > > This is quite strange and I am not able to see the reason why this can > > happen. The only way to debug this problem that comes to mind is to > > add more LOGS around the code that removes the WAL files. For example, > > we can try to print the value of minimumslotLSN (keep) and logSegNo in > > KeepLogSeg(). > > That would require changing pg sources, I think, recompiling, and > retrying? > Yes. BTW, are you on the latest release of pg12, if not then you can once check the release notes to see if there is any related bug fix in the later releases? Is this problem reproducible? If so, can you find out why there are multiple time connection issues between walsender and walreceiver? -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Kyotaro Horiguchi
Date:
At Tue, 18 Oct 2022 16:51:26 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > On Mon, Oct 17, 2022 at 12:23 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in > > > On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski > > > > So, 4 files are missing. > > > > > > > > These were archived properly, and I tried to restore them from archive, and put > > > > them in pg_wal, but even then pg12 was rejecting tries to connect to focal14 > > > > slot with the same message about "requested WAL segment > > > > 000000010000CA0A00000049 has already been removed" > > > > > > > > > > I think you are seeing this behavior because we update the > > > lastRemovedSegNo before removing files in RemoveOldXlogFiles() and > > > then we use that to give the error you are seeing. > > > > lastRemovedSegNo is updated once per one segment of removal. Four > > files are lost in this case. > > > > I didn't understand your response. I was saying the one possible > reason why even after restoring files from the archive the error > appears is because of the lastRemovedSegNo related check in function > CheckXLogRemoved() and we update its value while removing old xlog > files. From this behavior, it appears that somehow the server has only > removed those files even though the reason is not clear yet. I meant that if PostgreSQL did something wrong (that I don't understand at all) there, the difference between lastRemovedSegNo and the last actually removed segment won't be more than 1. CheckXLogRevmoed warns for a segment logically already removed but was physically able to be opend. On the other hand WalSndSegmentOpen emits the same error earlier than CHeckXLogRemoved warns if the segment is actually ENOENT regardless of the value of lastRemovedSegNo. One point that bothers me is it seems that the walsender was killed. If the file is removed while the walsender is working, logical replication worker receives the error and emits "ERROR: could not receive data...has been removed" instead of suddenly disconnected as in this case. Considering this together with the possibility that the segments were removed by someone else, I suspencted virus scanners but that has found to be false.. (I still don't know there's any virus scanner that kills processes grabbing a suspectedly-malicious file.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Wed, Oct 19, 2022 at 7:31 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > One point that bothers me is it seems that the walsender was > killed. If the file is removed while the walsender is working, logical > replication worker receives the error and emits "ERROR: could not > receive data...has been removed" instead of suddenly disconnected as > in this case. Considering this together with the possibility that the > segments were removed by someone else, I suspencted virus scanners but > that has found to be false.. > Yeah, as requested yesterday, maybe checking if this can be reproduced will help us to narrow down the cause. If this doesn't happen due to an application mistake it appears that after the slot becomes inactive during the time walsender was down, the server removes the required files but there is nothing in the code that suggests that such a thing can happen. Are there any more logs on publisher/subscriber which indicate any sort of error? -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Wed, Oct 19, 2022 at 03:16:59PM +0530, Amit Kapila wrote: > Yeah, as requested yesterday, maybe checking if this can be reproduced > will help us to narrow down the cause. If this doesn't happen due to > an application mistake it appears that after the slot becomes inactive > during the time walsender was down, the server removes the required > files but there is nothing in the code that suggests that such a thing > can happen. > Are there any more logs on publisher/subscriber which indicate any > sort of error? This specific setup of replication was torn down, we couldn't keep it running. Can setup it once more, and with some luck it will break again. Will take a day, though. Is there anything I can do to make it more debugable, but that will not require restart of pg12? Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote: > On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski > <depesz@depesz.com> wrote: > > > > On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote: > > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibileon pg12/bionic. > > > This is quite strange and I am not able to see the reason why this can > > > happen. The only way to debug this problem that comes to mind is to > > > add more LOGS around the code that removes the WAL files. For example, > > > we can try to print the value of minimumslotLSN (keep) and logSegNo in > > > KeepLogSeg(). > > > > That would require changing pg sources, I think, recompiling, and > > retrying? > > > > Yes. BTW, are you on the latest release of pg12, if not then you can > once check the release notes to see if there is any related bug fix in > the later releases? > > Is this problem reproducible? If so, can you find out why there are > multiple time connection issues between walsender and walreceiver? I can try to redo it, but before I do - anything I could do to either side of replication to increase our chances at figuring out the underlying problem? I can't restart pg12, though. Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Wed, Oct 19, 2022 at 3:50 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote: > > On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski > > <depesz@depesz.com> wrote: > > > > > > On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote: > > > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibileon pg12/bionic. > > > > This is quite strange and I am not able to see the reason why this can > > > > happen. The only way to debug this problem that comes to mind is to > > > > add more LOGS around the code that removes the WAL files. For example, > > > > we can try to print the value of minimumslotLSN (keep) and logSegNo in > > > > KeepLogSeg(). > > > > > > That would require changing pg sources, I think, recompiling, and > > > retrying? > > > > > > > Yes. BTW, are you on the latest release of pg12, if not then you can > > once check the release notes to see if there is any related bug fix in > > the later releases? > > > > Is this problem reproducible? If so, can you find out why there are > > multiple time connection issues between walsender and walreceiver? > > I can try to redo it, but before I do - anything I could do to either > side of replication to increase our chances at figuring out the > underlying problem? I can't restart pg12, though. > One idea is to change log level to DEBUG2 so that we can print which files are removed by the server via RemoveOldXlogFiles() { ... elog(DEBUG2, "attempting to remove WAL segments older than log file %s", lastoff); ... } If we can do this then at the very least we can know whether the required files are removed by the server or by some external application. -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Wed, Oct 19, 2022 at 04:14:28PM +0530, Amit Kapila wrote: > One idea is to change log level to DEBUG2 so that we can print which > files are removed by the server via > RemoveOldXlogFiles() > { > ... > elog(DEBUG2, "attempting to remove WAL segments older than log file %s", > lastoff); > ... > } > > If we can do this then at the very least we can know whether the > required files are removed by the server or by some external > application. That should be doable. Will take some time to setup test env, and run it. Will get back to you on this. Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote: > On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski > <depesz@depesz.com> wrote: > > > > On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote: > > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibileon pg12/bionic. > > > This is quite strange and I am not able to see the reason why this can > > > happen. The only way to debug this problem that comes to mind is to > > > add more LOGS around the code that removes the WAL files. For example, > > > we can try to print the value of minimumslotLSN (keep) and logSegNo in > > > KeepLogSeg(). > > > > That would require changing pg sources, I think, recompiling, and > > retrying? > > > > Yes. BTW, are you on the latest release of pg12, if not then you can > once check the release notes to see if there is any related bug fix in > the later releases? We're not. We're on 12.9, and there seem to be some wal-related changes since then: https://why-upgrade.depesz.com/show?from=12.9&to=12.12&keywords=wal* will talk with responsible people about upgrading it, and then redoing the test. Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Wed, Oct 19, 2022 at 01:11:06PM +0200, hubert depesz lubaczewski wrote: > On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote: > > On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski > > <depesz@depesz.com> wrote: > > > > > > On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote: > > > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibileon pg12/bionic. > > > > This is quite strange and I am not able to see the reason why this can > > > > happen. The only way to debug this problem that comes to mind is to > > > > add more LOGS around the code that removes the WAL files. For example, > > > > we can try to print the value of minimumslotLSN (keep) and logSegNo in > > > > KeepLogSeg(). > > > > > > That would require changing pg sources, I think, recompiling, and > > > retrying? > > > > > > > Yes. BTW, are you on the latest release of pg12, if not then you can > > once check the release notes to see if there is any related bug fix in > > the later releases? > > We're not. We're on 12.9, and there seem to be some wal-related changes > since then: > https://why-upgrade.depesz.com/show?from=12.9&to=12.12&keywords=wal* > > will talk with responsible people about upgrading it, and then redoing > the test. Hi, just an update. Due to procedures we will upgrade to 12.12 at the beginning of November. Then, I'll try to replicate the problem again, and we'll see how this will pan out. For next test will do my best to use debug2 log level. Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Thu, Oct 20, 2022 at 01:55:16PM +0200, hubert depesz lubaczewski wrote: > Hi, > just an update. Due to procedures we will upgrade to 12.12 at the > beginning of November. Then, I'll try to replicate the problem again, > and we'll see how this will pan out. > For next test will do my best to use debug2 log level. So. I have an update. Out of 8 servers that we prepared upgrade for, one failed with the same/similar problem. Unfortunately I don't know if this is the same server as before. Sorry. Specifically, I see now, in logs on focal/pg14 side: #v+ 2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive datafrom WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logicalreplication worker",,0 #v- On source (bionic/pg12), I see: #v+ =# select * from pg_replication_slots ; slot_name │ plugin │ slot_type │ datoid │ database │ temporary │ active │ active_pid │ xmin │ catalog_xmin │ restart_lsn │ confirmed_flush_lsn ═══════════╪══════════╪═══════════╪════════╪══════════╪═══════════╪════════╪════════════╪════════╪══════════════╪════════════════╪═════════════════════ focal14 │ pgoutput │ logical │ 16607 │ canvas │ f │ f │ [null] │ [null] │ 3241443528 │ 1039D/83825958│ 1039D/96453F38 (1 row) #v- but pg, like previously, lost some wal files, this time 8 of them: #v+ =$ sudo ls -l ~postgres/12/main/pg_wal | head total 224403644 -rw------- 1 postgres postgres 349 Feb 11 2022 000000010000BDEF000000FF.00045F68.backup -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008B -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008C -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008D ... #v- This time we had debug2 enabled on both sides. Looks like we lost segments: 000000010001039D00000083, 000000010001039D00000084, 000000010001039D00000085, 000000010001039D00000086, 000000010001039D00000087, 000000010001039D00000088, 000000010001039D00000089, and 000000010001039D0000008A. It looks that first time replica logged the error was: #v+ 2022-11-10 21:06:16.427 UTC,,,572364,,636d67c8.8bbcc,2,,2022-11-10 21:06:16 UTC,6/0,0,ERROR,08P01,"could not receive datafrom WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logicalreplication worker",,0 #v- In Pg logs on source I see: #v+ 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000082""",,,,,,,,,"" 2022-11-10 21:06:07.523 UTC,,,28925,,636378d6.70fd,4594903,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000083""",,,,,,,,,"" 2022-11-10 21:06:07.524 UTC,,,28925,,636378d6.70fd,4594904,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000084""",,,,,,,,,"" 2022-11-10 21:06:07.525 UTC,,,28925,,636378d6.70fd,4594905,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000085""",,,,,,,,,"" 2022-11-10 21:06:07.526 UTC,,,28925,,636378d6.70fd,4594906,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000086""",,,,,,,,,"" 2022-11-10 21:06:07.527 UTC,,,28925,,636378d6.70fd,4594907,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000087""",,,,,,,,,"" 2022-11-10 21:06:07.528 UTC,,,28925,,636378d6.70fd,4594908,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000088""",,,,,,,,,"" 2022-11-10 21:06:07.529 UTC,,,28925,,636378d6.70fd,4594909,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000089""",,,,,,,,,"" 2022-11-10 21:06:07.530 UTC,,,28925,,636378d6.70fd,4594910,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D0000008A""",,,,,,,,,"" 2022-11-10 21:06:16.426 UTC,"upgrayedd","canvas",28608,"10.1.238.101:45328",636d67c8.6fc0,8,"idle",2022-11-10 21:06:16 UTC,147/0,0,ERROR,58P01,"requestedWAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14" 2022-11-10 21:06:21.456 UTC,"upgrayedd","canvas",28815,"10.1.238.101:52316",636d67cd.708f,8,"idle",2022-11-10 21:06:21 UTC,147/0,0,ERROR,58P01,"requestedWAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14" 2022-11-10 21:06:26.488 UTC,"upgrayedd","canvas",28962,"10.1.238.101:52320",636d67d2.7122,8,"idle",2022-11-10 21:06:26 UTC,148/0,0,ERROR,58P01,"requestedWAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14" #v- So, the whole set of missing files was recycled at the same time. One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1 minute.Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp. Lookslike this: #v+ timestamp pg_current_wal_lsn slot_name plugin slot_type datoid database temporary active active_pid xmin catalog_xmin restart_lsn confirmed_flush_lsn 2022-11-10 20:50:04 UTC 103A6/C49D36F8 focal14 pgoutput logical 16607 canvas f t 16794 \N 3241375739 1039D/65365F68 1039D/782908E8 2022-11-10 20:51:03 UTC 103A6/CC957148 focal14 pgoutput logical 16607 canvas f f \N \N 3241375739 1039D/65365F68 1039D/782908E8 2022-11-10 20:52:02 UTC 103A6/D7348A30 focal14 pgoutput logical 16607 canvas f t 21140 \N 3241422387 1039D/65365F68 1039D/78C94588 2022-11-10 20:53:01 UTC 103A6/EB24F408 focal14 pgoutput logical 16607 canvas f t 21140 \N 3241422387 1039D/65365F68 1039D/7C8E7D78 2022-11-10 20:54:01 UTC 103A7/27A4778 focal14 pgoutput logical 16607 canvas f f \N \N 3241422387 1039D/65365F68 1039D/7C8E7D78 2022-11-10 20:55:00 UTC 103A7/1A635CD0 focal14 pgoutput logical 16607 canvas f t 27607 \N 3241422387 1039D/78833568 1039D/7F8BCD50 2022-11-10 20:55:59 UTC 103A7/2D5E6EE8 focal14 pgoutput logical 16607 canvas f t 27607 \N 3241422387 1039D/78833568 1039D/818B6990 2022-11-10 20:56:58 UTC 103A7/44548000 focal14 pgoutput logical 16607 canvas f t 27607 \N 3241422387 1039D/7E4CDF90 1039D/89E93620 2022-11-10 20:57:57 UTC 103A7/579F8000 focal14 pgoutput logical 16607 canvas f f \N \N 3241422387 1039D/7E4CDF90 1039D/89E93620 2022-11-10 20:58:56 UTC 103A7/6B6C96D0 focal14 pgoutput logical 16607 canvas f f \N \N 3241422387 1039D/7E4CDF90 1039D/89E93620 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-10 21:09:46 UTC 103A8/233202E0 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-10 21:10:45 UTC 103A8/2FBC9720 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 #v- Now, I can keep this situation up for some time, if more debug will be helpful, but I don't know how long (disk space issues). Is there anything in here that helps? Any other info I can dig out? Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Fri, Nov 11, 2022 at 03:50:40PM +0100, hubert depesz lubaczewski wrote: > #v+ > 2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive datafrom WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logicalreplication worker",,0 > #v- Sooo... plot thickens. Without any changes, manual rebuild or anything, yesterday, the problem seems to have solved itself?! In logs on focal/pg14 I see: #v+ 2022-11-12 20:55:39.190 UTC,,,1897563,,6370084b.1cf45b,2,,2022-11-12 20:55:39 UTC,6/0,0,ERROR,08P01,"could not receive datafrom WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logicalreplication worker",,0 #v- And this is *the last* such message. On bionic/pg12 we have in logs from pg_replication_slots: #v+ timestamp pg_current_wal_lsn slot_name plugin slot_type datoid database temporary active active_pid xmin catalog_xmin restart_lsn confirmed_flush_lsn 2022-11-12 20:51:00 UTC 1041E/D3A0E540 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-12 20:51:59 UTC 1041E/D89B6000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-12 20:52:58 UTC 1041E/E0547450 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-12 20:53:58 UTC 1041E/E59634F0 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-12 20:54:57 UTC 1041E/EBB50DE8 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 2022-11-12 20:55:55 UTC 1041E/FBBC3160 focal14 pgoutput logical 16607 canvas f t 18626 \N 3241450490 1039D/9170B010 1039D/9B86EAF0 2022-11-12 20:56:55 UTC 1041F/FF34000 focal14 pgoutput logical 16607 canvas f t 18626 \N 3241469432 1039D/A21B4598 1039D/A928A6D0 2022-11-12 20:57:54 UTC 1041F/277FAE30 focal14 pgoutput logical 16607 canvas f t 18626 \N 3241480448 1039D/AD4C7A00 1039D/BA3FC840 2022-11-12 20:58:53 UTC 1041F/319A0000 focal14 pgoutput logical 16607 canvas f t 18626 \N 3241505462 1039D/C5C32398 1039D/DF899428 2022-11-12 20:59:52 UTC 1041F/3A399688 focal14 pgoutput logical 16607 canvas f t 18626 \N 3241527824 1039D/F3393280 1039E/AD9A740 ... 2022-11-13 08:00:44 UTC 1043F/E464B738 focal14 pgoutput logical 16607 canvas f t 18626 \N 3292625412 1043F/E0F38E78 1043F/E4609628 #v- I have no idea what has changed or why. If it helps I can provide logs, but would prefer to do it off list. Also, while the problem, with this cluster is "solved", I still have like 300 other clusters to upgrade, and at least 1 has hit the same problem today. Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Fri, Nov 11, 2022 at 8:20 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Thu, Oct 20, 2022 at 01:55:16PM +0200, hubert depesz lubaczewski wrote: > > Hi, > > just an update. Due to procedures we will upgrade to 12.12 at the > > beginning of November. Then, I'll try to replicate the problem again, > > and we'll see how this will pan out. > > For next test will do my best to use debug2 log level. > > So. I have an update. > > Out of 8 servers that we prepared upgrade for, one failed with the > same/similar problem. > Unfortunately I don't know if this is the same server as before. Sorry. > > Specifically, I see now, in logs on focal/pg14 side: > > #v+ > 2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive datafrom WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logicalreplication worker",,0 > #v- > > On source (bionic/pg12), I see: > > #v+ > =# select * from pg_replication_slots ; > slot_name │ plugin │ slot_type │ datoid │ database │ temporary │ active │ active_pid │ xmin │ catalog_xmin │ restart_lsn │ confirmed_flush_lsn > ═══════════╪══════════╪═══════════╪════════╪══════════╪═══════════╪════════╪════════════╪════════╪══════════════╪════════════════╪═════════════════════ > focal14 │ pgoutput │ logical │ 16607 │ canvas │ f │ f │ [null] │ [null] │ 3241443528 │ 1039D/83825958│ 1039D/96453F38 > (1 row) > #v- > > but pg, like previously, lost some wal files, this time 8 of them: > > #v+ > =$ sudo ls -l ~postgres/12/main/pg_wal | head > total 224403644 > -rw------- 1 postgres postgres 349 Feb 11 2022 000000010000BDEF000000FF.00045F68.backup > -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008B > -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008C > -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008D > ... > #v- > > This time we had debug2 enabled on both sides. > > Looks like we lost segments: > 000000010001039D00000083, 000000010001039D00000084, > 000000010001039D00000085, 000000010001039D00000086, > 000000010001039D00000087, 000000010001039D00000088, > 000000010001039D00000089, and 000000010001039D0000008A. > > It looks that first time replica logged the error was: > > #v+ > 2022-11-10 21:06:16.427 UTC,,,572364,,636d67c8.8bbcc,2,,2022-11-10 21:06:16 UTC,6/0,0,ERROR,08P01,"could not receive datafrom WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logicalreplication worker",,0 > #v- > > In Pg logs on source I see: > > #v+ > 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000082""",,,,,,,,,"" > 2022-11-10 21:06:07.523 UTC,,,28925,,636378d6.70fd,4594903,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000083""",,,,,,,,,"" > 2022-11-10 21:06:07.524 UTC,,,28925,,636378d6.70fd,4594904,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000084""",,,,,,,,,"" > 2022-11-10 21:06:07.525 UTC,,,28925,,636378d6.70fd,4594905,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000085""",,,,,,,,,"" > 2022-11-10 21:06:07.526 UTC,,,28925,,636378d6.70fd,4594906,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000086""",,,,,,,,,"" > 2022-11-10 21:06:07.527 UTC,,,28925,,636378d6.70fd,4594907,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000087""",,,,,,,,,"" > 2022-11-10 21:06:07.528 UTC,,,28925,,636378d6.70fd,4594908,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000088""",,,,,,,,,"" > 2022-11-10 21:06:07.529 UTC,,,28925,,636378d6.70fd,4594909,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000089""",,,,,,,,,"" > 2022-11-10 21:06:07.530 UTC,,,28925,,636378d6.70fd,4594910,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D0000008A""",,,,,,,,,"" > 2022-11-10 21:06:16.426 UTC,"upgrayedd","canvas",28608,"10.1.238.101:45328",636d67c8.6fc0,8,"idle",2022-11-10 21:06:16UTC,147/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14" > 2022-11-10 21:06:21.456 UTC,"upgrayedd","canvas",28815,"10.1.238.101:52316",636d67cd.708f,8,"idle",2022-11-10 21:06:21UTC,147/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14" > 2022-11-10 21:06:26.488 UTC,"upgrayedd","canvas",28962,"10.1.238.101:52320",636d67d2.7122,8,"idle",2022-11-10 21:06:26UTC,148/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names '""focal14""')",,,"focal14" > #v- > > So, the whole set of missing files was recycled at the same time. > > One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1 minute.Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp. Lookslike this: > > #v+ > timestamp pg_current_wal_lsn slot_name plugin slot_type datoid database temporary active active_pid xmin catalog_xmin restart_lsn confirmed_flush_lsn ... ... > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 ... I think from the above two it is clear that the slot with restart_lsn 1039D/83825958 is active and we seem to have recycled corresponding segment. > > Now, I can keep this situation up for some time, if more debug will be helpful, > but I don't know how long (disk space issues). > > Is there anything in here that helps? Any other info I can dig out? > You may have the LOG for "attempting to remove WAL segments older than log file %s", if so, please share. Also on similar lines, I think we need to add some LOGs before calling RemoveOldXlogFiles() to know where our computation to remove files goes off. Something, like the attached, would be helpful but note I have added this quickly on my working branch to show you what I have in mind. You may need small changes based on the branch you are working on. -- With Regards, Amit Kapila.
Attachment
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Mon, Nov 14, 2022 at 05:08:15PM +0530, Amit Kapila wrote: > > #v+ > > 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000082""",,,,,,,,,"" ... > > #v- > > > > So, the whole set of missing files was recycled at the same time. > > > > One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1 minute.Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp. Lookslike this: > > > > #v+ > > timestamp pg_current_wal_lsn slot_name plugin slot_type datoid database temporary active active_pid xmin catalog_xmin restart_lsn confirmed_flush_lsn > ... > ... > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 > ... > > I think from the above two it is clear that the slot with restart_lsn > 1039D/83825958 is active and we seem to have recycled corresponding > segment. > > You may have the LOG for "attempting to remove WAL segments older than > log file %s", if so, please share. Also on similar lines, I think we Absolutely. There is something weird happening: #v+ 2022-11-10 20:01:07.520 UTC,,,28925,,636378d6.70fd,4488667,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039B000000D9",,,,,,,,,"" 2022-11-10 20:06:07.359 UTC,,,28925,,636378d6.70fd,4496840,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039B000000F8",,,,,,,,,"" 2022-11-10 20:11:07.522 UTC,,,28925,,636378d6.70fd,4505099,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039C00000027",,,,,,,,,"" 2022-11-10 20:16:07.466 UTC,,,28925,,636378d6.70fd,4513331,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039C00000054",,,,,,,,,"" 2022-11-10 20:21:07.572 UTC,,,28925,,636378d6.70fd,4521466,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039C0000008F",,,,,,,,,"" 2022-11-10 20:26:07.384 UTC,,,28925,,636378d6.70fd,4529518,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039C000000B9",,,,,,,,,"" 2022-11-10 20:31:07.619 UTC,,,28925,,636378d6.70fd,4537868,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039C000000F8",,,,,,,,,"" 2022-11-10 20:36:07.532 UTC,,,28925,,636378d6.70fd,4546079,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000002B",,,,,,,,,"" 2022-11-10 20:41:07.603 UTC,,,28925,,636378d6.70fd,4554468,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D00000052",,,,,,,,,"" 2022-11-10 20:46:07.609 UTC,,,28925,,636378d6.70fd,4562376,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000005F",,,,,,,,,"" 2022-11-10 20:51:07.513 UTC,,,28925,,636378d6.70fd,4570523,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000005F",,,,,,,,,"" 2022-11-10 20:56:07.727 UTC,,,28925,,636378d6.70fd,4578553,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D00000070",,,,,,,,,"" 2022-11-10 21:01:07.509 UTC,,,28925,,636378d6.70fd,4586721,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D00000082",,,,,,,,,"" 2022-11-10 21:06:07.521 UTC,,,28925,,636378d6.70fd,4594902,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:11:07.402 UTC,,,28925,,636378d6.70fd,4602683,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:16:07.837 UTC,,,28925,,636378d6.70fd,4610577,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:21:07.571 UTC,,,28925,,636378d6.70fd,4618490,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:26:07.559 UTC,,,28925,,636378d6.70fd,4626365,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:31:07.649 UTC,,,28925,,636378d6.70fd,4634592,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:36:07.737 UTC,,,28925,,636378d6.70fd,4642829,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:41:07.278 UTC,,,28925,,636378d6.70fd,4650983,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:46:07.670 UTC,,,28925,,636378d6.70fd,4658973,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:51:07.579 UTC,,,28925,,636378d6.70fd,4666946,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:56:07.618 UTC,,,28925,,636378d6.70fd,4674922,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" #v- If I'll add "recycled" lines we will get: #v+ ... 2022-11-10 21:01:07.509 UTC,,,28925,,636378d6.70fd,4586721,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D00000082",,,,,,,,,"" 2022-11-10 21:01:07.512 UTC,,,28925,,636378d6.70fd,4586722,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000071""",,,,,,,,,"" 2022-11-10 21:01:07.514 UTC,,,28925,,636378d6.70fd,4586723,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000072""",,,,,,,,,"" 2022-11-10 21:01:07.515 UTC,,,28925,,636378d6.70fd,4586724,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000073""",,,,,,,,,"" 2022-11-10 21:01:07.517 UTC,,,28925,,636378d6.70fd,4586725,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000074""",,,,,,,,,"" 2022-11-10 21:01:07.517 UTC,,,28925,,636378d6.70fd,4586726,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000075""",,,,,,,,,"" 2022-11-10 21:01:07.519 UTC,,,28925,,636378d6.70fd,4586727,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000076""",,,,,,,,,"" 2022-11-10 21:01:07.520 UTC,,,28925,,636378d6.70fd,4586728,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000077""",,,,,,,,,"" 2022-11-10 21:01:07.521 UTC,,,28925,,636378d6.70fd,4586729,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000078""",,,,,,,,,"" 2022-11-10 21:01:07.522 UTC,,,28925,,636378d6.70fd,4586730,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000079""",,,,,,,,,"" 2022-11-10 21:01:07.523 UTC,,,28925,,636378d6.70fd,4586731,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D0000007A""",,,,,,,,,"" 2022-11-10 21:01:07.524 UTC,,,28925,,636378d6.70fd,4586732,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D0000007B""",,,,,,,,,"" 2022-11-10 21:01:07.524 UTC,,,28925,,636378d6.70fd,4586733,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D0000007C""",,,,,,,,,"" 2022-11-10 21:01:07.525 UTC,,,28925,,636378d6.70fd,4586734,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D0000007D""",,,,,,,,,"" 2022-11-10 21:01:07.527 UTC,,,28925,,636378d6.70fd,4586735,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D0000007E""",,,,,,,,,"" 2022-11-10 21:01:07.528 UTC,,,28925,,636378d6.70fd,4586736,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D0000007F""",,,,,,,,,"" 2022-11-10 21:01:07.529 UTC,,,28925,,636378d6.70fd,4586737,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000080""",,,,,,,,,"" 2022-11-10 21:01:07.530 UTC,,,28925,,636378d6.70fd,4586738,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000081""",,,,,,,,,"" 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000082""",,,,,,,,,"" 2022-11-10 21:06:07.521 UTC,,,28925,,636378d6.70fd,4594902,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:06:07.523 UTC,,,28925,,636378d6.70fd,4594903,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000083""",,,,,,,,,"" 2022-11-10 21:06:07.524 UTC,,,28925,,636378d6.70fd,4594904,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000084""",,,,,,,,,"" 2022-11-10 21:06:07.525 UTC,,,28925,,636378d6.70fd,4594905,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000085""",,,,,,,,,"" 2022-11-10 21:06:07.526 UTC,,,28925,,636378d6.70fd,4594906,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000086""",,,,,,,,,"" 2022-11-10 21:06:07.527 UTC,,,28925,,636378d6.70fd,4594907,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000087""",,,,,,,,,"" 2022-11-10 21:06:07.528 UTC,,,28925,,636378d6.70fd,4594908,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000088""",,,,,,,,,"" 2022-11-10 21:06:07.529 UTC,,,28925,,636378d6.70fd,4594909,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000089""",,,,,,,,,"" 2022-11-10 21:06:07.530 UTC,,,28925,,636378d6.70fd,4594910,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D0000008A""",,,,,,,,,"" 2022-11-10 21:11:07.402 UTC,,,28925,,636378d6.70fd,4602683,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:16:07.837 UTC,,,28925,,636378d6.70fd,4610577,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:21:07.571 UTC,,,28925,,636378d6.70fd,4618490,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:26:07.559 UTC,,,28925,,636378d6.70fd,4626365,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:31:07.649 UTC,,,28925,,636378d6.70fd,4634592,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:36:07.737 UTC,,,28925,,636378d6.70fd,4642829,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:41:07.278 UTC,,,28925,,636378d6.70fd,4650983,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:46:07.670 UTC,,,28925,,636378d6.70fd,4658973,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:51:07.579 UTC,,,28925,,636378d6.70fd,4666946,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" 2022-11-10 21:56:07.618 UTC,,,28925,,636378d6.70fd,4674922,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" #v- > need to add some LOGs before calling RemoveOldXlogFiles() to know > where our computation to remove files goes off. Something, like the > attached, would be helpful but note I have added this quickly on my > working branch to show you what I have in mind. You may need small > changes based on the branch you are working on. That will be complicated. Changing pg (recompile, and rerun) requires careful preparation with our customers. Which takes time. Additionally we are using precompiled binaries from pgdg, that makes the process slightly more involved. And to top it off, as I mailed in the other message - the problem doesn't happen always. For 8 clusters that I tried to prepare problem happened on two, and on one of them (this one actually) it kinda solved on its own?! I know I'm asking a lot, but is there any chance we could setup conference/shared screen session to try to get as much data as possible out of the situations we have now? Of course that would be paid consulting. If you're OK with it, please mail me off list, so we can arrange it. Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Mon, Nov 14, 2022 at 5:47 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Mon, Nov 14, 2022 at 05:08:15PM +0530, Amit Kapila wrote: > > > #v+ > > > 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000082""",,,,,,,,,"" > ... > > > #v- > > > > > > So, the whole set of missing files was recycled at the same time. > > > > > > One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1 minute.Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp. Lookslike this: > > > > > > #v+ > > > timestamp pg_current_wal_lsn slot_name plugin slot_type datoid database temporary active active_pid xmin catalog_xmin restart_lsn confirmed_flush_lsn > > ... > > ... > > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 > > ... > > > > I think from the above two it is clear that the slot with restart_lsn > > 1039D/83825958 is active and we seem to have recycled corresponding > > segment. > > > > You may have the LOG for "attempting to remove WAL segments older than > > log file %s", if so, please share. Also on similar lines, I think we > > Absolutely. > > There is something weird happening: > What exactly weird you are seeing in this? To me, it appears as if the system due to some reason ignores an existing slot that has restart_lsn as 1039D/83825958. > > If I'll add "recycled" lines we will get: > > #v+ > ... > 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled write-aheadlog file ""000000010001039D00000082""",,,,,,,,,"" > 2022-11-10 21:06:07.521 UTC,,,28925,,636378d6.70fd,4594902,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,"" > Actually, the last LOG was what I was not expecting because of the existence of a corresponding slot. This is what we have to find out how it has computed the value "000000000001039D0000008A" for the file to be removed. > > need to add some LOGs before calling RemoveOldXlogFiles() to know > > where our computation to remove files goes off. Something, like the > > attached, would be helpful but note I have added this quickly on my > > working branch to show you what I have in mind. You may need small > > changes based on the branch you are working on. > > That will be complicated. > > Changing pg (recompile, and rerun) requires careful preparation with our > customers. Which takes time. > > Additionally we are using precompiled binaries from pgdg, that makes the > process slightly more involved. > I can understand but OTOH it is not easy to proceed without more information. -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Sun, Nov 13, 2022 at 1:36 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Fri, Nov 11, 2022 at 03:50:40PM +0100, hubert depesz lubaczewski wrote: > > #v+ > > 2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receivedata from WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logicalreplication worker",,0 > > #v- > > Sooo... plot thickens. > > Without any changes, manual rebuild or anything, yesterday, the problem > seems to have solved itself?! > > In logs on focal/pg14 I see: > > #v+ > 2022-11-12 20:55:39.190 UTC,,,1897563,,6370084b.1cf45b,2,,2022-11-12 20:55:39 UTC,6/0,0,ERROR,08P01,"could not receivedata from WAL stream: ERROR: requested WAL segment 000000010001039D00000083 has already been removed",,,,,,,,,"","logicalreplication worker",,0 > #v- > > And this is *the last* such message. > > On bionic/pg12 we have in logs from pg_replication_slots: > > #v+ > timestamp pg_current_wal_lsn slot_name plugin slot_type datoid database temporary active active_pid xmin catalog_xmin restart_lsn confirmed_flush_lsn > 2022-11-12 20:51:00 UTC 1041E/D3A0E540 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > 2022-11-12 20:51:59 UTC 1041E/D89B6000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > 2022-11-12 20:52:58 UTC 1041E/E0547450 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > 2022-11-12 20:53:58 UTC 1041E/E59634F0 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > 2022-11-12 20:54:57 UTC 1041E/EBB50DE8 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > 2022-11-12 20:55:55 UTC 1041E/FBBC3160 focal14 pgoutput logical 16607 canvas f t 18626 \N 3241450490 1039D/9170B010 1039D/9B86EAF0 > ... ... From the last two lines above, it is clear why it started working. The restart_lsn has advanced from 1039D/83825958 to 1039D/9170B010 which means the system no longer needs data from 000000010001039D00000083. -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Mon, Nov 14, 2022 at 06:41:59PM +0530, Amit Kapila wrote: > From the last two lines above, it is clear why it started working. The > restart_lsn has advanced from 1039D/83825958 to 1039D/9170B010 which > means the system no longer needs data from 000000010001039D00000083. Sure, but how could it advance if the wal wasn't there? Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Mon, Nov 14, 2022 at 06:30:57PM +0530, Amit Kapila wrote: > > There is something weird happening: > What exactly weird you are seeing in this? To me, it appears as if the > system due to some reason ignores an existing slot that has > restart_lsn as 1039D/83825958. The weird part for me is that it is trying to remove wal files older than the same "x" many times. But perhaps it just means that the code that does it doesn't check if it did the same previously. Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Mon, Nov 14, 2022 at 7:03 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Mon, Nov 14, 2022 at 06:30:57PM +0530, Amit Kapila wrote: > > > There is something weird happening: > > What exactly weird you are seeing in this? To me, it appears as if the > > system due to some reason ignores an existing slot that has > > restart_lsn as 1039D/83825958. > > The weird part for me is that it is trying to remove wal files older > than the same "x" many times. > I think that is okay because as per checkpointer's computation it decides not to remove/replace any new WAL files. At this stage, I am not getting any idea except for getting the value of XLogGetReplicationSlotMinimumLSN() in one of the LOG prints. If you can't add all the LOGs, I shared in the last patch, can you try to get the value of XLogGetReplicationSlotMinimumLSN() by appending to the existing LOG "attempting to remove WAL segments older than log file .."? -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Mon, Nov 14, 2022 at 09:52:05PM +0530, Amit Kapila wrote: > On Mon, Nov 14, 2022 at 7:03 PM hubert depesz lubaczewski > <depesz@depesz.com> wrote: > > > > On Mon, Nov 14, 2022 at 06:30:57PM +0530, Amit Kapila wrote: > > > > There is something weird happening: > > > What exactly weird you are seeing in this? To me, it appears as if the > > > system due to some reason ignores an existing slot that has > > > restart_lsn as 1039D/83825958. > > > > The weird part for me is that it is trying to remove wal files older > > than the same "x" many times. > > > > I think that is okay because as per checkpointer's computation it > decides not to remove/replace any new WAL files. At this stage, I am > not getting any idea except for getting the value of > XLogGetReplicationSlotMinimumLSN() in one of the LOG prints. If you > can't add all the LOGs, I shared in the last patch, can you try to get > the value of XLogGetReplicationSlotMinimumLSN() by appending to the > existing LOG "attempting to remove WAL segments older than log file > .."? If I'll get *any* patch approved, and restart of client db, then it doesn't matter how many LOGs there will be :) Will get back to you as soon as I will have more info. Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Masahiko Sawada
Date:
On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18 > 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0 > 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 > 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38 > 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 It happened twice that the walsender process was active for at least one minute while not advancing restart_lsn and exited. Could you share the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we can find out what the wal sender processes were doing? Also, you may have logs for "failed to increase restart lsn: proposed %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X, flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so, please share it as well. I've tried to reproduce this issue with a similar setting in a loop but could not. Given that you have the log "attempting to remove WAL segments older than log file 000000000001039D0000008A", the last WAL file we need to retain calculated by KeepLogSeg() was wrong. KeepLogSeg() retreated the last segment that we need to retain while considering replication slots' minimum LSN (and wal_keep_segments) but it seemed not to work fine. The possibilities are (1) we got the wrong slots' minimum LSN that is newer than a slot's LSN (ie, retreated logSegNo but it's still newer than required WAL files) or (2) we got an invalid slots's minimum LSN (ie, didn't retreated at all). If we know what the RedoRecPtr was, we can calculate _logSegNo that has not been retreated by KeepLogSeg(). BTW in LogicalConfirmReceivedLocation(), we set restart_lsn as follows: if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr && MyReplicationSlot->candidate_restart_valid <= lsn) { Assert(MyReplicationSlot->candidate_restart_lsn != InvalidXLogRecPtr); MyReplicationSlot->data.restart_lsn = MyReplicationSlot->candidate_restart_lsn; MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr; MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr; updated_restart = true; } Is it worth changing the assertion to elog(ERROR) so that we can detect that an invalid value is set to restart_lsn? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Tue, Nov 15, 2022 at 7:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski > <depesz@depesz.com> wrote: > > > > 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18 > > 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0 > > 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 > > 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > It happened twice that the walsender process was active for at least > one minute while not advancing restart_lsn and exited. Could you share > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we > can find out what the wal sender processes were doing? > > Also, you may have logs for "failed to increase restart lsn: proposed > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X, > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so, > please share it as well. > > I've tried to reproduce this issue with a similar setting in a loop > but could not. > > Given that you have the log "attempting to remove WAL segments older > than log file 000000000001039D0000008A", the last WAL file we need to > retain calculated by KeepLogSeg() was wrong. > > KeepLogSeg() retreated the last segment that we need to retain while > considering replication slots' minimum LSN (and wal_keep_segments) but > it seemed not to work fine. The possibilities are (1) we got the wrong > slots' minimum LSN that is newer than a slot's LSN (ie, retreated > logSegNo but it's still newer than required WAL files) or (2) we got > an invalid slots's minimum LSN (ie, didn't retreated at all). If we > know what the RedoRecPtr was, we can calculate _logSegNo that has not > been retreated by KeepLogSeg(). > The other possibility is that the slot is not considered for minimum LSN computation (say in_use flag is temporarily set to false). > BTW in LogicalConfirmReceivedLocation(), we set restart_lsn as follows: > > if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr && > MyReplicationSlot->candidate_restart_valid <= lsn) > { > Assert(MyReplicationSlot->candidate_restart_lsn != > InvalidXLogRecPtr); > > MyReplicationSlot->data.restart_lsn = > MyReplicationSlot->candidate_restart_lsn; > MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr; > MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr; > updated_restart = true; > } > > Is it worth changing the assertion to elog(ERROR) so that we can > detect that an invalid value is set to restart_lsn? > This is worth considering but note that the caller of this has an assertion for invalid values being assigned to candidate_restart_* variables. Also, wouldn't be shown as invalid for restart_lsn in pg_replication_slots? -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Tue, Nov 15, 2022 at 7:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski > <depesz@depesz.com> wrote: > > > > 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18 > > 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0 > > 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 > > 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > It happened twice that the walsender process was active for at least > one minute while not advancing restart_lsn and exited. Could you share > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we > can find out what the wal sender processes were doing? > > Also, you may have logs for "failed to increase restart lsn: proposed > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X, > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so, > please share it as well. > > I've tried to reproduce this issue with a similar setting in a loop > but could not. > > Given that you have the log "attempting to remove WAL segments older > than log file 000000000001039D0000008A", the last WAL file we need to > retain calculated by KeepLogSeg() was wrong. > > KeepLogSeg() retreated the last segment that we need to retain while > considering replication slots' minimum LSN (and wal_keep_segments) but > it seemed not to work fine. The possibilities are (1) we got the wrong > slots' minimum LSN that is newer than a slot's LSN (ie, retreated > logSegNo but it's still newer than required WAL files) or (2) we got > an invalid slots's minimum LSN (ie, didn't retreated at all). If we > know what the RedoRecPtr was, we can calculate _logSegNo that has not > been retreated by KeepLogSeg(). > I was checking but didn't find if we have RedoRecPtr-related information available. I think it could be valuable to have RedoRecPtr and probably replicationSlotMinLSN, lastRemovedSegNo, etc. kind of information available via some view (either in pg_stat_bgwriter or a similar view) to debug this kind of issue. -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote: > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski > <depesz@depesz.com> wrote: > > > > 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18 > > 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0 > > 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 > > 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > It happened twice that the walsender process was active for at least > one minute while not advancing restart_lsn and exited. Could you share > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we > can find out what the wal sender processes were doing? That will be ~ 270MB. Not sure how that would fly, plus that's awfully lot of logs to check if I don't leak any sensitive info. But, if I'll remove lines that are generated by walsender to pg12 replicas (streaming replication to bionic/pg12 replicas), the logfile is only ~ 3MB. Still a bit too much, as there can be some queries with sensitive info. Perhaps I can grep something out? > Also, you may have logs for "failed to increase restart lsn: proposed > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X, > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so, > please share it as well. Found one in the time frame you mentioned: 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23 UTC,7/0,0,DEBUG,00000,"failedto increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958,current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" > I've tried to reproduce this issue with a similar setting in a loop > but could not. It seems to be rather elusive. Which is why I'm kinda "happy" that I can with high level of luck replicate it. Question might be 100% stupid, sorry if it is - any chance that we're on arm64 has something to do with it? Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Masahiko Sawada
Date:
On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote: > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski > > <depesz@depesz.com> wrote: > > > > > > 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18 > > > 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 > > > 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38 > > > 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > It happened twice that the walsender process was active for at least > > one minute while not advancing restart_lsn and exited. Could you share > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we > > can find out what the wal sender processes were doing? > > That will be ~ 270MB. Not sure how that would fly, plus that's awfully > lot of logs to check if I don't leak any sensitive info. > > But, if I'll remove lines that are generated by walsender to pg12 > replicas (streaming replication to bionic/pg12 replicas), the logfile is > only ~ 3MB. Still a bit too much, as there can be some queries with > sensitive info. > > Perhaps I can grep something out? > > > Also, you may have logs for "failed to increase restart lsn: proposed > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X, > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so, > > please share it as well. > > Found one in the time frame you mentioned: > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" > Thanks! LSN 1039D/8B5773D8 seems to be related to this issue. If we advance slot's restart_lsn to this LSN, we remove WAL files older than 000000000001039D0000008A. In LogicalIncreaseRestartDecodingForSlot(), since "current_lsn(1039D/9170B010) < slot->data.confirmed_flush(1039D/91F41B50)", we executed the following part and called LogicalConfirmReceivedLocation(): else if (current_lsn <= slot->data.confirmed_flush) { slot->candidate_restart_valid = current_lsn; slot->candidate_restart_lsn = restart_lsn; /* our candidate can directly be used */ updated_lsn = true; } Then in LogicalConfirmReceivedLocation(), IIUC we executed the following code: if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr && MyReplicationSlot->candidate_restart_valid <= lsn) { Assert(MyReplicationSlot->candidate_restart_lsn != InvalidXLogRecPtr); MyReplicationSlot->data.restart_lsn = MyReplicationSlot->candidate_restart_lsn; MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr; MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr; updated_restart = true; } So you should have logs for "updated xmin: %u restart: %u", could you check it? After the above codes, we do: SpinLockRelease(&MyReplicationSlot->mutex); /* first write new xmin to disk, so we know what's up after a crash */ if (updated_xmin || updated_restart) { ReplicationSlotMarkDirty(); ReplicationSlotSave(); elog(DEBUG1, "updated xmin: %u restart: %u", updated_xmin, updated_restart); } Is something like the following scenario possible to happen? 1. wal sender updates slot's restart_lsn and releases the spin lock (not saved in the disk yet) 2. someone updates slots' minimum restart_lsn (note that slot's restart_lsn in memory is already updated). 3. checkpointer removes WAL files older than the minimum restart_lsn calculated at step 2. 4. wal sender restarts for some reason (or server crashed). 5. after the restarts, wal sender acquires the slot but the slot's restart_lsn is old. 6. WAL files required for restart the decoding are already removed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Masahiko Sawada
Date:
On Tue, Nov 15, 2022 at 6:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski > <depesz@depesz.com> wrote: > > > > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote: > > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski > > > <depesz@depesz.com> wrote: > > > > > > > > 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18 > > > > 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > > 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 > > > > 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38 > > > > 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > > > It happened twice that the walsender process was active for at least > > > one minute while not advancing restart_lsn and exited. Could you share > > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we > > > can find out what the wal sender processes were doing? > > > > That will be ~ 270MB. Not sure how that would fly, plus that's awfully > > lot of logs to check if I don't leak any sensitive info. > > > > But, if I'll remove lines that are generated by walsender to pg12 > > replicas (streaming replication to bionic/pg12 replicas), the logfile is > > only ~ 3MB. Still a bit too much, as there can be some queries with > > sensitive info. > > > > Perhaps I can grep something out? > > > > > Also, you may have logs for "failed to increase restart lsn: proposed > > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X, > > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so, > > > please share it as well. > > > > Found one in the time frame you mentioned: > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" > > > > Thanks! > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance > slot's restart_lsn to this LSN, we remove WAL files older than > 000000000001039D0000008A. > > In LogicalIncreaseRestartDecodingForSlot(), since > "current_lsn(1039D/9170B010) < > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following > part and called LogicalConfirmReceivedLocation(): > > else if (current_lsn <= slot->data.confirmed_flush) > { > slot->candidate_restart_valid = current_lsn; > slot->candidate_restart_lsn = restart_lsn; > > /* our candidate can directly be used */ > updated_lsn = true; > } > > Then in LogicalConfirmReceivedLocation(), IIUC we executed the following code: > > if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr && > MyReplicationSlot->candidate_restart_valid <= lsn) > { > Assert(MyReplicationSlot->candidate_restart_lsn != > InvalidXLogRecPtr); > > MyReplicationSlot->data.restart_lsn = > MyReplicationSlot->candidate_restart_lsn; > MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr; > MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr; > updated_restart = true; > } > > So you should have logs for "updated xmin: %u restart: %u", could you check it? > > After the above codes, we do: > > SpinLockRelease(&MyReplicationSlot->mutex); > > /* first write new xmin to disk, so we know what's up after a crash */ > if (updated_xmin || updated_restart) > { > ReplicationSlotMarkDirty(); > ReplicationSlotSave(); > elog(DEBUG1, "updated xmin: %u restart: %u", updated_xmin, > updated_restart); > } > > Is something like the following scenario possible to happen? > > 1. wal sender updates slot's restart_lsn and releases the spin lock > (not saved in the disk yet) > 2. someone updates slots' minimum restart_lsn (note that slot's > restart_lsn in memory is already updated). > 3. checkpointer removes WAL files older than the minimum restart_lsn > calculated at step 2. > 4. wal sender restarts for some reason (or server crashed). > 5. after the restarts, wal sender acquires the slot but the slot's > restart_lsn is old. > 6. WAL files required for restart the decoding are already removed. I'm not sure that the reported issue happened in the above scenario but I've confirmed this scenario happened on my environment by adding sleeps and killing the wal sender process. After the server restarts, pg_replication_slots.restart_lsn retreated from 0/5390F78 to 0/1659FC0, and the walsender raised an error "requested WAL segment 000000010000000000000001 has already been removed". Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Tue, Nov 15, 2022 at 2:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski > <depesz@depesz.com> wrote: > > > > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote: > > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski > > > <depesz@depesz.com> wrote: > > > > > > > > 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18 > > > > 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > > 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 > > > > 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38 > > > > 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > > > It happened twice that the walsender process was active for at least > > > one minute while not advancing restart_lsn and exited. Could you share > > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we > > > can find out what the wal sender processes were doing? > > > > That will be ~ 270MB. Not sure how that would fly, plus that's awfully > > lot of logs to check if I don't leak any sensitive info. > > > > But, if I'll remove lines that are generated by walsender to pg12 > > replicas (streaming replication to bionic/pg12 replicas), the logfile is > > only ~ 3MB. Still a bit too much, as there can be some queries with > > sensitive info. > > > > Perhaps I can grep something out? > > > > > Also, you may have logs for "failed to increase restart lsn: proposed > > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X, > > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so, > > > please share it as well. > > > > Found one in the time frame you mentioned: > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" > > > > Thanks! > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance > slot's restart_lsn to this LSN, we remove WAL files older than > 000000000001039D0000008A. > > In LogicalIncreaseRestartDecodingForSlot(), since > "current_lsn(1039D/9170B010) < > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following > part and called LogicalConfirmReceivedLocation(): > > else if (current_lsn <= slot->data.confirmed_flush) > { > slot->candidate_restart_valid = current_lsn; > slot->candidate_restart_lsn = restart_lsn; > > /* our candidate can directly be used */ > updated_lsn = true; > } > If this would have been executed in LogicalIncreaseRestartDecodingForSlot(), then the values displayed in the above DEBUG messages "current candidate 1039D/83825958, current after 1039D/8B5773D8" should be the same as proposed and after "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing something? -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Masahiko Sawada
Date:
On Wed, Nov 16, 2022 at 1:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Nov 15, 2022 at 2:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski > > <depesz@depesz.com> wrote: > > > > > > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote: > > > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski > > > > <depesz@depesz.com> wrote: > > > > > > > > > > 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18 > > > > > 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > > > 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 > > > > > 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38 > > > > > 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > > 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > > 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > > 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > > > > > It happened twice that the walsender process was active for at least > > > > one minute while not advancing restart_lsn and exited. Could you share > > > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we > > > > can find out what the wal sender processes were doing? > > > > > > That will be ~ 270MB. Not sure how that would fly, plus that's awfully > > > lot of logs to check if I don't leak any sensitive info. > > > > > > But, if I'll remove lines that are generated by walsender to pg12 > > > replicas (streaming replication to bionic/pg12 replicas), the logfile is > > > only ~ 3MB. Still a bit too much, as there can be some queries with > > > sensitive info. > > > > > > Perhaps I can grep something out? > > > > > > > Also, you may have logs for "failed to increase restart lsn: proposed > > > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X, > > > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so, > > > > please share it as well. > > > > > > Found one in the time frame you mentioned: > > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" > > > > > > > Thanks! > > > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance > > slot's restart_lsn to this LSN, we remove WAL files older than > > 000000000001039D0000008A. > > > > In LogicalIncreaseRestartDecodingForSlot(), since > > "current_lsn(1039D/9170B010) < > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following > > part and called LogicalConfirmReceivedLocation(): > > > > else if (current_lsn <= slot->data.confirmed_flush) > > { > > slot->candidate_restart_valid = current_lsn; > > slot->candidate_restart_lsn = restart_lsn; > > > > /* our candidate can directly be used */ > > updated_lsn = true; > > } > > > > If this would have been executed in > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in > the above DEBUG messages "current candidate 1039D/83825958, current > after 1039D/8B5773D8" should be the same as proposed and after > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing > something? Oh, you're right. Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal to or greater than 1039D/8B5773D8 at that time but slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Nov 16, 2022 at 1:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Nov 15, 2022 at 2:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski > > > <depesz@depesz.com> wrote: > > > > > > > > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote: > > > > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski > > > > > <depesz@depesz.com> wrote: > > > > > > > > > > > > 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18 > > > > > > 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > > > > 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > > > > 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0 > > > > > > 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0 > > > > > > 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38 > > > > > > 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > > > 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > > > 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > > > 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38 > > > > > > > > > > It happened twice that the walsender process was active for at least > > > > > one minute while not advancing restart_lsn and exited. Could you share > > > > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we > > > > > can find out what the wal sender processes were doing? > > > > > > > > That will be ~ 270MB. Not sure how that would fly, plus that's awfully > > > > lot of logs to check if I don't leak any sensitive info. > > > > > > > > But, if I'll remove lines that are generated by walsender to pg12 > > > > replicas (streaming replication to bionic/pg12 replicas), the logfile is > > > > only ~ 3MB. Still a bit too much, as there can be some queries with > > > > sensitive info. > > > > We are primarily interested in LOGs that display lsn values. Can you try to get that or if possible remove sensitive queries information and then send the remaining LOGs? > > > > Perhaps I can grep something out? > > > > > > > > > Also, you may have logs for "failed to increase restart lsn: proposed > > > > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X, > > > > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so, > > > > > please share it as well. > > > > > > > > Found one in the time frame you mentioned: > > > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" > > > > > > > > > > Thanks! > > > > > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance > > > slot's restart_lsn to this LSN, we remove WAL files older than > > > 000000000001039D0000008A. > > > > > > In LogicalIncreaseRestartDecodingForSlot(), since > > > "current_lsn(1039D/9170B010) < > > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following > > > part and called LogicalConfirmReceivedLocation(): > > > > > > else if (current_lsn <= slot->data.confirmed_flush) > > > { > > > slot->candidate_restart_valid = current_lsn; > > > slot->candidate_restart_lsn = restart_lsn; > > > > > > /* our candidate can directly be used */ > > > updated_lsn = true; > > > } > > > > > > > If this would have been executed in > > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in > > the above DEBUG messages "current candidate 1039D/83825958, current > > after 1039D/8B5773D8" should be the same as proposed and after > > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing > > something? > > Oh, you're right. > > Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal > to or greater than 1039D/8B5773D8 at that time but > slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird. > Yes, that is weird but it had been a bit obvious if the same LOG would have printed slot->data.restart_lsn. This means that somehow slot's 'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't figure out yet how that can happen but if that happens then the slot's restart_lsn can retreat in LogicalConfirmReceivedLocation() because we don't check if slot's candidate_restart_lsn is lesser than its restart_lsn before assigning the same in line MyReplicationSlot->data.restart_lsn = MyReplicationSlot->candidate_restart_lsn;. I think that should be checked. Note that we call LogicalConfirmReceivedLocation() can be called from ProcessStandbyReplyMessage(), so once the wrong candidate_restart_lsn is set, it can be assigned to restart_lsn from other code paths as well. I am not able to think how 'candidate_restart_lsn' can be set to an LSN value prior to 'restart_lsn'. For that, we probably need more LOGs. I hope that depesz can share more information from LOGs as that might give us some clues. -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Wed, Nov 16, 2022 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > Found one in the time frame you mentioned: > > > > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-1021:03:23 UTC,7/0,0,DEBUG,00000,"failedto increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958,current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" > > > > > > > > > > > > > Thanks! > > > > > > > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance > > > > slot's restart_lsn to this LSN, we remove WAL files older than > > > > 000000000001039D0000008A. > > > > > > > > In LogicalIncreaseRestartDecodingForSlot(), since > > > > "current_lsn(1039D/9170B010) < > > > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following > > > > part and called LogicalConfirmReceivedLocation(): > > > > > > > > else if (current_lsn <= slot->data.confirmed_flush) > > > > { > > > > slot->candidate_restart_valid = current_lsn; > > > > slot->candidate_restart_lsn = restart_lsn; > > > > > > > > /* our candidate can directly be used */ > > > > updated_lsn = true; > > > > } > > > > > > > > > > If this would have been executed in > > > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in > > > the above DEBUG messages "current candidate 1039D/83825958, current > > > after 1039D/8B5773D8" should be the same as proposed and after > > > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing > > > something? > > > > Oh, you're right. > > > > Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal > > to or greater than 1039D/8B5773D8 at that time but > > slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird. > > > > Yes, that is weird but it had been a bit obvious if the same LOG would > have printed slot->data.restart_lsn. This means that somehow slot's > 'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't > figure out yet how that can happen but if that happens then the slot's > restart_lsn can retreat in LogicalConfirmReceivedLocation() because we > don't check if slot's candidate_restart_lsn is lesser than its > restart_lsn before assigning the same in line > MyReplicationSlot->data.restart_lsn = > MyReplicationSlot->candidate_restart_lsn;. I think that should be > checked. Note that we call LogicalConfirmReceivedLocation() can be > called from ProcessStandbyReplyMessage(), so once the wrong > candidate_restart_lsn is set, it can be assigned to restart_lsn from > other code paths as well. > > I am not able to think how 'candidate_restart_lsn' can be set to an > LSN value prior to 'restart_lsn'. > In the below part of the code, we use the value of 'last_serialized_snapshot' for restart_lsn. else if (txn == NULL && builder->reorder->current_restart_decoding_lsn != InvalidXLogRecPtr && builder->last_serialized_snapshot != InvalidXLogRecPtr) LogicalIncreaseRestartDecodingForSlot(lsn, builder->last_serialized_snapshot); Now, say, after restart, we start reading from slot's restart_lsn which is 1039D/8B5773D8. At this LSN, we restored a snapshot that has the last_seriealized_snapshot set to 1039D/83825958. If that happens, then in LogicalIncreaseRestartDecodingForSlot, we can set these values to slot's candidate_*_lsn variables. Say, if this happens, next time whenever LogicalConfirmReceivedLocation() is called the value of slot's restart_lsn will be moved back. Once it is moved back, yet another restart will lead to this problem. Does this theory makes sense? -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Masahiko Sawada
Date:
On Wed, Nov 16, 2022 at 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Nov 16, 2022 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > Found one in the time frame you mentioned: > > > > > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-1021:03:23 UTC,7/0,0,DEBUG,00000,"failedto increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958,current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance > > > > > slot's restart_lsn to this LSN, we remove WAL files older than > > > > > 000000000001039D0000008A. > > > > > > > > > > In LogicalIncreaseRestartDecodingForSlot(), since > > > > > "current_lsn(1039D/9170B010) < > > > > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following > > > > > part and called LogicalConfirmReceivedLocation(): > > > > > > > > > > else if (current_lsn <= slot->data.confirmed_flush) > > > > > { > > > > > slot->candidate_restart_valid = current_lsn; > > > > > slot->candidate_restart_lsn = restart_lsn; > > > > > > > > > > /* our candidate can directly be used */ > > > > > updated_lsn = true; > > > > > } > > > > > > > > > > > > > If this would have been executed in > > > > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in > > > > the above DEBUG messages "current candidate 1039D/83825958, current > > > > after 1039D/8B5773D8" should be the same as proposed and after > > > > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing > > > > something? > > > > > > Oh, you're right. > > > > > > Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal > > > to or greater than 1039D/8B5773D8 at that time but > > > slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird. > > > > > > > Yes, that is weird but it had been a bit obvious if the same LOG would > > have printed slot->data.restart_lsn. This means that somehow slot's > > 'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't > > figure out yet how that can happen but if that happens then the slot's > > restart_lsn can retreat in LogicalConfirmReceivedLocation() because we > > don't check if slot's candidate_restart_lsn is lesser than its > > restart_lsn before assigning the same in line > > MyReplicationSlot->data.restart_lsn = > > MyReplicationSlot->candidate_restart_lsn;. I think that should be > > checked. Note that we call LogicalConfirmReceivedLocation() can be > > called from ProcessStandbyReplyMessage(), so once the wrong > > candidate_restart_lsn is set, it can be assigned to restart_lsn from > > other code paths as well. > > > > I am not able to think how 'candidate_restart_lsn' can be set to an > > LSN value prior to 'restart_lsn'. > > > > In the below part of the code, we use the value of > 'last_serialized_snapshot' for restart_lsn. > else if (txn == NULL && > builder->reorder->current_restart_decoding_lsn != InvalidXLogRecPtr && > builder->last_serialized_snapshot != InvalidXLogRecPtr) > LogicalIncreaseRestartDecodingForSlot(lsn, > builder->last_serialized_snapshot); > > Now, say, after restart, we start reading from slot's restart_lsn > which is 1039D/8B5773D8. At this LSN, we restored a snapshot that has > the last_seriealized_snapshot set to 1039D/83825958. I don't get this part. You meant that the snapshot we restored at 1039D/8B5773D8 has last_serialized_snapshot set to 1039D/83825958? I think we don't update last_serialized_snapshot when restoring the snapshot. Also, if we restore the snapshot after restart, we don't execute the above code since SnapBuildFindSnapshot() returns false, no? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Thu, Nov 17, 2022 at 6:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Nov 16, 2022 at 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Nov 16, 2022 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > Found one in the time frame you mentioned: > > > > > > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-1021:03:23 UTC,7/0,0,DEBUG,00000,"failedto increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958,current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance > > > > > > slot's restart_lsn to this LSN, we remove WAL files older than > > > > > > 000000000001039D0000008A. > > > > > > > > > > > > In LogicalIncreaseRestartDecodingForSlot(), since > > > > > > "current_lsn(1039D/9170B010) < > > > > > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following > > > > > > part and called LogicalConfirmReceivedLocation(): > > > > > > > > > > > > else if (current_lsn <= slot->data.confirmed_flush) > > > > > > { > > > > > > slot->candidate_restart_valid = current_lsn; > > > > > > slot->candidate_restart_lsn = restart_lsn; > > > > > > > > > > > > /* our candidate can directly be used */ > > > > > > updated_lsn = true; > > > > > > } > > > > > > > > > > > > > > > > If this would have been executed in > > > > > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in > > > > > the above DEBUG messages "current candidate 1039D/83825958, current > > > > > after 1039D/8B5773D8" should be the same as proposed and after > > > > > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing > > > > > something? > > > > > > > > Oh, you're right. > > > > > > > > Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal > > > > to or greater than 1039D/8B5773D8 at that time but > > > > slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird. > > > > > > > > > > Yes, that is weird but it had been a bit obvious if the same LOG would > > > have printed slot->data.restart_lsn. This means that somehow slot's > > > 'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't > > > figure out yet how that can happen but if that happens then the slot's > > > restart_lsn can retreat in LogicalConfirmReceivedLocation() because we > > > don't check if slot's candidate_restart_lsn is lesser than its > > > restart_lsn before assigning the same in line > > > MyReplicationSlot->data.restart_lsn = > > > MyReplicationSlot->candidate_restart_lsn;. I think that should be > > > checked. Note that we call LogicalConfirmReceivedLocation() can be > > > called from ProcessStandbyReplyMessage(), so once the wrong > > > candidate_restart_lsn is set, it can be assigned to restart_lsn from > > > other code paths as well. > > > > > > I am not able to think how 'candidate_restart_lsn' can be set to an > > > LSN value prior to 'restart_lsn'. > > > > > > > In the below part of the code, we use the value of > > 'last_serialized_snapshot' for restart_lsn. > > else if (txn == NULL && > > builder->reorder->current_restart_decoding_lsn != InvalidXLogRecPtr && > > builder->last_serialized_snapshot != InvalidXLogRecPtr) > > LogicalIncreaseRestartDecodingForSlot(lsn, > > builder->last_serialized_snapshot); > > > > Now, say, after restart, we start reading from slot's restart_lsn > > which is 1039D/8B5773D8. At this LSN, we restored a snapshot that has > > the last_seriealized_snapshot set to 1039D/83825958. > > I don't get this part. You meant that the snapshot we restored at > 1039D/8B5773D8 has last_serialized_snapshot set to 1039D/83825958? > Yes. > I > think we don't update last_serialized_snapshot when restoring the > snapshot. > Right, I missed that point that even though it is present in the serialized snapshot, while restoring we don't update the same in the builder. So, this theory won't work. -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Andres Freund
Date:
Hi, On 2022-11-11 15:50:40 +0100, hubert depesz lubaczewski wrote: > Out of 8 servers that we prepared upgrade for, one failed with the > same/similar problem. Perhaps a stupid question - are you using max_slot_wal_keep_size? And what's your wal_segment_size? And wal_keep_size? I'm a bit confused by the changes the changes c6550776394e made to KeepLogSeg() etc. Before: /* then check whether slots limit removal further */ if (max_replication_slots > 0 && keep != InvalidXLogRecPtr) { XLogSegNo slotSegNo; XLByteToSeg(keep, slotSegNo, wal_segment_size); if (slotSegNo <= 0) segno = 1; else if (slotSegNo < segno) segno = slotSegNo; } Current: keep = XLogGetReplicationSlotMinimumLSN(); if (keep != InvalidXLogRecPtr) { XLByteToSeg(keep, segno, wal_segment_size); /* Cap by max_slot_wal_keep_size ... */ if (max_slot_wal_keep_size_mb >= 0) { uint64 slot_keep_segs; slot_keep_segs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size); if (currSegNo - segno > slot_keep_segs) segno = currSegNo - slot_keep_segs; } } Which looks to me to have lost the protection against segno being 0. Which afaict would break the callers, because they do _logSegNo--. Which would then allow all WAL to be reused, no? I guess the argument for changing this was that we should never have such a dangerous value returned by XLogGetReplicationSlotMinimumLSN()? It seems decidedly not great to not log at least a debug1 (but probably it should be LOG) message when KeepLogSeg() decides to limit based on max_slot_wal_keep_size. It feels wrong to subtract max_slot_wal_keep_size from recptr - that's the end of the checkpoint record. Given that we, leaving max_slot_wal_keep_size aside, only actually remove WAL if older than the segment that RedoRecPtr (the logical start of the checkpoint) is in. If the checkpoint is large we'll end up removing replication slots even though they potentially would only have retained one additional WAL segment. Isn't it problematic to use ConvertToXSegs() to implement max_slot_wal_keep_size, given that it rounds *down*? Particularly for a large wal_segment_size that'd afaict lead to being much more aggressive invalidating slots. Also, why do we do something as expensive as InvalidateObsoleteReplicationSlots() even when max_slot_wal_keep_size had no effect? Greetings, Andres Freund
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Andres Freund
Date:
Hi, On 2022-11-15 23:59:37 +0900, Masahiko Sawada wrote: > > Is something like the following scenario possible to happen? > > > > 1. wal sender updates slot's restart_lsn and releases the spin lock > > (not saved in the disk yet) > > 2. someone updates slots' minimum restart_lsn (note that slot's > > restart_lsn in memory is already updated). You mean ReplicationSlotsComputeRequiredLSN(), or update that specific slot's restart_lsn? The latter shouldn't happen. > > 3. checkpointer removes WAL files older than the minimum restart_lsn > > calculated at step 2. For xmin we have protection against that via the split between catalog_xmin/effective_catalog_xmin. We should probably mirror that for restart_lsn as well. We should also call ReplicationSlotsComputeRequiredLSN if only update_restart is true... > > 4. wal sender restarts for some reason (or server crashed). I don't think walsender alone restarting should change anything, but crash-restart obviously would. Greetings, Andres Freund
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Wed, Nov 16, 2022 at 11:25:12PM -0800, Andres Freund wrote: > Perhaps a stupid question - are you using max_slot_wal_keep_size? And what's > your wal_segment_size? And wal_keep_size? Are we talking about pg12 side? => select name, setting from pg_settings where name ~ 'wal'; name │ setting ══════════════════════════════╪═══════════ max_wal_senders │ 50 max_wal_size │ 49152 min_wal_size │ 256 wal_block_size │ 8192 wal_buffers │ 2048 wal_compression │ on wal_consistency_checking │ wal_init_zero │ on wal_keep_segments │ 1000 wal_level │ logical wal_log_hints │ off wal_receiver_status_interval │ 10 wal_receiver_timeout │ 60000 wal_recycle │ on wal_retrieve_retry_interval │ 5000 wal_segment_size │ 16777216 wal_sender_timeout │ 60000 wal_sync_method │ fdatasync wal_writer_delay │ 200 wal_writer_flush_after │ 128 (20 rows) max_slot_wal_keep_size was added in pg13, afaik. and wel_keep_* was _segments in pg 12. Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Masahiko Sawada
Date:
On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-11-15 23:59:37 +0900, Masahiko Sawada wrote: > > > Is something like the following scenario possible to happen? > > > > > > 1. wal sender updates slot's restart_lsn and releases the spin lock > > > (not saved in the disk yet) > > > 2. someone updates slots' minimum restart_lsn (note that slot's > > > restart_lsn in memory is already updated). > > You mean ReplicationSlotsComputeRequiredLSN(), or update that specific slot's > restart_lsn? The latter shouldn't happen. I meant the former. > > > > > 3. checkpointer removes WAL files older than the minimum restart_lsn > > > calculated at step 2. > > For xmin we have protection against that via the split between > catalog_xmin/effective_catalog_xmin. We should probably mirror that for > restart_lsn as well. > > We should also call ReplicationSlotsComputeRequiredLSN if only update_restart > is true... Agree. > > > > > 4. wal sender restarts for some reason (or server crashed). > > I don't think walsender alone restarting should change anything, but > crash-restart obviously would. Right. I've confirmed this scenario is possible to happen with crash-restart. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Mon, Nov 14, 2022 at 05:08:15PM +0530, Amit Kapila wrote: > You may have the LOG for "attempting to remove WAL segments older than > log file %s", if so, please share. Also on similar lines, I think we > need to add some LOGs before calling RemoveOldXlogFiles() to know > where our computation to remove files goes off. Something, like the > attached, would be helpful but note I have added this quickly on my > working branch to show you what I have in mind. You may need small > changes based on the branch you are working on. Hi, I tried to apply it on pg12.12, both clean from https://www.postgresql.org/ftp/source/v12.12/ and one that is being built for pgdg .deb for 12.12, and in both cases it failed. Current state of xlog.c from the deb building (which is what I have to build) is available at https://depesz.com/various/xlog.c.xz I don't trust myself with modifing patches in C - any chance I could ask you to check why it doesn't apply, and possibly make new one that applies on pg12? I tried to see what is going on there, but you have in patch lines that I don't see in my version of xlog.c, so I'm not sure what to do with it. The procedure I followed to get this xlog.c in this form was: dget -x https://apt-archive.postgresql.org/pub/repos/apt/pool/main/p/postgresql-12/postgresql-12_12.1-2.pgdg100%2B1.dsc dpkg-source -x postgresql-12_12.1-2.pgdg100%2B1.dsc cd postgresql-12-12.1/ sudo apt build-dep . quilt push -a and I think that was it. Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Andres Freund
Date:
Hi, On 2022-11-17 23:22:12 +0900, Masahiko Sawada wrote: > On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote: > > > > 4. wal sender restarts for some reason (or server crashed). > > > > I don't think walsender alone restarting should change anything, but > > crash-restart obviously would. > > Right. I've confirmed this scenario is possible to happen with crash-restart. Depesz, were there any crashes / immediate restarts on the PG 12 side? If so, then we know what the problem likely is and can fix it. If not... Just to confirm, the problem has "newly" occurred after you've upgraded to 12.12? I couldn't quite tell from the thread. Greetings, Andres Freund
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Thu, Nov 17, 2022 at 10:55:29AM -0800, Andres Freund wrote: > Hi, > > On 2022-11-17 23:22:12 +0900, Masahiko Sawada wrote: > > On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote: > > > > > 4. wal sender restarts for some reason (or server crashed). > > > > > > I don't think walsender alone restarting should change anything, but > > > crash-restart obviously would. > > > > Right. I've confirmed this scenario is possible to happen with crash-restart. > > Depesz, were there any crashes / immediate restarts on the PG 12 side? If so, > then we know what the problem likely is and can fix it. If not... > > > Just to confirm, the problem has "newly" occurred after you've upgraded to > 12.12? I couldn't quite tell from the thread. No crashes that I can find any track of. And no, it's not related to 12.12 We were using 12.9, and in process of trying to figure stuff out we upgraded to 12.12, which didn't help. Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Fri, Nov 18, 2022 at 4:57 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Thu, Nov 17, 2022 at 10:55:29AM -0800, Andres Freund wrote: > > Hi, > > > > On 2022-11-17 23:22:12 +0900, Masahiko Sawada wrote: > > > On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > 4. wal sender restarts for some reason (or server crashed). > > > > > > > > I don't think walsender alone restarting should change anything, but > > > > crash-restart obviously would. > > > > > > Right. I've confirmed this scenario is possible to happen with crash-restart. > > > > Depesz, were there any crashes / immediate restarts on the PG 12 side? If so, > > then we know what the problem likely is and can fix it. If not... > > > > > > Just to confirm, the problem has "newly" occurred after you've upgraded to > > 12.12? I couldn't quite tell from the thread. > > No crashes that I can find any track of. > I think this problem could arise when walsender exits due to some error like "terminating walsender process due to replication timeout". Here is the theory I came up with: 1. Initially the restart_lsn is updated to 1039D/83825958. This will allow all files till 000000000001039D00000082 to be removed. 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8. 3. walsender restarts due to replication timeout. 4. After restart, it starts reading WAL from 1039D/83825958 as that was restart_lsn. 5. walsender gets a message to update write, flush, apply, etc. As part of that, it invokes ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and replicationSlotMinLSN will also be computed to the same value allowing to remove of all files older than 000000000001039D0000008A. This will allow removing 000000000001039D00000083, 000000010001039D00000084, etc. 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958. Remember from step 1, we are still reading WAL from that location. 8. At the next update for write, flush, etc. as part of processing standby reply message, we will invoke ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. This updates restart_lsn to 1039D/83825958. 9. walsender restarts due to replication timeout. 10. After restart, it starts reading WAL from 1039D/83825958 as that was restart_lsn and gets an error "requested WAL segment 000000010001039D00000083 has already been removed" because the same is removed as part of point-6. 11. After that walsender will keep on getting the same error after restart as the restart_lsn is never progressed. What do you think? If this diagnosis is correct, I think we need to clear candidate_restart_lsn and friends during ReplicationSlotRelease(). -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Andres Freund
Date:
Hi, On 2022-11-21 19:56:20 +0530, Amit Kapila wrote: > I think this problem could arise when walsender exits due to some > error like "terminating walsender process due to replication timeout". > Here is the theory I came up with: > > 1. Initially the restart_lsn is updated to 1039D/83825958. This will > allow all files till 000000000001039D00000082 to be removed. > 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8. > 3. walsender restarts due to replication timeout. > 4. After restart, it starts reading WAL from 1039D/83825958 as that > was restart_lsn. > 5. walsender gets a message to update write, flush, apply, etc. As > part of that, it invokes > ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. > 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and > replicationSlotMinLSN will also be computed to the same value allowing > to remove of all files older than 000000000001039D0000008A. This will > allow removing 000000000001039D00000083, 000000010001039D00000084, > etc. This would require that the client acknowledged an LSN that we haven't sent out, no? Shouldn't the MyReplicationSlot->candidate_restart_valid <= lsn from LogicalConfirmReceivedLocation() prevented this from happening unless the client acknowledges up to candidate_restart_valid? > 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958. > Remember from step 1, we are still reading WAL from that location. I don't think LogicalIncreaseRestartDecodingForSlot() would do anything in that case, because of the /* don't overwrite if have a newer restart lsn */ check. > If this diagnosis is correct, I think we need to clear > candidate_restart_lsn and friends during ReplicationSlotRelease(). Possible, but I don't quite see it yet. Greetings, Andres Freund
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Amit Kapila
Date:
On Tue, Nov 22, 2022 at 1:38 AM Andres Freund <andres@anarazel.de> wrote: > > On 2022-11-21 19:56:20 +0530, Amit Kapila wrote: > > I think this problem could arise when walsender exits due to some > > error like "terminating walsender process due to replication timeout". > > Here is the theory I came up with: > > > > 1. Initially the restart_lsn is updated to 1039D/83825958. This will > > allow all files till 000000000001039D00000082 to be removed. > > 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8. > > 3. walsender restarts due to replication timeout. > > 4. After restart, it starts reading WAL from 1039D/83825958 as that > > was restart_lsn. > > 5. walsender gets a message to update write, flush, apply, etc. As > > part of that, it invokes > > ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. > > 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and > > replicationSlotMinLSN will also be computed to the same value allowing > > to remove of all files older than 000000000001039D0000008A. This will > > allow removing 000000000001039D00000083, 000000010001039D00000084, > > etc. > > This would require that the client acknowledged an LSN that we haven't > sent out, no? Shouldn't the > MyReplicationSlot->candidate_restart_valid <= lsn > from LogicalConfirmReceivedLocation() prevented this from happening > unless the client acknowledges up to candidate_restart_valid? > We would have sent the higher LSN prior to the restart. Let me share a snippet of LOG which I have extracted from a large LOG depesz has shared me offlist after my request: 2022-11-10 21:03:23.478+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,4,idle,2022-11-10 21:03:23+00,7/0,0,DEBUG,00000,received replication command: IDENTIFY_SYSTEM,,,,,,,,,focal14 2022-11-10 21:03:23.479+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,5,idle,2022-11-10 21:03:23+00,7/0,0,DEBUG,00000,"received replication command: START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/91F41B50 (proto_version '1', publication_names '""focal14""')",,,,,,,,,focal14 2022-11-10 21:03:23.481+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,6,idle,2022-11-10 21:03:23+00,7/0,0,LOG,00000,"starting logical decoding for slot ""focal14""","Streaming transactions committing after 1039D/911A8DB0, reading WAL from 1039D/83825958.",,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/91F41B50 (proto_version '1', publication_names '""focal14""')",,,focal14 2022-11-10 21:03:23.481+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,7,idle,2022-11-10 21:03:23+00,7/0,0,DEBUG,00000,sending replication keepalive,,,,,,,,,focal14 2022-11-10 21:03:23.484+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,8,idle,2022-11-10 21:03:23+00,7/0,0,LOG,00000,logical decoding found consistent point at 1039D/83825958,Logical decoding will begin using saved snapshot.,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/91F41B50 (proto_version '1', publication_names '""focal14""')",,,focal14 2022-11-10 21:03:23.484+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,9,idle,2022-11-10 21:03:23+00,7/0,0,DEBUG,00000,write 1039D/91F41B50 flush 1039D/91F41B50 apply 1039D/91F41B50 reply_time 2022-11-10 21:03:23.482052+00,,,,,,,,,focal14 2022-11-10 21:03:23.495+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,10,idle,2022-11-10 21:03:23+00,7/0,0,DEBUG,00000,updated xmin: 1 restart: 1,,,,,,,,,focal14 Here, we can see that even though we have started reading WAL from 1039D/83825958, the confiremed_flush location is 1039D/911A8DB0. Then we get a client reply (write 1039D/91F41B50 flush 1039D/91F41B50) probably for a keepalive message. The flush LSN here is higher than candidate_restart_valid. The value of candidate_restart_valid before walsender restart was 1039D/9170B010 and candidate_restart_lsn was 1039D/8B5773D8. > > > 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958. > > Remember from step 1, we are still reading WAL from that location. > > I don't think LogicalIncreaseRestartDecodingForSlot() would do anything > in that case, because of the > /* don't overwrite if have a newer restart lsn */ > check. > But later in LogicalIncreaseRestartDecodingForSlot(), we have the below check which can allow to override it. /* * Only increase if the previous values have been applied, otherwise we * might never end up updating if the receiver acks too slowly. A missed * value here will just cause some extra effort after reconnecting. */ if (slot->candidate_restart_valid == InvalidXLogRecPtr) Do we need else if here instead of if? I think that would save us from retreating the restart_lsn value. We seem to have "else if" in the corresponding case in the function LogicalIncreaseXminForSlot(): /* * Only increase if the previous values have been applied, otherwise we * might never end up updating if the receiver acks too slowly. */ else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr) -- With Regards, Amit Kapila.
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Masahiko Sawada
Date:
On Tue, Nov 22, 2022 at 12:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Nov 22, 2022 at 1:38 AM Andres Freund <andres@anarazel.de> wrote: > > > > On 2022-11-21 19:56:20 +0530, Amit Kapila wrote: > > > I think this problem could arise when walsender exits due to some > > > error like "terminating walsender process due to replication timeout". > > > Here is the theory I came up with: > > > > > > 1. Initially the restart_lsn is updated to 1039D/83825958. This will > > > allow all files till 000000000001039D00000082 to be removed. > > > 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8. > > > 3. walsender restarts due to replication timeout. > > > 4. After restart, it starts reading WAL from 1039D/83825958 as that > > > was restart_lsn. > > > 5. walsender gets a message to update write, flush, apply, etc. As > > > part of that, it invokes > > > ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. > > > 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and > > > replicationSlotMinLSN will also be computed to the same value allowing > > > to remove of all files older than 000000000001039D0000008A. This will > > > allow removing 000000000001039D00000083, 000000010001039D00000084, > > > etc. > > > > This would require that the client acknowledged an LSN that we haven't > > sent out, no? Shouldn't the > > MyReplicationSlot->candidate_restart_valid <= lsn > > from LogicalConfirmReceivedLocation() prevented this from happening > > unless the client acknowledges up to candidate_restart_valid? > > > > We would have sent the higher LSN prior to the restart. Let me share a > snippet of LOG which I have extracted from a large LOG depesz has > shared me offlist after my request: > > 2022-11-10 21:03:23.478+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,4,idle,2022-11-10 > 21:03:23+00,7/0,0,DEBUG,00000,received replication command: > IDENTIFY_SYSTEM,,,,,,,,,focal14 > 2022-11-10 21:03:23.479+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,5,idle,2022-11-10 > 21:03:23+00,7/0,0,DEBUG,00000,"received replication command: > START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/91F41B50 > (proto_version '1', publication_names '""focal14""')",,,,,,,,,focal14 > 2022-11-10 21:03:23.481+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,6,idle,2022-11-10 > 21:03:23+00,7/0,0,LOG,00000,"starting logical decoding for slot > ""focal14""","Streaming transactions committing after 1039D/911A8DB0, > reading WAL from 1039D/83825958.",,,,,"START_REPLICATION SLOT > ""focal14"" LOGICAL 1039D/91F41B50 (proto_version '1', > publication_names '""focal14""')",,,focal14 > 2022-11-10 21:03:23.481+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,7,idle,2022-11-10 > 21:03:23+00,7/0,0,DEBUG,00000,sending replication > keepalive,,,,,,,,,focal14 > 2022-11-10 21:03:23.484+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,8,idle,2022-11-10 > 21:03:23+00,7/0,0,LOG,00000,logical decoding found consistent point at > 1039D/83825958,Logical decoding will begin using saved > snapshot.,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL > 1039D/91F41B50 (proto_version '1', publication_names > '""focal14""')",,,focal14 > 2022-11-10 21:03:23.484+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,9,idle,2022-11-10 > 21:03:23+00,7/0,0,DEBUG,00000,write 1039D/91F41B50 flush > 1039D/91F41B50 apply 1039D/91F41B50 reply_time 2022-11-10 > 21:03:23.482052+00,,,,,,,,,focal14 > 2022-11-10 21:03:23.495+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,10,idle,2022-11-10 > 21:03:23+00,7/0,0,DEBUG,00000,updated xmin: 1 restart: > 1,,,,,,,,,focal14 > > Here, we can see that even though we have started reading WAL from > 1039D/83825958, the confiremed_flush location is 1039D/911A8DB0. Then > we get a client reply (write 1039D/91F41B50 flush 1039D/91F41B50) > probably for a keepalive message. The flush LSN here is higher than > candidate_restart_valid. The value of candidate_restart_valid before > walsender restart was 1039D/9170B010 and candidate_restart_lsn was > 1039D/8B5773D8. > > > > > > 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958. > > > Remember from step 1, we are still reading WAL from that location. > > > > I don't think LogicalIncreaseRestartDecodingForSlot() would do anything > > in that case, because of the > > /* don't overwrite if have a newer restart lsn */ > > check. > > > > But later in LogicalIncreaseRestartDecodingForSlot(), we have the > below check which can allow to override it. > /* > * Only increase if the previous values have been applied, otherwise we > * might never end up updating if the receiver acks too slowly. A missed > * value here will just cause some extra effort after reconnecting. > */ > if (slot->candidate_restart_valid == InvalidXLogRecPtr) I also think we did this override-path. After updating the restart_lsn we accept the given values as candidate_restart_lsn and candidate_restart_valid, which were 1039D/83825958 and 1039D/8B5773D8. Analyzing the server logs that depesz shared also with me offlist, this scenario basically makes sense to me. But please note that 000000000001039D0000008A and older files were removed while there was no running wal sender. The wal sender who restarted decoding from 1039D/83825958 (ie step 4) exited due to wal_sender_timeout at 21:04:56, and no walsender was launched until 000000000001039D0000008A or older files were removed at 21:06:07. I guess the scenario needs to change a bit. 6. The restart_lsn was updated to 1039D/8B5773D8 and replicationSlotMinLSN was also computed to the same value. But WAL files were not removed at this point because there was no checkpoint between 6 and 7. This was around 21:03:23. 7. We got new slot->candidate_restart_lsn as 1039D/83825958 and candidate_restart_valid as 1039D/8B5773D8. This can happen for the reason that Amit mentioned above. 8. The restart_lsn was updated to 1039D/83825958, but we didn't update replicationSlotMinLSN since slot->data.catalog_xmin was not updated at the same time (see LogicalConfirmReceivedLocation()). And the walsender exited again due to wal_sender_timeout. At this point, note that slot's restart_lsn was 1039D/83825958 whereas replicationSlotMinLSN was 1039D/8B5773D8. This was around 21:04:56. 9. Until 000000000001039D0000008A or older files were removed at 21:06:07, a client had checked the pg_replication_slots view every minute but it had showed 1039D/83825958 in the restart_lsn column. On the other hand, the checkpointer removed the WAL file segment files 000000000001039D0000008A or older at 21:06:07, based on replicationSlotMinLSN ( 1039D/8B5773D8). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
Hi, just checking - has there been any progress on diagnosing/fixing the bug? Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Masahiko Sawada
Date:
Hi, On Thu, Dec 8, 2022 at 8:13 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > Hi, > just checking - has there been any progress on diagnosing/fixing the > bug? Sorry for the late response. Based on the analysis we did[1][2], I've created the manual scenario to reproduce this issue with the attached patch and the script. The scenario.md explains the basic steps to reproduce this issue. It consists of 13 steps (very tricky!!). It's not sophisticated and could be improved. test.sh is the shell script I used to execute the reproduction steps from 1 to 10. In my environment, I could reproduce this issue by the following steps. 1. apply the patch and build PostgreSQL. 2. run test.sh. 3. execute the step 11 and later described in scenario.md. The test.sh is a very hacky and dirty script and is optimized in my environment (especially adding many sleeps). You might need to adjust it while checking scenario.md. I've also confirmed that this issue is fixed by the attached patch, which clears candidate_restart_lsn and friends during ReplicationSlotRelease(). [1] https://www.postgresql.org/message-id/CAA4eK1JvyWHzMwhO9jzPquctE_ha6bz3EkB3KE6qQJx63StErQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAD21AoBHMCEDcV4eBtSVvDrCN4SrMXanX5N9%2BL-E%2B4OWXYY0ew%40mail.gmail.com Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
On Mon, Feb 06, 2023 at 05:25:42PM +0900, Masahiko Sawada wrote: > Based on the analysis we did[1][2], I've created the manual scenario > to reproduce this issue with the attached patch and the script. > > The scenario.md explains the basic steps to reproduce this issue. It > consists of 13 steps (very tricky!!). It's not sophisticated and could > be improved. test.sh is the shell script I used to execute the > reproduction steps from 1 to 10. In my environment, I could reproduce > this issue by the following steps. > > 1. apply the patch and build PostgreSQL. > 2. run test.sh. > 3. execute the step 11 and later described in scenario.md. > > The test.sh is a very hacky and dirty script and is optimized in my > environment (especially adding many sleeps). You might need to adjust > it while checking scenario.md. > > I've also confirmed that this issue is fixed by the attached patch, > which clears candidate_restart_lsn and friends during > ReplicationSlotRelease(). Hi, one important question - do I patch newer Pg, or older? The thing is that we were able to replicate the problem (with some luck) only on production databases, and patching them will be hard sell. Maybe possible, but if it's enough to patch the pg14 (recipient) it would make my life much easier. Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Masahiko Sawada
Date:
On Mon, Feb 6, 2023 at 8:15 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Mon, Feb 06, 2023 at 05:25:42PM +0900, Masahiko Sawada wrote: > > Based on the analysis we did[1][2], I've created the manual scenario > > to reproduce this issue with the attached patch and the script. > > > > The scenario.md explains the basic steps to reproduce this issue. It > > consists of 13 steps (very tricky!!). It's not sophisticated and could > > be improved. test.sh is the shell script I used to execute the > > reproduction steps from 1 to 10. In my environment, I could reproduce > > this issue by the following steps. > > > > 1. apply the patch and build PostgreSQL. > > 2. run test.sh. > > 3. execute the step 11 and later described in scenario.md. > > > > The test.sh is a very hacky and dirty script and is optimized in my > > environment (especially adding many sleeps). You might need to adjust > > it while checking scenario.md. > > > > I've also confirmed that this issue is fixed by the attached patch, > > which clears candidate_restart_lsn and friends during > > ReplicationSlotRelease(). > > Hi, > one important question - do I patch newer Pg, or older? The thing is > that we were able to replicate the problem (with some luck) only on > production databases, and patching them will be hard sell. Maybe > possible, but if it's enough to patch the pg14 (recipient) it would make > my life much easier. Unfortunately, the patch I attached is for the publisher (i.e., sender side). There might be a way to fix this issue from the receiver side but I have no idea for now. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Alvaro Herrera
Date:
On 2023-Feb-06, Masahiko Sawada wrote: > I've also confirmed that this issue is fixed by the attached patch, > which clears candidate_restart_lsn and friends during > ReplicationSlotRelease(). Hmm, interesting -- I was studying some other bug recently involving the xmin of a slot that had been invalidated and I remember wondering if these "candidate" fields were being properly ignored when the slot is marked not in use; but I didn't check. Are you sure that resetting them when the slot is released is the appropriate thing to do? I mean, shouldn't they be kept set while the slot is in use, and only reset if we destroy it? (But, actually, in that case, maybe we don't need to reset them: instead we need to make sure to ignore the values for slots that are not in_use. However, I don't know what the semantics are for these "candidate" fields, so maybe this is wrong.) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Masahiko Sawada
Date:
On Tue, Feb 7, 2023 at 12:22 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2023-Feb-06, Masahiko Sawada wrote: > > > I've also confirmed that this issue is fixed by the attached patch, > > which clears candidate_restart_lsn and friends during > > ReplicationSlotRelease(). > > Hmm, interesting -- I was studying some other bug recently involving the > xmin of a slot that had been invalidated and I remember wondering if > these "candidate" fields were being properly ignored when the slot is > marked not in use; but I didn't check. Are you sure that resetting them > when the slot is released is the appropriate thing to do? I mean, > shouldn't they be kept set while the slot is in use, and only reset if > we destroy it? Thinking about the root cause more, it seems to me that the root cause is not the fact that candidate_xxx values are not cleared when being released. In the scenario I reproduced, after restarting the logical decoding, the walsender sets the restart_lsn to a candidate_restart_lsn left in the slot upon receiving the ack from the subscriber. Then, when decoding a RUNNING_XACTS record, the walsender updates the restart_lsn, which actually retreats it. The patch, which clears the candidate_xxx values at releasing, fixes this issue by not updating the restart_lsn upon receiving the ack in this case, but I think it's legitimate that the walsender sets the restart_lsn to the candidate_restart_lsn left in the slot in this case. The root cause here seems to me that it allows the restart_lsn to retreat. In LogicalIncreaseRestartDecodingForSlot(), we have the following codes: /* don't overwrite if have a newer restart lsn */ if (restart_lsn <= slot->data.restart_lsn) { } /* * We might have already flushed far enough to directly accept this lsn, * in this case there is no need to check for existing candidate LSNs */ else if (current_lsn <= slot->data.confirmed_flush) { slot->candidate_restart_valid = current_lsn; slot->candidate_restart_lsn = restart_lsn; /* our candidate can directly be used */ updated_lsn = true; } /* * Only increase if the previous values have been applied, otherwise we * might never end up updating if the receiver acks too slowly. A missed * value here will just cause some extra effort after reconnecting. */ if (slot->candidate_restart_valid == InvalidXLogRecPtr) { slot->candidate_restart_valid = current_lsn; slot->candidate_restart_lsn = restart_lsn; SpinLockRelease(&slot->mutex); } If slot->candidate_restart_valid is InvalidXLogRecPtr, he candidate_restart_lsn could be set to a LSN lower than the slot's restart_lsn in spite of the first comment saying "don't overwrite if have a newer restart lsn". So I think the right fix here is to not allow the restart_lsn to retreat by changing "if" to "else if" of the third block. > > (But, actually, in that case, maybe we don't need to reset them: instead > we need to make sure to ignore the values for slots that are not in_use. > However, I don't know what the semantics are for these "candidate" > fields, so maybe this is wrong.) IIUC in that case, the WAL segments are removed while the slot is in_use. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
hubert depesz lubaczewski
Date:
Hi, so, we have another bit of interesting information. maybe related, maybe not. We noticed weird situation on two clusters we're trying to upgrade. In both cases sitaution looked the same: 1. there was another process (debezium) connected to source (pg12) using logical replication 2. pg12 -> pg14 replication failed with the message 'ERROR: requested WAL segment ... has already been ' 3. some time afterwards (most likely couple of hours) the process that is/was responsible for debezium replicaiton (pg process) stopped handling WAL, but instead is eating 100% of cpu. When this situation happens, we can't pg_cancel_backend(pid) for the "broken" wal sender, it also can't be pg_terminate_backend() ! strace of the process doesn't show anything. When I tried to get backtrace from gdb all I got was: (gdb) bt #0 0x0000aaaad270521c in hash_seq_search () #1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so #2 0x0000aaaad26e3644 in CallSyscacheCallbacks () #3 0x0000aaaad26e3644 in CallSyscacheCallbacks () #4 0x0000aaaad257764c in ReorderBufferCommit () #5 0x0000aaaad256c804 in ?? () #6 0x0000aaaaf303d280 in ?? () If I'd quit gdb, and restart, and redo bt, I get #0 0x0000ffff806c81a8 in hash_seq_search@plt () from /usr/lib/postgresql/12/lib/pgoutput.so #1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so #2 0x0000aaaad291ae58 in ?? () or #0 0x0000aaaad2705244 in hash_seq_search () #1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so #2 0x0000aaaad26e3644 in CallSyscacheCallbacks () #3 0x0000aaaad26e3644 in CallSyscacheCallbacks () #4 0x0000aaaad257764c in ReorderBufferCommit () #5 0x0000aaaad256c804 in ?? () #6 0x0000aaaaf303d280 in ?? () At this moment, the only thing that we can do is kill -9 the process (or restart pg). I don't know if it's relevant, but I have this case *right now*, and if it's helpful I can provide more information before we will have to kill it. Best regards, depesz
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Kyotaro Horiguchi
Date:
This seems to have a thin connection to the issue, but. At Wed, 16 Nov 2022 23:25:12 -0800, Andres Freund <andres@anarazel.de> wrote in > I'm a bit confused by the changes the changes c6550776394e made to > KeepLogSeg() etc. > > Before: > /* then check whether slots limit removal further */ > if (max_replication_slots > 0 && keep != InvalidXLogRecPtr) > { > XLogSegNo slotSegNo; > > XLByteToSeg(keep, slotSegNo, wal_segment_size); > > if (slotSegNo <= 0) > segno = 1; > else if (slotSegNo < segno) > segno = slotSegNo; > } > > Current: > keep = XLogGetReplicationSlotMinimumLSN(); > if (keep != InvalidXLogRecPtr) > { > XLByteToSeg(keep, segno, wal_segment_size); > > /* Cap by max_slot_wal_keep_size ... */ > if (max_slot_wal_keep_size_mb >= 0) > { > uint64 slot_keep_segs; > > slot_keep_segs = > ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size); > > if (currSegNo - segno > slot_keep_segs) > segno = currSegNo - slot_keep_segs; > } > } > > Which looks to me to have lost the protection against segno being 0. Which > afaict would break the callers, because they do _logSegNo--. Which would then > allow all WAL to be reused, no? I guess the argument for changing this was > that we should never have such a dangerous value returned by > XLogGetReplicationSlotMinimumLSN()? Mmm. Right. If XLogGetReplicationSlotMinimumLSN() returned an LSN within the zeroth segment, segno ends up with being 0, which is dangerous if happens. However, if it happens, slots are evidently already broken, but there's no cost to have that protection. > It seems decidedly not great to not log at least a debug1 (but probably it > should be LOG) message when KeepLogSeg() decides to limit based on > max_slot_wal_keep_size. It's easy to do that, but that log is highly accompanied by a LOG line "terminating process %d to release replication slot \"%s\"". I don't mind adding it if it is a DEBUGx. > It feels wrong to subtract max_slot_wal_keep_size from recptr - that's the end > of the checkpoint record. Given that we, leaving max_slot_wal_keep_size aside, > only actually remove WAL if older than the segment that RedoRecPtr (the > logical start of the checkpoint) is in. If the checkpoint is large we'll end > up removing replication slots even though they potentially would only have > retained one additional WAL segment. I think that it is a controversial part, but that variable is defined the similar way to wal_keep_size. And I think that all max_wal_size, wal_keep_size and max_slot_wal_keep_size being defined with the same base LSN makes things simpler for users (also for developers). Regardless of checkpoint length, if slots get frequently invalidated, the setting should be considered to be too small for the system requirements. > Isn't it problematic to use ConvertToXSegs() to implement > max_slot_wal_keep_size, given that it rounds *down*? Particularly for a large > wal_segment_size that'd afaict lead to being much more aggressive invalidating > slots. I think max_slot_wal_keep_size is, like max_wal_size for checkpoints, a safeguard for slots not to fill-up WAL directory. Thus they both are rounded down. If you have 1GB WAL directory and set wal_segment_size to 4192MB, I don't see it a sane setup. But if segment size is smaller than one hundredth of max_wal_size, that difference won't matter for anyone. But anyway, it's a pain in the a.. that the both of them (and wal_keep_size) don't work in a immediate manner, though.. > Also, why do we do something as expensive as > InvalidateObsoleteReplicationSlots() even when max_slot_wal_keep_size had no > effect? Indeed. Maybe we didn't regard that process as complex at start? I think we can compare the cutoff segno against XLogGetReplicationSlotMinimumLSN() before entering the loop over slots. Thus I think there's room for the following improvements. - Prevent KeepLogSeg from returning 0. - Add DEBUG log to KeepLogSeg emitted when max_slot_wal_keep_size affects. - Check against minimum slot LSN before actually examining through the slots in Invalidateobsoletereplicationslots. I'm not sure about the second item but the others seem back-patchable. If we need to continue further discussion, will need another thread. Anyway I'll come up with the patch for the above three items. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
From
Andres Freund
Date:
Hi, On 2023-02-13 15:45:49 +0900, Kyotaro Horiguchi wrote: > This seems to have a thin connection to the issue, but. I was worried that the changes could lead us to removing WAL without max_slot_wal_keep_size set. > > It seems decidedly not great to not log at least a debug1 (but probably it > > should be LOG) message when KeepLogSeg() decides to limit based on > > max_slot_wal_keep_size. > > It's easy to do that, but that log is highly accompanied by a LOG line > "terminating process %d to release replication slot \"%s\"". I don't > mind adding it if it is a DEBUGx. My problem with that is that we might *NOT* see those log messages for some reason, but that that's impossible to debug as-is. And even if we see them, it's not that easy to figure out by how much we were over max_slot_wal_keep_size, because we always report it in the context of a specific slot. Removing WAL that's still needed is a *very* severe operation. Emitting an additional line in case it happens isn't a problem. > > It feels wrong to subtract max_slot_wal_keep_size from recptr - that's the end > > of the checkpoint record. Given that we, leaving max_slot_wal_keep_size aside, > > only actually remove WAL if older than the segment that RedoRecPtr (the > > logical start of the checkpoint) is in. If the checkpoint is large we'll end > > up removing replication slots even though they potentially would only have > > retained one additional WAL segment. > > I think that it is a controversial part, but that variable is defined > the similar way to wal_keep_size. And I think that all max_wal_size, > wal_keep_size and max_slot_wal_keep_size being defined with the same > base LSN makes things simpler for users (also for developers). > Regardless of checkpoint length, if slots get frequently invalidated, > the setting should be considered to be too small for the system > requirements. I think it's bad that we define wal_keep_size, max_slot_wal_keep_size that way. I don't think bringing max_wal_size into this is useful, as it influences different things. > > Isn't it problematic to use ConvertToXSegs() to implement > > max_slot_wal_keep_size, given that it rounds *down*? Particularly for a large > > wal_segment_size that'd afaict lead to being much more aggressive invalidating > > slots. > > I think max_slot_wal_keep_size is, like max_wal_size for checkpoints, > a safeguard for slots not to fill-up WAL directory. Thus they both are > rounded down. If you have 1GB WAL directory and set wal_segment_size > to 4192MB, I don't see it a sane setup. But if segment size is smaller > than one hundredth of max_wal_size, that difference won't matter for > anyone. But anyway, it's a pain in the a.. that the both of them (and > wal_keep_size) don't work in a immediate manner, though.. It doesn't matter a lot for 16MB segments, but with 1GB segments it's a different story. To me the way it's done now is a bug, one that can in extreme circumstances lead to data loss. > > Also, why do we do something as expensive as > > InvalidateObsoleteReplicationSlots() even when max_slot_wal_keep_size had no > > effect? > > Indeed. Maybe we didn't regard that process as complex at start? I > think we can compare the cutoff segno against > XLogGetReplicationSlotMinimumLSN() before entering the loop over > slots. That'd be better, but I'd probably go further, and really gate it on max_slot_wal_keep_size having had an effect. > Thus I think there's room for the following improvements. > > - Prevent KeepLogSeg from returning 0. > > - Add DEBUG log to KeepLogSeg emitted when max_slot_wal_keep_size affects. > > - Check against minimum slot LSN before actually examining through the > slots in Invalidateobsoletereplicationslots. > > I'm not sure about the second item but the others seem back-patchable. > > If we need to continue further discussion, will need another > thread. Anyway I'll come up with the patch for the above three items. Yep, probably a good idea to start another thread. There's also https://www.postgresql.org/message-id/20220223014855.4lsddr464i7mymk2%40alap3.anarazel.de that unfortunately nobody replied to. Greetings, Andres Freund