Thread: WAL segments removed from primary despite the fact that logical replication slot needs it.

WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
Hi,
We are still in process of preparing upgrades from Ubuntu Bionic with Pg
12 to Focal with Pg 14.

And today (well, yesterday) we hit something that looks like a bug.

Jumping to problem - wal files needed by logical replication were
removed from primary, despite slot being there.

Now, a bit more details.

Our replication is setup using following method:

1. We create all tables and primary keys on pg14
2. replication slot is created, with no tables
3. we add some tables to replication slot, and wait for all tables to
   switch state to "ready"
4. rinse and repeat until all tables are in replication set
5. once all is ready, we add rest of schema (indexes, constraints,
   functions, fkeys, and so on).

This works. Worked?

Yesterday we saw something that broke the replication.

First data from bionic side:
when                    | currentLSN    | slot_name │ plugin   │ slot_type │ datoid │ database │ temporary │ active │
active_pid│  xmin  │ catalog_xmin │  restart_lsn  │ confirmed_flush_lsn
 
2022-10-04 14:13:49 UTC | CA0E/273F2180 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743431068   | CA09/BE23BC70 | CA09/C82B2A58
 
2022-10-04 14:14:48 UTC | CA0E/3B5FF5C8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743451605   | CA09/C9DA89B0 | CA09/D3FA9898
 
2022-10-04 14:15:47 UTC | CA0E/4F7C06F8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743479469   | CA09/DD32CE50 | CA09/E65E7F80
 
2022-10-04 14:16:46 UTC | CA0E/6CA820A8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743516374   | CA09/EDD61CB8 | CA09/F3A856D8
 
2022-10-04 14:17:45 UTC | CA0E/8C7F3950 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743635890   | CA0A/2157B8F8 | CA0A/2A4C48F8
 
2022-10-04 14:18:44 UTC | CA0E/A5D874F8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743686305   | CA0A/38BE5AB0 | CA0A/3D2B0C20
 
2022-10-04 14:19:43 UTC | CA0E/C3F6C240 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743749741   | CA0A/49ED91C8 | CA0A/4F6D3048
 
2022-10-04 14:20:42 UTC | CA0E/D7DBA590 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743749741   | CA0A/49ED91C8 | CA0A/4F6D3048
 
2022-10-04 14:21:41 UTC | CA0E/F65C0F48 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
29207     | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:22:40 UTC | CA0F/103DBE98 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:23:39 UTC | CA0F/22D791A8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:24:38 UTC | CA0F/32E81380 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:25:38 UTC | CA0F/41E4A380 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:26:37 UTC | CA0F/58FD1B30 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:27:36 UTC | CA0F/6F732698 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:28:35 UTC | CA0F/855E2F50 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:29:34 UTC | CA0F/94B36450 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:30:33 UTC | CA0F/A7553BF8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:31:33 UTC | CA0F/BB10F020 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:32:32 UTC | CA0F/D1B4DF10 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:33:31 UTC | CA0F/EE332240 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 
2022-10-04 14:34:30 UTC | CA10/3B78E10  | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30
 

As you can see around 2:20pm replication stopped fetching data.

At this time, in pg14 logs don't show any problems what so ever.
Pg_restore was doing it's work with create indexes, and some queries
like:
ALTER TABLE ONLY schema_name.table_name REPLICA IDENTITY USING INDEX index_table_name_replica_identity;

At 2:20:48pm there was:
2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could not
receivedata from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0
 

But these errors happened earlier too (minutes/hours earlier).

But then, 3 minutes later I got:

2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive
datafrom WAL stream: SSL SYSCALL error: EOF detected
 
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.",,,,,,,,,"","logical replication worker",,0
2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical
replicationapply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0
 
2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical
replicationworker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0
 
2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive data
fromWAL stream: ERROR:  requested WAL segment 000000010000CA0A00000049 has already been removed",,,,,,,,,"","logical
replicationworker",,0
 
2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker ""logical
replicationworker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0
 


Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on
pg12/bionic.

Unfortunately I don't know the content of pg_wal on bionic at the
moment, but currently I see:

$ ls -l pg_wal/| head
total 297534552
-rw------- 1 postgres postgres      349 Feb 11  2022 000000010000A1410000001B.00C81918.backup
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004D
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004E
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004F
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000050
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000051
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000052
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000053
-rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000054

So, 4 files are missing.

These were archived properly, and I tried to restore them from archive, and put
them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
slot with the same message about "requested WAL segment
000000010000CA0A00000049 has already been removed"

I know that at this moment we're hosed. This replication was removed,
everythbing cleared (after all, we accumulated already 300G of wal, and
we can't extend the FS forever, so we needed to cleanup the situation).

But, the way I look at it, Pg12 shouldn't be removing these 4 wal files,
and the fact that it did means that either we did something wrong (what,
though?), or there is some kind of bug in Pg.

Anyone has any idea on how to investigate/fix that?

Best regards,

depesz




On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> Hi,
> We are still in process of preparing upgrades from Ubuntu Bionic with Pg
> 12 to Focal with Pg 14.
>
> And today (well, yesterday) we hit something that looks like a bug.
>
> Jumping to problem - wal files needed by logical replication were
> removed from primary, despite slot being there.
>
> Now, a bit more details.
>
> Our replication is setup using following method:
>
> 1. We create all tables and primary keys on pg14
> 2. replication slot is created, with no tables
> 3. we add some tables to replication slot, and wait for all tables to
>    switch state to "ready"
> 4. rinse and repeat until all tables are in replication set
> 5. once all is ready, we add rest of schema (indexes, constraints,
>    functions, fkeys, and so on).
>
> This works. Worked?
>
> Yesterday we saw something that broke the replication.
>
> First data from bionic side:
> when                    | currentLSN    | slot_name │ plugin   │ slot_type │ datoid │ database │ temporary │ active │
active_pid│  xmin  │ catalog_xmin │  restart_lsn  │ confirmed_flush_lsn 
> 2022-10-04 14:13:49 UTC | CA0E/273F2180 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743431068   | CA09/BE23BC70 | CA09/C82B2A58 
> 2022-10-04 14:14:48 UTC | CA0E/3B5FF5C8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743451605   | CA09/C9DA89B0 | CA09/D3FA9898 
> 2022-10-04 14:15:47 UTC | CA0E/4F7C06F8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743479469   | CA09/DD32CE50 | CA09/E65E7F80 
> 2022-10-04 14:16:46 UTC | CA0E/6CA820A8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743516374   | CA09/EDD61CB8 | CA09/F3A856D8 
> 2022-10-04 14:17:45 UTC | CA0E/8C7F3950 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743635890   | CA0A/2157B8F8 | CA0A/2A4C48F8 
> 2022-10-04 14:18:44 UTC | CA0E/A5D874F8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743686305   | CA0A/38BE5AB0 | CA0A/3D2B0C20 
> 2022-10-04 14:19:43 UTC | CA0E/C3F6C240 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
6084      | \N     | 3743749741   | CA0A/49ED91C8 | CA0A/4F6D3048 
> 2022-10-04 14:20:42 UTC | CA0E/D7DBA590 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743749741   | CA0A/49ED91C8 | CA0A/4F6D3048 
> 2022-10-04 14:21:41 UTC | CA0E/F65C0F48 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | t      |
29207     | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30 
> 2022-10-04 14:22:40 UTC | CA0F/103DBE98 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30 
> 2022-10-04 14:23:39 UTC | CA0F/22D791A8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30 
> 2022-10-04 14:24:38 UTC | CA0F/32E81380 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30 
> 2022-10-04 14:25:38 UTC | CA0F/41E4A380 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30 
> 2022-10-04 14:26:37 UTC | CA0F/58FD1B30 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30 
> 2022-10-04 14:27:36 UTC | CA0F/6F732698 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30 
> 2022-10-04 14:28:35 UTC | CA0F/855E2F50 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30 
> 2022-10-04 14:29:34 UTC | CA0F/94B36450 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30 
> 2022-10-04 14:30:33 UTC | CA0F/A7553BF8 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30 
> 2022-10-04 14:31:33 UTC | CA0F/BB10F020 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30 
> 2022-10-04 14:32:32 UTC | CA0F/D1B4DF10 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30 
> 2022-10-04 14:33:31 UTC | CA0F/EE332240 | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30 
> 2022-10-04 14:34:30 UTC | CA10/3B78E10  | focal14   | pgoutput | logical   | 16609  | dbname   | f         | f      |
\N        | \N     | 3743760874   | CA0A/49ED91C8 | CA0A/55987F30 
>
> As you can see around 2:20pm replication stopped fetching data.
>

Around that time, it seems the slot is also marked inactive probably
due to walsender exit.

> At this time, in pg14 logs don't show any problems what so ever.
> Pg_restore was doing it's work with create indexes, and some queries
> like:
> ALTER TABLE ONLY schema_name.table_name REPLICA IDENTITY USING INDEX index_table_name_replica_identity;
>
> At 2:20:48pm there was:
> 2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could
notreceive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0 
>
> But these errors happened earlier too (minutes/hours earlier).
>
> But then, 3 minutes later I got:
>
> 2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive
datafrom WAL stream: SSL SYSCALL error: EOF detected 
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.",,,,,,,,,"","logical replication worker",,0
> 2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical
replicationapply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0 
> 2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker
""logicalreplication worker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0 
> 2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive
datafrom WAL stream: ERROR:  requested WAL segment 000000010000CA0A00000049 has already been
removed",,,,,,,,,"","logicalreplication worker",,0 
> 2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker
""logicalreplication worker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0 
>
>
> Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile on
pg12/bionic.
>

This is quite strange and I am not able to see the reason why this can
happen. The only way to debug this problem that comes to mind is to
add more LOGS around the code that removes the WAL files. For example,
we can try to print the value of minimumslotLSN (keep) and logSegNo in
KeepLogSeg().

> Unfortunately I don't know the content of pg_wal on bionic at the
> moment, but currently I see:
>
> $ ls -l pg_wal/| head
> total 297534552
> -rw------- 1 postgres postgres      349 Feb 11  2022 000000010000A1410000001B.00C81918.backup
> -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004D
> -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004E
> -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A0000004F
> -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000050
> -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000051
> -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000052
> -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000053
> -rw------- 1 postgres postgres 16777216 Oct  4 13:29 000000010000CA0A00000054
>
> So, 4 files are missing.
>
> These were archived properly, and I tried to restore them from archive, and put
> them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
> slot with the same message about "requested WAL segment
> 000000010000CA0A00000049 has already been removed"
>

I think you are seeing this behavior because we update the
lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
then we use that to give the error you are seeing.

> I know that at this moment we're hosed. This replication was removed,
> everythbing cleared (after all, we accumulated already 300G of wal, and
> we can't extend the FS forever, so we needed to cleanup the situation).
>
> But, the way I look at it, Pg12 shouldn't be removing these 4 wal files,
> and the fact that it did means that either we did something wrong (what,
> though?), or there is some kind of bug in Pg.
>
> Anyone has any idea on how to investigate/fix that?
>

As suggested above, I feel we need to add some LOGS to see what is
going on unless you have a reproducible testcase.

--
With Regards,
Amit Kapila.



At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
> <depesz@depesz.com> wrote:
> > At 2:20:48pm there was:
> > 2022-10-04 14:20:48.579 UTC,,,59744,,633c3fac.e960,62,,2022-10-04 14:14:04 UTC,8/6519966,7015376,ERROR,08006,"could
notreceive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,"","logical replication worker",,0
 
> >
> > But these errors happened earlier too (minutes/hours earlier).
> >
> > But then, 3 minutes later I got:
> >
> > 2022-10-04 14:23:10.618 UTC,,,63569,,633c4140.f851,38,,2022-10-04 14:20:48 UTC,8/0,0,ERROR,08006,"could not receive
datafrom WAL stream: SSL SYSCALL error: EOF detected
 
> > server closed the connection unexpectedly
> >         This probably means the server terminated abnormally
> >         before or while processing the request.",,,,,,,,,"","logical replication worker",,0
> > 2022-10-04 14:23:10.623 UTC,,,64701,,633c41ce.fcbd,1,,2022-10-04 14:23:10 UTC,8/6529701,0,LOG,00000,"logical
replicationapply worker for subscription ""focal14"" has started",,,,,,,,,"","logical replication worker",,0
 
> > 2022-10-04 14:23:10.624 UTC,,,1801,,633aa5ea.709,41,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker
""logicalreplication worker"" (PID 63569) exited with exit code 1",,,,,,,,,"","postmaster",,0
 
> > 2022-10-04 14:23:10.643 UTC,,,64701,,633c41ce.fcbd,2,,2022-10-04 14:23:10 UTC,8/0,0,ERROR,08P01,"could not receive
datafrom WAL stream: ERROR:  requested WAL segment 000000010000CA0A00000049 has already been
removed",,,,,,,,,"","logicalreplication worker",,0
 
> > 2022-10-04 14:23:10.646 UTC,,,1801,,633aa5ea.709,42,,2022-10-03 09:05:46 UTC,,0,LOG,00000,"background worker
""logicalreplication worker"" (PID 64701) exited with exit code 1",,,,,,,,,"","postmaster",,0
 
> >
> >
> > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile
onpg12/bionic.
 
> >
> 
> This is quite strange and I am not able to see the reason why this can
> happen. The only way to debug this problem that comes to mind is to
> add more LOGS around the code that removes the WAL files. For example,
> we can try to print the value of minimumslotLSN (keep) and logSegNo in
> KeepLogSeg().

What I can see from the information at hand is that walsender is
killed (not died) twice for some reasons then some "reserved" WAL
files are lost.  I don't see the reason for the second walsender (pid
29207) being killed but if the "has already been removed" log line
above is the first one, the file removal is probably happend AT
14:23:10.  (log is needed to know that.)

Anyway, I didn't come up with a possibility other than of someone else
having removed those files..  For example, virus scan softwares.  If
any such software kills processes that is grabbing the files that the
software identified as malicious and is going to remove, that could be
happen?

> > So, 4 files are missing.
> >
> > These were archived properly, and I tried to restore them from archive, and put
> > them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
> > slot with the same message about "requested WAL segment
> > 000000010000CA0A00000049 has already been removed"
> >
> 
> I think you are seeing this behavior because we update the
> lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
> then we use that to give the error you are seeing.

lastRemovedSegNo is updated once per one segment of removal.  Four
files are lost in this case.


> > I know that at this moment we're hosed. This replication was removed,
> > everythbing cleared (after all, we accumulated already 300G of wal, and
> > we can't extend the FS forever, so we needed to cleanup the situation).
> >
> > But, the way I look at it, Pg12 shouldn't be removing these 4 wal files,
> > and the fact that it did means that either we did something wrong (what,
> > though?), or there is some kind of bug in Pg.
> >
> > Anyone has any idea on how to investigate/fix that?
> >
> 
> As suggested above, I feel we need to add some LOGS to see what is
> going on unless you have a reproducible testcase.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
> > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as visibile
onpg12/bionic.
 
> This is quite strange and I am not able to see the reason why this can
> happen. The only way to debug this problem that comes to mind is to
> add more LOGS around the code that removes the WAL files. For example,
> we can try to print the value of minimumslotLSN (keep) and logSegNo in
> KeepLogSeg().

That would require changing pg sources, I think, recompiling, and
retrying?

And on the pg12 side? I don't think we can do it, but I'll ask boss.

> > Anyone has any idea on how to investigate/fix that?
> As suggested above, I feel we need to add some LOGS to see what is
> going on unless you have a reproducible testcase.

The problem is that we were trying to use this approach to upgrade pg12
production db to pg14. Modifying it (pg12) will be tricky, as it would
require restart. Unless there is a way to add extra logging without
restart...

depesz



Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Mon, Oct 17, 2022 at 03:53:07PM +0900, Kyotaro Horiguchi wrote:
> Anyway, I didn't come up with a possibility other than of someone else
> having removed those files..  For example, virus scan softwares.  If
> any such software kills processes that is grabbing the files that the
> software identified as malicious and is going to remove, that could be
> happen?

We don't use virus scanning there. The pg12 server is on ubuntu 18.04,
used for normal production workload, and pg14 is on ubuntu focal, with
no load, aside from 2 physical streaming replication to pg14 replicas.

Best regards,

depesz




On Mon, Oct 17, 2022 at 12:23 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
> > On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
> > > So, 4 files are missing.
> > >
> > > These were archived properly, and I tried to restore them from archive, and put
> > > them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
> > > slot with the same message about "requested WAL segment
> > > 000000010000CA0A00000049 has already been removed"
> > >
> >
> > I think you are seeing this behavior because we update the
> > lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
> > then we use that to give the error you are seeing.
>
> lastRemovedSegNo is updated once per one segment of removal.  Four
> files are lost in this case.
>

I didn't understand your response. I was saying the one possible
reason why even after restoring files from the archive the error
appears is because of the lastRemovedSegNo related check in function
CheckXLogRemoved() and we update its value while removing old xlog
files. From this behavior, it appears that somehow the server has only
removed those files even though the reason is not clear yet.

-- 
With Regards,
Amit Kapila.



On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
> > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as
visibileon pg12/bionic.
 
> > This is quite strange and I am not able to see the reason why this can
> > happen. The only way to debug this problem that comes to mind is to
> > add more LOGS around the code that removes the WAL files. For example,
> > we can try to print the value of minimumslotLSN (keep) and logSegNo in
> > KeepLogSeg().
>
> That would require changing pg sources, I think, recompiling, and
> retrying?
>

Yes. BTW, are you on the latest release of pg12, if not then you can
once check the release notes to see if there is any related bug fix in
the later releases?

Is this problem reproducible? If so, can you find out why there are
multiple time connection issues between walsender and walreceiver?

-- 
With Regards,
Amit Kapila.



At Tue, 18 Oct 2022 16:51:26 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Mon, Oct 17, 2022 at 12:23 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
> > > On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
> > > > So, 4 files are missing.
> > > >
> > > > These were archived properly, and I tried to restore them from archive, and put
> > > > them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
> > > > slot with the same message about "requested WAL segment
> > > > 000000010000CA0A00000049 has already been removed"
> > > >
> > >
> > > I think you are seeing this behavior because we update the
> > > lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
> > > then we use that to give the error you are seeing.
> >
> > lastRemovedSegNo is updated once per one segment of removal.  Four
> > files are lost in this case.
> >
> 
> I didn't understand your response. I was saying the one possible
> reason why even after restoring files from the archive the error
> appears is because of the lastRemovedSegNo related check in function
> CheckXLogRemoved() and we update its value while removing old xlog
> files. From this behavior, it appears that somehow the server has only
> removed those files even though the reason is not clear yet.

I meant that if PostgreSQL did something wrong (that I don't
understand at all) there, the difference between lastRemovedSegNo and
the last actually removed segment won't be more than 1.

CheckXLogRevmoed warns for a segment logically already removed but was
physically able to be opend.  On the other hand WalSndSegmentOpen
emits the same error earlier than CHeckXLogRemoved warns if the
segment is actually ENOENT regardless of the value of
lastRemovedSegNo.

One point that bothers me is it seems that the walsender was
killed. If the file is removed while the walsender is working, logical
replication worker receives the error and emits "ERROR: could not
receive data...has been removed" instead of suddenly disconnected as
in this case.  Considering this together with the possibility that the
segments were removed by someone else, I suspencted virus scanners but
that has found to be false..  (I still don't know there's any virus
scanner that kills processes grabbing a suspectedly-malicious file.)


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On Wed, Oct 19, 2022 at 7:31 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
>
> One point that bothers me is it seems that the walsender was
> killed. If the file is removed while the walsender is working, logical
> replication worker receives the error and emits "ERROR: could not
> receive data...has been removed" instead of suddenly disconnected as
> in this case.  Considering this together with the possibility that the
> segments were removed by someone else, I suspencted virus scanners but
> that has found to be false..
>

Yeah, as requested yesterday, maybe checking if this can be reproduced
will help us to narrow down the cause. If this doesn't happen due to
an application mistake it appears that after the slot becomes inactive
during the time walsender was down, the server removes the required
files but there is nothing in the code that suggests that such a thing
can happen.

Are there any more logs on publisher/subscriber which indicate any
sort of error?


-- 
With Regards,
Amit Kapila.



Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Wed, Oct 19, 2022 at 03:16:59PM +0530, Amit Kapila wrote:
> Yeah, as requested yesterday, maybe checking if this can be reproduced
> will help us to narrow down the cause. If this doesn't happen due to
> an application mistake it appears that after the slot becomes inactive
> during the time walsender was down, the server removes the required
> files but there is nothing in the code that suggests that such a thing
> can happen.
> Are there any more logs on publisher/subscriber which indicate any
> sort of error?

This specific setup of replication was torn down, we couldn't keep it
running. Can setup it once more, and with some luck it will break again.
Will take a day, though.

Is there anything I can do to make it more debugable, but that will not
require restart of pg12?

Best regards,

depesz




Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote:
> On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
> <depesz@depesz.com> wrote:
> >
> > On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
> > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as
visibileon pg12/bionic.
 
> > > This is quite strange and I am not able to see the reason why this can
> > > happen. The only way to debug this problem that comes to mind is to
> > > add more LOGS around the code that removes the WAL files. For example,
> > > we can try to print the value of minimumslotLSN (keep) and logSegNo in
> > > KeepLogSeg().
> >
> > That would require changing pg sources, I think, recompiling, and
> > retrying?
> >
> 
> Yes. BTW, are you on the latest release of pg12, if not then you can
> once check the release notes to see if there is any related bug fix in
> the later releases?
> 
> Is this problem reproducible? If so, can you find out why there are
> multiple time connection issues between walsender and walreceiver?

I can try to redo it, but before I do - anything I could do to either
side of replication to increase our chances at figuring out the
underlying problem? I can't restart pg12, though.

Best regards,

depesz




On Wed, Oct 19, 2022 at 3:50 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote:
> > On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
> > <depesz@depesz.com> wrote:
> > >
> > > On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
> > > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as
visibileon pg12/bionic.
 
> > > > This is quite strange and I am not able to see the reason why this can
> > > > happen. The only way to debug this problem that comes to mind is to
> > > > add more LOGS around the code that removes the WAL files. For example,
> > > > we can try to print the value of minimumslotLSN (keep) and logSegNo in
> > > > KeepLogSeg().
> > >
> > > That would require changing pg sources, I think, recompiling, and
> > > retrying?
> > >
> >
> > Yes. BTW, are you on the latest release of pg12, if not then you can
> > once check the release notes to see if there is any related bug fix in
> > the later releases?
> >
> > Is this problem reproducible? If so, can you find out why there are
> > multiple time connection issues between walsender and walreceiver?
>
> I can try to redo it, but before I do - anything I could do to either
> side of replication to increase our chances at figuring out the
> underlying problem? I can't restart pg12, though.
>

One idea is to change log level to DEBUG2 so that we can print which
files are removed by the server via
RemoveOldXlogFiles()
{
...
elog(DEBUG2, "attempting to remove WAL segments older than log file %s",
lastoff);
...
}

If we can do this then at the very least we can know whether the
required files are removed by the server or by some external
application.

-- 
With Regards,
Amit Kapila.



Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Wed, Oct 19, 2022 at 04:14:28PM +0530, Amit Kapila wrote:
> One idea is to change log level to DEBUG2 so that we can print which
> files are removed by the server via
> RemoveOldXlogFiles()
> {
> ...
> elog(DEBUG2, "attempting to remove WAL segments older than log file %s",
> lastoff);
> ...
> }
> 
> If we can do this then at the very least we can know whether the
> required files are removed by the server or by some external
> application.

That should be doable. Will take some time to setup test env, and run
it. Will get back to you on this.

Best regards,

depesz




Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote:
> On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
> <depesz@depesz.com> wrote:
> >
> > On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
> > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as
visibileon pg12/bionic.
 
> > > This is quite strange and I am not able to see the reason why this can
> > > happen. The only way to debug this problem that comes to mind is to
> > > add more LOGS around the code that removes the WAL files. For example,
> > > we can try to print the value of minimumslotLSN (keep) and logSegNo in
> > > KeepLogSeg().
> >
> > That would require changing pg sources, I think, recompiling, and
> > retrying?
> >
> 
> Yes. BTW, are you on the latest release of pg12, if not then you can
> once check the release notes to see if there is any related bug fix in
> the later releases?

We're not. We're on 12.9, and there seem to be some wal-related changes
since then:
https://why-upgrade.depesz.com/show?from=12.9&to=12.12&keywords=wal*

will talk with responsible people about upgrading it, and then redoing
the test.

Best regards,

depesz




Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Wed, Oct 19, 2022 at 01:11:06PM +0200, hubert depesz lubaczewski wrote:
> On Tue, Oct 18, 2022 at 04:57:52PM +0530, Amit Kapila wrote:
> > On Mon, Oct 17, 2022 at 2:43 PM hubert depesz lubaczewski
> > <depesz@depesz.com> wrote:
> > >
> > > On Sun, Oct 16, 2022 at 10:35:17AM +0530, Amit Kapila wrote:
> > > > > Wal file has been removed. Please note that the file was, as shown earlier, still within "restart_lsn" as
visibileon pg12/bionic.
 
> > > > This is quite strange and I am not able to see the reason why this can
> > > > happen. The only way to debug this problem that comes to mind is to
> > > > add more LOGS around the code that removes the WAL files. For example,
> > > > we can try to print the value of minimumslotLSN (keep) and logSegNo in
> > > > KeepLogSeg().
> > >
> > > That would require changing pg sources, I think, recompiling, and
> > > retrying?
> > >
> > 
> > Yes. BTW, are you on the latest release of pg12, if not then you can
> > once check the release notes to see if there is any related bug fix in
> > the later releases?
> 
> We're not. We're on 12.9, and there seem to be some wal-related changes
> since then:
> https://why-upgrade.depesz.com/show?from=12.9&to=12.12&keywords=wal*
> 
> will talk with responsible people about upgrading it, and then redoing
> the test.

Hi,
just an update. Due to procedures we will upgrade to 12.12 at the
beginning of November. Then, I'll try to replicate the problem again,
and we'll see how this will pan out.

For next test will do my best to use debug2 log level.

Best regards,

depesz




Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Thu, Oct 20, 2022 at 01:55:16PM +0200, hubert depesz lubaczewski wrote:
> Hi,
> just an update. Due to procedures we will upgrade to 12.12 at the
> beginning of November. Then, I'll try to replicate the problem again,
> and we'll see how this will pan out.
> For next test will do my best to use debug2 log level.

So. I have an update.

Out of 8 servers that we prepared upgrade for, one failed with the
same/similar problem.
Unfortunately I don't know if this is the same server as before. Sorry.

Specifically, I see now, in logs on focal/pg14 side:

#v+
2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive
datafrom WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,,,,"","logicalreplication worker",,0
 
#v-

On source (bionic/pg12), I see:

#v+
=# select * from pg_replication_slots ;
 slot_name │  plugin  │ slot_type │ datoid │ database │ temporary │ active │ active_pid │  xmin  │ catalog_xmin │
restart_lsn  │ confirmed_flush_lsn
 

═══════════╪══════════╪═══════════╪════════╪══════════╪═══════════╪════════╪════════════╪════════╪══════════════╪════════════════╪═════════════════════
 focal14   │ pgoutput │ logical   │  16607 │ canvas   │ f         │ f      │     [null] │ [null] │   3241443528 │
1039D/83825958│ 1039D/96453F38
 
(1 row)
#v-

but pg, like previously, lost some wal files, this time 8 of them:

#v+
=$ sudo ls -l ~postgres/12/main/pg_wal | head
total 224403644
-rw------- 1 postgres postgres      349 Feb 11  2022 000000010000BDEF000000FF.00045F68.backup
-rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008B
-rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008C
-rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008D
...
#v-

This time we had debug2 enabled on both sides.

Looks like we lost segments:
000000010001039D00000083, 000000010001039D00000084,
000000010001039D00000085, 000000010001039D00000086,
000000010001039D00000087, 000000010001039D00000088,
000000010001039D00000089, and 000000010001039D0000008A.

It looks that first time replica logged the error was:

#v+
2022-11-10 21:06:16.427 UTC,,,572364,,636d67c8.8bbcc,2,,2022-11-10 21:06:16 UTC,6/0,0,ERROR,08P01,"could not receive
datafrom WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,,,,"","logicalreplication worker",,0
 
#v-

In Pg logs on source I see:

#v+
2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000082""",,,,,,,,,""
 
2022-11-10 21:06:07.523 UTC,,,28925,,636378d6.70fd,4594903,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000083""",,,,,,,,,""
 
2022-11-10 21:06:07.524 UTC,,,28925,,636378d6.70fd,4594904,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000084""",,,,,,,,,""
 
2022-11-10 21:06:07.525 UTC,,,28925,,636378d6.70fd,4594905,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000085""",,,,,,,,,""
 
2022-11-10 21:06:07.526 UTC,,,28925,,636378d6.70fd,4594906,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000086""",,,,,,,,,""
 
2022-11-10 21:06:07.527 UTC,,,28925,,636378d6.70fd,4594907,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000087""",,,,,,,,,""
 
2022-11-10 21:06:07.528 UTC,,,28925,,636378d6.70fd,4594908,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000088""",,,,,,,,,""
 
2022-11-10 21:06:07.529 UTC,,,28925,,636378d6.70fd,4594909,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000089""",,,,,,,,,""
 
2022-11-10 21:06:07.530 UTC,,,28925,,636378d6.70fd,4594910,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D0000008A""",,,,,,,,,""
 
2022-11-10 21:06:16.426 UTC,"upgrayedd","canvas",28608,"10.1.238.101:45328",636d67c8.6fc0,8,"idle",2022-11-10 21:06:16
UTC,147/0,0,ERROR,58P01,"requestedWAL segment 000000010001039D00000083 has already been
removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names
'""focal14""')",,,"focal14"
2022-11-10 21:06:21.456 UTC,"upgrayedd","canvas",28815,"10.1.238.101:52316",636d67cd.708f,8,"idle",2022-11-10 21:06:21
UTC,147/0,0,ERROR,58P01,"requestedWAL segment 000000010001039D00000083 has already been
removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names
'""focal14""')",,,"focal14"
2022-11-10 21:06:26.488 UTC,"upgrayedd","canvas",28962,"10.1.238.101:52320",636d67d2.7122,8,"idle",2022-11-10 21:06:26
UTC,148/0,0,ERROR,58P01,"requestedWAL segment 000000010001039D00000083 has already been
removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names
'""focal14""')",,,"focal14"
#v-

So, the whole set of missing files was recycled at the same time.

One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1
minute.Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp.
Lookslike this:
 

#v+
timestamp                pg_current_wal_lsn  slot_name  plugin    slot_type  datoid  database  temporary  active
active_pid xmin  catalog_xmin  restart_lsn     confirmed_flush_lsn
 
2022-11-10 20:50:04 UTC  103A6/C49D36F8      focal14    pgoutput  logical    16607   canvas    f          t       16794
     \N    3241375739    1039D/65365F68  1039D/782908E8
 
2022-11-10 20:51:03 UTC  103A6/CC957148      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241375739    1039D/65365F68  1039D/782908E8
 
2022-11-10 20:52:02 UTC  103A6/D7348A30      focal14    pgoutput  logical    16607   canvas    f          t       21140
     \N    3241422387    1039D/65365F68  1039D/78C94588
 
2022-11-10 20:53:01 UTC  103A6/EB24F408      focal14    pgoutput  logical    16607   canvas    f          t       21140
     \N    3241422387    1039D/65365F68  1039D/7C8E7D78
 
2022-11-10 20:54:01 UTC  103A7/27A4778       focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241422387    1039D/65365F68  1039D/7C8E7D78
 
2022-11-10 20:55:00 UTC  103A7/1A635CD0      focal14    pgoutput  logical    16607   canvas    f          t       27607
     \N    3241422387    1039D/78833568  1039D/7F8BCD50
 
2022-11-10 20:55:59 UTC  103A7/2D5E6EE8      focal14    pgoutput  logical    16607   canvas    f          t       27607
     \N    3241422387    1039D/78833568  1039D/818B6990
 
2022-11-10 20:56:58 UTC  103A7/44548000      focal14    pgoutput  logical    16607   canvas    f          t       27607
     \N    3241422387    1039D/7E4CDF90  1039D/89E93620
 
2022-11-10 20:57:57 UTC  103A7/579F8000      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241422387    1039D/7E4CDF90  1039D/89E93620
 
2022-11-10 20:58:56 UTC  103A7/6B6C96D0      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241422387    1039D/7E4CDF90  1039D/89E93620
 
2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t       10414
     \N    3241434855    1039D/83825958  1039D/8DECBA18
 
2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t       10414
     \N    3241434855    1039D/83825958  1039D/911A8DB0
 
2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241434855    1039D/83825958  1039D/911A8DB0 
2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241434855    1039D/83825958  1039D/911A8DB0
 
2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t       21748
     \N    3241443528    1039D/83825958  1039D/955633D0
 
2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t       21748
     \N    3241443528    1039D/83825958  1039D/96453F38
 
2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241443528    1039D/83825958  1039D/96453F38
 
2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241443528    1039D/83825958  1039D/96453F38
 
2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241443528    1039D/83825958  1039D/96453F38
 
2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241443528    1039D/83825958  1039D/96453F38
 
2022-11-10 21:09:46 UTC  103A8/233202E0      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241443528    1039D/83825958  1039D/96453F38
 
2022-11-10 21:10:45 UTC  103A8/2FBC9720      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241443528    1039D/83825958  1039D/96453F38
 
#v-

Now, I can keep this situation up for some time, if more debug will be helpful,
but I don't know how long (disk space issues).

Is there anything in here that helps? Any other info I can dig out?

Best regards,

depesz




Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Fri, Nov 11, 2022 at 03:50:40PM +0100, hubert depesz lubaczewski wrote:
> #v+
> 2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive
datafrom WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,,,,"","logicalreplication worker",,0
 
> #v-

Sooo... plot thickens.

Without any changes, manual rebuild or anything, yesterday, the problem
seems to have solved itself?!

In logs on focal/pg14 I see:

#v+
2022-11-12 20:55:39.190 UTC,,,1897563,,6370084b.1cf45b,2,,2022-11-12 20:55:39 UTC,6/0,0,ERROR,08P01,"could not receive
datafrom WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,,,,"","logicalreplication worker",,0
 
#v-

And this is *the last* such message.

On bionic/pg12 we have in logs from pg_replication_slots:

#v+
timestamp                pg_current_wal_lsn  slot_name  plugin    slot_type  datoid  database  temporary  active
active_pid xmin  catalog_xmin  restart_lsn     confirmed_flush_lsn
 
2022-11-12 20:51:00 UTC  1041E/D3A0E540      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241443528    1039D/83825958  1039D/96453F38
 
2022-11-12 20:51:59 UTC  1041E/D89B6000      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241443528    1039D/83825958  1039D/96453F38
 
2022-11-12 20:52:58 UTC  1041E/E0547450      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241443528    1039D/83825958  1039D/96453F38
 
2022-11-12 20:53:58 UTC  1041E/E59634F0      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241443528    1039D/83825958  1039D/96453F38
 
2022-11-12 20:54:57 UTC  1041E/EBB50DE8      focal14    pgoutput  logical    16607   canvas    f          f       \N
     \N    3241443528    1039D/83825958  1039D/96453F38
 
2022-11-12 20:55:55 UTC  1041E/FBBC3160      focal14    pgoutput  logical    16607   canvas    f          t       18626
     \N    3241450490    1039D/9170B010  1039D/9B86EAF0
 
2022-11-12 20:56:55 UTC  1041F/FF34000       focal14    pgoutput  logical    16607   canvas    f          t       18626
     \N    3241469432    1039D/A21B4598  1039D/A928A6D0
 
2022-11-12 20:57:54 UTC  1041F/277FAE30      focal14    pgoutput  logical    16607   canvas    f          t       18626
     \N    3241480448    1039D/AD4C7A00  1039D/BA3FC840
 
2022-11-12 20:58:53 UTC  1041F/319A0000      focal14    pgoutput  logical    16607   canvas    f          t       18626
     \N    3241505462    1039D/C5C32398  1039D/DF899428
 
2022-11-12 20:59:52 UTC  1041F/3A399688      focal14    pgoutput  logical    16607   canvas    f          t       18626
     \N    3241527824    1039D/F3393280  1039E/AD9A740
 
...
2022-11-13 08:00:44 UTC  1043F/E464B738      focal14    pgoutput  logical    16607   canvas    f          t       18626
     \N    3292625412    1043F/E0F38E78  1043F/E4609628
 
#v-

I have no idea what has changed or why.

If it helps I can provide logs, but would prefer to do it off list.

Also, while the problem, with this cluster is "solved", I still have
like 300 other clusters to upgrade, and at least 1 has hit the same
problem today.

Best regards,

depesz





On Fri, Nov 11, 2022 at 8:20 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Thu, Oct 20, 2022 at 01:55:16PM +0200, hubert depesz lubaczewski wrote:
> > Hi,
> > just an update. Due to procedures we will upgrade to 12.12 at the
> > beginning of November. Then, I'll try to replicate the problem again,
> > and we'll see how this will pan out.
> > For next test will do my best to use debug2 log level.
>
> So. I have an update.
>
> Out of 8 servers that we prepared upgrade for, one failed with the
> same/similar problem.
> Unfortunately I don't know if this is the same server as before. Sorry.
>
> Specifically, I see now, in logs on focal/pg14 side:
>
> #v+
> 2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not receive
datafrom WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,,,,"","logicalreplication worker",,0 
> #v-
>
> On source (bionic/pg12), I see:
>
> #v+
> =# select * from pg_replication_slots ;
>  slot_name │  plugin  │ slot_type │ datoid │ database │ temporary │ active │ active_pid │  xmin  │ catalog_xmin │
restart_lsn  │ confirmed_flush_lsn 
>
═══════════╪══════════╪═══════════╪════════╪══════════╪═══════════╪════════╪════════════╪════════╪══════════════╪════════════════╪═════════════════════
>  focal14   │ pgoutput │ logical   │  16607 │ canvas   │ f         │ f      │     [null] │ [null] │   3241443528 │
1039D/83825958│ 1039D/96453F38 
> (1 row)
> #v-
>
> but pg, like previously, lost some wal files, this time 8 of them:
>
> #v+
> =$ sudo ls -l ~postgres/12/main/pg_wal | head
> total 224403644
> -rw------- 1 postgres postgres      349 Feb 11  2022 000000010000BDEF000000FF.00045F68.backup
> -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008B
> -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008C
> -rw------- 1 postgres postgres 16777216 Nov 10 19:08 000000010001039D0000008D
> ...
> #v-
>
> This time we had debug2 enabled on both sides.
>
> Looks like we lost segments:
> 000000010001039D00000083, 000000010001039D00000084,
> 000000010001039D00000085, 000000010001039D00000086,
> 000000010001039D00000087, 000000010001039D00000088,
> 000000010001039D00000089, and 000000010001039D0000008A.
>
> It looks that first time replica logged the error was:
>
> #v+
> 2022-11-10 21:06:16.427 UTC,,,572364,,636d67c8.8bbcc,2,,2022-11-10 21:06:16 UTC,6/0,0,ERROR,08P01,"could not receive
datafrom WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,,,,"","logicalreplication worker",,0 
> #v-
>
> In Pg logs on source I see:
>
> #v+
> 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000082""",,,,,,,,,"" 
> 2022-11-10 21:06:07.523 UTC,,,28925,,636378d6.70fd,4594903,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000083""",,,,,,,,,"" 
> 2022-11-10 21:06:07.524 UTC,,,28925,,636378d6.70fd,4594904,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000084""",,,,,,,,,"" 
> 2022-11-10 21:06:07.525 UTC,,,28925,,636378d6.70fd,4594905,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000085""",,,,,,,,,"" 
> 2022-11-10 21:06:07.526 UTC,,,28925,,636378d6.70fd,4594906,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000086""",,,,,,,,,"" 
> 2022-11-10 21:06:07.527 UTC,,,28925,,636378d6.70fd,4594907,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000087""",,,,,,,,,"" 
> 2022-11-10 21:06:07.528 UTC,,,28925,,636378d6.70fd,4594908,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000088""",,,,,,,,,"" 
> 2022-11-10 21:06:07.529 UTC,,,28925,,636378d6.70fd,4594909,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000089""",,,,,,,,,"" 
> 2022-11-10 21:06:07.530 UTC,,,28925,,636378d6.70fd,4594910,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D0000008A""",,,,,,,,,"" 
> 2022-11-10 21:06:16.426 UTC,"upgrayedd","canvas",28608,"10.1.238.101:45328",636d67c8.6fc0,8,"idle",2022-11-10
21:06:16UTC,147/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names
'""focal14""')",,,"focal14"
> 2022-11-10 21:06:21.456 UTC,"upgrayedd","canvas",28815,"10.1.238.101:52316",636d67cd.708f,8,"idle",2022-11-10
21:06:21UTC,147/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names
'""focal14""')",,,"focal14"
> 2022-11-10 21:06:26.488 UTC,"upgrayedd","canvas",28962,"10.1.238.101:52320",636d67d2.7122,8,"idle",2022-11-10
21:06:26UTC,148/0,0,ERROR,58P01,"requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,"START_REPLICATIONSLOT ""focal14"" LOGICAL 1039D/9754F6F8 (proto_version '1', publication_names
'""focal14""')",,,"focal14"
> #v-
>
> So, the whole set of missing files was recycled at the same time.
>
> One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1
minute.Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp.
Lookslike this: 
>
> #v+
> timestamp                pg_current_wal_lsn  slot_name  plugin    slot_type  datoid  database  temporary  active
active_pid xmin  catalog_xmin  restart_lsn     confirmed_flush_lsn 
...
...
> 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241434855    1039D/83825958  1039D/911A8DB0 
> 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/955633D0 
...

I think from the above two it is clear that the slot with restart_lsn
1039D/83825958  is active and we seem to have recycled corresponding
segment.

>
> Now, I can keep this situation up for some time, if more debug will be helpful,
> but I don't know how long (disk space issues).
>
> Is there anything in here that helps? Any other info I can dig out?
>

You may have the LOG for "attempting to remove WAL segments older than
log file %s", if so, please share. Also on similar lines, I think we
need to add some LOGs before calling RemoveOldXlogFiles() to know
where our computation to remove files goes off. Something, like the
attached, would be helpful but note I have added this quickly on my
working branch to show you what I have in mind. You may need small
changes based on the branch you are working on.

--
With Regards,
Amit Kapila.

Attachment

Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Mon, Nov 14, 2022 at 05:08:15PM +0530, Amit Kapila wrote:
> > #v+
> > 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000082""",,,,,,,,,""
 
...
> > #v-
> >
> > So, the whole set of missing files was recycled at the same time.
> >
> > One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1
minute.Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp.
Lookslike this:
 
> >
> > #v+
> > timestamp                pg_current_wal_lsn  slot_name  plugin    slot_type  datoid  database  temporary  active
active_pid xmin  catalog_xmin  restart_lsn     confirmed_flush_lsn
 
> ...
> ...
> > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/955633D0
 
> ...
> 
> I think from the above two it is clear that the slot with restart_lsn
> 1039D/83825958  is active and we seem to have recycled corresponding
> segment.
> 
> You may have the LOG for "attempting to remove WAL segments older than
> log file %s", if so, please share. Also on similar lines, I think we

Absolutely.

There is something weird happening:

#v+
2022-11-10 20:01:07.520 UTC,,,28925,,636378d6.70fd,4488667,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039B000000D9",,,,,,,,,""
 
2022-11-10 20:06:07.359 UTC,,,28925,,636378d6.70fd,4496840,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039B000000F8",,,,,,,,,""
 
2022-11-10 20:11:07.522 UTC,,,28925,,636378d6.70fd,4505099,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039C00000027",,,,,,,,,""
 
2022-11-10 20:16:07.466 UTC,,,28925,,636378d6.70fd,4513331,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039C00000054",,,,,,,,,""
 
2022-11-10 20:21:07.572 UTC,,,28925,,636378d6.70fd,4521466,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039C0000008F",,,,,,,,,""
 
2022-11-10 20:26:07.384 UTC,,,28925,,636378d6.70fd,4529518,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039C000000B9",,,,,,,,,""
 
2022-11-10 20:31:07.619 UTC,,,28925,,636378d6.70fd,4537868,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039C000000F8",,,,,,,,,""
 
2022-11-10 20:36:07.532 UTC,,,28925,,636378d6.70fd,4546079,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000002B",,,,,,,,,""
 
2022-11-10 20:41:07.603 UTC,,,28925,,636378d6.70fd,4554468,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D00000052",,,,,,,,,""
 
2022-11-10 20:46:07.609 UTC,,,28925,,636378d6.70fd,4562376,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000005F",,,,,,,,,""
 
2022-11-10 20:51:07.513 UTC,,,28925,,636378d6.70fd,4570523,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000005F",,,,,,,,,""
 
2022-11-10 20:56:07.727 UTC,,,28925,,636378d6.70fd,4578553,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D00000070",,,,,,,,,""
 
2022-11-10 21:01:07.509 UTC,,,28925,,636378d6.70fd,4586721,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D00000082",,,,,,,,,""
 
2022-11-10 21:06:07.521 UTC,,,28925,,636378d6.70fd,4594902,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:11:07.402 UTC,,,28925,,636378d6.70fd,4602683,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:16:07.837 UTC,,,28925,,636378d6.70fd,4610577,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:21:07.571 UTC,,,28925,,636378d6.70fd,4618490,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:26:07.559 UTC,,,28925,,636378d6.70fd,4626365,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:31:07.649 UTC,,,28925,,636378d6.70fd,4634592,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:36:07.737 UTC,,,28925,,636378d6.70fd,4642829,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:41:07.278 UTC,,,28925,,636378d6.70fd,4650983,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:46:07.670 UTC,,,28925,,636378d6.70fd,4658973,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:51:07.579 UTC,,,28925,,636378d6.70fd,4666946,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:56:07.618 UTC,,,28925,,636378d6.70fd,4674922,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
#v-

If I'll add "recycled" lines we will get:

#v+
...
2022-11-10 21:01:07.509 UTC,,,28925,,636378d6.70fd,4586721,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D00000082",,,,,,,,,""
 
2022-11-10 21:01:07.512 UTC,,,28925,,636378d6.70fd,4586722,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000071""",,,,,,,,,""
 
2022-11-10 21:01:07.514 UTC,,,28925,,636378d6.70fd,4586723,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000072""",,,,,,,,,""
 
2022-11-10 21:01:07.515 UTC,,,28925,,636378d6.70fd,4586724,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000073""",,,,,,,,,""
 
2022-11-10 21:01:07.517 UTC,,,28925,,636378d6.70fd,4586725,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000074""",,,,,,,,,""
 
2022-11-10 21:01:07.517 UTC,,,28925,,636378d6.70fd,4586726,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000075""",,,,,,,,,""
 
2022-11-10 21:01:07.519 UTC,,,28925,,636378d6.70fd,4586727,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000076""",,,,,,,,,""
 
2022-11-10 21:01:07.520 UTC,,,28925,,636378d6.70fd,4586728,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000077""",,,,,,,,,""
 
2022-11-10 21:01:07.521 UTC,,,28925,,636378d6.70fd,4586729,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000078""",,,,,,,,,""
 
2022-11-10 21:01:07.522 UTC,,,28925,,636378d6.70fd,4586730,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000079""",,,,,,,,,""
 
2022-11-10 21:01:07.523 UTC,,,28925,,636378d6.70fd,4586731,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D0000007A""",,,,,,,,,""
 
2022-11-10 21:01:07.524 UTC,,,28925,,636378d6.70fd,4586732,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D0000007B""",,,,,,,,,"" 
2022-11-10 21:01:07.524 UTC,,,28925,,636378d6.70fd,4586733,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D0000007C""",,,,,,,,,""
 
2022-11-10 21:01:07.525 UTC,,,28925,,636378d6.70fd,4586734,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D0000007D""",,,,,,,,,""
 
2022-11-10 21:01:07.527 UTC,,,28925,,636378d6.70fd,4586735,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D0000007E""",,,,,,,,,""
 
2022-11-10 21:01:07.528 UTC,,,28925,,636378d6.70fd,4586736,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D0000007F""",,,,,,,,,""
 
2022-11-10 21:01:07.529 UTC,,,28925,,636378d6.70fd,4586737,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000080""",,,,,,,,,""
 
2022-11-10 21:01:07.530 UTC,,,28925,,636378d6.70fd,4586738,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000081""",,,,,,,,,""
 
2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000082""",,,,,,,,,""
 
2022-11-10 21:06:07.521 UTC,,,28925,,636378d6.70fd,4594902,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:06:07.523 UTC,,,28925,,636378d6.70fd,4594903,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000083""",,,,,,,,,""
 
2022-11-10 21:06:07.524 UTC,,,28925,,636378d6.70fd,4594904,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000084""",,,,,,,,,""
 
2022-11-10 21:06:07.525 UTC,,,28925,,636378d6.70fd,4594905,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000085""",,,,,,,,,""
 
2022-11-10 21:06:07.526 UTC,,,28925,,636378d6.70fd,4594906,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000086""",,,,,,,,,""
 
2022-11-10 21:06:07.527 UTC,,,28925,,636378d6.70fd,4594907,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000087""",,,,,,,,,""
 
2022-11-10 21:06:07.528 UTC,,,28925,,636378d6.70fd,4594908,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000088""",,,,,,,,,""
 
2022-11-10 21:06:07.529 UTC,,,28925,,636378d6.70fd,4594909,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000089""",,,,,,,,,""
 
2022-11-10 21:06:07.530 UTC,,,28925,,636378d6.70fd,4594910,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D0000008A""",,,,,,,,,""
 
2022-11-10 21:11:07.402 UTC,,,28925,,636378d6.70fd,4602683,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:16:07.837 UTC,,,28925,,636378d6.70fd,4610577,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:21:07.571 UTC,,,28925,,636378d6.70fd,4618490,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:26:07.559 UTC,,,28925,,636378d6.70fd,4626365,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:31:07.649 UTC,,,28925,,636378d6.70fd,4634592,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:36:07.737 UTC,,,28925,,636378d6.70fd,4642829,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:41:07.278 UTC,,,28925,,636378d6.70fd,4650983,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:46:07.670 UTC,,,28925,,636378d6.70fd,4658973,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:51:07.579 UTC,,,28925,,636378d6.70fd,4666946,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
2022-11-10 21:56:07.618 UTC,,,28925,,636378d6.70fd,4674922,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
#v-

> need to add some LOGs before calling RemoveOldXlogFiles() to know
> where our computation to remove files goes off. Something, like the
> attached, would be helpful but note I have added this quickly on my
> working branch to show you what I have in mind. You may need small
> changes based on the branch you are working on.

That will be complicated.

Changing pg (recompile, and rerun) requires careful preparation with our
customers. Which takes time.

Additionally we are using precompiled binaries from pgdg, that makes the
process slightly more involved.

And to top it off, as I mailed in the other message - the problem doesn't
happen always. For 8 clusters that I tried to prepare problem happened on two,
and on one of them (this one actually) it kinda solved on its own?!

I know I'm asking a lot, but is there any chance we could setup
conference/shared screen session to try to get as much data as possible
out of the situations we have now? Of course that would be paid
consulting. If you're OK with it, please mail me off list, so we can
arrange it.

Best regards,

depesz




On Mon, Nov 14, 2022 at 5:47 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Mon, Nov 14, 2022 at 05:08:15PM +0530, Amit Kapila wrote:
> > > #v+
> > > 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000082""",,,,,,,,,""
 
> ...
> > > #v-
> > >
> > > So, the whole set of missing files was recycled at the same time.
> > >
> > > One more thing that I have is copy of pg_replication_slots info. This data is taken periodically, around every 1
minute.Each line, aside from pg_replication_slots slots data has also output of pg_current_wal_lsn() and timestamp.
Lookslike this:
 
> > >
> > > #v+
> > > timestamp                pg_current_wal_lsn  slot_name  plugin    slot_type  datoid  database  temporary  active
active_pid xmin  catalog_xmin  restart_lsn     confirmed_flush_lsn
 
> > ...
> > ...
> > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/955633D0
 
> > ...
> >
> > I think from the above two it is clear that the slot with restart_lsn
> > 1039D/83825958  is active and we seem to have recycled corresponding
> > segment.
> >
> > You may have the LOG for "attempting to remove WAL segments older than
> > log file %s", if so, please share. Also on similar lines, I think we
>
> Absolutely.
>
> There is something weird happening:
>

What exactly weird you are seeing in this? To me, it appears as if the
system due to some reason ignores an existing slot that has
restart_lsn as 1039D/83825958.

>
> If I'll add "recycled" lines we will get:
>
> #v+
> ...
> 2022-11-10 21:01:07.531 UTC,,,28925,,636378d6.70fd,4586739,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"recycled
write-aheadlog file ""000000010001039D00000082""",,,,,,,,,""
 
> 2022-11-10 21:06:07.521 UTC,,,28925,,636378d6.70fd,4594902,,2022-11-03 08:16:22 UTC,,0,DEBUG,00000,"attempting to
removeWAL segments older than log file 000000000001039D0000008A",,,,,,,,,""
 
>

Actually, the last LOG was what I was not expecting because of the
existence of a corresponding slot. This is what we have to find out
how it has computed the value "000000000001039D0000008A" for the file
to be removed.

> > need to add some LOGs before calling RemoveOldXlogFiles() to know
> > where our computation to remove files goes off. Something, like the
> > attached, would be helpful but note I have added this quickly on my
> > working branch to show you what I have in mind. You may need small
> > changes based on the branch you are working on.
>
> That will be complicated.
>
> Changing pg (recompile, and rerun) requires careful preparation with our
> customers. Which takes time.
>
> Additionally we are using precompiled binaries from pgdg, that makes the
> process slightly more involved.
>

I can understand but OTOH it is not easy to proceed without more information.

-- 
With Regards,
Amit Kapila.



On Sun, Nov 13, 2022 at 1:36 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Fri, Nov 11, 2022 at 03:50:40PM +0100, hubert depesz lubaczewski wrote:
> > #v+
> > 2022-11-11 12:45:26.432 UTC,,,994963,,636e43e6.f2e93,2,,2022-11-11 12:45:26 UTC,6/0,0,ERROR,08P01,"could not
receivedata from WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,,,,"","logicalreplication worker",,0 
> > #v-
>
> Sooo... plot thickens.
>
> Without any changes, manual rebuild or anything, yesterday, the problem
> seems to have solved itself?!
>
> In logs on focal/pg14 I see:
>
> #v+
> 2022-11-12 20:55:39.190 UTC,,,1897563,,6370084b.1cf45b,2,,2022-11-12 20:55:39 UTC,6/0,0,ERROR,08P01,"could not
receivedata from WAL stream: ERROR:  requested WAL segment 000000010001039D00000083 has already been
removed",,,,,,,,,"","logicalreplication worker",,0 
> #v-
>
> And this is *the last* such message.
>
> On bionic/pg12 we have in logs from pg_replication_slots:
>
> #v+
> timestamp                pg_current_wal_lsn  slot_name  plugin    slot_type  datoid  database  temporary  active
active_pid xmin  catalog_xmin  restart_lsn     confirmed_flush_lsn 
> 2022-11-12 20:51:00 UTC  1041E/D3A0E540      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241443528    1039D/83825958  1039D/96453F38 
> 2022-11-12 20:51:59 UTC  1041E/D89B6000      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241443528    1039D/83825958  1039D/96453F38 
> 2022-11-12 20:52:58 UTC  1041E/E0547450      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241443528    1039D/83825958  1039D/96453F38 
> 2022-11-12 20:53:58 UTC  1041E/E59634F0      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241443528    1039D/83825958  1039D/96453F38 
> 2022-11-12 20:54:57 UTC  1041E/EBB50DE8      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241443528    1039D/83825958  1039D/96453F38 
> 2022-11-12 20:55:55 UTC  1041E/FBBC3160      focal14    pgoutput  logical    16607   canvas    f          t
18626      \N    3241450490    1039D/9170B010  1039D/9B86EAF0 
>
...
...

From the last two lines above, it is clear why it started working. The
restart_lsn has advanced from 1039D/83825958 to 1039D/9170B010 which
means the system no longer needs data from 000000010001039D00000083.

--
With Regards,
Amit Kapila.



Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Mon, Nov 14, 2022 at 06:41:59PM +0530, Amit Kapila wrote:
> From the last two lines above, it is clear why it started working. The
> restart_lsn has advanced from 1039D/83825958 to 1039D/9170B010 which
> means the system no longer needs data from 000000010001039D00000083.

Sure, but how could it advance if the wal wasn't there?

Best regards,

depesz




Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Mon, Nov 14, 2022 at 06:30:57PM +0530, Amit Kapila wrote:
> > There is something weird happening:
> What exactly weird you are seeing in this? To me, it appears as if the
> system due to some reason ignores an existing slot that has
> restart_lsn as 1039D/83825958.

The weird part for me is that it is trying to remove wal files older
than the same "x" many times.
But perhaps it just means that the code that does it doesn't check if it
did the same previously.

Best regards,

depesz




On Mon, Nov 14, 2022 at 7:03 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Mon, Nov 14, 2022 at 06:30:57PM +0530, Amit Kapila wrote:
> > > There is something weird happening:
> > What exactly weird you are seeing in this? To me, it appears as if the
> > system due to some reason ignores an existing slot that has
> > restart_lsn as 1039D/83825958.
>
> The weird part for me is that it is trying to remove wal files older
> than the same "x" many times.
>

I think that is okay because as per checkpointer's computation it
decides not to remove/replace any new WAL files. At this stage, I am
not getting any idea except for getting the value of
XLogGetReplicationSlotMinimumLSN() in one of the LOG prints. If you
can't add all the LOGs, I shared in the last patch, can you try to get
the value of XLogGetReplicationSlotMinimumLSN() by appending to the
existing LOG "attempting to remove WAL segments older than log file
.."?

-- 
With Regards,
Amit Kapila.



Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Mon, Nov 14, 2022 at 09:52:05PM +0530, Amit Kapila wrote:
> On Mon, Nov 14, 2022 at 7:03 PM hubert depesz lubaczewski
> <depesz@depesz.com> wrote:
> >
> > On Mon, Nov 14, 2022 at 06:30:57PM +0530, Amit Kapila wrote:
> > > > There is something weird happening:
> > > What exactly weird you are seeing in this? To me, it appears as if the
> > > system due to some reason ignores an existing slot that has
> > > restart_lsn as 1039D/83825958.
> >
> > The weird part for me is that it is trying to remove wal files older
> > than the same "x" many times.
> >
> 
> I think that is okay because as per checkpointer's computation it
> decides not to remove/replace any new WAL files. At this stage, I am
> not getting any idea except for getting the value of
> XLogGetReplicationSlotMinimumLSN() in one of the LOG prints. If you
> can't add all the LOGs, I shared in the last patch, can you try to get
> the value of XLogGetReplicationSlotMinimumLSN() by appending to the
> existing LOG "attempting to remove WAL segments older than log file
> .."?

If I'll get *any* patch approved, and restart of client db, then it
doesn't matter how many LOGs there will be :)

Will get back to you as soon as I will have more info.

Best regards,

depesz




On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/8DECBA18
 
> 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/955633D0
 
> 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/96453F38
 
> 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241443528    1039D/83825958  1039D/96453F38
 
> 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241443528    1039D/83825958  1039D/96453F38
 
> 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241443528    1039D/83825958  1039D/96453F38
 
> 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f       \N
       \N    3241443528    1039D/83825958  1039D/96453F38
 

It happened twice that the walsender process was active for at least
one minute while not advancing restart_lsn and exited. Could you share
the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
can find out what the wal sender processes were doing?

Also, you may have logs for "failed to increase restart lsn: proposed
%X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
please share it as well.

I've tried to reproduce this issue with a similar setting in a loop
but could not.

Given that you have the log "attempting to remove WAL segments older
than log file 000000000001039D0000008A", the last WAL file we need to
retain calculated by KeepLogSeg() was wrong.

KeepLogSeg() retreated the last segment that we need to retain while
considering replication slots' minimum LSN (and wal_keep_segments) but
it seemed not to work fine. The possibilities are (1) we got the wrong
slots' minimum LSN that is newer than a slot's LSN (ie, retreated
logSegNo but it's still newer than required WAL files) or (2) we got
an invalid slots's minimum LSN (ie, didn't retreated at all). If we
know what the RedoRecPtr was, we can calculate _logSegNo that has not
been retreated by KeepLogSeg().

BTW in LogicalConfirmReceivedLocation(), we set restart_lsn as follows:

        if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr &&
            MyReplicationSlot->candidate_restart_valid <= lsn)
        {
            Assert(MyReplicationSlot->candidate_restart_lsn !=
InvalidXLogRecPtr);

            MyReplicationSlot->data.restart_lsn =
MyReplicationSlot->candidate_restart_lsn;
            MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr;
            MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr;
            updated_restart = true;
        }

Is it worth changing the assertion to elog(ERROR) so that we can
detect that an invalid value is set to restart_lsn?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Tue, Nov 15, 2022 at 7:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
> <depesz@depesz.com> wrote:
> >
> > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/8DECBA18
 
> > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/955633D0
 
> > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
>
> It happened twice that the walsender process was active for at least
> one minute while not advancing restart_lsn and exited. Could you share
> the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
> can find out what the wal sender processes were doing?
>
> Also, you may have logs for "failed to increase restart lsn: proposed
> %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
> flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
> please share it as well.
>
> I've tried to reproduce this issue with a similar setting in a loop
> but could not.
>
> Given that you have the log "attempting to remove WAL segments older
> than log file 000000000001039D0000008A", the last WAL file we need to
> retain calculated by KeepLogSeg() was wrong.
>
> KeepLogSeg() retreated the last segment that we need to retain while
> considering replication slots' minimum LSN (and wal_keep_segments) but
> it seemed not to work fine. The possibilities are (1) we got the wrong
> slots' minimum LSN that is newer than a slot's LSN (ie, retreated
> logSegNo but it's still newer than required WAL files) or (2) we got
> an invalid slots's minimum LSN (ie, didn't retreated at all). If we
> know what the RedoRecPtr was, we can calculate _logSegNo that has not
> been retreated by KeepLogSeg().
>

The other possibility is that the slot is not considered for minimum
LSN computation (say in_use flag is temporarily set to false).

> BTW in LogicalConfirmReceivedLocation(), we set restart_lsn as follows:
>
>         if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr &&
>             MyReplicationSlot->candidate_restart_valid <= lsn)
>         {
>             Assert(MyReplicationSlot->candidate_restart_lsn !=
> InvalidXLogRecPtr);
>
>             MyReplicationSlot->data.restart_lsn =
> MyReplicationSlot->candidate_restart_lsn;
>             MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr;
>             MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr;
>             updated_restart = true;
>         }
>
> Is it worth changing the assertion to elog(ERROR) so that we can
> detect that an invalid value is set to restart_lsn?
>

This is worth considering but note that the caller of this has an
assertion for invalid values being assigned to candidate_restart_*
variables. Also, wouldn't be shown as invalid for restart_lsn in
pg_replication_slots?

-- 
With Regards,
Amit Kapila.



On Tue, Nov 15, 2022 at 7:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
> <depesz@depesz.com> wrote:
> >
> > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/8DECBA18
 
> > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/955633D0
 
> > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
>
> It happened twice that the walsender process was active for at least
> one minute while not advancing restart_lsn and exited. Could you share
> the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
> can find out what the wal sender processes were doing?
>
> Also, you may have logs for "failed to increase restart lsn: proposed
> %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
> flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
> please share it as well.
>
> I've tried to reproduce this issue with a similar setting in a loop
> but could not.
>
> Given that you have the log "attempting to remove WAL segments older
> than log file 000000000001039D0000008A", the last WAL file we need to
> retain calculated by KeepLogSeg() was wrong.
>
> KeepLogSeg() retreated the last segment that we need to retain while
> considering replication slots' minimum LSN (and wal_keep_segments) but
> it seemed not to work fine. The possibilities are (1) we got the wrong
> slots' minimum LSN that is newer than a slot's LSN (ie, retreated
> logSegNo but it's still newer than required WAL files) or (2) we got
> an invalid slots's minimum LSN (ie, didn't retreated at all). If we
> know what the RedoRecPtr was, we can calculate _logSegNo that has not
> been retreated by KeepLogSeg().
>

I was checking but didn't find if we have RedoRecPtr-related
information available. I think it could be valuable to have RedoRecPtr
and probably replicationSlotMinLSN, lastRemovedSegNo, etc. kind of
information available via some view (either in pg_stat_bgwriter or a
similar view) to debug this kind of issue.

-- 
With Regards,
Amit Kapila.



Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
> On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
> <depesz@depesz.com> wrote:
> >
> > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/8DECBA18
 
> > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241434855    1039D/83825958  1039D/911A8DB0
 
> > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/955633D0
 
> > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
> > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38
 
> 
> It happened twice that the walsender process was active for at least
> one minute while not advancing restart_lsn and exited. Could you share
> the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
> can find out what the wal sender processes were doing?

That will be ~ 270MB. Not sure how that would fly, plus that's awfully
lot of logs to check if I don't leak any sensitive info.

But, if I'll remove lines that are generated by walsender to pg12
replicas (streaming replication to bionic/pg12 replicas), the logfile is
only ~ 3MB. Still a bit too much, as there can be some queries with
sensitive info.

Perhaps I can grep something out?

> Also, you may have logs for "failed to increase restart lsn: proposed
> %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
> flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
> please share it as well.

Found one in the time frame you mentioned:
2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10 21:03:23
UTC,7/0,0,DEBUG,00000,"failedto increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate
1039D/83825958,current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14"
 

> I've tried to reproduce this issue with a similar setting in a loop
> but could not.

It seems to be rather elusive. Which is why I'm kinda "happy" that I can with
high level of luck replicate it.

Question might be 100% stupid, sorry if it is - any chance that we're on arm64
has something to do with it?

Best regards,

depesz




On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
> > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
> > <depesz@depesz.com> wrote:
> > >
> > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/8DECBA18 
> > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t
10414      \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/955633D0 
> > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t
21748      \N    3241443528    1039D/83825958  1039D/96453F38 
> > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38 
> > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38 
> > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38 
> > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f
\N         \N    3241443528    1039D/83825958  1039D/96453F38 
> >
> > It happened twice that the walsender process was active for at least
> > one minute while not advancing restart_lsn and exited. Could you share
> > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
> > can find out what the wal sender processes were doing?
>
> That will be ~ 270MB. Not sure how that would fly, plus that's awfully
> lot of logs to check if I don't leak any sensitive info.
>
> But, if I'll remove lines that are generated by walsender to pg12
> replicas (streaming replication to bionic/pg12 replicas), the logfile is
> only ~ 3MB. Still a bit too much, as there can be some queries with
> sensitive info.
>
> Perhaps I can grep something out?
>
> > Also, you may have logs for "failed to increase restart lsn: proposed
> > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
> > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
> > please share it as well.
>
> Found one in the time frame you mentioned:
> 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10
21:03:23UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current
candidate1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" 
>

Thanks!

LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
slot's restart_lsn to this LSN, we remove WAL files older than
000000000001039D0000008A.

In LogicalIncreaseRestartDecodingForSlot(), since
"current_lsn(1039D/9170B010) <
slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
part and called LogicalConfirmReceivedLocation():

    else if (current_lsn <= slot->data.confirmed_flush)
    {
        slot->candidate_restart_valid = current_lsn;
        slot->candidate_restart_lsn = restart_lsn;

        /* our candidate can directly be used */
        updated_lsn = true;
    }

Then in LogicalConfirmReceivedLocation(), IIUC we executed the following code:

       if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr &&
           MyReplicationSlot->candidate_restart_valid <= lsn)
       {
           Assert(MyReplicationSlot->candidate_restart_lsn !=
InvalidXLogRecPtr);

           MyReplicationSlot->data.restart_lsn =
MyReplicationSlot->candidate_restart_lsn;
           MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr;
           MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr;
           updated_restart = true;
       }

So you should have logs for "updated xmin: %u restart: %u", could you check it?

After the above codes, we do:

        SpinLockRelease(&MyReplicationSlot->mutex);

        /* first write new xmin to disk, so we know what's up after a crash */
        if (updated_xmin || updated_restart)
        {
            ReplicationSlotMarkDirty();
            ReplicationSlotSave();
            elog(DEBUG1, "updated xmin: %u restart: %u", updated_xmin,
updated_restart);
        }

Is something like the following scenario possible to happen?

1. wal sender updates slot's restart_lsn and releases the spin lock
(not saved in the disk yet)
2. someone updates slots' minimum restart_lsn (note that slot's
restart_lsn in memory is already updated).
3. checkpointer removes WAL files older than the minimum restart_lsn
calculated at step 2.
4. wal sender restarts for some reason (or server crashed).
5. after the restarts, wal sender acquires the slot but the slot's
restart_lsn is old.
6. WAL files required for restart the decoding are already removed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Tue, Nov 15, 2022 at 6:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski
> <depesz@depesz.com> wrote:
> >
> > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
> > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
> > > <depesz@depesz.com> wrote:
> > > >
> > > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t
 10414       \N    3241434855    1039D/83825958  1039D/8DECBA18 
> > > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t
 10414       \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f
 \N          \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f
 \N          \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
 21748       \N    3241443528    1039D/83825958  1039D/955633D0 
> > > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t
 21748       \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f
 \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f
 \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f
 \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f
 \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > >
> > > It happened twice that the walsender process was active for at least
> > > one minute while not advancing restart_lsn and exited. Could you share
> > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
> > > can find out what the wal sender processes were doing?
> >
> > That will be ~ 270MB. Not sure how that would fly, plus that's awfully
> > lot of logs to check if I don't leak any sensitive info.
> >
> > But, if I'll remove lines that are generated by walsender to pg12
> > replicas (streaming replication to bionic/pg12 replicas), the logfile is
> > only ~ 3MB. Still a bit too much, as there can be some queries with
> > sensitive info.
> >
> > Perhaps I can grep something out?
> >
> > > Also, you may have logs for "failed to increase restart lsn: proposed
> > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
> > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
> > > please share it as well.
> >
> > Found one in the time frame you mentioned:
> > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10
21:03:23UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current
candidate1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" 
> >
>
> Thanks!
>
> LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
> slot's restart_lsn to this LSN, we remove WAL files older than
> 000000000001039D0000008A.
>
> In LogicalIncreaseRestartDecodingForSlot(), since
> "current_lsn(1039D/9170B010) <
> slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
> part and called LogicalConfirmReceivedLocation():
>
>     else if (current_lsn <= slot->data.confirmed_flush)
>     {
>         slot->candidate_restart_valid = current_lsn;
>         slot->candidate_restart_lsn = restart_lsn;
>
>         /* our candidate can directly be used */
>         updated_lsn = true;
>     }
>
> Then in LogicalConfirmReceivedLocation(), IIUC we executed the following code:
>
>        if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr &&
>            MyReplicationSlot->candidate_restart_valid <= lsn)
>        {
>            Assert(MyReplicationSlot->candidate_restart_lsn !=
> InvalidXLogRecPtr);
>
>            MyReplicationSlot->data.restart_lsn =
> MyReplicationSlot->candidate_restart_lsn;
>            MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr;
>            MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr;
>            updated_restart = true;
>        }
>
> So you should have logs for "updated xmin: %u restart: %u", could you check it?
>
> After the above codes, we do:
>
>         SpinLockRelease(&MyReplicationSlot->mutex);
>
>         /* first write new xmin to disk, so we know what's up after a crash */
>         if (updated_xmin || updated_restart)
>         {
>             ReplicationSlotMarkDirty();
>             ReplicationSlotSave();
>             elog(DEBUG1, "updated xmin: %u restart: %u", updated_xmin,
> updated_restart);
>         }
>
> Is something like the following scenario possible to happen?
>
> 1. wal sender updates slot's restart_lsn and releases the spin lock
> (not saved in the disk yet)
> 2. someone updates slots' minimum restart_lsn (note that slot's
> restart_lsn in memory is already updated).
> 3. checkpointer removes WAL files older than the minimum restart_lsn
> calculated at step 2.
> 4. wal sender restarts for some reason (or server crashed).
> 5. after the restarts, wal sender acquires the slot but the slot's
> restart_lsn is old.
> 6. WAL files required for restart the decoding are already removed.

I'm not sure that the reported issue happened in the above scenario
but I've confirmed this scenario happened on my environment by adding
sleeps and killing the wal sender process. After the server restarts,
pg_replication_slots.restart_lsn retreated from 0/5390F78 to
0/1659FC0, and the walsender raised an error "requested WAL segment
000000010000000000000001 has already been removed".

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Tue, Nov 15, 2022 at 2:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski
> <depesz@depesz.com> wrote:
> >
> > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
> > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
> > > <depesz@depesz.com> wrote:
> > > >
> > > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t
 10414       \N    3241434855    1039D/83825958  1039D/8DECBA18 
> > > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t
 10414       \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f
 \N          \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f
 \N          \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
 21748       \N    3241443528    1039D/83825958  1039D/955633D0 
> > > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t
 21748       \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f
 \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f
 \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f
 \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f
 \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > >
> > > It happened twice that the walsender process was active for at least
> > > one minute while not advancing restart_lsn and exited. Could you share
> > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
> > > can find out what the wal sender processes were doing?
> >
> > That will be ~ 270MB. Not sure how that would fly, plus that's awfully
> > lot of logs to check if I don't leak any sensitive info.
> >
> > But, if I'll remove lines that are generated by walsender to pg12
> > replicas (streaming replication to bionic/pg12 replicas), the logfile is
> > only ~ 3MB. Still a bit too much, as there can be some queries with
> > sensitive info.
> >
> > Perhaps I can grep something out?
> >
> > > Also, you may have logs for "failed to increase restart lsn: proposed
> > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
> > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
> > > please share it as well.
> >
> > Found one in the time frame you mentioned:
> > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10
21:03:23UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current
candidate1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" 
> >
>
> Thanks!
>
> LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
> slot's restart_lsn to this LSN, we remove WAL files older than
> 000000000001039D0000008A.
>
> In LogicalIncreaseRestartDecodingForSlot(), since
> "current_lsn(1039D/9170B010) <
> slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
> part and called LogicalConfirmReceivedLocation():
>
>     else if (current_lsn <= slot->data.confirmed_flush)
>     {
>         slot->candidate_restart_valid = current_lsn;
>         slot->candidate_restart_lsn = restart_lsn;
>
>         /* our candidate can directly be used */
>         updated_lsn = true;
>     }
>

If this would have been executed in
LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
the above DEBUG messages "current candidate 1039D/83825958, current
after 1039D/8B5773D8" should be the same as proposed and after
"proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
something?

--
With Regards,
Amit Kapila.



On Wed, Nov 16, 2022 at 1:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Nov 15, 2022 at 2:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski
> > <depesz@depesz.com> wrote:
> > >
> > > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
> > > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
> > > > <depesz@depesz.com> wrote:
> > > > >
> > > > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t
   10414       \N    3241434855    1039D/83825958  1039D/8DECBA18 
> > > > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t
   10414       \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f
   \N          \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f
   \N          \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
   21748       \N    3241443528    1039D/83825958  1039D/955633D0 
> > > > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t
   21748       \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f
   \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f
   \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f
   \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f
   \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > >
> > > > It happened twice that the walsender process was active for at least
> > > > one minute while not advancing restart_lsn and exited. Could you share
> > > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
> > > > can find out what the wal sender processes were doing?
> > >
> > > That will be ~ 270MB. Not sure how that would fly, plus that's awfully
> > > lot of logs to check if I don't leak any sensitive info.
> > >
> > > But, if I'll remove lines that are generated by walsender to pg12
> > > replicas (streaming replication to bionic/pg12 replicas), the logfile is
> > > only ~ 3MB. Still a bit too much, as there can be some queries with
> > > sensitive info.
> > >
> > > Perhaps I can grep something out?
> > >
> > > > Also, you may have logs for "failed to increase restart lsn: proposed
> > > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
> > > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
> > > > please share it as well.
> > >
> > > Found one in the time frame you mentioned:
> > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10
21:03:23UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current
candidate1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" 
> > >
> >
> > Thanks!
> >
> > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
> > slot's restart_lsn to this LSN, we remove WAL files older than
> > 000000000001039D0000008A.
> >
> > In LogicalIncreaseRestartDecodingForSlot(), since
> > "current_lsn(1039D/9170B010) <
> > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
> > part and called LogicalConfirmReceivedLocation():
> >
> >     else if (current_lsn <= slot->data.confirmed_flush)
> >     {
> >         slot->candidate_restart_valid = current_lsn;
> >         slot->candidate_restart_lsn = restart_lsn;
> >
> >         /* our candidate can directly be used */
> >         updated_lsn = true;
> >     }
> >
>
> If this would have been executed in
> LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
> the above DEBUG messages "current candidate 1039D/83825958, current
> after 1039D/8B5773D8" should be the same as proposed and after
> "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
> something?

Oh, you're right.

Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal
to or greater than 1039D/8B5773D8 at that time but
slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 1:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Nov 15, 2022 at 2:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Nov 15, 2022 at 4:55 PM hubert depesz lubaczewski
> > > <depesz@depesz.com> wrote:
> > > >
> > > > On Tue, Nov 15, 2022 at 11:20:19AM +0900, Masahiko Sawada wrote:
> > > > > On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
> > > > > <depesz@depesz.com> wrote:
> > > > > >
> > > > > > 2022-11-10 20:59:55 UTC  103A7/7B666388      focal14    pgoutput  logical    16607   canvas    f          t
     10414       \N    3241434855    1039D/83825958  1039D/8DECBA18 
> > > > > > 2022-11-10 21:00:54 UTC  103A7/88225C28      focal14    pgoutput  logical    16607   canvas    f          t
     10414       \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > > > 2022-11-10 21:01:53 UTC  103A7/9E77EF90      focal14    pgoutput  logical    16607   canvas    f          f
     \N          \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > > > 2022-11-10 21:02:52 UTC  103A7/B45ECEA8      focal14    pgoutput  logical    16607   canvas    f          f
     \N          \N    3241434855    1039D/83825958  1039D/911A8DB0 
> > > > > > 2022-11-10 21:03:52 UTC  103A7/C4355F60      focal14    pgoutput  logical    16607   canvas    f          t
     21748       \N    3241443528    1039D/83825958  1039D/955633D0 
> > > > > > 2022-11-10 21:04:51 UTC  103A7/D5A6FDD8      focal14    pgoutput  logical    16607   canvas    f          t
     21748       \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > > > 2022-11-10 21:05:50 UTC  103A7/E2C7F590      focal14    pgoutput  logical    16607   canvas    f          f
     \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > > > 2022-11-10 21:06:49 UTC  103A7/F28B2000      focal14    pgoutput  logical    16607   canvas    f          f
     \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > > > 2022-11-10 21:07:48 UTC  103A8/3828D20       focal14    pgoutput  logical    16607   canvas    f          f
     \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > > > 2022-11-10 21:08:47 UTC  103A8/15BBF088      focal14    pgoutput  logical    16607   canvas    f          f
     \N          \N    3241443528    1039D/83825958  1039D/96453F38 
> > > > >
> > > > > It happened twice that the walsender process was active for at least
> > > > > one minute while not advancing restart_lsn and exited. Could you share
> > > > > the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
> > > > > can find out what the wal sender processes were doing?
> > > >
> > > > That will be ~ 270MB. Not sure how that would fly, plus that's awfully
> > > > lot of logs to check if I don't leak any sensitive info.
> > > >
> > > > But, if I'll remove lines that are generated by walsender to pg12
> > > > replicas (streaming replication to bionic/pg12 replicas), the logfile is
> > > > only ~ 3MB. Still a bit too much, as there can be some queries with
> > > > sensitive info.
> > > >

We are primarily interested in LOGs that display lsn values. Can you
try to get that or if possible remove sensitive queries information
and then send the remaining LOGs?

> > > > Perhaps I can grep something out?
> > > >
> > > > > Also, you may have logs for "failed to increase restart lsn: proposed
> > > > > %X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
> > > > > flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
> > > > > please share it as well.
> > > >
> > > > Found one in the time frame you mentioned:
> > > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-10
21:03:23UTC,7/0,0,DEBUG,00000,"failed to increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current
candidate1039D/83825958, current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" 
> > > >
> > >
> > > Thanks!
> > >
> > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
> > > slot's restart_lsn to this LSN, we remove WAL files older than
> > > 000000000001039D0000008A.
> > >
> > > In LogicalIncreaseRestartDecodingForSlot(), since
> > > "current_lsn(1039D/9170B010) <
> > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
> > > part and called LogicalConfirmReceivedLocation():
> > >
> > >     else if (current_lsn <= slot->data.confirmed_flush)
> > >     {
> > >         slot->candidate_restart_valid = current_lsn;
> > >         slot->candidate_restart_lsn = restart_lsn;
> > >
> > >         /* our candidate can directly be used */
> > >         updated_lsn = true;
> > >     }
> > >
> >
> > If this would have been executed in
> > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
> > the above DEBUG messages "current candidate 1039D/83825958, current
> > after 1039D/8B5773D8" should be the same as proposed and after
> > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
> > something?
>
> Oh, you're right.
>
> Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal
> to or greater than 1039D/8B5773D8 at that time but
> slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird.
>

Yes, that is weird but it had been a bit obvious if the same LOG would
have printed slot->data.restart_lsn. This means that somehow slot's
'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't
figure out yet how that can happen but if that happens then the slot's
restart_lsn can retreat in LogicalConfirmReceivedLocation() because we
don't check if slot's candidate_restart_lsn is lesser than its
restart_lsn before assigning the same in line
MyReplicationSlot->data.restart_lsn =
MyReplicationSlot->candidate_restart_lsn;. I think that should be
checked. Note that we call LogicalConfirmReceivedLocation() can be
called from ProcessStandbyReplyMessage(), so once the wrong
candidate_restart_lsn is set, it can be assigned to restart_lsn from
other code paths as well.

I am not able to think how 'candidate_restart_lsn' can be set to an
LSN value prior to 'restart_lsn'. For that, we probably need more
LOGs. I hope that depesz can share more information from LOGs as that
might give us some clues.

--
With Regards,
Amit Kapila.



On Wed, Nov 16, 2022 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > > > Found one in the time frame you mentioned:
> > > > > 2022-11-10 21:03:24.612
UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-1021:03:23
UTC,7/0,0,DEBUG,00000,"failedto increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate
1039D/83825958,current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" 
> > > > >
> > > >
> > > > Thanks!
> > > >
> > > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
> > > > slot's restart_lsn to this LSN, we remove WAL files older than
> > > > 000000000001039D0000008A.
> > > >
> > > > In LogicalIncreaseRestartDecodingForSlot(), since
> > > > "current_lsn(1039D/9170B010) <
> > > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
> > > > part and called LogicalConfirmReceivedLocation():
> > > >
> > > >     else if (current_lsn <= slot->data.confirmed_flush)
> > > >     {
> > > >         slot->candidate_restart_valid = current_lsn;
> > > >         slot->candidate_restart_lsn = restart_lsn;
> > > >
> > > >         /* our candidate can directly be used */
> > > >         updated_lsn = true;
> > > >     }
> > > >
> > >
> > > If this would have been executed in
> > > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
> > > the above DEBUG messages "current candidate 1039D/83825958, current
> > > after 1039D/8B5773D8" should be the same as proposed and after
> > > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
> > > something?
> >
> > Oh, you're right.
> >
> > Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal
> > to or greater than 1039D/8B5773D8 at that time but
> > slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird.
> >
>
> Yes, that is weird but it had been a bit obvious if the same LOG would
> have printed slot->data.restart_lsn. This means that somehow slot's
> 'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't
> figure out yet how that can happen but if that happens then the slot's
> restart_lsn can retreat in LogicalConfirmReceivedLocation() because we
> don't check if slot's candidate_restart_lsn is lesser than its
> restart_lsn before assigning the same in line
> MyReplicationSlot->data.restart_lsn =
> MyReplicationSlot->candidate_restart_lsn;. I think that should be
> checked. Note that we call LogicalConfirmReceivedLocation() can be
> called from ProcessStandbyReplyMessage(), so once the wrong
> candidate_restart_lsn is set, it can be assigned to restart_lsn from
> other code paths as well.
>
> I am not able to think how 'candidate_restart_lsn' can be set to an
> LSN value prior to 'restart_lsn'.
>

In the below part of the code, we use the value of
'last_serialized_snapshot' for restart_lsn.
else if (txn == NULL &&
builder->reorder->current_restart_decoding_lsn != InvalidXLogRecPtr &&
builder->last_serialized_snapshot != InvalidXLogRecPtr)
LogicalIncreaseRestartDecodingForSlot(lsn,
  builder->last_serialized_snapshot);

Now, say, after restart, we start reading from slot's restart_lsn
which is 1039D/8B5773D8. At this LSN, we restored a snapshot that has
the last_seriealized_snapshot set to 1039D/83825958. If that happens,
then in LogicalIncreaseRestartDecodingForSlot, we can set these values
to slot's candidate_*_lsn variables. Say, if this happens, next time
whenever LogicalConfirmReceivedLocation() is called the value of
slot's restart_lsn will be moved back. Once it is moved back, yet
another restart will lead to this problem. Does this theory makes
sense?


--
With Regards,
Amit Kapila.



On Wed, Nov 16, 2022 at 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > > > > Found one in the time frame you mentioned:
> > > > > > 2022-11-10 21:03:24.612
UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-1021:03:23
UTC,7/0,0,DEBUG,00000,"failedto increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate
1039D/83825958,current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" 
> > > > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
> > > > > slot's restart_lsn to this LSN, we remove WAL files older than
> > > > > 000000000001039D0000008A.
> > > > >
> > > > > In LogicalIncreaseRestartDecodingForSlot(), since
> > > > > "current_lsn(1039D/9170B010) <
> > > > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
> > > > > part and called LogicalConfirmReceivedLocation():
> > > > >
> > > > >     else if (current_lsn <= slot->data.confirmed_flush)
> > > > >     {
> > > > >         slot->candidate_restart_valid = current_lsn;
> > > > >         slot->candidate_restart_lsn = restart_lsn;
> > > > >
> > > > >         /* our candidate can directly be used */
> > > > >         updated_lsn = true;
> > > > >     }
> > > > >
> > > >
> > > > If this would have been executed in
> > > > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
> > > > the above DEBUG messages "current candidate 1039D/83825958, current
> > > > after 1039D/8B5773D8" should be the same as proposed and after
> > > > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
> > > > something?
> > >
> > > Oh, you're right.
> > >
> > > Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal
> > > to or greater than 1039D/8B5773D8 at that time but
> > > slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird.
> > >
> >
> > Yes, that is weird but it had been a bit obvious if the same LOG would
> > have printed slot->data.restart_lsn. This means that somehow slot's
> > 'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't
> > figure out yet how that can happen but if that happens then the slot's
> > restart_lsn can retreat in LogicalConfirmReceivedLocation() because we
> > don't check if slot's candidate_restart_lsn is lesser than its
> > restart_lsn before assigning the same in line
> > MyReplicationSlot->data.restart_lsn =
> > MyReplicationSlot->candidate_restart_lsn;. I think that should be
> > checked. Note that we call LogicalConfirmReceivedLocation() can be
> > called from ProcessStandbyReplyMessage(), so once the wrong
> > candidate_restart_lsn is set, it can be assigned to restart_lsn from
> > other code paths as well.
> >
> > I am not able to think how 'candidate_restart_lsn' can be set to an
> > LSN value prior to 'restart_lsn'.
> >
>
> In the below part of the code, we use the value of
> 'last_serialized_snapshot' for restart_lsn.
> else if (txn == NULL &&
> builder->reorder->current_restart_decoding_lsn != InvalidXLogRecPtr &&
> builder->last_serialized_snapshot != InvalidXLogRecPtr)
> LogicalIncreaseRestartDecodingForSlot(lsn,
>   builder->last_serialized_snapshot);
>
> Now, say, after restart, we start reading from slot's restart_lsn
> which is 1039D/8B5773D8. At this LSN, we restored a snapshot that has
> the last_seriealized_snapshot set to 1039D/83825958.

I don't get this part. You meant that the snapshot we restored at
1039D/8B5773D8 has last_serialized_snapshot set to 1039D/83825958? I
think we don't update last_serialized_snapshot when restoring the
snapshot. Also, if we restore the snapshot after restart, we don't
execute the above code since SnapBuildFindSnapshot() returns false,
no?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Thu, Nov 17, 2022 at 6:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > > > > Found one in the time frame you mentioned:
> > > > > > > 2022-11-10 21:03:24.612
UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-1021:03:23
UTC,7/0,0,DEBUG,00000,"failedto increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate
1039D/83825958,current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" 
> > > > > > >
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance
> > > > > > slot's restart_lsn to this LSN, we remove WAL files older than
> > > > > > 000000000001039D0000008A.
> > > > > >
> > > > > > In LogicalIncreaseRestartDecodingForSlot(), since
> > > > > > "current_lsn(1039D/9170B010) <
> > > > > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following
> > > > > > part and called LogicalConfirmReceivedLocation():
> > > > > >
> > > > > >     else if (current_lsn <= slot->data.confirmed_flush)
> > > > > >     {
> > > > > >         slot->candidate_restart_valid = current_lsn;
> > > > > >         slot->candidate_restart_lsn = restart_lsn;
> > > > > >
> > > > > >         /* our candidate can directly be used */
> > > > > >         updated_lsn = true;
> > > > > >     }
> > > > > >
> > > > >
> > > > > If this would have been executed in
> > > > > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in
> > > > > the above DEBUG messages "current candidate 1039D/83825958, current
> > > > > after 1039D/8B5773D8" should be the same as proposed and after
> > > > > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing
> > > > > something?
> > > >
> > > > Oh, you're right.
> > > >
> > > > Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal
> > > > to or greater than 1039D/8B5773D8 at that time but
> > > > slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird.
> > > >
> > >
> > > Yes, that is weird but it had been a bit obvious if the same LOG would
> > > have printed slot->data.restart_lsn. This means that somehow slot's
> > > 'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't
> > > figure out yet how that can happen but if that happens then the slot's
> > > restart_lsn can retreat in LogicalConfirmReceivedLocation() because we
> > > don't check if slot's candidate_restart_lsn is lesser than its
> > > restart_lsn before assigning the same in line
> > > MyReplicationSlot->data.restart_lsn =
> > > MyReplicationSlot->candidate_restart_lsn;. I think that should be
> > > checked. Note that we call LogicalConfirmReceivedLocation() can be
> > > called from ProcessStandbyReplyMessage(), so once the wrong
> > > candidate_restart_lsn is set, it can be assigned to restart_lsn from
> > > other code paths as well.
> > >
> > > I am not able to think how 'candidate_restart_lsn' can be set to an
> > > LSN value prior to 'restart_lsn'.
> > >
> >
> > In the below part of the code, we use the value of
> > 'last_serialized_snapshot' for restart_lsn.
> > else if (txn == NULL &&
> > builder->reorder->current_restart_decoding_lsn != InvalidXLogRecPtr &&
> > builder->last_serialized_snapshot != InvalidXLogRecPtr)
> > LogicalIncreaseRestartDecodingForSlot(lsn,
> >   builder->last_serialized_snapshot);
> >
> > Now, say, after restart, we start reading from slot's restart_lsn
> > which is 1039D/8B5773D8. At this LSN, we restored a snapshot that has
> > the last_seriealized_snapshot set to 1039D/83825958.
>
> I don't get this part. You meant that the snapshot we restored at
> 1039D/8B5773D8 has last_serialized_snapshot set to 1039D/83825958?
>

Yes.

> I
> think we don't update last_serialized_snapshot when restoring the
> snapshot.
>

Right, I missed that point that even though it is present in the
serialized snapshot, while restoring we don't update the same in the
builder. So, this theory won't work.

--
With Regards,
Amit Kapila.



Hi,

On 2022-11-11 15:50:40 +0100, hubert depesz lubaczewski wrote:
> Out of 8 servers that we prepared upgrade for, one failed with the
> same/similar problem.

Perhaps a stupid question - are you using max_slot_wal_keep_size? And what's
your wal_segment_size? And wal_keep_size?



I'm a bit confused by the changes the changes c6550776394e made to
KeepLogSeg() etc.

Before:
    /* then check whether slots limit removal further */
    if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
    {
        XLogSegNo    slotSegNo;

        XLByteToSeg(keep, slotSegNo, wal_segment_size);

        if (slotSegNo <= 0)
            segno = 1;
        else if (slotSegNo < segno)
            segno = slotSegNo;
    }

Current:
    keep = XLogGetReplicationSlotMinimumLSN();
    if (keep != InvalidXLogRecPtr)
    {
        XLByteToSeg(keep, segno, wal_segment_size);

        /* Cap by max_slot_wal_keep_size ... */
        if (max_slot_wal_keep_size_mb >= 0)
        {
            uint64        slot_keep_segs;

            slot_keep_segs =
                ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);

            if (currSegNo - segno > slot_keep_segs)
                segno = currSegNo - slot_keep_segs;
        }
    }

Which looks to me to have lost the protection against segno being 0. Which
afaict would break the callers, because they do _logSegNo--. Which would then
allow all WAL to be reused, no?  I guess the argument for changing this was
that we should never have such a dangerous value returned by
XLogGetReplicationSlotMinimumLSN()?


It seems decidedly not great to not log at least a debug1 (but probably it
should be LOG) message when KeepLogSeg() decides to limit based on
max_slot_wal_keep_size.


It feels wrong to subtract max_slot_wal_keep_size from recptr - that's the end
of the checkpoint record. Given that we, leaving max_slot_wal_keep_size aside,
only actually remove WAL if older than the segment that RedoRecPtr (the
logical start of the checkpoint) is in. If the checkpoint is large we'll end
up removing replication slots even though they potentially would only have
retained one additional WAL segment.

Isn't it problematic to use ConvertToXSegs() to implement
max_slot_wal_keep_size, given that it rounds *down*? Particularly for a large
wal_segment_size that'd afaict lead to being much more aggressive invalidating
slots.

Also, why do we do something as expensive as
InvalidateObsoleteReplicationSlots() even when max_slot_wal_keep_size had no
effect?

Greetings,

Andres Freund



Hi,

On 2022-11-15 23:59:37 +0900, Masahiko Sawada wrote:
> > Is something like the following scenario possible to happen?
> >
> > 1. wal sender updates slot's restart_lsn and releases the spin lock
> > (not saved in the disk yet)
> > 2. someone updates slots' minimum restart_lsn (note that slot's
> > restart_lsn in memory is already updated).

You mean ReplicationSlotsComputeRequiredLSN(), or update that specific slot's
restart_lsn? The latter shouldn't happen.


> > 3. checkpointer removes WAL files older than the minimum restart_lsn
> > calculated at step 2.

For xmin we have protection against that via the split between
catalog_xmin/effective_catalog_xmin. We should probably mirror that for
restart_lsn as well.

We should also call ReplicationSlotsComputeRequiredLSN if only update_restart
is true...


> > 4. wal sender restarts for some reason (or server crashed).

I don't think walsender alone restarting should change anything, but
crash-restart obviously would.

Greetings,

Andres Freund



Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Wed, Nov 16, 2022 at 11:25:12PM -0800, Andres Freund wrote:
> Perhaps a stupid question - are you using max_slot_wal_keep_size? And what's
> your wal_segment_size? And wal_keep_size?

Are we talking about pg12 side?

=> select name, setting from pg_settings where name ~ 'wal';
             name             │  setting
══════════════════════════════╪═══════════
 max_wal_senders              │ 50
 max_wal_size                 │ 49152
 min_wal_size                 │ 256
 wal_block_size               │ 8192
 wal_buffers                  │ 2048
 wal_compression              │ on
 wal_consistency_checking     │
 wal_init_zero                │ on
 wal_keep_segments            │ 1000
 wal_level                    │ logical
 wal_log_hints                │ off
 wal_receiver_status_interval │ 10
 wal_receiver_timeout         │ 60000
 wal_recycle                  │ on
 wal_retrieve_retry_interval  │ 5000
 wal_segment_size             │ 16777216
 wal_sender_timeout           │ 60000
 wal_sync_method              │ fdatasync
 wal_writer_delay             │ 200
 wal_writer_flush_after       │ 128
(20 rows)

max_slot_wal_keep_size was added in pg13, afaik. and wel_keep_* was
_segments in pg 12.

Best regards,

depesz




On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-11-15 23:59:37 +0900, Masahiko Sawada wrote:
> > > Is something like the following scenario possible to happen?
> > >
> > > 1. wal sender updates slot's restart_lsn and releases the spin lock
> > > (not saved in the disk yet)
> > > 2. someone updates slots' minimum restart_lsn (note that slot's
> > > restart_lsn in memory is already updated).
>
> You mean ReplicationSlotsComputeRequiredLSN(), or update that specific slot's
> restart_lsn? The latter shouldn't happen.

I meant the former.

>
>
> > > 3. checkpointer removes WAL files older than the minimum restart_lsn
> > > calculated at step 2.
>
> For xmin we have protection against that via the split between
> catalog_xmin/effective_catalog_xmin. We should probably mirror that for
> restart_lsn as well.
>
> We should also call ReplicationSlotsComputeRequiredLSN if only update_restart
> is true...

Agree.

>
>
> > > 4. wal sender restarts for some reason (or server crashed).
>
> I don't think walsender alone restarting should change anything, but
> crash-restart obviously would.

Right. I've confirmed this scenario is possible to happen with crash-restart.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Mon, Nov 14, 2022 at 05:08:15PM +0530, Amit Kapila wrote:
> You may have the LOG for "attempting to remove WAL segments older than
> log file %s", if so, please share. Also on similar lines, I think we
> need to add some LOGs before calling RemoveOldXlogFiles() to know
> where our computation to remove files goes off. Something, like the
> attached, would be helpful but note I have added this quickly on my
> working branch to show you what I have in mind. You may need small
> changes based on the branch you are working on.

Hi,
I tried to apply it on pg12.12, both clean from
https://www.postgresql.org/ftp/source/v12.12/ and one that is being
built for pgdg .deb for 12.12, and in both cases it failed.

Current state of xlog.c from the deb building (which is what I have to
build) is available at https://depesz.com/various/xlog.c.xz

I don't trust myself with modifing patches in C - any chance I could ask
you to check why it doesn't apply, and possibly make new one that
applies on pg12?

I tried to see what is going on there, but you have in patch lines that
I don't see in my version of xlog.c, so I'm not sure what to do with it.

The procedure I followed to get this xlog.c in this form was:

dget -x
https://apt-archive.postgresql.org/pub/repos/apt/pool/main/p/postgresql-12/postgresql-12_12.1-2.pgdg100%2B1.dsc
dpkg-source -x postgresql-12_12.1-2.pgdg100%2B1.dsc
cd postgresql-12-12.1/
sudo apt build-dep .
quilt push -a

and I think that was it.

Best regards,

depesz




Hi,

On 2022-11-17 23:22:12 +0900, Masahiko Sawada wrote:
> On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote:
> > > > 4. wal sender restarts for some reason (or server crashed).
> >
> > I don't think walsender alone restarting should change anything, but
> > crash-restart obviously would.
> 
> Right. I've confirmed this scenario is possible to happen with crash-restart.

Depesz, were there any crashes / immediate restarts on the PG 12 side? If so,
then we know what the problem likely is and can fix it. If not...


Just to confirm, the problem has "newly" occurred after you've upgraded to
12.12? I couldn't quite tell from the thread.

Greetings,

Andres Freund



Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Thu, Nov 17, 2022 at 10:55:29AM -0800, Andres Freund wrote:
> Hi,
> 
> On 2022-11-17 23:22:12 +0900, Masahiko Sawada wrote:
> > On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > 4. wal sender restarts for some reason (or server crashed).
> > >
> > > I don't think walsender alone restarting should change anything, but
> > > crash-restart obviously would.
> > 
> > Right. I've confirmed this scenario is possible to happen with crash-restart.
> 
> Depesz, were there any crashes / immediate restarts on the PG 12 side? If so,
> then we know what the problem likely is and can fix it. If not...
> 
> 
> Just to confirm, the problem has "newly" occurred after you've upgraded to
> 12.12? I couldn't quite tell from the thread.

No crashes that I can find any track of.

And no, it's not related to 12.12

We were using 12.9, and in process of trying to figure stuff out we
upgraded to 12.12, which didn't help.

Best regards,

depesz




On Fri, Nov 18, 2022 at 4:57 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Thu, Nov 17, 2022 at 10:55:29AM -0800, Andres Freund wrote:
> > Hi,
> >
> > On 2022-11-17 23:22:12 +0900, Masahiko Sawada wrote:
> > > On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > > 4. wal sender restarts for some reason (or server crashed).
> > > >
> > > > I don't think walsender alone restarting should change anything, but
> > > > crash-restart obviously would.
> > >
> > > Right. I've confirmed this scenario is possible to happen with crash-restart.
> >
> > Depesz, were there any crashes / immediate restarts on the PG 12 side? If so,
> > then we know what the problem likely is and can fix it. If not...
> >
> >
> > Just to confirm, the problem has "newly" occurred after you've upgraded to
> > 12.12? I couldn't quite tell from the thread.
>
> No crashes that I can find any track of.
>

I think this problem could arise when walsender exits due to some
error like "terminating walsender process due to replication timeout".
Here is the theory I came up with:

1. Initially the restart_lsn is updated to 1039D/83825958. This will
allow all files till 000000000001039D00000082 to be removed.
2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8.
3. walsender restarts due to replication timeout.
4. After restart, it starts reading WAL from 1039D/83825958 as that
was restart_lsn.
5. walsender gets a message to update write, flush, apply, etc. As
part of that, it invokes
ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation.
6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and
replicationSlotMinLSN will also be computed to the same value allowing
to remove of all files older than 000000000001039D0000008A. This will
allow removing 000000000001039D00000083, 000000010001039D00000084,
etc.
7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958.
Remember from step 1, we are still reading WAL from that location.
8. At the next update for write, flush, etc. as part of processing
standby reply message, we will invoke
ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. This
updates restart_lsn to 1039D/83825958.
9. walsender restarts due to replication timeout.
10. After restart, it starts reading WAL from 1039D/83825958 as that
was restart_lsn and gets an error "requested WAL segment
000000010001039D00000083 has already been removed" because the same is
removed as part of point-6.
11. After that walsender will keep on getting the same error after
restart as the restart_lsn is never progressed.

What do you think?

If this diagnosis is correct, I think we need to clear
candidate_restart_lsn and friends during ReplicationSlotRelease().

-- 
With Regards,
Amit Kapila.



Hi,

On 2022-11-21 19:56:20 +0530, Amit Kapila wrote:
> I think this problem could arise when walsender exits due to some
> error like "terminating walsender process due to replication timeout".
> Here is the theory I came up with:
> 
> 1. Initially the restart_lsn is updated to 1039D/83825958. This will
> allow all files till 000000000001039D00000082 to be removed.
> 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8.
> 3. walsender restarts due to replication timeout.
> 4. After restart, it starts reading WAL from 1039D/83825958 as that
> was restart_lsn.
> 5. walsender gets a message to update write, flush, apply, etc. As
> part of that, it invokes
> ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation.
> 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and
> replicationSlotMinLSN will also be computed to the same value allowing
> to remove of all files older than 000000000001039D0000008A. This will
> allow removing 000000000001039D00000083, 000000010001039D00000084,
> etc.

This would require that the client acknowledged an LSN that we haven't
sent out, no? Shouldn't the
  MyReplicationSlot->candidate_restart_valid <= lsn
from LogicalConfirmReceivedLocation() prevented this from happening
unless the client acknowledges up to candidate_restart_valid?


> 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958.
> Remember from step 1, we are still reading WAL from that location.

I don't think LogicalIncreaseRestartDecodingForSlot() would do anything
in that case, because of the
    /* don't overwrite if have a newer restart lsn */
check.


> If this diagnosis is correct, I think we need to clear
> candidate_restart_lsn and friends during ReplicationSlotRelease().

Possible, but I don't quite see it yet.

Greetings,

Andres Freund



On Tue, Nov 22, 2022 at 1:38 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2022-11-21 19:56:20 +0530, Amit Kapila wrote:
> > I think this problem could arise when walsender exits due to some
> > error like "terminating walsender process due to replication timeout".
> > Here is the theory I came up with:
> >
> > 1. Initially the restart_lsn is updated to 1039D/83825958. This will
> > allow all files till 000000000001039D00000082 to be removed.
> > 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8.
> > 3. walsender restarts due to replication timeout.
> > 4. After restart, it starts reading WAL from 1039D/83825958 as that
> > was restart_lsn.
> > 5. walsender gets a message to update write, flush, apply, etc. As
> > part of that, it invokes
> > ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation.
> > 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and
> > replicationSlotMinLSN will also be computed to the same value allowing
> > to remove of all files older than 000000000001039D0000008A. This will
> > allow removing 000000000001039D00000083, 000000010001039D00000084,
> > etc.
>
> This would require that the client acknowledged an LSN that we haven't
> sent out, no?  Shouldn't the
>   MyReplicationSlot->candidate_restart_valid <= lsn
> from LogicalConfirmReceivedLocation() prevented this from happening
> unless the client acknowledges up to candidate_restart_valid?
>

We would have sent the higher LSN prior to the restart. Let me share a
snippet of LOG which I have extracted from a large LOG depesz has
shared me offlist after my request:

2022-11-10 21:03:23.478+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,4,idle,2022-11-10
21:03:23+00,7/0,0,DEBUG,00000,received replication command:
IDENTIFY_SYSTEM,,,,,,,,,focal14
2022-11-10 21:03:23.479+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,5,idle,2022-11-10
21:03:23+00,7/0,0,DEBUG,00000,"received replication command:
START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/91F41B50
(proto_version '1', publication_names '""focal14""')",,,,,,,,,focal14
2022-11-10 21:03:23.481+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,6,idle,2022-11-10
21:03:23+00,7/0,0,LOG,00000,"starting logical decoding for slot
""focal14""","Streaming transactions committing after 1039D/911A8DB0,
reading WAL from 1039D/83825958.",,,,,"START_REPLICATION SLOT
""focal14"" LOGICAL 1039D/91F41B50 (proto_version '1',
publication_names '""focal14""')",,,focal14
2022-11-10 21:03:23.481+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,7,idle,2022-11-10
21:03:23+00,7/0,0,DEBUG,00000,sending replication
keepalive,,,,,,,,,focal14
2022-11-10 21:03:23.484+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,8,idle,2022-11-10
21:03:23+00,7/0,0,LOG,00000,logical decoding found consistent point at
1039D/83825958,Logical decoding will begin using saved
snapshot.,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL
1039D/91F41B50 (proto_version '1', publication_names
'""focal14""')",,,focal14
2022-11-10 21:03:23.484+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,9,idle,2022-11-10
21:03:23+00,7/0,0,DEBUG,00000,write 1039D/91F41B50 flush
1039D/91F41B50 apply 1039D/91F41B50 reply_time 2022-11-10
21:03:23.482052+00,,,,,,,,,focal14
2022-11-10 21:03:23.495+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,10,idle,2022-11-10
21:03:23+00,7/0,0,DEBUG,00000,updated xmin: 1 restart:
1,,,,,,,,,focal14

Here, we can see that even though we have started reading WAL from
1039D/83825958, the confiremed_flush location is 1039D/911A8DB0. Then
we get a client reply (write 1039D/91F41B50 flush 1039D/91F41B50)
probably for a keepalive message. The flush LSN here is higher than
candidate_restart_valid. The value of candidate_restart_valid before
walsender restart was 1039D/9170B010 and candidate_restart_lsn was
1039D/8B5773D8.

>
> > 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958.
> > Remember from step 1, we are still reading WAL from that location.
>
> I don't think LogicalIncreaseRestartDecodingForSlot() would do anything
> in that case, because of the
>         /* don't overwrite if have a newer restart lsn */
> check.
>

But later in LogicalIncreaseRestartDecodingForSlot(), we have the
below check which can allow to override it.
/*
* Only increase if the previous values have been applied, otherwise we
* might never end up updating if the receiver acks too slowly. A missed
* value here will just cause some extra effort after reconnecting.
*/
if (slot->candidate_restart_valid == InvalidXLogRecPtr)

Do we need else if here instead of if? I think that would save us from
retreating the restart_lsn value. We seem to have "else if" in the
corresponding case in the function LogicalIncreaseXminForSlot():
/*
* Only increase if the previous values have been applied, otherwise we
* might never end up updating if the receiver acks too slowly.
*/
else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr)

-- 
With Regards,
Amit Kapila.



On Tue, Nov 22, 2022 at 12:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Nov 22, 2022 at 1:38 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2022-11-21 19:56:20 +0530, Amit Kapila wrote:
> > > I think this problem could arise when walsender exits due to some
> > > error like "terminating walsender process due to replication timeout".
> > > Here is the theory I came up with:
> > >
> > > 1. Initially the restart_lsn is updated to 1039D/83825958. This will
> > > allow all files till 000000000001039D00000082 to be removed.
> > > 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8.
> > > 3. walsender restarts due to replication timeout.
> > > 4. After restart, it starts reading WAL from 1039D/83825958 as that
> > > was restart_lsn.
> > > 5. walsender gets a message to update write, flush, apply, etc. As
> > > part of that, it invokes
> > > ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation.
> > > 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and
> > > replicationSlotMinLSN will also be computed to the same value allowing
> > > to remove of all files older than 000000000001039D0000008A. This will
> > > allow removing 000000000001039D00000083, 000000010001039D00000084,
> > > etc.
> >
> > This would require that the client acknowledged an LSN that we haven't
> > sent out, no?  Shouldn't the
> >   MyReplicationSlot->candidate_restart_valid <= lsn
> > from LogicalConfirmReceivedLocation() prevented this from happening
> > unless the client acknowledges up to candidate_restart_valid?
> >
>
> We would have sent the higher LSN prior to the restart. Let me share a
> snippet of LOG which I have extracted from a large LOG depesz has
> shared me offlist after my request:
>
> 2022-11-10 21:03:23.478+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,4,idle,2022-11-10
> 21:03:23+00,7/0,0,DEBUG,00000,received replication command:
> IDENTIFY_SYSTEM,,,,,,,,,focal14
> 2022-11-10 21:03:23.479+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,5,idle,2022-11-10
> 21:03:23+00,7/0,0,DEBUG,00000,"received replication command:
> START_REPLICATION SLOT ""focal14"" LOGICAL 1039D/91F41B50
> (proto_version '1', publication_names '""focal14""')",,,,,,,,,focal14
> 2022-11-10 21:03:23.481+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,6,idle,2022-11-10
> 21:03:23+00,7/0,0,LOG,00000,"starting logical decoding for slot
> ""focal14""","Streaming transactions committing after 1039D/911A8DB0,
> reading WAL from 1039D/83825958.",,,,,"START_REPLICATION SLOT
> ""focal14"" LOGICAL 1039D/91F41B50 (proto_version '1',
> publication_names '""focal14""')",,,focal14
> 2022-11-10 21:03:23.481+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,7,idle,2022-11-10
> 21:03:23+00,7/0,0,DEBUG,00000,sending replication
> keepalive,,,,,,,,,focal14
> 2022-11-10 21:03:23.484+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,8,idle,2022-11-10
> 21:03:23+00,7/0,0,LOG,00000,logical decoding found consistent point at
> 1039D/83825958,Logical decoding will begin using saved
> snapshot.,,,,,"START_REPLICATION SLOT ""focal14"" LOGICAL
> 1039D/91F41B50 (proto_version '1', publication_names
> '""focal14""')",,,focal14
> 2022-11-10 21:03:23.484+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,9,idle,2022-11-10
> 21:03:23+00,7/0,0,DEBUG,00000,write 1039D/91F41B50 flush
> 1039D/91F41B50 apply 1039D/91F41B50 reply_time 2022-11-10
> 21:03:23.482052+00,,,,,,,,,focal14
> 2022-11-10 21:03:23.495+00,upgrayedd,canvas,21748,10.1.238.101:35640,636d671b.54f4,10,idle,2022-11-10
> 21:03:23+00,7/0,0,DEBUG,00000,updated xmin: 1 restart:
> 1,,,,,,,,,focal14
>
> Here, we can see that even though we have started reading WAL from
> 1039D/83825958, the confiremed_flush location is 1039D/911A8DB0. Then
> we get a client reply (write 1039D/91F41B50 flush 1039D/91F41B50)
> probably for a keepalive message. The flush LSN here is higher than
> candidate_restart_valid. The value of candidate_restart_valid before
> walsender restart was 1039D/9170B010 and candidate_restart_lsn was
> 1039D/8B5773D8.
>
> >
> > > 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958.
> > > Remember from step 1, we are still reading WAL from that location.
> >
> > I don't think LogicalIncreaseRestartDecodingForSlot() would do anything
> > in that case, because of the
> >         /* don't overwrite if have a newer restart lsn */
> > check.
> >
>
> But later in LogicalIncreaseRestartDecodingForSlot(), we have the
> below check which can allow to override it.
> /*
> * Only increase if the previous values have been applied, otherwise we
> * might never end up updating if the receiver acks too slowly. A missed
> * value here will just cause some extra effort after reconnecting.
> */
> if (slot->candidate_restart_valid == InvalidXLogRecPtr)

I also think we did this override-path. After updating the restart_lsn
we accept the given values as candidate_restart_lsn and
candidate_restart_valid, which were 1039D/83825958 and 1039D/8B5773D8.

Analyzing the server logs that depesz shared also with me offlist,
this scenario basically makes sense to me. But please note that
000000000001039D0000008A and older files were removed while there was
no running wal sender. The wal sender who restarted decoding from
1039D/83825958 (ie step 4) exited due to wal_sender_timeout at
21:04:56, and no walsender was launched until 000000000001039D0000008A
or older files were removed at 21:06:07. I guess the scenario needs to
change a bit.

6. The restart_lsn was updated to 1039D/8B5773D8 and
replicationSlotMinLSN was also computed to the same value. But WAL
files were not removed at this point because there was no checkpoint
between 6 and 7. This was around 21:03:23.

7. We got new slot->candidate_restart_lsn as 1039D/83825958 and
candidate_restart_valid as 1039D/8B5773D8. This can happen for the
reason that Amit mentioned above.

8. The restart_lsn was updated to 1039D/83825958, but we didn't update
replicationSlotMinLSN since slot->data.catalog_xmin was not updated at
the same time (see LogicalConfirmReceivedLocation()). And the
walsender exited again due to wal_sender_timeout. At this point, note
that slot's restart_lsn was 1039D/83825958 whereas
replicationSlotMinLSN was 1039D/8B5773D8. This was around 21:04:56.

9. Until 000000000001039D0000008A or older files were removed at
21:06:07, a client had checked the pg_replication_slots view every
minute but it had showed 1039D/83825958 in the restart_lsn column. On
the other hand, the checkpointer removed the WAL file segment files
000000000001039D0000008A or older at 21:06:07, based on
replicationSlotMinLSN ( 1039D/8B5773D8).

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
Hi,
just checking - has there been any progress on diagnosing/fixing the
bug?

Best regards,

depesz




Hi,

On Thu, Dec 8, 2022 at 8:13 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> Hi,
> just checking - has there been any progress on diagnosing/fixing the
> bug?

Sorry for the late response.

Based on the analysis we did[1][2], I've created the manual scenario
to reproduce this issue with the attached patch and the script.

The scenario.md explains the basic steps to reproduce this issue. It
consists of 13 steps (very tricky!!). It's not sophisticated and could
be improved. test.sh is the shell script I used to execute the
reproduction steps from 1 to 10. In my environment, I could reproduce
this issue by the following steps.

1. apply the patch and build PostgreSQL.
2. run test.sh.
3. execute the step 11 and later described in scenario.md.

The test.sh is a very hacky and dirty script and is optimized in my
environment (especially adding many sleeps). You might need to adjust
it while checking scenario.md.

I've also confirmed that this issue is fixed by the attached patch,
which clears candidate_restart_lsn and friends during
ReplicationSlotRelease().

[1] https://www.postgresql.org/message-id/CAA4eK1JvyWHzMwhO9jzPquctE_ha6bz3EkB3KE6qQJx63StErQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAD21AoBHMCEDcV4eBtSVvDrCN4SrMXanX5N9%2BL-E%2B4OWXYY0ew%40mail.gmail.com

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
On Mon, Feb 06, 2023 at 05:25:42PM +0900, Masahiko Sawada wrote:
> Based on the analysis we did[1][2], I've created the manual scenario
> to reproduce this issue with the attached patch and the script.
> 
> The scenario.md explains the basic steps to reproduce this issue. It
> consists of 13 steps (very tricky!!). It's not sophisticated and could
> be improved. test.sh is the shell script I used to execute the
> reproduction steps from 1 to 10. In my environment, I could reproduce
> this issue by the following steps.
> 
> 1. apply the patch and build PostgreSQL.
> 2. run test.sh.
> 3. execute the step 11 and later described in scenario.md.
> 
> The test.sh is a very hacky and dirty script and is optimized in my
> environment (especially adding many sleeps). You might need to adjust
> it while checking scenario.md.
> 
> I've also confirmed that this issue is fixed by the attached patch,
> which clears candidate_restart_lsn and friends during
> ReplicationSlotRelease().

Hi,
one important question - do I patch newer Pg, or older? The thing is
that we were able to replicate the problem (with some luck) only on
production databases, and patching them will be hard sell. Maybe
possible, but if it's enough to patch the pg14 (recipient) it would make
my life much easier.

Best regards,

depesz




On Mon, Feb 6, 2023 at 8:15 PM hubert depesz lubaczewski
<depesz@depesz.com> wrote:
>
> On Mon, Feb 06, 2023 at 05:25:42PM +0900, Masahiko Sawada wrote:
> > Based on the analysis we did[1][2], I've created the manual scenario
> > to reproduce this issue with the attached patch and the script.
> >
> > The scenario.md explains the basic steps to reproduce this issue. It
> > consists of 13 steps (very tricky!!). It's not sophisticated and could
> > be improved. test.sh is the shell script I used to execute the
> > reproduction steps from 1 to 10. In my environment, I could reproduce
> > this issue by the following steps.
> >
> > 1. apply the patch and build PostgreSQL.
> > 2. run test.sh.
> > 3. execute the step 11 and later described in scenario.md.
> >
> > The test.sh is a very hacky and dirty script and is optimized in my
> > environment (especially adding many sleeps). You might need to adjust
> > it while checking scenario.md.
> >
> > I've also confirmed that this issue is fixed by the attached patch,
> > which clears candidate_restart_lsn and friends during
> > ReplicationSlotRelease().
>
> Hi,
> one important question - do I patch newer Pg, or older? The thing is
> that we were able to replicate the problem (with some luck) only on
> production databases, and patching them will be hard sell. Maybe
> possible, but if it's enough to patch the pg14 (recipient) it would make
> my life much easier.

Unfortunately, the patch I attached is for the publisher (i.e., sender
side). There might be a way to fix this issue from the receiver side
but I have no idea for now.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On 2023-Feb-06, Masahiko Sawada wrote:

> I've also confirmed that this issue is fixed by the attached patch,
> which clears candidate_restart_lsn and friends during
> ReplicationSlotRelease().

Hmm, interesting -- I was studying some other bug recently involving the
xmin of a slot that had been invalidated and I remember wondering if
these "candidate" fields were being properly ignored when the slot is
marked not in use; but I didn't check.  Are you sure that resetting them
when the slot is released is the appropriate thing to do?  I mean,
shouldn't they be kept set while the slot is in use, and only reset if
we destroy it?

(But, actually, in that case, maybe we don't need to reset them: instead
we need to make sure to ignore the values for slots that are not in_use.
However, I don't know what the semantics are for these "candidate"
fields, so maybe this is wrong.)

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



On Tue, Feb 7, 2023 at 12:22 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2023-Feb-06, Masahiko Sawada wrote:
>
> > I've also confirmed that this issue is fixed by the attached patch,
> > which clears candidate_restart_lsn and friends during
> > ReplicationSlotRelease().
>
> Hmm, interesting -- I was studying some other bug recently involving the
> xmin of a slot that had been invalidated and I remember wondering if
> these "candidate" fields were being properly ignored when the slot is
> marked not in use; but I didn't check.  Are you sure that resetting them
> when the slot is released is the appropriate thing to do?  I mean,
> shouldn't they be kept set while the slot is in use, and only reset if
> we destroy it?

Thinking about the root cause more, it seems to me that the root cause
is not the fact that candidate_xxx values are not cleared when being
released.

In the scenario I reproduced, after restarting the logical decoding,
the walsender sets the restart_lsn to a candidate_restart_lsn left in
the slot upon receiving the ack from the subscriber. Then, when
decoding a RUNNING_XACTS record, the walsender updates the
restart_lsn, which actually retreats it. The patch, which clears the
candidate_xxx values at releasing, fixes this issue by not updating
the restart_lsn upon receiving the ack in this case, but I think it's
legitimate that the walsender sets the restart_lsn to the
candidate_restart_lsn left in the slot in this case. The root cause
here seems to me that it allows the restart_lsn to retreat. In
LogicalIncreaseRestartDecodingForSlot(), we have the following codes:

/* don't overwrite if have a newer restart lsn */
if (restart_lsn <= slot->data.restart_lsn)
{
}

/*
 * We might have already flushed far enough to directly accept this lsn,
 * in this case there is no need to check for existing candidate LSNs
 */
else if (current_lsn <= slot->data.confirmed_flush)
{
    slot->candidate_restart_valid = current_lsn;
    slot->candidate_restart_lsn = restart_lsn;

    /* our candidate can directly be used */
    updated_lsn = true;
}

/*
 * Only increase if the previous values have been applied, otherwise we
 * might never end up updating if the receiver acks too slowly. A missed
 * value here will just cause some extra effort after reconnecting.
 */
if (slot->candidate_restart_valid == InvalidXLogRecPtr)
{
    slot->candidate_restart_valid = current_lsn;
    slot->candidate_restart_lsn = restart_lsn;
    SpinLockRelease(&slot->mutex);
}

If slot->candidate_restart_valid is InvalidXLogRecPtr, he
candidate_restart_lsn could be set to a LSN lower than the slot's
restart_lsn in spite of the first comment saying "don't overwrite if
have a newer restart lsn". So I think the right fix here is to not
allow the restart_lsn to retreat by changing "if" to "else if" of the
third block.

>
> (But, actually, in that case, maybe we don't need to reset them: instead
> we need to make sure to ignore the values for slots that are not in_use.
> However, I don't know what the semantics are for these "candidate"
> fields, so maybe this is wrong.)

IIUC in that case, the WAL segments are removed while the slot is in_use.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From
hubert depesz lubaczewski
Date:
Hi,
so, we have another bit of interesting information. maybe related, maybe
not.

We noticed weird situation on two clusters we're trying to upgrade.

In both cases sitaution looked the same:

1. there was another process (debezium) connected to source (pg12) using
   logical replication
2. pg12 -> pg14 replication failed with the message 'ERROR:  requested
   WAL segment ... has already been '
3. some time afterwards (most likely couple of hours) the process that
   is/was responsible for debezium replicaiton (pg process) stopped
   handling WAL, but instead is eating 100% of cpu.

When this situation happens, we can't pg_cancel_backend(pid) for the
"broken" wal sender, it also can't be pg_terminate_backend() !

strace of the process doesn't show anything.

When I tried to get backtrace from gdb all I got was:

(gdb) bt
#0  0x0000aaaad270521c in hash_seq_search ()
#1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#3  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#4  0x0000aaaad257764c in ReorderBufferCommit ()
#5  0x0000aaaad256c804 in ?? ()
#6  0x0000aaaaf303d280 in ?? ()

If I'd quit gdb, and restart, and redo bt, I get 

#0  0x0000ffff806c81a8 in hash_seq_search@plt () from /usr/lib/postgresql/12/lib/pgoutput.so
#1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaad291ae58 in ?? ()

or

#0  0x0000aaaad2705244 in hash_seq_search ()
#1  0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so
#2  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#3  0x0000aaaad26e3644 in CallSyscacheCallbacks ()
#4  0x0000aaaad257764c in ReorderBufferCommit ()
#5  0x0000aaaad256c804 in ?? ()
#6  0x0000aaaaf303d280 in ?? ()

At this moment, the only thing that we can do is kill -9 the process (or
restart pg).

I don't know if it's relevant, but I have this case *right now*, and if
it's helpful I can provide more information before we will have to kill
it.

Best regards,

depesz




This seems to have a thin connection to the issue, but.

At Wed, 16 Nov 2022 23:25:12 -0800, Andres Freund <andres@anarazel.de> wrote in 
> I'm a bit confused by the changes the changes c6550776394e made to
> KeepLogSeg() etc.
> 
> Before:
>     /* then check whether slots limit removal further */
>     if (max_replication_slots > 0 && keep != InvalidXLogRecPtr)
>     {
>         XLogSegNo    slotSegNo;
> 
>         XLByteToSeg(keep, slotSegNo, wal_segment_size);
> 
>         if (slotSegNo <= 0)
>             segno = 1;
>         else if (slotSegNo < segno)
>             segno = slotSegNo;
>     }
> 
> Current:
>     keep = XLogGetReplicationSlotMinimumLSN();
>     if (keep != InvalidXLogRecPtr)
>     {
>         XLByteToSeg(keep, segno, wal_segment_size);
> 
>         /* Cap by max_slot_wal_keep_size ... */
>         if (max_slot_wal_keep_size_mb >= 0)
>         {
>             uint64        slot_keep_segs;
> 
>             slot_keep_segs =
>                 ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
> 
>             if (currSegNo - segno > slot_keep_segs)
>                 segno = currSegNo - slot_keep_segs;
>         }
>     }
> 
> Which looks to me to have lost the protection against segno being 0. Which
> afaict would break the callers, because they do _logSegNo--. Which would then
> allow all WAL to be reused, no?  I guess the argument for changing this was
> that we should never have such a dangerous value returned by
> XLogGetReplicationSlotMinimumLSN()?

Mmm. Right. If XLogGetReplicationSlotMinimumLSN() returned an LSN
within the zeroth segment, segno ends up with being 0, which is
dangerous if happens. However, if it happens, slots are evidently
already broken, but there's no cost to have that protection.

> It seems decidedly not great to not log at least a debug1 (but probably it
> should be LOG) message when KeepLogSeg() decides to limit based on
> max_slot_wal_keep_size.

It's easy to do that, but that log is highly accompanied by a LOG line
"terminating process %d to release replication slot \"%s\"". I don't
mind adding it if it is a DEBUGx.

> It feels wrong to subtract max_slot_wal_keep_size from recptr - that's the end
> of the checkpoint record. Given that we, leaving max_slot_wal_keep_size aside,
> only actually remove WAL if older than the segment that RedoRecPtr (the
> logical start of the checkpoint) is in. If the checkpoint is large we'll end
> up removing replication slots even though they potentially would only have
> retained one additional WAL segment.

I think that it is a controversial part, but that variable is defined
the similar way to wal_keep_size.  And I think that all max_wal_size,
wal_keep_size and max_slot_wal_keep_size being defined with the same
base LSN makes things simpler for users (also for developers).
Regardless of checkpoint length, if slots get frequently invalidated,
the setting should be considered to be too small for the system
requirements.

> Isn't it problematic to use ConvertToXSegs() to implement
> max_slot_wal_keep_size, given that it rounds *down*? Particularly for a large
> wal_segment_size that'd afaict lead to being much more aggressive invalidating
> slots.

I think max_slot_wal_keep_size is, like max_wal_size for checkpoints,
a safeguard for slots not to fill-up WAL directory. Thus they both are
rounded down. If you have 1GB WAL directory and set wal_segment_size
to 4192MB, I don't see it a sane setup. But if segment size is smaller
than one hundredth of max_wal_size, that difference won't matter for
anyone.  But anyway, it's a pain in the a.. that the both of them (and
wal_keep_size) don't work in a immediate manner, though..

> Also, why do we do something as expensive as
> InvalidateObsoleteReplicationSlots() even when max_slot_wal_keep_size had no
> effect?

Indeed. Maybe we didn't regard that process as complex at start? I
think we can compare the cutoff segno against
XLogGetReplicationSlotMinimumLSN() before entering the loop over
slots.

Thus I think there's room for the following improvements.

- Prevent KeepLogSeg from returning 0.

- Add DEBUG log to KeepLogSeg emitted when max_slot_wal_keep_size affects.

- Check against minimum slot LSN before actually examining through the
  slots in Invalidateobsoletereplicationslots.

I'm not sure about the second item but the others seem back-patchable.

If we need to continue further discussion, will need another
thread. Anyway I'll come up with the patch for the above three items.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Hi,

On 2023-02-13 15:45:49 +0900, Kyotaro Horiguchi wrote:
> This seems to have a thin connection to the issue, but.

I was worried that the changes could lead us to removing WAL without
max_slot_wal_keep_size set.


> > It seems decidedly not great to not log at least a debug1 (but probably it
> > should be LOG) message when KeepLogSeg() decides to limit based on
> > max_slot_wal_keep_size.
> 
> It's easy to do that, but that log is highly accompanied by a LOG line
> "terminating process %d to release replication slot \"%s\"". I don't
> mind adding it if it is a DEBUGx.

My problem with that is that we might *NOT* see those log messages for some
reason, but that that's impossible to debug as-is. And even if we see them,
it's not that easy to figure out by how much we were over
max_slot_wal_keep_size, because we always report it in the context of a
specific slot.

Removing WAL that's still needed is a *very* severe operation. Emitting an
additional line in case it happens isn't a problem.


> > It feels wrong to subtract max_slot_wal_keep_size from recptr - that's the end
> > of the checkpoint record. Given that we, leaving max_slot_wal_keep_size aside,
> > only actually remove WAL if older than the segment that RedoRecPtr (the
> > logical start of the checkpoint) is in. If the checkpoint is large we'll end
> > up removing replication slots even though they potentially would only have
> > retained one additional WAL segment.
> 
> I think that it is a controversial part, but that variable is defined
> the similar way to wal_keep_size.  And I think that all max_wal_size,
> wal_keep_size and max_slot_wal_keep_size being defined with the same
> base LSN makes things simpler for users (also for developers).
> Regardless of checkpoint length, if slots get frequently invalidated,
> the setting should be considered to be too small for the system
> requirements.

I think it's bad that we define wal_keep_size, max_slot_wal_keep_size that
way. I don't think bringing max_wal_size into this is useful, as it influences
different things.


> > Isn't it problematic to use ConvertToXSegs() to implement
> > max_slot_wal_keep_size, given that it rounds *down*? Particularly for a large
> > wal_segment_size that'd afaict lead to being much more aggressive invalidating
> > slots.
> 
> I think max_slot_wal_keep_size is, like max_wal_size for checkpoints,
> a safeguard for slots not to fill-up WAL directory. Thus they both are
> rounded down. If you have 1GB WAL directory and set wal_segment_size
> to 4192MB, I don't see it a sane setup. But if segment size is smaller
> than one hundredth of max_wal_size, that difference won't matter for
> anyone.  But anyway, it's a pain in the a.. that the both of them (and
> wal_keep_size) don't work in a immediate manner, though..

It doesn't matter a lot for 16MB segments, but with 1GB segments it's a
different story.

To me the way it's done now is a bug, one that can in extreme circumstances
lead to data loss.


> > Also, why do we do something as expensive as
> > InvalidateObsoleteReplicationSlots() even when max_slot_wal_keep_size had no
> > effect?
> 
> Indeed. Maybe we didn't regard that process as complex at start? I
> think we can compare the cutoff segno against
> XLogGetReplicationSlotMinimumLSN() before entering the loop over
> slots.

That'd be better, but I'd probably go further, and really gate it on
max_slot_wal_keep_size having had an effect.


> Thus I think there's room for the following improvements.
> 
> - Prevent KeepLogSeg from returning 0.
> 
> - Add DEBUG log to KeepLogSeg emitted when max_slot_wal_keep_size affects.
> 
> - Check against minimum slot LSN before actually examining through the
>   slots in Invalidateobsoletereplicationslots.
> 
> I'm not sure about the second item but the others seem back-patchable.
> 
> If we need to continue further discussion, will need another
> thread. Anyway I'll come up with the patch for the above three items.

Yep, probably a good idea to start another thread.

There's also https://www.postgresql.org/message-id/20220223014855.4lsddr464i7mymk2%40alap3.anarazel.de
that unfortunately nobody replied to.

Greetings,

Andres Freund