RE: speed up a logical replica setup - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: speed up a logical replica setup
Date
Msg-id TYCPR01MB1207713BEC5C379A05D65E342F54B2@TYCPR01MB12077.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: speed up a logical replica setup  ("Euler Taveira" <euler@eulerto.com>)
Responses RE: speed up a logical replica setup
List pgsql-hackers
Dear Euler,

Further comments for v17.

01.
This program assumes that the target server has same major version with this.
Because the target server would be restarted by same version's pg_ctl command.
I felt it should be ensured by reading the PG_VERSION.

02.
pg_upgrade checked the version of using executables, like pg_ctl, postgres, and
pg_resetwal. I felt it should be as well.

03. get_bin_directory
```
    if (find_my_exec(path, full_path) < 0)
    {
        pg_log_error("The program \"%s\" is needed by %s but was not found in the\n"
                     "same directory as \"%s\".\n",
                     "pg_ctl", progname, full_path);
```

s/"pg_ctl"/progname

04.
Missing canonicalize_path()?

05.
Assuming that the target server is a cascade standby, i.e., it has a role as
another primary. In this case, I thought the child node would not work. Because
pg_createsubcriber runs pg_resetwal and all WAL files would be discarded at that
time. I have not tested, but should the program detect it and exit earlier?

06.
wait_for_end_recovery() waits forever even if the standby has been disconnected
from the primary, right? should we check the status of the replication via
pg_stat_wal_receiver?

07.
The cleanup function has couple of bugs.

* If subscriptions have been created on the database, the function also tries to
  drop a publication. But it leads an ERROR because it has been already dropped.
  See setup_subscriber().
* If the subscription has been created, drop_replication_slot() leads an ERROR.
  Because the subscriber tried to drop the subscription while executing DROP SUBSCRIPTION.

08.
I found that all messages (ERROR, WARNING, INFO, etc...) would output to stderr,
but I felt it should be on stdout. Is there a reason? pg_dump outputs messages to
stderr, but the motivation might be to avoid confusion with dumps.

09.
I'm not sure the cleanup for subscriber is really needed. Assuming that there
are two databases, e.g., pg1 pg2 , and we fail to create a subscription on pg2.
This can happen when the subscription which has the same name has been already
created on the primary server.
In this case a subscirption pn pg1 would be removed. But what is a next step?
Since a timelineID on the standby server is larger than the primary (note that
the standby has been promoted once), we cannot resume the physical replication
as-is. IIUC the easiest method to retry is removing a cluster once and restarting
from pg_basebackup. If so, no need to cleanup the standby because it is corrupted.
We just say "Please remove the cluster and recreate again".

Here is a reproducer.

1. apply the txt patch atop 0001 patch.
2. run test_corruption.sh.
3. when you find a below output [1], connect to a testdb from another terminal and
   run CREATE SUBSCRITPION for the same subscription on the primary
4. Finally, pg_createsubscriber would fail the creation.

I also attached server logs of both nodes and the output.
Note again that this is a real issue. I used a tricky way for surely overlapping name,
but this can happen randomly.

10.
While investigating #09, I found that we cannot report properly a reason why the
subscription cannot be created. The output said:

```
pg_createsubscriber: error: could not create subscription "pg_createsubscriber_16389_3884" on database "testdb": out of
memory
```

But the standby serverlog said:

```
ERROR:  subscription "pg_createsubscriber_16389_3884" already exists
STATEMENT:  CREATE SUBSCRIPTION pg_createsubscriber_16389_3884 CONNECTION 'user=postgres port=5431 dbname=testdb'
PUBLICATIONpg_createsubscriber_16389 WITH (create_slot = false, copy_data = false, enabled = false)
 
```

[1]
```
pg_createsubscriber: creating the replication slot "pg_createsubscriber_16389_3884" on database "testdb"
pg_createsubscriber: XXX: sleep 20s
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 


Attachment

pgsql-hackers by date:

Previous
From: Shubham Khanna
Date:
Subject: Re: speed up a logical replica setup
Next
From: Dave Cramer
Date:
Subject: Re: [PATCH] Add native windows on arm64 support