Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CAJpy0uAMqBkb5zuLFbw3BQY66JFtv6Mvp2XRzZna6gOpNUZE9w@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby (Amit Kapila <amit.kapila16@gmail.com>) |
List | pgsql-hackers |
On Wed, Dec 27, 2023 at 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 27, 2023 at 11:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Dec 26, 2023 at 9:27 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > I would like to revisit the current dependency of slotsync worker on > > > dbname used in 002 patch. Currently we accept dbname in > > > primary_conninfo and thus the user has to make sure to provide one (by > > > manually altering it) even in case of a conf file auto-generated by > > > "pg_basebackup -R". > > > Thus I would like to discuss if there are better ways to do it. > > > Complete background is as follow: > > > > > > We need dbname for 2 purposes: > > > > > > 1) to connect to remote db in order to run SELECT queries to fetch the > > > info needed by slotsync worker. > > > 2) to make connection in slot-sync worker itself in order to be able > > > to use libpq APIs for 1) > > > > > > We run 3 kind of select queries in slot-sync worker currently: > > > > > > a) To fetch all failover slots (logical slots) info at once in > > > synchronize_slots(). > > > b) To fetch a particular slot info during > > > wait_for_primary_slot_catchup() logic (logical slot). > > > c) To validate primary slot (physical one) and also to distinguish > > > between standby and cascading standby by running pg_is_in_recovery(). > > > > > > 1) One approach to avoid dependency on dbname is using commands > > > instead of SELECT. This will need implementing LIST_SLOTS command for > > > a), and for b) we can use LIST_SLOTS and fetch everything (even though > > > it is not needed) or have LIST_SLOTS with a filter on slot-name or > > > extend READ_REPLICATION_SLOT, and for c) we can have some other > > > command to get pg_is_in_recovery() info. But, I feel by relying on > > > commands we will be making the extension of the slot-sync feature > > > difficult. In future, if there is some more requirement to fetch any > > > other info, > > > then there too we have to implement a command. I am not sure if it is > > > good and extensible approach. > > > > > > 2) Another way to avoid asking for a dbname in primary_conninfo is to > > > use the default dbname internally. This brings us to two questions: > > > 'How' and 'Which default db'? > > > > > > 2.1) To answer 'How': > > > Using default dbname is simpler for the purpose of slot-sync worker > > > having its own db-connection, but is a little tricky for the purpose > > > of connection to remote_db. This is because we have to inject this > > > dbname internally in our connection-info. > > > > > > 2.1.1) Say we use primary_conninfo (i.e. original one w/o dbname), > > > then currently it could have 2 formats: > > > > > > a) The simple "=" format for key-value pairs, example: > > > 'user=replication host=127.0.0.1 port=5433 dbname=postgres'. > > > b) URI format, example: > > > postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp > > > > > > We can distinguish between the 2 formats using 'uri_prefix_length' but > > > injecting the dbname part will be messy specially for URI format. If > > > we want to do it w/o injecting and only by changing libpq interfaces > > > to accept dbname separately apart from conninfo, then there is no > > > current simpler way available. It will need a good amount of changes > > > in libpq. > > > > > > 2.1.2) Another way is to not rely on primary_conninfo directly but > > > rely on 'WalRcv->conninfo' in order to connect to remote_db. This is > > > because the latter is never URI format, it is some parsed format and > > > appending may work. As an example, primary_conninfo = > > > 'postgresql://replication@localhost:5433', WalRcv->conninfo loaded > > > internally is: > > > "user=replication passfile=/home/shveta/.pgpass channel_binding=prefer > > > dbname=replication host=localhost port=5433 > > > fallback_application_name=walreceiver sslmode=prefer sslcompression=0 > > > sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2 > > > gssencmode=disable krbsrvname=postgres gssdelegation=0 > > > target_session_attrs=any load_balance_hosts=disable", '\000' > > > > > > So we can try appending our default dbname to this. But all the > > > defaults loaded in WalRcv->conninfo need some careful analysis to > > > figure out if they work for slot-sync worker case. > > > > > > 2.2) Now coming to 'Which default db': > > > > > > 2.2.1) If we use 'template1' as default db, it may block 'create db' > > > operations on primary for the time when the slot-sync worker is > > > connected to remote using this dbname. Example: > > > > > > postgres=# create database newdb1; > > > ERROR: source database "template1" is being accessed by other users > > > DETAIL: There is 1 other session using the database. > > > > > > 2.2.2) If we use 'postgres' as default db, there are chances that it > > > can be dropped as unlike 'template1', it is allowed to be dropped by > > > user, and if slotsync worker is connected to it, user may see: > > > newdb1=# drop database postgres; > > > ERROR: database "postgres" is being accessed by other users > > > DETAIL: There is 1 other session using the database. > > > > > > But once the slot-sync worker or standby goes down, user can always > > > drop this and next time slot-sync worker may not be able to come up. > > > > > > > Other random ideas for discussion are: > > > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC > > parameter, say slot_sync_dbname, to specify the database to connect. > > The slot_sync_dbname overwrites the dbname if primary_conninfo also > > specifies it. If both don't have a dbname, raise an error. > > > > Would the users prefer to provide a value for a separate GUC instead > of changing primary_conninfo? It is possible that we can have some > users prefer to use one GUC and others prefer a separate GUC but we > should add a new GUC if we are sure that is what users would prefer. > Also, even if have to consider this option, I think we can easily > later add a new GUC to provide a dbname in addition to having the > provision of giving it in primary_conninfo. > > Also, I think having a separate GUC for dbanme has some complexity in > terms of appending the dbname to primary_conninfo as pointed out by > Shveta. > > > 4) The slotsync worker uses a new GUC parameter, say > > slot_sync_conninfo, to specify the connection string to the primary > > aside from primary_conninfo. And pg_basebackup -R generates > > slot_sync_conninfo as well if required (new option required). > > > > Yeah, this is worth considering but won't slot_sync_conninfo be mostly > a duplicate of primary_conninfo apart from dbname? I am not sure if > the benefit outweighs the disadvantage of having mostly similar > information in two GUCs. > > -- > With Regards, > Amit Kapila. PFA v55. It has fixes for 2 CFBot failures seen on v53 and 1 CFBot failure seen on v54. patch002: 1) In 32-bit env, a Datum for int64 is treated as a pointer, and thus below leads to NULL pointer access if the concerned attribute is NULL. Corrected it now. DatumGetLSN(slot_getattr(tupslot, 3, &isnull)); 2)During slot-creation on standby it is possible to get NULL confirmed_lsn from primary even for a valid slot with valid restart_lsn. This may happen when a slot is just created on primary with valid restart_lsn and slot-sync worker has fetched it before primary could set valid confirmed_lsn. And thus along with remote_slot's restart_lsn to catch up, we also need to check for non-null confirmed_lsn of remote_slot. patch003: 3) Another intermittent failure was due to an unstable test added in 050_standby_failover_slots_sync.pl. It has now been removed. The other tests already have the coverage which the problematic test was trying to achieve. Thank You Hou-san for working on this. thanks Shveta
Attachment
pgsql-hackers by date: