Re: Logical Replication WIP - Mailing list pgsql-hackers

From Steve Singer
Subject Re: Logical Replication WIP
Date
Msg-id 581687C9.2090700@ssinger.info
Whole thread Raw
In response to Re: Logical Replication WIP  (Petr Jelinek <petr@2ndquadrant.com>)
Responses Re: Logical Replication WIP  (Petr Jelinek <petr@2ndquadrant.com>)
List pgsql-hackers
On 10/24/2016 09:22 AM, Petr Jelinek wrote:
> Hi,
>
> attached is updated version of the patch.
>
> There are quite a few improvements and restructuring, I fixed all the
> bugs and basically everything that came up from the reviews and was
> agreed on. There are still couple of things missing, ie column type
> definition in protocol and some things related to existing data copy.

Here are a few things I've noticed so far.

+<programlisting>
+CREATE SUBSCRIPTION mysub WITH CONNECTION <quote>dbname=foo host=bar 
user=repuser</quote> PUBLICATION mypub;
+</programlisting>
+  </para>
+  <para>

The documentation above doesn't match the syntax, CONNECTION needs to be 
in single quotes not double quotes
I think you want
+<programlisting>
+CREATE SUBSCRIPTION mysub WITH CONNECTION 'dbname=foo host=bar 
user=repuser' PUBLICATION mypub;
+</programlisting>
+  </para>
+  <para>



I am not sure if this is a known issue covered by your comments about 
data copy but I am still having issues with error reporting on a failed 
subscription.

I created a subscription, dropped the subscription and created a second 
one.  The second subscription isn't active but shows no errors.


P: create publication mypub for table public.a;
S: create subscription mysub with connection 'dbname=test host=localhost 
port=5440' publication mypub;
P:  insert into a(b) values ('t');
S: select * FROM a; a | b
---+--- 1 | t
(1 row)

Everything is good
Then I do
S: drop subscription mysub;

S: create subscription mysub2 with connection 'dbname=test 
host=localhost port=5440' publication mypub;
P:  insert into a(b) values ('f');
S: select * FROM a; a | b
---+--- 1 | t

The data doesn't replicate


select * FROM pg_stat_subscription; subid | subname | pid | relid | received_lsn | last_msg_send_time | 
last_msg_receipt_time | latest_end_lsn | latest_end_time

-------+---------+-----+-------+--------------+--------------------+-----------------------+----------------+-----------------
16398| mysub2  |     |       |              | |                       
 
|                |
(1 row)


The only thing in my log is

2016-10-30 15:27:27.038 EDT [6028] NOTICE:  dropped replication slot 
"mysub" on publisher
2016-10-30 15:27:36.072 EDT [6028] NOTICE:  created replication slot 
"mysub2" on publisher
2016-10-30 15:27:36.082 EDT [6028] NOTICE:  synchronized table states


I'd expect an error in the log or something.
However, if I delete everything from the table on the subscriber then 
the subscription proceeds

I think there are still problems with signal handling in the initial sync

If I try to drop mysub2 (while the subscription is stuck instead of 
deleting the data) the drop hangs
If I then try to kill the postmaster for the subscriber nothing happens, 
have to send it a -9 to go away.

However once I do that and then restart the postmaster for the 
subscriber I start to see the duplicate key errors in the log

2016-10-30 16:00:54.635 EDT [7018] ERROR:  duplicate key value violates 
unique constraint "a_pkey"
2016-10-30 16:00:54.635 EDT [7018] DETAIL:  Key (a)=(1) already exists.
2016-10-30 16:00:54.635 EDT [7018] CONTEXT:  COPY a, line 1
2016-10-30 16:00:54.637 EDT [7007] LOG:  worker process: logical 
replication worker 16400 sync 16387 (PID 7018) exited with exit code 1

I'm not sure why I didn't get those until I restarted the postmaster but 
it seems to happen whenever I drop a subscription then create a new one. 
Creating the second subscription from the same psql session as I 
create/drop the first seems important  in reproducing this




I am also having issues dropping a second subscription from the same 
psql session

(table a is empty on both nodes to avoid duplicate key errors)
S: create subscription sub1 with connection  'host=localhost dbname=test 
port=5440' publication mypub;
S: create subscription sub2 with connection  'host=localhost dbname=test 
port=5440' publication mypub;
S: drop subscription sub1;
S: drop subscription sub2;

At this point the drop subscription hangs.






>
> The biggest changes are:
>
> I added one more prerequisite patch (the first one) which adds ephemeral
> slots (or well implements UI on top of the code that was mostly already
> there). The ephemeral slots are different in that they go away either on
> error or when session is closed. This means the initial data sync does
> not have to worry about cleaning up the slots after itself. I think this
> will be useful in other places as well (for example basebackup). I
> originally wanted to call them temporary slots in the UI but since the
> behavior is bit different from temp tables I decided to go with what the
> underlying code calls them in UI as well.
>
> I also split out the libpqwalreceiver rewrite to separate patch which
> does just the re-architecture and does not really add new functionality.
> And I did the re-architecture bit differently based on the review.
>
> There is now new executor module in execReplication.c, no new nodes but
> several utility commands. I moved there the tuple lookup functions from
> apply and also wrote new interfaces for doing inserts/updates/deletes to
> a table including index updates and constraints checks and trigger
> execution but without the need for the whole nodeModifyTable handling.
>
> What I also did when rewriting this is implementation of the tuple
> lookup also using sequential scan so that we can support replica
> identity full properly. This greatly simplified the dependency handling
> between pkeys and publications (by removing it completely ;) ). Also
> when there is replica identity full and the table has primary key, the
> code will use the primary key even though it's not replica identity
> index to lookup the row so that users who want to combine the logical
> replication with some kind of other system that requires replica
> identity full (ie auditing) they still get usable experience.
>
> The way copy is done was heavily reworked. For one it uses the ephemeral
> slots mentioned above. But more importantly there are now new custom
> commands anymore. Instead the walsender accepts some SQL, currently
> allowed are BEGIN, ROLLBACK, SELECT and COPY. The way that is
> implemented is probably not perfect and it could use look from somebody
> who knows bison better. How it works is that if the command sent to
> walsender starts with one of the above mentioned keywords the walsender
> parser passes the whole query back and it's passed then to
> exec_simple_query. The main reason why we need BEGIN is so that the COPY
> can use the snapshot exported by the slot creation so that there is
> synchronization point when there are concurrent writes. This probably
> needs more discussion.
>
> I also tried to keep the naming more consistent so cleaned up all
> mentions of "provider" and changed them to "publisher" and also
> publications don't mention that they "replicate", they just "publish"
> now (that has effect on DDL syntax as well).
>
>
> Some things that were discussed in the reviews that I didn't implement
> knowingly include:
>
> Removal of the Oid in the pg_publication_rel, that's mainly because it
> would need significant changes to pg_dump which assumes everything
> that's dumped has Oid and it's not something that seems worth it as part
> of this patch.
>
> Also didn't do the outfuncs, it's unclear to me what are the rules there
> as the only DDL statement there is CreateStmt atm.
>
>
> There are still few TODOs:
>
> Type info for columns. My current best idea is to write typeOid and
> typemod in the relation message and add another message (type message)
> that describes the type which will skip the built-in types (as we can't
> really remap those without breaking a lot of software so they seem safe
> to skip). I plan to do this soonish barring objections.
>
> Removal of use of replication origin in the table sync worker.
>
> Parallelization of the initial copy. And ability to resync (do new copy)
> of a table. These two mainly wait for agreement over how the current way
> of doing copy should work.
>
>
>




pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: sources.sgml typo
Next
From: Masahiko Sawada
Date:
Subject: Re: Transactions involving multiple postgres foreign servers