Re: Errors with schema migration and logical replication — expected? - Mailing list pgsql-general

From Adrian Klaver
Subject Re: Errors with schema migration and logical replication — expected?
Date
Msg-id 42bcb35c-9173-bb9e-bda6-ea5c5950119b@aklaver.com
Whole thread Raw
In response to Errors with schema migration and logical replication — expected?  (Mike Lissner <mlissner@michaeljaylissner.com>)
Responses Re: Errors with schema migration and logical replication — expected?  (Mike Lissner <mlissner@michaeljaylissner.com>)
List pgsql-general
On 12/8/18 11:26 PM, Mike Lissner wrote:
> Hi, first time poster.
> 
> I just ran into a rather messy problem when doing a schema migration 
> with logical replication. I'm not entirely sure what went wrong, why, or 
> how to prevent it in the future. The migration I ran was pretty simple 
> (though auto-generated by Django):

> 
> And after running this migration, I started getting this error on the 
> subscriber:
> 
> |2018-12-09 05:59:45 UTC::@:[13373]:LOG: logical replication apply 
> worker for subscription "replicasubscription" has started 2018-12-09 
> 05:59:45 UTC::@:[13373]:ERROR: null value in column 
> "recap_sequence_number" violates not-null constraint 2018-12-09 05:59:45 
> UTC::@:[13373]:DETAIL: Failing row contains (48064261, 2018-12-07 
> 04:48:40.388377+00, 2018-12-07 04:48:40.388402+00, null, 576, , 4571214, 
> null, null). 2018-12-09 05:59:45 UTC::@:[6342]:LOG: worker process: 
> logical replication worker for subscription 18390 (PID 13373) exited 
> with exit code 1|
> 
> 
> So, my migration created a new column with a null constraint and somehow 
> the subscriber got data that violated that. I don't know how that's 
> possible since this was a new column and it was never nullable.


The above seems to be the crux of the problem, how did NULL get into the 
column data?

The DDL migration did what it was supposed to do.

> 
> I applied the above migration simultaneously on my publisher and 
> subscriber thinking that postgresql was smart enough to do the right 
> thing. I think the subscriber finished first (it has less traffic).

And it did as far as the schema changes where concerned.

> 
> The docs hint that postgresql might be smart enough to not worry about 
> the order you do migrations:
> 
>  > /Logical replication is robust when schema definitions change in a 
> live database:/ When the schema is changed on the publisher and 
> replicated data starts arriving at the subscriber but does not fit into 
> the table schema, replication will error until the schema is updated.
> 
> And it even hints that doing a migration on the subscriber first is a 
> good thing in some cases:
> 
>  > In many cases, intermittent errors can be avoided by applying 
> additive schema changes to the subscriber first.
> 
> But I'm now supremely skeptical that doing anything at the subscriber 
> first is a good idea. Are the docs wrong? Does the above error make 
> sense? Is the process for schema migrations documented somewhere beyond 
> the above?
> 
> I have lots of questions because I thought this would have gone smoother 
> than it did.
> 
> As for the fix: I made the column nullable on the subscriber and I'm 
> waiting for it to catch up. Once it does I'll re-sync its schema with 
> the publisher. Anybody interested in following along with all this (or 
> finding this later and having questions) can follow the issue here:
> 
> https://github.com/freelawproject/courtlistener/issues/919
> 
> Thank you for the lovely database! I hope this is helpful.
> 
> Mike


-- 
Adrian Klaver
adrian.klaver@aklaver.com


pgsql-general by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: amazon aroura config - seriously overcommited defaults? (May beOff Topic)
Next
From: Mike Lissner
Date:
Subject: Re: Errors with schema migration and logical replication — expected?