The following bug has been logged on the website:
Bug reference: 15549
Logged by: Mike Lissner
Email address: mlissner@michaeljaylissner.com
PostgreSQL version: 10.6
Operating system: Linux
Description:
As discussed on the mailing list[1] I recently ran into what feels like a
bug in logical replication. The migration I ran had the following SQL:
ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number"
varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number"
DROP DEFAULT;
The documentation says:
> *Logical replication is robust when schema definitions change in a live
database:* When the schema is changed on the publisher and replicated data
starts arriving at the subscriber but does not fit into the table schema,
replication will error until the schema is updated.
...and:
> In many cases, intermittent errors can be avoided by applying additive
schema changes to the subscriber first.
That made me think that it didn't much matter whether I applied the
migration at the subscriber or the publisher first (it said things were
robust, right?), so I did them at roughly the same time. The subscriber
finished first.
After the migration, I had the following errors:
2018-12-09 05:59:45 UTC::@:[13373]:LOG: logical replication apply
worker for subscription "replicasubscription" has started
2018-12-09 05:59:45 UTC::@:[13373]:ERROR: null value in column
"recap_sequence_number" violates not-null constraint
2018-12-09 05:59:45 UTC::@:[13373]:DETAIL: Failing row contains
(48064261, 2018-12-07 04:48:40.388377+00, 2018-12-07
04:48:40.388402+00, null, 576, , 4571214, null, null).
2018-12-09 05:59:45 UTC::@:[6342]:LOG: worker process: logical
replication worker for subscription 18390 (PID 13373) exited with exit
code 1
The problem, from what we've discussed on the mailing list seems to be
that:
> Columns of a table are also matched by name. A different order of
columns in the target table is allowed, but the column types have to
match. The target table can have additional columns not provided by the
published table. Those will be filled with their default values."
...and:
> "If there is no default for a column, then the default is null."
So what appears to have happened is that we disallowed nulls and didn't have
a default value, with the result being breakage. As I discuss in the mailing
list post, this was fixed by briefly allowing nulls on the subscriber,
UPDATE'ing the null values in that column to be empty strings (the Django
default), and then disallowing nulls again.
I guess things worked more or less how the documentation described, but:
1. I'd expect a warning in the documentation about which types of additive
changes cause breakage.
2. I'd love to see some documentation about the best practices for schema
changes. There's none that I know of.
3. I'd expect the breakage to be "robust" as the documentation promises.
I.e., when somebody adds a column at the subscriber that has these
characteristics (NOT NULL and no DEFAULT), I'd expect the *replication* to
break, but I'd expect that when the publisher's schema matched again, that
the replication would return to full functionality. Under no circumstances
do I expect the publisher to send faulty data to the subscriber that
requires manual intervention to repair.
There are more details on the mailing list, and I filed an issue in our bug
tracker here:
https://github.com/freelawproject/courtlistener/issues/919
Thank you,
Mike
[1]:
https://www.postgresql.org/message-id/CAMp9%3DEydz258bTK-7%3DupANs%2BPff4wYymQuG%3DuNjeupCCrbpKFQ%40mail.gmail.com