RE: long-standing data loss bug in initial sync of logical replication - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: long-standing data loss bug in initial sync of logical replication
Date
Msg-id TYAPR01MB569258257A869477A27F0BD0F56A2@TYAPR01MB5692.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: long-standing data loss bug in initial sync of logical replication  (Shlok Kyal <shlok.kyal.oss@gmail.com>)
List pgsql-hackers
Dear Shlok,

> Hi,
> 
> I tried to add changes to selectively invalidate the cache to reduce
> the performance degradation during the distribution of invalidations.

Thanks for improving the patch!

>...
> 
> Solution:
> 1. When we alter a publication using commands like ‘ALTER PUBLICATION
> pub_name DROP TABLE table_name’, first all tables in the publications
> are invalidated using the function ‘rel_sync_cache_relation_cb’. Then
> again ‘rel_sync_cache_publication_cb’ function is called which
> invalidates all the tables.

On my environment, rel_sync_cache_publication_cb() was called first and invalidate
all the entries, then rel_sync_cache_relation_cb() was called and the specified
entry is invalidated - hence second is NO-OP.

> This happens because of the following
> callback registered:
> CacheRegisterSyscacheCallback(PUBLICATIONRELMAP,
> rel_sync_cache_publication_cb, (Datum) 0);

But even in this case, I could understand that you want to remove the
rel_sync_cache_publication_cb() callback.

> 2. When we add/drop a schema to/from a publication using command like
> ‘ALTER PUBLICATION pub_name ADD TABLES in SCHEMA schema_name’, first
> all tables in that schema are invalidated using
> ‘rel_sync_cache_relation_cb’ and then again
> ‘rel_sync_cache_publication_cb’ function is called which invalidates
> all the tables.

Even in this case, rel_sync_cache_publication_cb() was called first and then
rel_sync_cache_relation_cb().

> 
> 3. When we alter a namespace using command like ‘ALTER SCHEMA
> schema_name RENAME to new_schema_name’ all the table in cache are
> invalidated as ‘rel_sync_cache_publication_cb’ is called due to the
> following registered callback:
> CacheRegisterSyscacheCallback(NAMESPACEOID,
> rel_sync_cache_publication_cb, (Datum) 0);
>
> So, we added a new callback function ‘rel_sync_cache_namespacerel_cb’
> will be called instead of function ‘rel_sync_cache_publication_cb’ ,
> which invalidates only the cache of the tables which are part of that
> particular namespace. For the new function the ‘namespace id’ is added
> in the Invalidation message.

Hmm, I feel this fix is too much. Unlike ALTER PUBLICATION statements, I think
ALTER SCHEMA is rarely executed at the production stage. However, this approach
requires adding a new cache callback system, which affects the entire postgres
system; this is not very beneficial compared to the outcome. It should be discussed
on another thread to involve more people, and then we can add the improvement
after being accepted.

> Performance Comparison:
> I have run the same tests as shared in [1] and observed a significant
> decrease in the degradation with the new changes.  With selective
> invalidation degradation is around ~5%. This results are an average of
> 3 runs.

IIUC, the executed workload did not contain ALTER SCHEMA command, so
third improvement did not contribute this improvement.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Documentation to upgrade logical replication cluster
Next
From: jian he
Date:
Subject: Re: not null constraints, again