Re: Proposal: Conflict log history table for Logical Replication - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Proposal: Conflict log history table for Logical Replication |
Date | |
Msg-id | CAJpy0uAfRZa4axLV_e4gvVdmunb8BOVx+Yr=XecECAVD0KnD=A@mail.gmail.com Whole thread Raw |
In response to | Proposal: Conflict log history table for Logical Replication (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: Proposal: Conflict log history table for Logical Replication
|
List | pgsql-hackers |
On Tue, Aug 5, 2025 at 5:54 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Currently we log conflicts to the server's log file and updates, this > approach has limitations, 1) Difficult to query and analyze, parsing > plain text log files for conflict details is inefficient. 2) Lack of > structured data, key conflict attributes (table, operation, old/new > data, LSN, etc.) are not readily available in a structured, queryable > format. 3) Difficult for external monitoring tools or custom > resolution scripts to consume conflict data directly. > > This proposal aims to address these limitations by introducing a > conflict log history table, providing a structured, and queryable > record of all logical replication conflicts. This should be a > configurable option whether to log into the conflict log history > table, server logs or both. > +1 for the idea. > This proposal has two main design questions: > =================================== > > 1. How do we store conflicting tuples from different tables? > Using a JSON column to store the row data seems like the most flexible > solution, as it can accommodate different table schemas. Yes, that is one option. I have not looked into details myself, but you can also explore 'anyarray' used in pg_statistics to store 'Column data values of the appropriate kind'. > 2. Should this be a system table or a user table? > a) System Table: Storing this in a system catalog is simple, but > catalogs aren't designed for ever-growing data. While pg_large_object > is an exception, this is not what we generally do IMHO. > b) User Table: This offers more flexibility. We could allow a user to > specify the table name during CREATE SUBSCRIPTION. Then we choose to > either create the table internally or let the user create the table > with a predefined schema. > > A potential drawback is that a user might drop or alter the table. > However, we could mitigate this risk by simply logging a WARNING if > the table is configured but an insertion fails. I believe it makes more sense for this to be a catalog table rather than a user table. I wanted to check if we already have a large catalog table of this kind, and I think pg_statistic could be an example of a sizable catalog table. To get a rough idea of how size scales with data, I ran a quick experiment: I created 1000 tables, each with 2 JSON columns, 1 text column, and 2 integer columns. Then, I inserted 1000 rows into each table and ran ANALYZE to collect statistics. Here’s what I observed on a fresh database before and after: Before: pg_statistic row count: 412 Table size: ~256 kB After: pg_statistic row count: 6,412 Table size: ~5.3 MB Although it isn’t an exact comparison, this gives us some insight into how the statistics catalog table size grows with the number of rows. It doesn’t seem excessively large with 6k rows, given the fact that pg_statistic itself is a complex table having many 'anyarray'-type columns. That said, irrespective of what we decide, it would be ideal to offer users an option for automatic purging, perhaps via a retention period parameter like conflict_stats_retention_period (say default to 30 days), or a manual purge API such as purge_conflict_stats('older than date'). I wasn’t able to find any such purge mechanism for PostgreSQL stats tables, but Oracle does provide such purging options for some of their statistics tables (not related to conflicts), see [1], [2]. And to manage it better, it could be range partitioned on timestamp. > I am currently working on a POC patch for the same, but will post that > once we have some thoughts on design choices. > > Schema for the conflict log history table may look like this, although > there is a room for discussion on this. > > Note: I think these fields are self explanatory so I haven't > explained them here. > > conflict_log_table ( > logid SERIAL PRIMARY KEY, > subid OID, > schema_id OID, > table_id OID, > conflict_type TEXT NOT NULL, > operation_type TEXT NOT NULL, I feel operation_type is not needed when we already have conflict_type. The name of 'conflict_type' is enough to give us info on operation-type. > replication_origin TEXT, > remote_commit_ts TIMESTAMPTZ, > local_commit_ts TIMESTAMPTZ, > ri_key JSON, > remote_tuple JSON, > local_tuple JSON, > ); > > Credit: Thanks to Amit Kapila for discussing this offlist and > providing some valuable suggestions. > [1] https://docs.oracle.com/en/database/oracle/oracle-database/21/arpls/DBMS_STATS.html#GUID-8E6413D5-F827-4F57-9FAD-7EC56362A98C [2] https://docs.oracle.com/en/database/oracle/oracle-database/21/arpls/DBMS_STATS.html#GUID-A04AE1C0-5DE1-4AFC-91F8-D35D41DF98A2 thanks Shveta
pgsql-hackers by date: