Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication |
Date | |
Msg-id | CAJpy0uB9M9v0rO-o=SOvRxj+thkF7zjiuU7w-XC8iK0SFDg57w@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication (Melih Mutlu <m.melihmutlu@gmail.com>) |
Responses |
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
RE: [PATCH] Reuse Workers and Replication Slots during Logical Replication |
List | pgsql-hackers |
On Wed, Feb 1, 2023 at 5:42 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote: > > Hi Shveta, > > shveta malik <shveta.malik@gmail.com>, 1 Şub 2023 Çar, 15:01 tarihinde şunu yazdı: >> >> On Wed, Feb 1, 2023 at 5:05 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote: >> 2) I found a crash in the previous patch (v9), but have not tested it >> on the latest yet. Crash happens when all the replication slots are >> consumed and we are trying to create new. I tweaked the settings like >> below so that it can be reproduced easily: >> max_sync_workers_per_subscription=3 >> max_replication_slots = 2 >> and then ran the test case shared by you. I think there is some memory >> corruption happening. (I did test in debug mode, have not tried in >> release mode). I tried to put some traces to identify the root-cause. >> I observed that worker_1 keeps on moving from 1 table to another table >> correctly, but at some point, it gets corrupted i.e. origin-name >> obtained for it is wrong and it tries to advance that and since that >> origin does not exist, it asserts and then something else crashes. >> From log: (new trace lines added by me are prefixed by shveta, also >> tweaked code to have my comment 1 fixed to have clarity on worker-id). >> >> form below traces, it is clear that worker_1 was moving from one >> relation to another, always getting correct origin 'pg_16688_1', but >> at the end it got 'pg_16688_49' which does not exist. Second part of >> trace shows who updated 'pg_16688_49', it was done by worker_49 which >> even did not get chance to create this origin due to max_rep_slot >> reached. > > > Thanks for investigating this error. I think it's the same error as the one Shi reported earlier. [1] > I couldn't reproduce it yet but will apply your tweaks and try again. > Looking into this. > > [1] https://www.postgresql.org/message-id/OSZPR01MB631013C833C98E826B3CFCB9FDC69%40OSZPR01MB6310.jpnprd01.prod.outlook.com > Hi Melih, I think I am able to identify the root cause. It is not memory corruption, but the way origin-names are stored in system-catalog mapped to a particular relation-id before even those are created. After adding few more logs: [4227] LOG: shveta- LogicalRepSyncTableStart- worker_49 constructed originname :pg_16684_49, relid:16540 [4227] LOG: shveta- LogicalRepSyncTableStart- worker_49 updated-origin in system catalog:pg_16684_49, slot:pg_16684_sync_49_7195149685251088378, relid:16540 [4227] LOG: shveta- LogicalRepSyncTableStart- worker_49 get-origin-id2 originid:0, originname:pg_16684_49 [4227] ERROR: could not create replication slot "pg_16684_sync_49_7195149685251088378": ERROR: all replication slots are in use HINT: Free one or increase max_replication_slots. [4428] LOG: shveta- LogicalRepSyncTableStart- worker_148 constructed originname :pg_16684_49, relid:16540 [4428] LOG: could not drop replication slot "pg_16684_sync_49_7195149685251088378" on publisher: ERROR: replication slot "pg_16684_sync_49_7195149 685251088378" does not exist [4428] LOG: shveta- LogicalRepSyncTableStart- worker_148 drop-origin originname:pg_16684_49 [4428] LOG: shveta- LogicalRepSyncTableStart- worker_148 updated-origin:pg_16684_49, slot:pg_16684_sync_148_7195149685251088378, relid:16540 So from above, worker_49 came and picked up relid:16540, constructed origin-name and slot-name and updated in system-catalog and then it tried to actually create that slot and origin but since max-slot count was reached, it failed and exited, but did not do cleanup from system catalog for that relid. Then worker_148 came and also picked up table 16540 since it was not completed/started by previous worker, but this time it found that origin and slot entry present in system-catalog against this relid, so it picked the same names and started processing, but since those do not exist, it asserted while advancing the origin. The assert is only reproduced when an already running worker (say worker_1) who has 'created=true' set, gets to sync the relid for which a previously failed worker has tried and updated origin-name w/o creating it. In such a case worker_1 (with created=true) will try to reuse the origin and thus will try to advance it and will end up asserting. That is why you might not see the assertion always. The condition 'created' is set to true for that worker and it goes to reuse the origin updated by the previous worker. So to fix this, I think either we update origin and slot entries in the system catalog after the creation has passed or we clean-up the system catalog in case of failure. What do you think? thanks Shveta
pgsql-hackers by date: