Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |
Date | |
Msg-id | CAA4eK1L6OuysEW=63qY-qxAu01h-FK84tHw8+SUpX0VgM6CO4Q@mail.gmail.com Whole thread Raw |
In response to | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions (Noah Misch <noah@leadboat.com>) |
Responses |
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
|
List | pgsql-hackers |
On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote: > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > see the buildfarm reports > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14 > failed the new 015_stream.pl test with the subscriber looping like this: > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication apply worker for subscription "tap_sub" has started > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes":No such file or directory > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication apply worker for subscription "tap_sub" has started > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker "logical replication worker" (PID 13959252) exited withexit code 1 > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open temporary file "16393-510.changes.0" from BufFile "16393-510.changes":No such file or directory > ... > > What happened there? > What is going on here is that the expected streaming file is missing. Normally, the first time we send a stream of changes (some percentage of transaction changes) we create the streaming file, and then in respective streams we just keep on writing in that file the changes we receive from the publisher, and on commit, we read that file and apply all the changes. The above kind of error can happen due to the following reasons: (a) the first time we sent the stream and created the file and that got removed before the second stream reached the subscriber. (b) from the publisher-side, we never sent the indication that it is the first stream and the subscriber directly tries to open the file thinking it is already there. Now, the publisher and subscriber log doesn't directly indicate any of the above problems but I have some observations. The subscriber log indicates that before the apply worker exits due to an error the new apply worker gets started. We delete the streaming-related temporary files on proc_exit, so one possibility could have been that the new apply worker has created the streaming file which the old apply worker has removed but that is not possible because we always create these temp-files by having procid in the path. The other thing I observed in the code is that we can mark the transaction as streamed (via ReorderBufferTruncateTxn) if we try to stream a transaction that has no changes the first time we try to stream the transaction. This would lead to symptom (b) because the second-time when there are more changes we would stream the changes as it is not the first time. However, this shouldn't happen because we never pick-up a transaction to stream which has no changes. I can try to fix the code here such that we don't mark the transaction as streamed unless we have streamed at least one change but I don't see how it is related to this particular test failure. I am not sure why this failure is not repeated since it occurred a few months back, it's probably a timing issue. I have few timing issues in the last month or so related to this feature but I am not able to come up with a theory if any of those would have fixed this problem. -- With Regards, Amit Kapila.
pgsql-hackers by date: