Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date
Msg-id CAFiTN-t1fN-fedz7mWwMmZUdPo6ZC5wnDT4=6UiZoj5kb64WEg@mail.gmail.com
Whole thread Raw
In response to Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
List pgsql-hackers
On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:
> >
> > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > > see the buildfarm reports
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> > failed the new 015_stream.pl test with the subscriber looping like this:
> >
> > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
> > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG:  background worker "logical replication worker" (PID 13959252) exited
withexit code 1
 
> > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
> > ...
> >
> > What happened there?
> >
>
> What is going on here is that the expected streaming file is missing.
> Normally, the first time we send a stream of changes (some percentage
> of transaction changes) we create the streaming file, and then in
> respective streams we just keep on writing in that file the changes we
> receive from the publisher, and on commit, we read that file and apply
> all the changes.
>
> The above kind of error can happen due to the following reasons: (a)
> the first time we sent the stream and created the file and that got
> removed before the second stream reached the subscriber. (b) from the
> publisher-side, we never sent the indication that it is the first
> stream and the subscriber directly tries to open the file thinking it
> is already there.
>
> Now, the publisher and subscriber log doesn't directly indicate any of
> the above problems but I have some observations.
>
> The subscriber log indicates that before the apply worker exits due to
> an error the new apply worker gets started. We delete the
> streaming-related temporary files on proc_exit, so one possibility
> could have been that the new apply worker has created the streaming
> file which the old apply worker has removed but that is not possible
> because we always create these temp-files by having procid in the
> path.

Yeah, and I have tried to test on this line, basically, after the
streaming has started I have set the binary=on.  Now using gdb I have
made the worker wait before it deletes the temp file and meanwhile the
new worker started and it worked properly as expected.

> The other thing I observed in the code is that we can mark the
> transaction as streamed (via ReorderBufferTruncateTxn) if we try to
> stream a transaction that has no changes the first time we try to
> stream the transaction. This would lead to symptom (b) because the
> second-time when there are more changes we would stream the changes as
> it is not the first time. However, this shouldn't happen because we
> never pick-up a transaction to stream which has no changes. I can try
> to fix the code here such that we don't mark the transaction as
> streamed unless we have streamed at least one change but I don't see
> how it is related to this particular test failure.

Yeah, this can be improved but as you mentioned that we never select
an empty transaction for streaming so this case should not occur.  I
will perform some testing/review around this and report.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted
Next
From: Craig Ringer
Date:
Subject: Re: TAP test utility module 'PG_LSN.pm'