Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Date
Msg-id CAA4eK1KPEQz4AZ1EhKsHRSeywsKSq9DOS7aCW60A8BZAbei8Xg@mail.gmail.com
Whole thread Raw
In response to Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > What is going on here is that the expected streaming file is missing.
> > Normally, the first time we send a stream of changes (some percentage
> > of transaction changes) we create the streaming file, and then in
> > respective streams we just keep on writing in that file the changes we
> > receive from the publisher, and on commit, we read that file and apply
> > all the changes.
> >
> > The above kind of error can happen due to the following reasons: (a)
> > the first time we sent the stream and created the file and that got
> > removed before the second stream reached the subscriber. (b) from the
> > publisher-side, we never sent the indication that it is the first
> > stream and the subscriber directly tries to open the file thinking it
> > is already there.
> >
> > Now, the publisher and subscriber log doesn't directly indicate any of
> > the above problems but I have some observations.
> >
> > The subscriber log indicates that before the apply worker exits due to
> > an error the new apply worker gets started. We delete the
> > streaming-related temporary files on proc_exit, so one possibility
> > could have been that the new apply worker has created the streaming
> > file which the old apply worker has removed but that is not possible
> > because we always create these temp-files by having procid in the
> > path.
>
> Yeah, and I have tried to test on this line, basically, after the
> streaming has started I have set the binary=on.  Now using gdb I have
> made the worker wait before it deletes the temp file and meanwhile the
> new worker started and it worked properly as expected.
>
> > The other thing I observed in the code is that we can mark the
> > transaction as streamed (via ReorderBufferTruncateTxn) if we try to
> > stream a transaction that has no changes the first time we try to
> > stream the transaction. This would lead to symptom (b) because the
> > second-time when there are more changes we would stream the changes as
> > it is not the first time. However, this shouldn't happen because we
> > never pick-up a transaction to stream which has no changes. I can try
> > to fix the code here such that we don't mark the transaction as
> > streamed unless we have streamed at least one change but I don't see
> > how it is related to this particular test failure.
>
> Yeah, this can be improved but as you mentioned that we never select
> an empty transaction for streaming so this case should not occur.  I
> will perform some testing/review around this and report.
>

On further thinking about this point, I think the message seen on
subscriber [1] won't occur if missed the first stream. This is because
we always check the value of fileset from the stream hash table
(xidhash) and it won't be there if we directly send the second stream
and that would have lead to a different kind of problem (probably
crash). This symptom seems to be due to the reason (a) mentioned above
unless we are missing something else. Now, I am not sure how the file
can be removed without the corresponding entry in hash table (xidhash)
is still present. The only reasons that come to mind are that some
other process cleaned pgsql_tmp directory thinking these temporary
file are not required or one manually removes it, none of those seems
plausible reasons.

[1] - ERROR: could not open temporary file "16393-510.changes.0" from
BufFile "16393-510.changes": No such file or directory

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Amit Khandekar
Date:
Subject: Re: Improving spin-lock implementation on ARM.
Next
From: "Hou, Zhijie"
Date:
Subject: Avoid using lcons and list_delete_first in plan_union_children()