On Wed, Jun 21, 2023 at 8:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jun 15, 2023 at 7:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jun 13, 2023 at 2:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Sun, Jun 11, 2023 at 5:31 AM Andres Freund <andres@anarazel.de> wrote:
> > > >
> > > > A separate issue is that TransactionIdDidAbort() can end up being very slow if
> > > > a lot of transactions are in progress concurrently. As soon as the clog
> > > > buffers are extended all time is spent copying pages from the kernel
> > > > pagecache. I'd not at all be surprised if this changed causes a substantial
> > > > slowdown in workloads with lots of small transactions, where most transactions
> > > > commit.
> > > >
> > >
> > > Indeed. So it should check the transaction status less frequently. It
> > > doesn't benefit much even if we can skip collecting decoded changes of
> > > small transactions. Another idea is that we check the status of only
> > > large transactions. That is, when the size of decoded changes of an
> > > aborted transaction exceeds logical_decoding_work_mem, we mark it as
> > > aborted , free its changes decoded so far, and skip further
> > > collection.
> > >
> >
> > Your idea might work for large transactions but I have not come across
> > reports where this is reported as a problem. Do you see any such
> > reports and can we see how much is the benefit with large
> > transactions? Because we do have the handling of concurrent aborts
> > during sys table scans and that might help sometimes for large
> > transactions.
>
> I've heard there was a case where a user had 29 million deletes in a
> single transaction with each one wrapped in a savepoint and rolled it
> back, which led to 11TB of spill files. If decoding such a large
> transaction fails for some reasons (e.g. a disk full), it would try
> decoding the same transaction again and again.
>
I was thinking why the existing handling of concurrent aborts doesn't
handle such a case and it seems that we check that only on catalog
access. However, in your case, the user probably is accessing the same
relation without any concurrent DDL on the same table, so it would
just be a cache look-up for catalogs. Your idea of checking aborts
every logical_decoding_work_mem should work for such cases.
--
With Regards,
Amit Kapila.