Home > mailing lists

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher - Mailing list pgsql-hackers

From	Önder Kalacı
Subject	Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Date	March 2, 2023 16:20:09
Msg-id	CACawEhUDwAqFx4T6XhxBuKJKqWTFfy_iUhf22JVcnK9m6mHCVA@mail.gmail.com Whole thread Raw
In response to	Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher (Amit Kapila <amit.kapila16@gmail.com>)
Responses	Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
List	pgsql-hackers

Tree view

Hi Amit, Shi Yu

>
> b. Executed SQL.
> I executed TRUNCATE and INSERT before each UPDATE. I am not sure if you did the
> same, or just executed 50 consecutive UPDATEs. If the latter one, there would be
> lots of old tuples and this might have a bigger impact on sequential scan. I
> tried this case (which executes 50 consecutive UPDATEs) and also saw that the
> overhead is smaller than before.

Alright, I'll do similarly, execute truncate/insert before each update.

In the above profile number of calls to index_fetch_heap(),
heapam_index_fetch_tuple() explains the reason for the regression you
are seeing with the index scan. Because the update will generate dead
tuples in the same transaction and those dead tuples won't be removed,
we get those from the index and then need to perform
index_fetch_heap() to find out whether the tuple is dead or not. Now,
for sequence scan also we need to scan those dead tuples but there we
don't need to do back-and-forth between index and heap.

Thanks for the insights, I think what you describe makes a lot of sense.

I think we can
once check with more number of tuples (say with 20000, 50000, etc.)
for case-1.

As we'd expect, this test made the performance regression more visible.

I quickly ran case-1 for 50 times with 50000 as Shi Yu does, and got

the following results. I'm measuring end-to-end times for running the

whole set of commands:

seq_scan: 00 hr 24 minutes, 42 seconds

index_scan: 01 hr 04 minutes 54 seconds

But, I'm still not sure whether we should focus on this regression too

much. In the end, what we are talking about is a case (e.g., all or many

rows are duplicated) where using an index is not a good idea anyway. So,

I doubt users would have such indexes.

> The quadratic apply performance

> the sequential scans cause, are a much bigger hazard for users than some apply
> performance reqression.

Quoting Andres' note, I personally think that the regression for this case

is not a big concern.

> I'd prefer not having an option, because we figure out the cause of the
> performance regression (reducing it to be small enough to not care). After
> that an option defaulting to using indexes. I don't think an option defaulting
> to false makes sense.

I think we figured out the cause of the performance regression. I think it is not small

enough for some scenarios like the above. But those scenarios seem like synthetic

test cases, with not much user impacting implications. Still, I think you are better suited

to comment on this.

If you consider that this is a significant issue, we could consider the second patch as well

such that for this unlikely scenario users could disable index scans.

Thanks,

Onder

pgsql-hackers by date:

From: "Drouvot, Bertrand"
Date: 02 March 2023, 16:04:58
Subject: Re: Fix comments in gistxlogDelete, xl_heap_freeze_page and xl_btree_delete

From: "Hayato Kuroda (Fujitsu)"
Date: 02 March 2023, 16:25:17
Subject: RE: Time delayed LR (WAS Re: logical replication restrictions)

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher - Mailing list pgsql-hackers

Previous

Next