Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher - Mailing list pgsql-hackers

From Önder Kalacı
Subject Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Date
Msg-id CACawEhUDwAqFx4T6XhxBuKJKqWTFfy_iUhf22JVcnK9m6mHCVA@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
List pgsql-hackers
Hi Amit, Shi Yu


>
> b. Executed SQL.
> I executed TRUNCATE and INSERT before each UPDATE. I am not sure if you did the
> same, or just executed 50 consecutive UPDATEs. If the latter one, there would be
> lots of old tuples and this might have a bigger impact on sequential scan. I
> tried this case (which executes 50 consecutive UPDATEs) and also saw that the
> overhead is smaller than before.

Alright, I'll do similarly, execute truncate/insert before each update.  
 
In the above profile number of calls to index_fetch_heap(),
heapam_index_fetch_tuple() explains the reason for the regression you
are seeing with the index scan. Because the update will generate dead
tuples in the same transaction and those dead tuples won't be removed,
we get those from the index and then need to perform
index_fetch_heap() to find out whether the tuple is dead or not. Now,
for sequence scan also we need to scan those dead tuples but there we
don't need to do back-and-forth between index and heap.

Thanks for the insights, I think what you describe makes a lot of sense.

 
I think we can
once check with more number of tuples (say with 20000, 50000, etc.)
for case-1.


As we'd expect, this test made the performance regression more visible.

I quickly ran case-1 for 50 times with 50000 as Shi Yu does, and got
the following results. I'm measuring end-to-end times for running the
whole set of commands:

seq_scan:     00 hr 24 minutes, 42 seconds
index_scan:  01 hr 04 minutes 54 seconds


But, I'm still not sure whether we should focus on this regression too
much. In the end, what we are talking about is a case (e.g., all or many
rows are duplicated) where using an index is not a good idea anyway. So,
I doubt users would have such indexes.


>  The quadratic apply performance
> the sequential scans cause, are a much bigger hazard for users than some apply
> performance reqression.

Quoting Andres' note, I personally think that the regression for this case
is not a big concern. 

> I'd prefer not having an option, because we figure out the cause of the
> performance regression (reducing it to be small enough to not care). After
> that an option defaulting to using indexes. I don't think an option defaulting
> to false makes sense.

I think we figured out the cause of the performance regression. I think it is not small 
enough for some scenarios like the above. But those scenarios seem like synthetic
test cases, with not much user impacting implications. Still, I think you are better suited
to comment on this.

If you consider that this is a significant issue,  we could consider the second patch as well
such that for this unlikely scenario users could disable index scans. 
  
Thanks,
Onder

pgsql-hackers by date:

Previous
From: "Drouvot, Bertrand"
Date:
Subject: Re: Fix comments in gistxlogDelete, xl_heap_freeze_page and xl_btree_delete
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Time delayed LR (WAS Re: logical replication restrictions)