Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Date
Msg-id CAA4eK1K9e1zVhNtkuRV128JqJ28pj2_vfBMiyDLie2BO7OKU4w@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher  (Önder Kalacı <onderkalaci@gmail.com>)
Responses Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
List pgsql-hackers
On Wed, Jul 20, 2022 at 8:19 PM Önder Kalacı <onderkalaci@gmail.com> wrote:
>
>>
>> > - Add a new class of invalidation callbacks: Today, if we do ALTER TABLE or CREATE INDEX on a table, the
CacheRegisterRelcacheCallbackhelps us to re-create the cache entries. In this case, as far as I can see, we need a
callbackthat is called when table "ANALYZE"d, because that is when the statistics change. That is the time picking a
newindex makes sense. 
>> > However, that seems like adding another dimension to this patch, which I can try but also see that committing
becomeseven harder. 
>> >
>>
>> This idea sounds worth investigating. I see that this will require
>> more work but OTOH, we can't allow the existing system to regress
>> especially because depending on workload it might regress badly.
>
>
> Just to note if that is not clear: This patch avoids (or at least aims to avoid assuming no bugs) changing the
behaviorof the existing systems with PRIMARY KEY or UNIQUE index. In that case, we still use the relevant indexes. 
>
>>
>> We
>> can create a patch for this atop the base patch for easier review/test
>> but I feel we need some way to address this point.
>>
>
> One another idea could be to re-calculate the index, say after N updates/deletes for the table. We may consider using
subscription_parameterfor getting N -- with a good default, or even hard-code into the code. I think the cost of
re-calculatingshould really be pretty small compared to the other things happening during logical replication. So, a
sanedefault might work? 
>

One difficulty in deciding the value of N for the user or choosing a
default would be that we need to probably consider the local DML
operations on the table as well.

> If you think the above doesn't work, I can try to work on a separate patch which adds something like "analyze
invalidationcallback". 
>

I suggest we should give this a try and if this turns out to be
problematic or complex then we can think of using some heuristic as
you are suggesting above.

>
>>
>>
>> Now, your point related to planner code in the patch bothers me as
>> well but I haven't studied the patch in detail to provide any
>> alternatives at this stage. Do you have any other ideas to make it
>> simpler or solve this problem in some other way?
>>
>
> One idea I tried earlier was to go over the existing indexes and on the table, then get the IndexInfo via
BuildIndexInfo().And then, try to find a good heuristic to pick an index. In the end, I felt like that is doing a
sub-optimaljob, requiring a similar amount of code of the current patch, and still using the similar infrastructure. 
>
> My conclusion for that route was I should either use a very simple heuristic (like pick the index with the most
columns)and have a suboptimal index pick, 
>

Not only that but say all index have same number of columns then we
need to probably either pick the first such index or use some other
heuristic.

>
> OR use a complex heuristic with a reasonable index pick. And, the latter approach converged to the planner code in
thepatch. Do you think the former approach is acceptable? 
>

In this regard, I was thinking in which cases a sequence scan can be
better than the index scan (considering one is available). I think if
a certain column has a lot of duplicates (for example, a column has a
boolean value) then probably doing a sequence scan is better. Now,
considering this even though your other approach sounds simpler but
could lead to unpredictable results. So, I think the latter approach
is preferable.

BTW, do we want to consider partial indexes for the scan in this
context? I mean it may not have data of all rows so how that would be
usable?

Few comments:
===============
1.
static List *
+CreateReplicaIdentityFullPaths(Relation localrel)
{
...
+ /*
+ * Rather than doing all the pushups that would be needed to use
+ * set_baserel_size_estimates, just do a quick hack for rows and width.
+ */
+ rel->rows = rel->tuples;

Is it a good idea to set rows without any selectivity estimation?
Won't this always set the entire rows in a relation? Also, if we don't
want to use set_baserel_size_estimates(), how will we compute
baserestrictcost which will later be used in the costing of paths (for
example, costing of seqscan path (cost_seqscan) uses it)?

In general, I think it will be better to consider calling some
top-level planner functions even for paths. Can we consider using
make_one_rel() instead of building individual paths? On similar lines,
in function PickCheapestIndexPathIfExists(), can we use
set_cheapest()?

2.
@@ -57,9 +60,6 @@ build_replindex_scan_key(ScanKey skey, Relation rel,
Relation idxrel,
  int2vector *indkey = &idxrel->rd_index->indkey;
  bool hasnulls = false;

- Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel) ||
-    RelationGetPrimaryKeyIndex(rel) == RelationGetRelid(idxrel));

You have removed this assertion but there is a comment ("This is not
generic routine, it expects the idxrel to be replication identity of a
rel and meet all limitations associated with that.") atop this
function which either needs to be changed/removed and probably we
should think if the function needs some change after removing that
restriction.

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: "wangw.fnst@fujitsu.com"
Date:
Subject: RE: Data is copied twice when specifying both child and parent table in publication
Next
From: Amit Kapila
Date:
Subject: Re: Refactor to make use of a common function for GetSubscriptionRelations and GetSubscriptionNotReadyRelations.