Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher |
Date | |
Msg-id | CAA4eK1K9e1zVhNtkuRV128JqJ28pj2_vfBMiyDLie2BO7OKU4w@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher (Önder Kalacı <onderkalaci@gmail.com>) |
Responses |
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
|
List | pgsql-hackers |
On Wed, Jul 20, 2022 at 8:19 PM Önder Kalacı <onderkalaci@gmail.com> wrote: > >> >> > - Add a new class of invalidation callbacks: Today, if we do ALTER TABLE or CREATE INDEX on a table, the CacheRegisterRelcacheCallbackhelps us to re-create the cache entries. In this case, as far as I can see, we need a callbackthat is called when table "ANALYZE"d, because that is when the statistics change. That is the time picking a newindex makes sense. >> > However, that seems like adding another dimension to this patch, which I can try but also see that committing becomeseven harder. >> > >> >> This idea sounds worth investigating. I see that this will require >> more work but OTOH, we can't allow the existing system to regress >> especially because depending on workload it might regress badly. > > > Just to note if that is not clear: This patch avoids (or at least aims to avoid assuming no bugs) changing the behaviorof the existing systems with PRIMARY KEY or UNIQUE index. In that case, we still use the relevant indexes. > >> >> We >> can create a patch for this atop the base patch for easier review/test >> but I feel we need some way to address this point. >> > > One another idea could be to re-calculate the index, say after N updates/deletes for the table. We may consider using subscription_parameterfor getting N -- with a good default, or even hard-code into the code. I think the cost of re-calculatingshould really be pretty small compared to the other things happening during logical replication. So, a sanedefault might work? > One difficulty in deciding the value of N for the user or choosing a default would be that we need to probably consider the local DML operations on the table as well. > If you think the above doesn't work, I can try to work on a separate patch which adds something like "analyze invalidationcallback". > I suggest we should give this a try and if this turns out to be problematic or complex then we can think of using some heuristic as you are suggesting above. > >> >> >> Now, your point related to planner code in the patch bothers me as >> well but I haven't studied the patch in detail to provide any >> alternatives at this stage. Do you have any other ideas to make it >> simpler or solve this problem in some other way? >> > > One idea I tried earlier was to go over the existing indexes and on the table, then get the IndexInfo via BuildIndexInfo().And then, try to find a good heuristic to pick an index. In the end, I felt like that is doing a sub-optimaljob, requiring a similar amount of code of the current patch, and still using the similar infrastructure. > > My conclusion for that route was I should either use a very simple heuristic (like pick the index with the most columns)and have a suboptimal index pick, > Not only that but say all index have same number of columns then we need to probably either pick the first such index or use some other heuristic. > > OR use a complex heuristic with a reasonable index pick. And, the latter approach converged to the planner code in thepatch. Do you think the former approach is acceptable? > In this regard, I was thinking in which cases a sequence scan can be better than the index scan (considering one is available). I think if a certain column has a lot of duplicates (for example, a column has a boolean value) then probably doing a sequence scan is better. Now, considering this even though your other approach sounds simpler but could lead to unpredictable results. So, I think the latter approach is preferable. BTW, do we want to consider partial indexes for the scan in this context? I mean it may not have data of all rows so how that would be usable? Few comments: =============== 1. static List * +CreateReplicaIdentityFullPaths(Relation localrel) { ... + /* + * Rather than doing all the pushups that would be needed to use + * set_baserel_size_estimates, just do a quick hack for rows and width. + */ + rel->rows = rel->tuples; Is it a good idea to set rows without any selectivity estimation? Won't this always set the entire rows in a relation? Also, if we don't want to use set_baserel_size_estimates(), how will we compute baserestrictcost which will later be used in the costing of paths (for example, costing of seqscan path (cost_seqscan) uses it)? In general, I think it will be better to consider calling some top-level planner functions even for paths. Can we consider using make_one_rel() instead of building individual paths? On similar lines, in function PickCheapestIndexPathIfExists(), can we use set_cheapest()? 2. @@ -57,9 +60,6 @@ build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel, int2vector *indkey = &idxrel->rd_index->indkey; bool hasnulls = false; - Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel) || - RelationGetPrimaryKeyIndex(rel) == RelationGetRelid(idxrel)); You have removed this assertion but there is a comment ("This is not generic routine, it expects the idxrel to be replication identity of a rel and meet all limitations associated with that.") atop this function which either needs to be changed/removed and probably we should think if the function needs some change after removing that restriction. -- With Regards, Amit Kapila.
pgsql-hackers by date: