Home > mailing lists

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher - Mailing list pgsql-hackers

From	Önder Kalacı
Subject	Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Date	August 8, 2022 18:58:52
Msg-id	CACawEhVZUJ0jDe3W-_U51Xt28p_s03sHGo82QmUm5t3X7aVQCg@mail.gmail.com Whole thread Raw
In response to	Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher (Amit Kapila <amit.kapila16@gmail.com>)
Responses	RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
List	pgsql-hackers

Tree view

Hi,

Thanks for the feedback, see my reply below.

>
> It turns out that we already invalidate the relevant entries in LogicalRepRelMap/LogicalRepPartMap when "ANALYZE" (or VACUUM) updates any of the statistics in pg_class.
>
> The call-stack for analyze is roughly:
> do_analyze_rel()
> -> vac_update_relstats()
> -> heap_inplace_update()
> -> if needs to apply any statistical change
> -> CacheInvalidateHeapTuple()
>

Yeah, it appears that this will work but I see that we don't update
here for inherited stats, how does it work for such cases?

There, the expansion of the relation list to partitions happens one level above on the call stack. So, the call stack looks like the following:

autovacuum_do_vac_analyze() (or ExecVacuum)

-> vacuum()

-> expand_vacuum_rel()

-> rel_list=parent+children partitions

-> for rel in rel_list

->analyze_rel()

->do_analyze_rel

... (and the same call stack as above)

I also added one variation of a similar test for partitioned tables, which I earlier added for non-partitioned tables as well:

Added a test which triggers this behavior. The test is as follows:
- Create two indexes on the target, on column_a and column_b
- Initially load data such that the column_a has a high cardinality
- Show that we use the index on column_a on a child table
- Load more data such that the column_b has higher cardinality
- ANALYZE on the parent table
- Show that we use the index on column_b afterwards on the child table

My answer for the above assumes that your question is regarding what happens if you ANALYZE on a partitioned table. If your question is something different, please let me know.

>> BTW, do we want to consider partial indexes for the scan in this
>> context? I mean it may not have data of all rows so how that would be
>> usable?
>>
>
> As far as I can see, check_index_predicates() never picks a partial index for the baserestrictinfos we create in CreateReplicaIdentityFullPaths(). The reason is that we have roughly the following call stack:
>
> -check_index_predicates
> --predicate_implied_by
> ---predicate_implied_by_recurse
> ----predicate_implied_by_simple_clause
> -----operator_predicate_proof
>
> And, inside operator_predicate_proof(), there is never going to be an equality. Because, we push `Param`s to the baserestrictinfos whereas the index predicates are always `Const`.
>

I agree that the way currently baserestrictinfos are formed by patch,
it won't select the partial path, and chances are that that will be
true in future as well but I think it is better to be explicit in this
case to avoid creating a dependency between two code paths.

Yes, it makes sense. So, I changed Assert into a function where we filter partial indexes and indexes on only expressions, so that we do not create such dependencies between the planner and here.

If one day planner supports using column values on index with expressions, this code would only not be able to use the optimization until we do some improvements in this code-path. I think that seems like a fair trade-off for now.

Few other comments:
==================
1. Why is it a good idea to choose the index selected even for the
bitmap path (T_BitmapHeapScan or T_BitmapIndexScan)? We use index scan
during update/delete, so not sure how we can conclude to use index for
bitmap paths.

In our case, during update/delete we are searching for a single tuple on the target. And, it seems like using an index is probably going to be cheaper for finding the single tuple. In general, I thought we should use an index if the planner ever decides to use it with the given restrictions.

Also, for the majority of the use-cases, I think we'd probably expect an index on a column with high cardinality -- hence use index scan. So, bitmap index scans are probably not going to be that much common.

Still, I don't see a problem with using such indexes. Of course, it is possible that I might be missing something. Do you have any specific concerns in this area?

2. The index info is built even on insert, so workload, where there
are no updates/deletes or those are not published then this index
selection work will go waste. Will it be better to do it at first
update/delete? One can say that it is not worth the hassle as anyway
it will be built the first time we perform an operation on the
relation or after the relation gets invalidated.

With the current approach, the index (re)-calculation is coupled with (in)validation of the relevant cache entries. So, I'd argue for the simplicity of the code, we could afford to waste this small overhead? According to my local measurements, especially for large tables, the index oid calculation is mostly insignificant compared to the rest of the steps. Does that sound OK to you?

If we think so, then
probably adding a comment could be useful.

Yes, that is useful if you are OK with the above, added.

3.
+my $synced_query =
+ "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT
IN ('r', 's');";
...
...
+# wait for initial table synchronization to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+ or die "Timed out while waiting for subscriber to synchronize data";

You can avoid such instances in the test by using the new
infrastructure added in commit 0c20dd33db.

Cool, applied changes.

4.
LogicalRepRelation *remoterel = &root->remoterel;
+
Oid partOid = RelationGetRelid(partrel);

Spurious line addition.

Fixed, went over the code and couldn't find other.

Attaching v5 of the patch which reflects the review on this email, also few minor test improvements.

Thanks,

Onder

Attachment

v5_0001_use_index_on_subs_when_pub_rep_ident_full.patch

pgsql-hackers by date:

From: Andres Freund
Date: 08 August 2022, 18:56:18
Subject: Re: failing to build preproc.c on solaris with sun studio

From: Andres Freund
Date: 08 August 2022, 19:02:04
Subject: Re: bug on log generation ?

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher - Mailing list pgsql-hackers

Attachment

Previous

Next