Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher - Mailing list pgsql-hackers

From Önder Kalacı
Subject Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Date
Msg-id CACawEhXbw==K02v3=nHFEAFJqegx0b4r2J+FtXtKFkJeE6R95Q@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher  (Peter Smith <smithpb2250@gmail.com>)
Responses Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
List pgsql-hackers
Hi Peter, all

Thanks for the detailed review!


1. Commit message

1a.
Majority of the logic on the subscriber side has already existed in the code.

1b.
Second, when REPLICA IDENTITY IS FULL on the publisher and an index is
used on the subscriber...


1c.
Still, below I try to show case the potential improvements using an
index on the subscriber
`pgbench_accounts(bid)`. With the index, the replication catches up
around ~5 seconds.
When the index is dropped, the replication takes around ~300 seconds.

"show case" -> "showcase"

Applied your suggestions to 1a/1b/1c/

~

1d.
In above text, what was meant by "catches up around ~5 seconds"?
e.g. Did it mean *improves* by ~5 seconds, or *takes* ~5 seconds?


It "takes" 5 seconds to replicate all the changes. To be specific, I execute 'SELECT sum(abalance) FROM pgbench_accounts' on the subscriber, and try to measure the time until when all the changes are replicated. I do use the same query on the publisher to check what the query result should be when replication is done.

I updated the relevant text, does that look better?
 
~

1e.
// create one indxe, even on a low cardinality column

typo "indxe"

======

fixed.

Also, I realized that some of the comments on the commit message are stale, updated those as well.

 

2. GENERAL

2a.
There are lots of single-line comments that start lowercase, but by
convention, I think they should start uppercase.

e.g. + /* we should always use at least one attribute for the index scan */
e.g. + /* we might not need this if the index is unique */
e.g. + /* avoid expensive equality check if index is unique */
e.g. + /* unrelated Path, skip */
e.g. + /* simple case, we already have an identity or pkey */
e.g. + /* indexscans are disabled, use seq. scan */
e.g. + /* target is a regular table */

~~

Thanks for noting this, I didn't realize that there is a strict requirement on this. Updated all of your suggestions, and realized one more such case.

Is there documentation where such conventions are listed? I couldn't find any.
 

2b.
There are some excess blank lines between the function. By convention,
I think 1 blank line is normal, but here there are sometimes 2.

~~

Updated as well.
 

2c.
There are some new function comments which include their function name
in the comment. It seemed unnecessary.

e.g. GetCheapestReplicaIdentityFullPath
e.g. FindUsableIndexForReplicaIdentityFull
e.g. LogicalRepUsableIndex

======

Fixed this as well.
 

3. src/backend/executor/execReplication.c - build_replindex_scan_key

- int attoff;
+ int index_attoff;
+ int scankey_attoff;
  bool isnull;
  Datum indclassDatum;
  oidvector  *opclass;
  int2vector *indkey = &idxrel->rd_index->indkey;
- bool hasnulls = false;
-
- Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel) ||
-    RelationGetPrimaryKeyIndex(rel) == RelationGetRelid(idxrel));

  indclassDatum = SysCacheGetAttr(INDEXRELID, idxrel->rd_indextuple,
  Anum_pg_index_indclass, &isnull);
  Assert(!isnull);
  opclass = (oidvector *) DatumGetPointer(indclassDatum);
+ scankey_attoff = 0;

Maybe just assign scankey_attoff = 0 at the declaration?


Again, lack of coding convention knowledge :/ My observation is that it is often not assigned during the declaration. But, changed this one.
 
~~~

4.

+ /*
+ * There are two cases to consider. First, if the index is a primary or
+ * unique key, we cannot have any indexes with expressions. So, at this
+ * point we are sure that the index we deal is not these.
+ */

"we deal" -> "we are dealing with" ?

makes sense
 
~~~

5.

+ /*
+ * For a non-primary/unique index with an additional expression, do
+ * not have to continue at this point. However, the below code
+ * assumes the index scan is only done for simple column references.
+ */
+ continue;

Is this one of those comments that ought to have a "XXX" prefix as a
note for the future?

Makes sense
 

~~~

6.

- int pkattno = attoff + 1;
...
  /* Initialize the scankey. */
- ScanKeyInit(&skey[attoff],
- pkattno,
+ ScanKeyInit(&skey[scankey_attoff],
+ index_attoff + 1,
  BTEqualStrategyNumber,
Wondering if it would have been simpler if you just did:
int pkattno = index_attoff + 1;


The index is not necessarily the primary key at this point, that's why I removed it. 

There are already 3 variables in the same function index_attoff, scankey_attoff and table_attno, which are hard to avoid. But, this one seemed ok to avoid, mostly to simplify the readability. Do you think it is better with the additional variable? Still, I think we need a better name as "pk" is not relevant anymore.
 

~~~

7.

- skey[attoff].sk_flags |= SK_ISNULL;
+ skey[scankey_attoff].sk_flags |= SK_ISNULL;
+ skey[scankey_attoff].sk_flags |= SK_SEARCHNULL;

SUGGESTION
skey[scankey_attoff].sk_flags |= (SK_ISNULL | SK_SEARCHNULL)


looks good, changed
 
~~~

8. src/backend/executor/execReplication.c - RelationFindReplTupleByIndex

@@ -128,28 +171,44 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
  TransactionId xwait;
  Relation idxrel;
  bool found;
+ TypeCacheEntry **eq;
+ bool indisunique;
+ int scankey_attoff;

  /* Open the index. */
  idxrel = index_open(idxoid, RowExclusiveLock);
+ indisunique = idxrel->rd_index->indisunique;
+
+ /* we might not need this if the index is unique */
+ eq = NULL;

Maybe just default assign eq = NULL in the declaration?


Again, I wasn't sure if it is OK regarding the coding convention to assign during the declaration. Changed now.
 
~~~

9.

+ scan = index_beginscan(rel, idxrel, &snap,
+    scankey_attoff, 0);

Unnecessary wrapping?


Seems so, changed
 
~~~

10.

+ /* we only need to allocate once */
+ if (eq == NULL)
+ eq = palloc0(sizeof(*eq) * outslot->tts_tupleDescriptor->natts);

But shouldn't you also free this 'eq' before the function returns, to
prevent leaking memory?


Two notes here. First, this is allocated inside ApplyMessageContext, which seems to be reset per tuple change. So, that seems like a good boundary to keep this allocation in memory.

Second, RelationFindReplTupleSeq() doesn't free the same allocation roughly at a very similar call stack. That's why I decided not to pfree. Do you see strong reason to pfree at this point? Then we should probably change that for RelationFindReplTupleSeq() as well.

 
======

11. src/backend/replication/logical/relation.c - logicalrep_rel_open

+ /*
+ * Finding a usable index is an infrequent operation, it is performed
+ * only when first time an operation is performed on the relation or
+ * after invalidation of the relation cache entry (e.g., such as ANALYZE).
+ */

SUGGESTION (minor rewording)
Finding a usable index is an infrequent task. It is performed only
when an operation is first performed on the relation, or after
invalidation of the relation cache entry (e.g., such as ANALYZE).

~~~

makes sense, applied
 
12. src/backend/replication/logical/relation.c - logicalrep_partition_open

Same as comment #11 above.

 
done

 
~~~

13. src/backend/replication/logical/relation.c - GetIndexOidFromPath

+static
+Oid
+GetIndexOidFromPath(Path *path)

Typically I think 'static Oid' should be on one line.

done 


~~~

14.

+ switch (path->pathtype)
+ {
+ case T_IndexScan:
+ case T_IndexOnlyScan:
+ {
+ IndexPath  *index_sc = (IndexPath *) path;
+ indexOid = index_sc->indexinfo->indexoid;
+
+ break;
+ }
+
+ default:
+ indexOid = InvalidOid;
+ }

Is there any point in using a switch statement when there is only one
functional code block?

Why not just do:

if (path->pathtype == T_IndexScan || path->pathtype == T_IndexOnlyScan)
{
...
}

return InvalidOid;

~~~

Good point, in the first iterations of the patch, we also had Bitmap scans here. Now, the switch is redundant, applied your suggestion.
 

15. src/backend/replication/logical/relation.c - IndexOnlyOnExpression

+ * Returns true if the given index consist only of expressions such as:
+ * CREATE INDEX idx ON table(foo(col));

"consist" -> "consists"

~~~

fixed
 

16.

+IndexOnlyOnExpression(IndexInfo *indexInfo)
+{
+ int i=0;
+ for (i = 0; i < indexInfo->ii_NumIndexKeyAttrs; i++)

Don't initialise 'i' twice.

~~~

fixed 
 

17.

+ AttrNumber attnum = indexInfo->ii_IndexAttrNumbers[i];
+ if (AttributeNumberIsValid(attnum))
+ return false;
+
+ }

Spurious blank line

~~~

fixed
 

18. src/backend/replication/logical/relation.c -
GetCheapestReplicaIdentityFullPath

+/*
+ * Iterates over the input path list, and returns another path list
+ * where paths with non-btree indexes, partial indexes or
+ * indexes on only expressions are eliminated from the list.
+ */

"path list, and" -> "path list and"

~~~

fixed
 

19.

+ if (!OidIsValid(indexOid))
+ {
+ /* unrelated Path, skip */
+ suitableIndexList = lappend(suitableIndexList, path);
+ continue;
+ }
+
+ indexRelation = index_open(indexOid, AccessShareLock);
+ indexInfo = BuildIndexInfo(indexRelation);
+ is_btree_index = (indexInfo->ii_Am == BTREE_AM_OID);
+ is_partial_index = (indexInfo->ii_Predicate != NIL);
+ is_index_only_on_expression = IndexOnlyOnExpression(indexInfo);
+ index_close(indexRelation, NoLock);
+
+ if (!is_btree_index || is_partial_index || is_index_only_on_expression)
+ continue;

Maybe better to change this logic using if/else and changing the last
condition so them you can avoid having any of those 'continue' in this
loop.

Yes, it makes sense. It is good to avoid `continue` in the loop.
 

~~~

20. src/backend/replication/logical/relation.c -
GetCheapestReplicaIdentityFullPath

+/*
+ * GetCheapestReplicaIdentityFullPath generates all the possible paths
+ * for the given subscriber relation, assuming that the source relation
+ * is replicated via REPLICA IDENTITY FULL.
+ *
+ * The function assumes that all the columns will be provided during
+ * the execution phase, given that REPLICA IDENTITY FULL gurantees
+ * that.
+ */

20a.
typo "gurantees"

~
 
Fixed, for future patches I'll do a more thorough review on these myself. Sorry for all these typos & convention errors!
 
20b.
The function comment neglects to say that after getting all these
paths the final function return is the cheapest one that it found.

~~~

Improved the comment a bit
 

21.

+ for (attno = 0; attno < RelationGetNumberOfAttributes(localrel); attno++)
+ {
+ Form_pg_attribute attr = TupleDescAttr(localrel->rd_att, attno);
+
+ if (attr->attisdropped)
+ {
+ continue;
+ }
+ else
+ {
+ Expr    *eq_op;

Maybe simplify by just removing the 'else' or instead just reverse the
condition of the 'if'.

~~~

I like the second suggestion more, as the `!attr->attisdropped` code block has local declarations, so keeping them local to that block seems easier to follow.
 

22.

+ /*
+ * A sequential scan has could have been dominated by
+ * by an index scan during make_one_rel(). We should always
+ * have a sequential scan before set_cheapest().
+ */

"has could have been" -> "could have been"

~~~

An interesting grammar I had :) Fixed 
 

23. src/backend/replication/logical/relation.c - LogicalRepUsableIndex

+static Oid
+LogicalRepUsableIndex(Relation localrel, LogicalRepRelation *remoterel)
+{
+ Oid idxoid;
+
+ /*
+ * We never need index oid for partitioned tables, always rely on leaf
+ * partition's index.
+ */
+ if (localrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ return InvalidOid;
+
+ /* simple case, we already have an identity or pkey */
+ idxoid = GetRelationIdentityOrPK(localrel);
+ if (OidIsValid(idxoid))
+ return idxoid;
+
+ /* indexscans are disabled, use seq. scan */
+ if (!enable_indexscan)
+ return InvalidOid;

I thought the (!enable_indexscan) fast exit perhaps should be done
first, or at least before calling GetRelationIdentityOrPK.

This is actually a point where I need some more feedback. On HEAD, even if the index scan is disabled, we use the index. For this one, (a) I didn't want to change the behavior for existing users (b) want to have a way to disable this feature, and enable_indexscan seems like a good one.

Do you think I should dare to move it above GetRelationIdentityOrPK()? Or, maybe I just need more comments? I improved the comment, and it would be nice to hear your thoughts on this.


======

24. src/backend/replication/logical/worker.c - apply_handle_delete_internal

@@ -2034,12 +2021,14 @@ apply_handle_delete_internal(ApplyExecutionData *edata,
  EPQState epqstate;
  TupleTableSlot *localslot;
  bool found;
+ Oid usableIndexOid = usable_indexoid_internal(edata, relinfo);

  EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1);
  ExecOpenIndices(relinfo, false);

- found = FindReplTupleInLocalRel(estate, localrel, remoterel,
- remoteslot, &localslot);
+
+ found = FindReplTupleInLocalRel(estate, localrel, usableIndexOid,
+ remoterel, remoteslot, &localslot);

24a.
Excess blank line above FindReplTupleInLocalRel call.

Fixed 
~

24b.
This code is almost same in function handle_update_internal(), except
the wrapping of the params is different. Better to keep everything
consistent looking.


Hmm, I have not changed how they look because they have one variable difference (&relmapentry->remoterel vs remoterel), which requires the indentation to be slightly difference. So, I either need a new variable or keep them as-is?

 
~~~

25. src/backend/replication/logical/worker.c - usable_indexoid_internal

+/*
+ * Decide whether we can pick an index for the relinfo (e.g., the relation)
+ * we're actually deleting/updating from. If it is a child partition of
+ * edata->targetRelInfo, find the index on the partition.
+ */
+static Oid
+usable_indexoid_internal(ApplyExecutionData *edata, ResultRelInfo *relinfo)

I'm not sure is this can maybe return InvalidOid? The function comment
should clarify it.


Improved the comment
 
~~~

26.

I might be mistaken, but somehow I feel this function can be
simplified. e.g. If you have a var 'relmapentry' and let the normal
table use the initial value of that. Then I think you only need to
test for the partitioned table and reassign that var as appropriate.
It also removes the need for having 'usableIndexOid' var.

FOR EXAMPLE,

static Oid
usable_indexoid_internal(ApplyExecutionData *edata, ResultRelInfo *relinfo)
{
ResultRelInfo *targetResultRelInfo = edata->targetRelInfo;
LogicalRepRelMapEntry *relmapentry = edata->targetRel;
Oid targetrelid = targetResultRelInfo->ri_RelationDesc->rd_rel->oid;
Oid localrelid = relinfo->ri_RelationDesc->rd_id;

if (targetrelid != localrelid)
{
/*
* Target is a partitioned table, so find relmapentry of the partition.
*/
TupleConversionMap *map = relinfo->ri_RootToPartitionMap;
AttrMap    *attrmap = map ? map->attrMap : NULL;
LogicalRepPartMapEntry *part_entry =
logicalrep_partition_open(relmapentry, relinfo->ri_RelationDesc,
attrmap);

Assert(targetResultRelInfo->ri_RelationDesc->rd_rel->relkind ==
   RELKIND_PARTITIONED_TABLE);

relmapentry = part_entry->relmapentry;
}
return relmapentry->usableIndexOid;
}

~~~

True, that simplifies the function, applied.
 

27.

+ /*
+ * Target is a partitioned table, get the index oid the partition.
+ */

SUGGESTION
Target is a partitioned table, so get the index oid of the partition.

or (see the example of comment @26)


Applied
 
~~~

28. src/backend/replication/logical/worker.c - FindReplTupleInLocalRel

@@ -2093,12 +2125,11 @@ FindReplTupleInLocalRel(EState *estate,
Relation localrel,

  *localslot = table_slot_create(localrel, &estate->es_tupleTable);

I think this might have been existing functionality...

The comment says "* Local tuple, if found, is returned in
'*localslot'." But the code is unconditionally doing
table_slot_create() before it even knows if a tuple was found or not.
So what about when it is NOT found - in that case shouldn't there be
some cleaning up that (unused?) table slot that got unconditionally
created?


This sounds accurate. But I guess it may not have been considered critical as we are operating in the ApplyMessageContext? Tha is going to be freed once a single tuple is dispatched.

I have a slight preference not to do it in this patch, but if you think otherwise let me know.
 
~~~

29. src/backend/replication/logical/worker.c - apply_handle_tuple_routing

@@ -2202,13 +2233,17 @@ apply_handle_tuple_routing(ApplyExecutionData *edata,
  * suitable partition.
  */
  {
+ LogicalRepRelMapEntry *entry;
  TupleTableSlot *localslot;
  ResultRelInfo *partrelinfo_new;
  bool found;

+ entry = &part_entry->relmapentry;

Maybe just do this assignment at the entry declaration time?


done
 
~~~

30.

  /* Get the matching local tuple from the partition. */
  found = FindReplTupleInLocalRel(estate, partrel,
- &part_entry->remoterel,
+ part_entry->relmapentry.usableIndexOid,
+ &entry->remoterel,
  remoteslot_part, &localslot);
Why not use the new 'entry' var just assigned instead of repeating
part_entry->relmapentry?

SUGGESTION
found = FindReplTupleInLocalRel(estate, partrel,
entry->usableIndexOid,
&entry->remoterel,
remoteslot_part, &localslot);

~~~

Yes, looks better, changed
 
31.

+ slot_modify_data(remoteslot_part, localslot, entry,
  newtup);

Unnecessary wrapping.

======

I think I have not changed this, but fixed anyway
 

32. src/include/replication/logicalrelation.h

+typedef struct LogicalRepPartMapEntry
+{
+ Oid partoid; /* LogicalRepPartMap's key */
+ LogicalRepRelMapEntry relmapentry;
+} LogicalRepPartMapEntry;

IIUC this struct has been moved from relation.c to here. But I think
there was a large comment about this struct which maybe needs to be
moved with it (see the original relation.c).

/*
 * Partition map (LogicalRepPartMap)
 *
 * When a partitioned table is used as replication target, replicated
 * operations are actually performed on its leaf partitions, which requires
 * the partitions to also be mapped to the remote relation.  Parent's entry
 * (LogicalRepRelMapEntry) cannot be used as-is for all partitions, because
 * individual partitions may have different attribute numbers, which means
 * attribute mappings to remote relation's attributes must be maintained
 * separately for each partition.
 */

======
Oh, seems so, moved.
 

33. .../subscription/t/032_subscribe_use_index.pl

Typo "MULTIPILE"

This typo occurs several times...

e.g. # Testcase start: SUBSCRIPTION USES INDEX MODIFIES MULTIPILE ROWS
e.g. # Testcase end: SUBSCRIPTION USES INDEX MODIFIES MULTIPILE ROWS
e.g. # Testcase start: SUBSCRIPTION USES INDEX WITH MULTIPILE COLUMNS
e.g. # Testcase end: SUBSCRIPTION USES INDEX MODIFIES MULTIPILE ROWS

~~~

 
Yep :/ Fixed now
 
34.

# Basic test where the subscriber uses index
# and only touches multiple rows

What does "only ... multiple" mean?

This occurs several times also.


Ah, in the earlier iterations, the tests were updating/deleting 1 row. Lately, I changed it to multiple rows, just to have more coverage. I guess the discrepancy is because of that. Updated now. 
 
~~~

35.

+# wait for initial table synchronization to finish
+$node_subscriber->wait_for_subscription_sync;
+$node_subscriber->wait_for_subscription_sync;
+$node_subscriber->wait_for_subscription_sync;

That triple wait looks unusual. Is it deliberate?

Ah, not really. Removed.

Thanks,
Onder
 
Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Letter case of "admin option"
Next
From: Daniel Gustafsson
Date:
Subject: Re: pg_receivewal and SIGTERM