On Sun, Aug 09, 2020 at 06:44:23PM -0500, Justin Pryzby wrote:
> Thanks for looking.
The REINDEX patch is progressing its way, so I have looked a bit at
the part for CIC.
Visibly, the case of multiple partition layers is not handled
correctly. Here is a sequence that gets broken:
CREATE TABLE parent_tab (id int) PARTITION BY RANGE (id);
CREATE TABLE child_0_10 PARTITION OF parent_tab
FOR VALUES FROM (0) TO (10);
CREATE TABLE child_10_20 PARTITION OF parent_tab
FOR VALUES FROM (10) TO (20);
CREATE TABLE child_20_30 PARTITION OF parent_tab
FOR VALUES FROM (20) TO (30);
INSERT INTO parent_tab VALUES (generate_series(0,29));
CREATE TABLE child_30_40 PARTITION OF parent_tab
FOR VALUES FROM (30) TO (40)
PARTITION BY RANGE(id);
CREATE TABLE child_30_35 PARTITION OF child_30_40
FOR VALUES FROM (30) TO (35);
CREATE TABLE child_35_40 PARTITION OF child_30_40
FOR VALUES FROM (35) TO (40);
INSERT INTO parent_tab VALUES (generate_series(30,39));
CREATE INDEX CONCURRENTLY parent_index ON parent_tab (id);
This fails as follows:
ERROR: XX000: unrecognized node type: 2139062143
LOCATION: copyObjectImpl, copyfuncs.c:5718
And the problem visibly comes from some handling with the second level
of partitions, child_30_40 in my example above. Even with that, this
outlines a rather severe problem in the patch: index_set_state_flags()
flips indisvalid on/off using a non-transactional update (see the use
of heap_inplace_update), meaning that if we fail in the middle of the
operation we may finish with a partition index tree where some of the
indexes are valid and some of them are invalid. In order to make this
logic consistent, I am afraid that we will need to do two things:
- Change index_set_state_flags() so as it uses a transactional
update. That's something I would like to change for other reasons,
like making sure that the REPLICA IDENTITY of a parent relation is
correctly reset when dropping a replica index.
- Make all the indexes of the partition tree valid in the *same*
sub-transaction.
You can note that this case is different than a concurrent REINDEX,
because in this case we just do an in-place change between the old and
new index, meaning that even if there is a failure happening while
processing, we may have some invalid indexes, but there are still
valid indexes attached to the partition tree, at any time.
+ MemoryContext oldcontext = MemoryContextSwitchTo(ind_context);
PartitionDesc partdesc = RelationGetPartitionDesc(rel);
int nparts = partdesc->nparts;
+ char *relname = pstrdup(RelationGetRelationName(rel));
Er, no. We should not play with the relation cache calls in a private
memory context. I think that this needs much more thoughts.
--
Michael