Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour. - Mailing list pgsql-hackers

From Rushabh Lathia
Subject Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.
Date
Msg-id CAGPqQf2anMcWCxWtnpa3RDCeZHgBctSW9ZBPTrg2CUfBLMGUwg@mail.gmail.com
Whole thread Raw
In response to Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
List pgsql-hackers

On Tue, Apr 3, 2018 at 2:52 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2018/04/03 14:45, Kyotaro HORIGUCHI wrote:
> Hello.
>
> At Mon, 2 Apr 2018 16:11:12 -0300, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> Why do we need AccessExclusiveLock on all children of a relation that we
>> want to scan to search for rows not satisfying the constraint?  I think
>> it should be enough to get ShareLock, which prevents INSERT, no?  I have
>> a feeling I'm missing something here, but I don't know what, and all
>> tests pass with that change.

Thinking on this a bit, I see no problem with locking the children with
just ShareLock.  It was just a paranoia that if we're going to lock the
table itself being attached with AEL, we must its children (if any) with
AEL, too.

> Mmm. I'm not sure if there's a lock-upgrade case but the
> following sentense is left at the last of one of the modified
> comments. ATREwriteTables is called with AEL after that (that has
> finally no effect in this case).
>
> |                                   But we cannot risk a deadlock by taking
> | * a weaker lock now and the stronger one only when needed.
>
> I don't actually find places where the children's lock can be
> raised but this seems just following the lock parent first
> principle.

No lock upgrade happen as of now.  The comment was added by the commit
972b6ec20bf [1], which removed the code that could cause such a deadlock.
The comment fragment is simply trying to explain why we don't postpone the
locking of children to a later time, say, to the point where we actually
know that they need to be scanned.  Previously the code next to the
comment used to lock the children using AccessShareLock, because at that
point we just needed to check if the table being attached is one of those
children and then later locked with AEL if it turned out that they need to
be scanned to check the partition constraint.

> By the way check_default_allows_bound() (CREATE TABLE case)
> contains a similar usage of find_all_inheritors(default_rel,
> AEL).

Good catch.  Its job is more or less same as
ValidatePartitionConstraints(), except all the children (if any) are
scanned right away instead of queuing it like in the AT case.

>> Also: the change proposed to remove the get_default_oid_from_partdesc()
>> call actually fixes the bug, but to me it was not at all obvious why.
>
> CommandCounterIncrement updates the content of a relcache entry
> via invalidation. It can be surprising for callers of a function
> like StorePartitionBound.
>
> CommandCounterIncrement
>  <skip inval mechanism>
>    RelationCacheInvalidateEntry
>      RelationFlushRelation
>        RelationClearRelation

Because of the CCI() after storing the default partition OID into the
system catalog, RelationClearRelation() would changes what
rel->rd_partdesc points to where 'rel' is the ATExecAttachPartition()'s
reference to the relcache entry of the parent that it passed to
StorePartitionBound.

So, whereas the rel->rd_partdesc wouldn't contain the default partition
before StorePartitionBound() was called, it would after.

>> To figure out why, I first had to realize that
>> ValidatePartitionConstraints was lying to me, both in name and in
>> comments: it doesn't actually validate anything, it merely queues
>> entries so that alter table's phase 3 would do the validation.  I found
>> this extremely confusing, so I fixed the comments to match reality, and
>> later decided to rename the function also.
>
> It is reasonable. Removing exccessive extension of lower-level
> partitions is also reasonable. The previous code extended
> inheritances in different manner for levels at odd and even
> depth.

I like the new code including the naming, but I notice this changes the
order in which we do the locking now.  There are still sites in the code
where the locking order is breadth-first, that is, as determined by
find_all_inheritors(), as this function would too previously.

Also note that beside the breadth-first -> depth-first change, this also
changes the locking order of partitions for a given partitioned table.
The OIDs in partdesc->oids[] are canonically ordered (that is order of
their partition bounds), whereas find_inheritance_children() that's called
by find_all_inheritors() would lock them in the order in which the
individual OIDs were found in the system catalog.

Not sure if there is anything to be alarmed of here, but in all previous
discussions, this has been a thing to avoid.

>> At that point I was able to understand what the problem was: when
>> attaching the default partition, we were setting the constraint to be
>> validated for that partition to the correctly computed partition
>> constraint; and later in the same function we would set the constraint
>> to "the immature constraint" because we were now seeing that the default
>> partition OID was not invalid.  So it was rather a bug in
>> ValidatePartitionConstraints, in that it was accepting to set the
>> expression to validate when another expression had already been set!  I
>> added an assert to protect against this.  And then changed the decision
>> of whether or not to run this block based on the attached partition
>> being the default one or not; because as the "if" test was, it was just
>> a recipe for confusion.  (Now, if you look carefully you realize that
>> the new test for bound->is_default I added is kinda redundant: it can
>> only be set if there was a default partition OID at start of the
>> function, and therefore we know we're not adding a default partition.
>> And for the case where we *are* adding the default partition, it is
>> still Invalid because we don't re-read its own OID.  But relying on that
>> seems brittle because it breaks if we happen to update the default
>> partition OID in the middle of that function, which is what we were
>> doing.  Hence the new is_default test.)
>
> Seems reasonable.

+1

> But even if we assume so, rereading
> defaultPartOid is still breaking the assumption that
> defaultPartOid is that at the time of entering to this function
> and the added condition just conceals the fact.

Afaics, defaultPartOid is only set at the beginning of
ATExecAttachPartition, so it seems fine.

> So I think it should be an assertion.
>
> | if (OidIsValid(defaultPartOid))
> | {
> |      /*
> |       *  The command cannot be adding default partition if the
> |       *  defaultPartOid is valid.
> |       */
> |      Assert(!cmd->bound->is_default);

I guess that makes sense, because when trying to attach a default
partition to the table that already has one, check_new_partition_bound
that's called before this errors out before we could get here.

>> I looked at the implementation of ValidatePartitionConstraints and
>> didn't like it, so I rewrote it also.
>>
>> This comment is mistaken, too:
>> -       /*
>> -        * Skip if the partition is itself a partitioned table.  We can only
>> -        * ever scan RELKIND_RELATION relations.
>> -        */
>> ... because it ignores the possibility of a partition being a foreign
>> table.  The code seems to work, but I think there is no test to cover
>> the case that a foreign table containing data that doesn't satisfy the
>> constraint is attached, because when I set that case to return doing
>> nothing (ie. ATGetQueueEntry is not called for a foreign partition), no
>> test failed.
>
> Foreign tables are intentionally not verified on attaching (in my
> understanding). ATRewriteTables ignores foreign tables so it
> contradicts to what ATRewriteTables thinks. As my understanding
> foreign tables are assumed to be unrestrictable by local
> constraint after attaching so the users are responsible to the
> consistency.

That and ATRewriteTable() in phase 3 cannot really cope with being handed
a foreign table as it can only work with RELKIND_RELATION tables.
Actually, the following in ATRewriteTables() also prevents passing foreign
tables:

static void
ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode)
{
    ListCell   *ltab;

    /* Go through each table that needs to be checked or rewritten */
    foreach(ltab, *wqueue)
    {
        AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab);

        /*
         * Foreign tables have no storage, nor do partitioned tables and
         * indexes.
         */
        if (tab->relkind == RELKIND_FOREIGN_TABLE ||
            tab->relkind == RELKIND_PARTITIONED_TABLE ||
            tab->relkind == RELKIND_PARTITIONED_INDEX)
            continue;

>> Generally speaking, I didn't like ATExecAttachPartition; it's doing too
>> much that should have been done in ATPrepAttachPartition.  The only
>> reason that's not broken is because we don't allow ATTACH PARTITION to
>> appear together with other commands in alter table, which seems
>> disappointing ... for example, when attaching multiple tables and a
>> default partition exist, you have to scan the default one multiple
>> times, which is lame.
>
> Indeed, currently only partition commands are isolated from other
> alter table subcommands (except all in tablespace cases). We can
> improve that in the next step?

I think we can improve this.

>> (Speaking of lame: I noticed that RelationGetPartitionQual obtains lock
>> on the parent relation ... I wonder if this can be used to cause a
>> deadlock during InitResultRelationInfo.)

It does seem that there is a possibility of deadlock, because when
InitResultRelInfo(), that's initializing a partition, calls
RelationGetPartitionQual() to get its partition constraint, the partition
would already have been locked.  So this reverses the generally
established order of locking the parent first; another session which tries
to add a column to the parent, for example, will lock the parent and then
partitions.

I think we need for inserts directly into a partition to lock all of its
ancestors from the root to the direct parent (in that order) before
locking the partition itself, or maybe at least the parent if not all
ancestors.

> Mmm. That seems strange since the RealtionGetPartitionQual
> doesn't return anything (specifically NIL) there, because we
> don't allow a partition to be attached to partitioned tables. It
> seems totally useless.
>
> Addition to that the code tries to add the partition qual (which
> is always NIL) destructively and assign to partConstraint..

It is not always NIL.  Imagine attaching a partition at a lower level.

create table foo (a int, b char) partition by list (a);
create table foo1 partition of foo for values in (1) partition by list (b);
create table foo1a (a, b) as values (2, 'b');

-- note that we're attaching to foo1, not foo
alter table foo1 attach partition foo1a for values in ('a');

If we didn't include foo1's (the parent) constraint (that is, a = 1), the
above command will wrongly succeed.  It must include a = 1 in the
constraint to be be checked when scanning foo1a.

Although, I noticed there is no test covering this.

>> BTW, I think this is already broken for the case where the default
>> partition is partitioned and you attach a partition colliding with a row
>> that's being concurrently inserted in a partition of the default
>> partition, though I didn't bother to write a test for that.
>
> How is it broken? Every attaching partitions are checked for the
> specified partition bound and every partitions of the default
> partition are also checked against the new default part bound. We
> already hold required locks on all the participants.

Yes, concurrent insertions to either the default partition or any of its
partitions couldn't be occurring as we'd have locked them.

>> Anyway, I'm just an onlooker fixing a CommandCounterIncrement change.
>
> It's reassuring. Thanks.

Yes, thank you for taking the time out to clean things up.

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=972b6ec20bf




--
Rushabh Lathia

pgsql-hackers by date:

Previous
From: Teodor Sigaev
Date:
Subject: Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)
Next
From: Teodor Sigaev
Date:
Subject: Re: [HACKERS] GSoC 2017: weekly progress reports (week 6)