Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY - Mailing list pgsql-hackers

From Amit Langote
Subject Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
Date
Msg-id CA+HiwqGaaE8bQc42goHAzr1TKD1Va1YNWWj0bYaww7GEWW3_bA@mail.gmail.com
Whole thread Raw
In response to Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-hackers
On Fri, Apr 23, 2021 at 4:26 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2021-Apr-22, Amit Langote wrote:
> > -        * The reason for this check is that we want to avoid seeing the
> > +        * The reason for this hack is that we want to avoid seeing the
> >          * partition as alive in RI queries during REPEATABLE READ or
> > <snip>
> > +        * SERIALIZABLE transactions.
> >
> > The comment doesn't quite make it clear why it is the RI query case
> > that necessitates this hack in the first case.
>
> I added "such queries use a different snapshot than the one used by
> regular (user) queries."  I hope that's sufficient.

Yeah, that makes sense.

> > Maybe the relation to what's going on with the partdesc

(I had to leave my desk while in the middle of typing this, but I
forget what I was going to add :()

> > BTW, I do feel a bit alarmed by the potential performance impact of
> > this.  If detached_exist of a cached partdesc is true, then RI queries
> > invoked during a bulk DML operation would have to rebuild one for
> > every tuple to be checked, right?  I haven't tried an actual example
> > yet though.
>
> Yeah, I was scared about that too (which is why I insisted on trying to
> add a cached copy of the partdesc omitting detached partitions).  But
> AFAICS what happens is that the plan for the RI query gets cached after
> a few tries; so we do build the partdesc a few times at first, but later
> we use the cached plan and so we no longer use that one.  So at least in
> the normal cases this isn't a serious problem that I can see.

Actually, ri_trigger.c (or really plancache.c) is not very good at
caching the plan when querying partitioned tables; it always chooses
to replan because a generic plan, even with runtime pruning built into
it, looks very expensive compared to a custom one.  Now that's a
problem we will have to fix sooner than later, but until then we have
to work around it.

Here is an example that shows the problem:

create unlogged table pk_parted (a int primary key) partition by range (a);
select 'create unlogged table pk_parted_' || i || ' partition of
pk_parted for values from (' || (i-1) * 1000 + 1 || ') to (' ||  i *
1000 + 1 || ');' from generate_series(1, 1000) i;
\gexec
create unlogged table fk (a int references pk_parted);
insert into pk_parted select generate_series(1, 10000);
begin;
select * from fk_parted where a = 1;

In another session:

alter table pk_parted detach partition pk_parted_1000 concurrently;
<blocks; cancel using ctrl-c>

Back in the 1st session:

end;
insert into fk select generate_series(1, 10000);
INSERT 0 10000
Time: 47400.792 ms (00:47.401)

The insert took unusually long, because the PartitionDesc for
pk_parted had to be built exactly 10000 times, because there's a
detach-pending partition lying around.  There is also a danger of an
OOM with such an insert because of leaking into PortalContext the
memory of every PartitionDesc thus built, especially with larger
counts of PK partitions and rows inserted into the FK table.

Also, I noticed that all queries on pk_parted, not just the RI
queries, have to build the PartitionDesc every time, so take that much
longer:

-- note the planning time
explain analyze select * from pk_parted where a = 1;
                                                                 QUERY
PLAN


-------------------------------------------------------------------------------------------------------------------------------
--------------
 Index Only Scan using pk_parted_1_pkey on pk_parted_1 pk_parted
(cost=0.28..8.29 rows=1 width=4) (actual time=0.016..0.017 ro
ws=1 loops=1)
   Index Cond: (a = 1)
   Heap Fetches: 1
 Planning Time: 7.543 ms
 Execution Time: 0.044 ms
(5 rows)

Finalizing the detach makes the insert and the query finish in normal
time, because the PartitionDesc can be cached again:

alter table pk_parted detach partition pk_parted_1000 finalize;
insert into fk select generate_series(1, 10000);
INSERT 0 10000
Time: 855.336 ms

explain analyze select * from pk_parted where a = 1;
                                                                 QUERY
PLAN


-------------------------------------------------------------------------------------------------------------------------------
--------------
 Index Only Scan using pk_parted_1_pkey on pk_parted_1 pk_parted
(cost=0.28..8.29 rows=1 width=4) (actual time=0.033..0.036 ro
ws=1 loops=1)
   Index Cond: (a = 1)
   Heap Fetches: 1
 Planning Time: 0.202 ms
 Execution Time: 0.075 ms
(5 rows)

I am afraid we may have to fix this in the code after all, because
there does not seem a good way to explain this away in the
documentation.  If I read correctly, you did try an approach of
caching the PartitionDesc that we currently don't, no?

-- 
Amit Langote
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Aleksander Alekseev
Date:
Subject: Re: Disallow cancellation of waiting for synchronous replication
Next
From: Magnus Hagander
Date:
Subject: Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?