Thread: Delay locking partitions during query execution

Delay locking partitions during query execution

From
David Rowley
Date:
Over on [1] I'm proposing to delay locking partitions of a partitioned
table that's the target of an INSERT or UPDATE command until we first
route a tuple to the partition.   Currently, we go and lock all
partitions, even if we just insert a single tuple to a single
partition. The patch in [1] improves the situation when there are many
partitions and only a few tuples to route to just to a few partitions.

Over here and along similar lines to the above, but this time I'd like
to take this even further and change things so we don't lock *any*
partitions during AcquireExecutorLocks() and instead just lock them
when we first access them with ExecGetRangeTableRelation().  This
improves the situation when many partitions get run-time pruned, as
we'll never bother locking those at all since we'll never call
ExecGetRangeTableRelation() on them. We'll of course still lock the
partitioned table so that plan invalidation works correctly.

This does make the locking order less well defined, but I'm already
proposing similar in [1] and over there I've mentioned that I can't
quite see any huge issues with doing that.  We already don't lock all
partitions inside AcquireExecutorLocks() during INSERT with VALUES
anyway.

The attached patch implements this delayed locking.  A quick benchmark
shows a pretty good performance improvement when there are a large
number of partitions and run-time pruning prunes almost all of them.

Setup:
create table hashp (a int) partition by hash(a);
select 'create table hashp'|| x::text || ' partition of hashp for
values with (modulus 10000, remainder ' || x ::text || ');' from
generate_Series(0,9999) x;
\gexec
insert into hashp select generate_Series(1,1000000)

bench.sql:
\set p_a 13315
select * from hashp where a = :p_a;

Master: 10000 parts

$ pgbench -n -f bench.sql -M prepared -T 60 postgres
tps = 108.882749 (excluding connections establishing)
tps = 108.245437 (excluding connections establishing)

delaylock: 10000 parts

$ pgbench -n -f bench.sql -M prepared -T 60 postgres
tps = 1068.289505 (excluding connections establishing)
tps = 1092.797015 (excluding connections establishing)

More could be done to make this quite a bit faster again, but that
mostly requires the range table coming directly from the planner as an
array and marking which array elements require locking with a
Bitmapset. This'll save having to loop over the entire large array
that mostly does not need anything locked. Similar can be done for
ExecCheckRTPerms(), but that's also for another day. With those
changes and some further tweaks done on the Append/MergeAppend code,
tps is about 22k on my machine, which is just slightly slower than
with an equivalent non-partitioned table.

I'll add the attached patch to the January commitfest

[1] https://www.postgresql.org/message-id/CAKJS1f-%3DFnMqmQP6qitkD%2BxEddxw22ySLP-0xFk3JAqUX2yfMw%40mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Delay locking partitions during query execution

From
Tomas Vondra
Date:
On 12/3/18 12:42 PM, David Rowley wrote:
> ...
>
> Master: 10000 parts
> 
> $ pgbench -n -f bench.sql -M prepared -T 60 postgres
> tps = 108.882749 (excluding connections establishing)
> tps = 108.245437 (excluding connections establishing)
> 
> delaylock: 10000 parts
> 
> $ pgbench -n -f bench.sql -M prepared -T 60 postgres
> tps = 1068.289505 (excluding connections establishing)
> tps = 1092.797015 (excluding connections establishing)
> 

I'm a bit confused, because I can't reproduce any such speedup. I've
used the attached script that varies the number of partitions (which
worked quite nicely in the INSERT thread), but I'm getting results like
this:

    partitions      0       100     1000   10000
    --------------------------------------------
    master         49      1214      186      11
    patched        53      1225      187      11

So I don't see any significant speedup, for some reason :-(

Before I start digging into this, is there something that needs to be
done to enable it?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Fri, 4 Jan 2019 at 02:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> I'm a bit confused, because I can't reproduce any such speedup. I've
> used the attached script that varies the number of partitions (which
> worked quite nicely in the INSERT thread), but I'm getting results like
> this:
>
>     partitions      0       100     1000   10000
>     --------------------------------------------
>     master         49      1214      186      11
>     patched        53      1225      187      11
>
> So I don't see any significant speedup, for some reason :-(
>
> Before I start digging into this, is there something that needs to be
> done to enable it?

Thanks for looking at this.

One thing I seem to quite often forget to mention is that I was running with:

plan_cache_mode = force_generic_plan
max_parallel_workers_per_gather = 0;

Without changing plan_cache_mode then the planner would likely never
favour a generic plan since it will not appear to be very efficient
due to the lack of consideration to the costing of run-time partition
pruning.

Also, then with a generic plan, the planner will likely want to build
a parallel plan since it sees up to 10k partitions that need to be
scanned. max_parallel_workers_per_gather = 0 puts it right.

(Ideally, the planner would cost run-time pruning, but it's not quite
so simple for RANGE partitions with non-equality operators. Likely
we'll want to fix that one day, but that's not for here)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Delay locking partitions during query execution

From
Tomas Vondra
Date:
On 1/3/19 10:50 PM, David Rowley wrote:
> On Fri, 4 Jan 2019 at 02:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> I'm a bit confused, because I can't reproduce any such speedup. I've
>> used the attached script that varies the number of partitions (which
>> worked quite nicely in the INSERT thread), but I'm getting results like
>> this:
>>
>>     partitions      0       100     1000   10000
>>     --------------------------------------------
>>     master         49      1214      186      11
>>     patched        53      1225      187      11
>>
>> So I don't see any significant speedup, for some reason :-(
>>
>> Before I start digging into this, is there something that needs to be
>> done to enable it?
> 
> Thanks for looking at this.
> 
> One thing I seem to quite often forget to mention is that I was running with:
> 
> plan_cache_mode = force_generic_plan
> max_parallel_workers_per_gather = 0;
> 
> Without changing plan_cache_mode then the planner would likely never
> favour a generic plan since it will not appear to be very efficient
> due to the lack of consideration to the costing of run-time partition
> pruning.
> 
> Also, then with a generic plan, the planner will likely want to build
> a parallel plan since it sees up to 10k partitions that need to be
> scanned. max_parallel_workers_per_gather = 0 puts it right.
> 
> (Ideally, the planner would cost run-time pruning, but it's not quite
> so simple for RANGE partitions with non-equality operators. Likely
> we'll want to fix that one day, but that's not for here)
> 

Nope, that doesn't seem to make any difference :-( In all cases the
resulting plan (with 10k partitions) looks like this:

test=# explain analyze select * from hashp where a = 13442;

                              QUERY PLAN
-----------------------------------------------------------------------
 Append  (cost=0.00..41.94 rows=13 width=4)
         (actual time=0.018..0.018 rows=0 loops=1)
   ->  Seq Scan on hashp6784 (cost=0.00..41.88 rows=13 width=4)
                             (actual time=0.017..0.018 rows=0 loops=1)
         Filter: (a = 13442)
 Planning Time: 75.870 ms
 Execution Time: 0.471 ms
(5 rows)

and it doesn't change (the timings on shape) no matter how I set any of
the GUCs.

Furthermore, I've repeatedly ran into this issue:

test=# \d hashp
ERROR:  unrecognized token: "false"
LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog...
                                                             ^
I have no idea why it breaks like this, and it's somewhat random (i.e.
not readily reproducible). But I've only ever seen it with this patch
applied.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Fri, 4 Jan 2019 at 11:48, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> Nope, that doesn't seem to make any difference :-( In all cases the
> resulting plan (with 10k partitions) looks like this:
>
> test=# explain analyze select * from hashp where a = 13442;
>
>                               QUERY PLAN
> -----------------------------------------------------------------------
>  Append  (cost=0.00..41.94 rows=13 width=4)
>          (actual time=0.018..0.018 rows=0 loops=1)
>    ->  Seq Scan on hashp6784 (cost=0.00..41.88 rows=13 width=4)
>                              (actual time=0.017..0.018 rows=0 loops=1)
>          Filter: (a = 13442)
>  Planning Time: 75.870 ms
>  Execution Time: 0.471 ms
> (5 rows)
>
> and it doesn't change (the timings on shape) no matter how I set any of
> the GUCs.

For this to work, run-time pruning needs to take place, so it must be
a PREPAREd statement.

With my test I used:

bench.sql:
\set p_a 13315
select * from hashp where a = :p_a;

$ pgbench -n -f bench.sql -M prepared -T 60 postgres

You'll know you're getting a generic plan when you see "Filter (a =
$1)" and see "Subplans Removed: 9999" below the Append.

> Furthermore, I've repeatedly ran into this issue:
>
> test=# \d hashp
> ERROR:  unrecognized token: "false"
> LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog...
>                                                              ^
> I have no idea why it breaks like this, and it's somewhat random (i.e.
> not readily reproducible). But I've only ever seen it with this patch
> applied.

You'll probably need to initdb with the patch applied as there's a new
field in RangeTblEntry. If there's a serialised one of these stored in
the in the catalogue somewhere then the new read function will have
issues reading the old serialised format.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Delay locking partitions during query execution

From
Tomas Vondra
Date:

On 1/3/19 11:57 PM, David Rowley wrote:
> On Fri, 4 Jan 2019 at 11:48, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> Nope, that doesn't seem to make any difference :-( In all cases the
>> resulting plan (with 10k partitions) looks like this:
>>
>> test=# explain analyze select * from hashp where a = 13442;
>>
>>                               QUERY PLAN
>> -----------------------------------------------------------------------
>>  Append  (cost=0.00..41.94 rows=13 width=4)
>>          (actual time=0.018..0.018 rows=0 loops=1)
>>    ->  Seq Scan on hashp6784 (cost=0.00..41.88 rows=13 width=4)
>>                              (actual time=0.017..0.018 rows=0 loops=1)
>>          Filter: (a = 13442)
>>  Planning Time: 75.870 ms
>>  Execution Time: 0.471 ms
>> (5 rows)
>>
>> and it doesn't change (the timings on shape) no matter how I set any of
>> the GUCs.
> 
> For this to work, run-time pruning needs to take place, so it must be
> a PREPAREd statement.
> 
> With my test I used:
> 
> bench.sql:
> \set p_a 13315
> select * from hashp where a = :p_a;
> 
> $ pgbench -n -f bench.sql -M prepared -T 60 postgres
> 
> You'll know you're getting a generic plan when you see "Filter (a =
> $1)" and see "Subplans Removed: 9999" below the Append.
> 

Indeed, with prepared statements I now see some improvements:

    partitions    0      100     1000    10000
    --------------------------------------------
    master       19     1590     2090      128
    patched      18     1780     6820     1130

So, that's nice. I wonder why the throughput drops so fast between 1k
and 10k partitions, but I'll look into that later.

Does this mean this optimization can only ever work with prepared
statements, or can it be made to work with regular plans too?

>> Furthermore, I've repeatedly ran into this issue:
>>
>> test=# \d hashp
>> ERROR:  unrecognized token: "false"
>> LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog...
>>                                                              ^
>> I have no idea why it breaks like this, and it's somewhat random (i.e.
>> not readily reproducible). But I've only ever seen it with this patch
>> applied.
> 
> You'll probably need to initdb with the patch applied as there's a new
> field in RangeTblEntry. If there's a serialised one of these stored in
> the in the catalogue somewhere then the new read function will have
> issues reading the old serialised format.
> 

D'oh! That explains it, because switching from/to patched binaries might
have easily been triggering the error. I've checked that there are no
changes to catalogs, but it did not occur to me adding a new RTE field
could have such consequences ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Fri, 4 Jan 2019 at 13:01, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> On 1/3/19 11:57 PM, David Rowley wrote:
> > You'll know you're getting a generic plan when you see "Filter (a =
> > $1)" and see "Subplans Removed: 9999" below the Append.
> >
>
> Indeed, with prepared statements I now see some improvements:
>
>     partitions    0      100     1000    10000
>     --------------------------------------------
>     master       19     1590     2090      128
>     patched      18     1780     6820     1130
>
> So, that's nice. I wonder why the throughput drops so fast between 1k
> and 10k partitions, but I'll look into that later.

Those look strange. Why is it so slow with the non-partitioned case?
I'd have expected that to be the fastest result.

> Does this mean this optimization can only ever work with prepared
> statements, or can it be made to work with regular plans too?

That's a good question.  I confirm it's only any use of run-time
pruning occurs during init plan. For this patch to be any use, you'll
need to see a "Subplans Removed: <N>" in the query's EXPLAIN ANALYZE
output.  If you don't see this then all Append/MergeAppend subplans
were initialised and the relation lock would have been obtained
regardless of if delaylock is set for the relation. The effect of the
patch here would just have been to obtain the lock during the first
call to ExecGetRangeTableRelation() for that relation instead of
during AcquireExecutorLocks().  There may actually be a tiny overhead
in this case since AcquireExecutorLocks() must skip the delaylock
relations, but they'll get locked later anyway. I doubt you could
measure that though.

When run-time pruning is able to prune partitions before execution
starts then the optimisation is useful since AcquireExecutorLocks()
won't obtain the lock and ExecGetRangeTableRelation() won't be called
for all pruned partition's rels as we don't bother to init the
Append/MergeAppend subplan for those.

I'm a little unsure if there are any cases where this type of run-time
pruning can occur when PREPAREd statements are not in use.  Initplan
parameters can't prune before executor run since we need to run the
executor to obtain the values of those. Likewise for evaluation of
volatile functions. So I think run-time pruning before initplan is
only ever going to happen for PARAM_EXTERN type parameters, i.e. with
PREPAREd statements (REF: analyze_partkey_exprs() partprune.c).
Without PREPAREd statements, if the planner itself was unable to prune
the partitions it would already have obtained the lock during
planning, so AcquireExecutorLocks(), in this case, would bump into the
local lock hash table entry and forego trying to obtain the lock
itself.  That's not free, but it's significantly faster than obtaining
a lock.

Or in short... it only good for prepared statements where the
statement's parameters allow for run-time pruning. However, that's a
pretty large case since the planner is still very slow at planning for
large numbers of partitions, meaning it's common (or at least it will
be) for people to use PREPAREd statement and plan_cache_mode =
force_generic_plan;

> >> Furthermore, I've repeatedly ran into this issue:
> >>
> >> test=# \d hashp
> >> ERROR:  unrecognized token: "false"
> >> LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog...
> >>                                                              ^
> >> I have no idea why it breaks like this, and it's somewhat random (i.e.
> >> not readily reproducible). But I've only ever seen it with this patch
> >> applied.
> >
> > You'll probably need to initdb with the patch applied as there's a new
> > field in RangeTblEntry. If there's a serialised one of these stored in
> > the in the catalogue somewhere then the new read function will have
> > issues reading the old serialised format.
> >
>
> D'oh! That explains it, because switching from/to patched binaries might
> have easily been triggering the error. I've checked that there are no
> changes to catalogs, but it did not occur to me adding a new RTE field
> could have such consequences ...

schema-wise, no changes, but data-wise, there are changes.

$ pg_dump --schema=pg_catalog --data-only postgres | grep ":rellockmode" | wc -l
    121

All of which are inside the pg_rewrite table:

$ pg_dump --schema=pg_catalog --data-only --table=pg_rewrite postgres
| grep ":rellockmode" | wc -l
    121

I just used ":rellockmode" here as it's a field that exists in RangeTblEntry.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Delay locking partitions during query execution

From
Tomas Vondra
Date:

On 1/4/19 1:53 AM, David Rowley wrote:
> On Fri, 4 Jan 2019 at 13:01, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> On 1/3/19 11:57 PM, David Rowley wrote:
>>> You'll know you're getting a generic plan when you see "Filter (a =
>>> $1)" and see "Subplans Removed: 9999" below the Append.
>>>
>>
>> Indeed, with prepared statements I now see some improvements:
>>
>>     partitions    0      100     1000    10000
>>     --------------------------------------------
>>     master       19     1590     2090      128
>>     patched      18     1780     6820     1130
>>
>> So, that's nice. I wonder why the throughput drops so fast between 1k
>> and 10k partitions, but I'll look into that later.
> 
> Those look strange. Why is it so slow with the non-partitioned case?
> I'd have expected that to be the fastest result.
> 

Because there are 1M rows in the table, and it's doing a seqscan.

That also makes the other cases difficult to compare, because with few
partitions there will be multiple pages per partition, scanned
sequentially. And with many partitions it's likely only a single page
with a couple of rows on it. I'll think about constructing a better
benchmark, to make it easier to compare - perhaps by using a single row
per table and/or adding indexes. Or something ...

>> Does this mean this optimization can only ever work with prepared
>> statements, or can it be made to work with regular plans too?
> 
> That's a good question.  I confirm it's only any use of run-time
> pruning occurs during init plan. For this patch to be any use, you'll
> need to see a "Subplans Removed: <N>" in the query's EXPLAIN ANALYZE
> output.  If you don't see this then all Append/MergeAppend subplans
> were initialised and the relation lock would have been obtained
> regardless of if delaylock is set for the relation. The effect of the
> patch here would just have been to obtain the lock during the first
> call to ExecGetRangeTableRelation() for that relation instead of
> during AcquireExecutorLocks().  There may actually be a tiny overhead
> in this case since AcquireExecutorLocks() must skip the delaylock
> relations, but they'll get locked later anyway. I doubt you could
> measure that though.
> 
> When run-time pruning is able to prune partitions before execution
> starts then the optimisation is useful since AcquireExecutorLocks()
> won't obtain the lock and ExecGetRangeTableRelation() won't be called
> for all pruned partition's rels as we don't bother to init the
> Append/MergeAppend subplan for those.
> 
> I'm a little unsure if there are any cases where this type of run-time
> pruning can occur when PREPAREd statements are not in use.  Initplan
> parameters can't prune before executor run since we need to run the
> executor to obtain the values of those. Likewise for evaluation of
> volatile functions. So I think run-time pruning before initplan is
> only ever going to happen for PARAM_EXTERN type parameters, i.e. with
> PREPAREd statements (REF: analyze_partkey_exprs() partprune.c).
> Without PREPAREd statements, if the planner itself was unable to prune
> the partitions it would already have obtained the lock during
> planning, so AcquireExecutorLocks(), in this case, would bump into the
> local lock hash table entry and forego trying to obtain the lock
> itself.  That's not free, but it's significantly faster than obtaining
> a lock.
> 
> Or in short... it only good for prepared statements where the
> statement's parameters allow for run-time pruning. However, that's a
> pretty large case since the planner is still very slow at planning for
> large numbers of partitions, meaning it's common (or at least it will
> be) for people to use PREPAREd statement and plan_cache_mode =
> force_generic_plan;
> 

OK, thanks for the explanation. One more reason to use prepared
statements in such cases ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Sat, 5 Jan 2019 at 03:12, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> >>
> >>     partitions    0      100     1000    10000
> >>     --------------------------------------------
> >>     master       19     1590     2090      128
> >>     patched      18     1780     6820     1130
> >>
> >> So, that's nice. I wonder why the throughput drops so fast between 1k
> >> and 10k partitions, but I'll look into that later.
> >
> > Those look strange. Why is it so slow with the non-partitioned case?
> > I'd have expected that to be the fastest result.
> >
>
> Because there are 1M rows in the table, and it's doing a seqscan.

Of course. My test did the same, but I didn't consider that because I
had so few rows per partition.  Likely just adding an index would have
it make more sense.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Tue, 4 Dec 2018 at 00:42, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Over here and along similar lines to the above, but this time I'd like
> to take this even further and change things so we don't lock *any*
> partitions during AcquireExecutorLocks() and instead just lock them
> when we first access them with ExecGetRangeTableRelation().

I've attached a rebase version of this. The previous version
conflicted with some changes made in b60c397599.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Delay locking partitions during query execution

From
Amit Langote
Date:
On 2019/01/04 9:53, David Rowley wrote:
> Without PREPAREd statements, if the planner itself was unable to prune
> the partitions it would already have obtained the lock during
> planning, so AcquireExecutorLocks(), in this case, would bump into the
> local lock hash table entry and forego trying to obtain the lock
> itself.  That's not free, but it's significantly faster than obtaining
> a lock.

Hmm, AcquireExecutorLocks is only called if prepared statements are used
and that too if a generic cached plan is reused.

GetCachedPlan -> CheckCachedPlan -> AcquireExecutorLocks

In GetCachedPlan:

    if (!customplan)
    {
        if (CheckCachedPlan(plansource))


Thanks,
Amit



Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Thu, 17 Jan 2019 at 17:18, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 2019/01/04 9:53, David Rowley wrote:
> > Without PREPAREd statements, if the planner itself was unable to prune
> > the partitions it would already have obtained the lock during
> > planning, so AcquireExecutorLocks(), in this case, would bump into the
> > local lock hash table entry and forego trying to obtain the lock
> > itself.  That's not free, but it's significantly faster than obtaining
> > a lock.
>
> Hmm, AcquireExecutorLocks is only called if prepared statements are used
> and that too if a generic cached plan is reused.
>
> GetCachedPlan -> CheckCachedPlan -> AcquireExecutorLocks

ah okay. Thanks for the correction, but either way, it's really only
useful when run-time pruning prunes partitions before initialising the
Append/MergeAppend node.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Sat, 12 Jan 2019 at 23:42, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've attached a rebase version of this. The previous version
> conflicted with some changes made in b60c397599.

I've attached another rebased version. This one fixes up the conflict
with e0c4ec07284.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Tue, 4 Dec 2018 at 00:42, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Over here and along similar lines to the above, but this time I'd like
> to take this even further and change things so we don't lock *any*
> partitions during AcquireExecutorLocks() and instead just lock them
> when we first access them with ExecGetRangeTableRelation().  This
> improves the situation when many partitions get run-time pruned, as
> we'll never bother locking those at all since we'll never call
> ExecGetRangeTableRelation() on them. We'll of course still lock the
> partitioned table so that plan invalidation works correctly.

I started looking at this patch again and I do see a problem with it.
Let me explain:

Currently during LockRelationOid() when we obtain a lock on a relation
we'll check for rel cache invalidation messages. This means that
during AcquireExecutorLocks(), as called from CheckCachedPlan(), we
could miss invalidation messages since we're no longer necessarily
locking the partitions.

I think this only creates problems for partition's reloptions being
changed and cached plans possibly being not properly invalidated when
that happens, but I might be wrong about that, but if I'm not, there
still also might be more important reasons to ensure we invalidate the
plan properly in the future.

The following shows the problem:

create table listp (a int) partition by list(a);
create table listp1 partition of listp for values in(1);
create table listp2 partition of listp for values in(2);
insert into listp select x from generate_series(1,2)x,
generate_series(1,100000);
analyze listp;

-- session 1;
begin;
prepare q1 as select count(*) from listp;
explain (costs off,analyze, timing off, summary off) execute q1;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Finalize Aggregate (actual rows=1 loops=1)
   ->  Gather (actual rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate (actual rows=1 loops=3)
               ->  Parallel Append (actual rows=66667 loops=3)
                     ->  Parallel Seq Scan on listp2 (actual rows=33333 loops=3)
                     ->  Parallel Seq Scan on listp1 (actual
rows=100000 loops=1)
(8 rows)
-- session 2
alter table listp1 set (parallel_workers=0);

-- session 1

explain (costs off, analyze, timing off, summary off) execute q1; --
same as previous plan, i.e. still does a parallel scan on listp1.

One way around this would be to always perform an invalidation on the
partition's parent when performing a relcache invalidation on the
partition.  We could perhaps invalidate all the way up to the top
level partitioned table, that way we could just obtain a lock on the
target partitioned table during AcquireExecutorLocks(). I'm currently
only setting the delaylock flag to false for leaf partitions only.

It's not particularly great to think of invalidating the partitioned
table's relcache entry for this, but it's also not so great that we
lock all partitions when runtime pruning might prune away the
partition anyway.  I don't see a way that we can have both, but I'm
happy to hear people's thoughts about this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Delay locking partitions during query execution

From
Amit Langote
Date:
On 2019/01/28 10:26, David Rowley wrote:
> On Tue, 4 Dec 2018 at 00:42, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> Over here and along similar lines to the above, but this time I'd like
>> to take this even further and change things so we don't lock *any*
>> partitions during AcquireExecutorLocks() and instead just lock them
>> when we first access them with ExecGetRangeTableRelation().  This
>> improves the situation when many partitions get run-time pruned, as
>> we'll never bother locking those at all since we'll never call
>> ExecGetRangeTableRelation() on them. We'll of course still lock the
>> partitioned table so that plan invalidation works correctly.
> 
> I started looking at this patch again and I do see a problem with it.
> Let me explain:
> 
> Currently during LockRelationOid() when we obtain a lock on a relation
> we'll check for rel cache invalidation messages. This means that
> during AcquireExecutorLocks(), as called from CheckCachedPlan(), we
> could miss invalidation messages since we're no longer necessarily
> locking the partitions.
> 
> I think this only creates problems for partition's reloptions being
> changed and cached plans possibly being not properly invalidated when
> that happens, but I might be wrong about that, but if I'm not, there
> still also might be more important reasons to ensure we invalidate the
> plan properly in the future.

It seems to me that plancache.c doesn't really need to perform
AcquireExecutorLocks()/LockRelationOid() to learn that a partition's
reloptions property has changed to discard a generic plan and build a new
one.  AFAICT, PlanCacheRelCallback() takes care of resetting a cached plan
by observing that an invalidation message that it received  either from
the same session or from another session belongs to one of the relations
in PlannedStmt.relationOids.  That list must already contain all
partitions' OIDs.

Session 1:
prepare q as select count(*) from p;
explain (costs off) execute q(1);
                     QUERY PLAN
────────────────────────────────────────────────────
 Finalize Aggregate
   ->  Gather
         Workers Planned: 2
         ->  Partial Aggregate
               ->  Parallel Append
                     ->  Parallel Seq Scan on p4
                     ->  Parallel Seq Scan on p1
                     ->  Parallel Seq Scan on p2
                     ->  Parallel Seq Scan on p3
                     ->  Parallel Seq Scan on p_def
(10 rows)


-- same session (p1 can no longer use parallel scan)
alter table p1 set (parallel_workers=0);
explain (costs off) execute q(1);
                     QUERY PLAN
────────────────────────────────────────────────────
 Finalize Aggregate
   ->  Gather
         Workers Planned: 2
         ->  Partial Aggregate
               ->  Parallel Append
                     ->  Seq Scan on p1
                     ->  Parallel Seq Scan on p4
                     ->  Parallel Seq Scan on p2
                     ->  Parallel Seq Scan on p3
                     ->  Parallel Seq Scan on p_def
(10 rows)

-- another session
alter table p1 reset (parallel_workers);

-- back to session 1 (p1 can now use parallel scan)
explain (costs off) execute q(1);
                     QUERY PLAN
────────────────────────────────────────────────────
 Finalize Aggregate
   ->  Gather
         Workers Planned: 2
         ->  Partial Aggregate
               ->  Parallel Append
                     ->  Parallel Seq Scan on p4
                     ->  Parallel Seq Scan on p1
                     ->  Parallel Seq Scan on p2
                     ->  Parallel Seq Scan on p3
                     ->  Parallel Seq Scan on p_def
(10 rows)

Thanks,
Amit



Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Mon, 28 Jan 2019 at 20:45, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> It seems to me that plancache.c doesn't really need to perform
> AcquireExecutorLocks()/LockRelationOid() to learn that a partition's
> reloptions property has changed to discard a generic plan and build a new
> one.  AFAICT, PlanCacheRelCallback() takes care of resetting a cached plan
> by observing that an invalidation message that it received  either from
> the same session or from another session belongs to one of the relations
> in PlannedStmt.relationOids.  That list must already contain all
> partitions' OIDs.

Really? So when you tried my case you properly got a plan with a
non-parallel Seq Scan on listp1?

I imagine you didn't with yours since we check for relcache
invalidations at the start of a transaction.  I performed both my
EXECUTEs in the same transaction.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Delay locking partitions during query execution

From
Amit Langote
Date:
On 2019/01/28 20:27, David Rowley wrote:
> On Mon, 28 Jan 2019 at 20:45, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> It seems to me that plancache.c doesn't really need to perform
>> AcquireExecutorLocks()/LockRelationOid() to learn that a partition's
>> reloptions property has changed to discard a generic plan and build a new
>> one.  AFAICT, PlanCacheRelCallback() takes care of resetting a cached plan
>> by observing that an invalidation message that it received  either from
>> the same session or from another session belongs to one of the relations
>> in PlannedStmt.relationOids.  That list must already contain all
>> partitions' OIDs.
> 
> Really? So when you tried my case you properly got a plan with a
> non-parallel Seq Scan on listp1?
> 
> I imagine you didn't with yours since we check for relcache
> invalidations at the start of a transaction.  I performed both my
> EXECUTEs in the same transaction.

Yeah, I performed each EXECUTE in a new transaction, so not the same case
as yours.  Sorry about the noise.

However, I tried the example as you described and the plan *doesn't*
change due to concurrent update of reloptions with master (without the
patch) either.

session 1:
begin;
prepare q1 as select count(*) from listp;
explain (costs off, analyze, timing off, summary off) execute q1;
                                   QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────
 Finalize Aggregate (actual rows=1 loops=1)
   ->  Gather (actual rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate (actual rows=1 loops=3)
               ->  Parallel Append (actual rows=66667 loops=3)
                     ->  Parallel Seq Scan on listp1 (actual rows=33333
loops=3)
                     ->  Parallel Seq Scan on listp2 (actual rows=50000
loops=2)
(8 rows)

session 2:
alter table listp1 set (parallel_workers=0);

session 1:
explain (costs off, analyze, timing off, summary off) execute q1;
                                   QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────
 Finalize Aggregate (actual rows=1 loops=1)
   ->  Gather (actual rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate (actual rows=1 loops=3)
               ->  Parallel Append (actual rows=66667 loops=3)
                     ->  Parallel Seq Scan on listp1 (actual rows=33333
loops=3)
                     ->  Parallel Seq Scan on listp2 (actual rows=50000
loops=2)
(8 rows)

I then built master with -DCATCACHE_FORCE_RELEASE and the plan does
change, but because of syscache misses here and there resulting in opening
the erring system catalog which then does AcceptInvalidationMessages().

In the normal build, invalidation messages don't seem to be processed even
by calling AcquireExecutorLocks(), which is perhaps not normal.  It seem
that the following condition in LockRelationOid is never true when called
from AcquireExecutorLocks:

    if (res != LOCKACQUIRE_ALREADY_CLEAR)
    {
        AcceptInvalidationMessages();
        MarkLockClear(locallock);
    }

Bug, maybe?

Thanks,
Amit



Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Tue, 29 Jan 2019 at 19:42, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> However, I tried the example as you described and the plan *doesn't*
> change due to concurrent update of reloptions with master (without the
> patch) either.

Well, I didn't think to try that.  I just assumed had broken it.
Could well be related to lowering the lock levels on setting these
reloptions.  Once upon a time, they required an AEL but that was
changed in e5550d5fec6. So previous to that commit you'd have been
blocked from making the concurrent change while the other session was
in a transaction.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Delay locking partitions during query execution

From
Robert Haas
Date:
On Sun, Jan 27, 2019 at 8:26 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> One way around this would be to always perform an invalidation on the
> partition's parent when performing a relcache invalidation on the
> partition.  We could perhaps invalidate all the way up to the top
> level partitioned table, that way we could just obtain a lock on the
> target partitioned table during AcquireExecutorLocks(). I'm currently
> only setting the delaylock flag to false for leaf partitions only.

Would this problem go away if we adopted the proposal discussed in
http://postgr.es/m/24823.1544220272@sss.pgh.pa.us and, if so, is that
a good fix?

I don't quite understand why this is happening.  It seems like as long
as you take at least one new lock, you'll process *every* pending
invalidation message, and that should trigger replanning as long as
the dependencies are correct.  But maybe the issue is that you hold
all the other locks you need already, and the lock on the partition at
issue is only acquired during execution, at which point it's too late
to replan.  If so, then I think something along the lines of the above
might make a lot of sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Delay locking partitions during query execution

From
Tomas Vondra
Date:

On 1/31/19 10:31 PM, Robert Haas wrote:
> On Sun, Jan 27, 2019 at 8:26 PM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> One way around this would be to always perform an invalidation on the
>> partition's parent when performing a relcache invalidation on the
>> partition.  We could perhaps invalidate all the way up to the top
>> level partitioned table, that way we could just obtain a lock on the
>> target partitioned table during AcquireExecutorLocks(). I'm currently
>> only setting the delaylock flag to false for leaf partitions only.
> 
> Would this problem go away if we adopted the proposal discussed in
> http://postgr.es/m/24823.1544220272@sss.pgh.pa.us and, if so, is that
> a good fix?
> 
> I don't quite understand why this is happening.  It seems like as long
> as you take at least one new lock, you'll process *every* pending
> invalidation message, and that should trigger replanning as long as
> the dependencies are correct.  But maybe the issue is that you hold
> all the other locks you need already, and the lock on the partition at
> issue is only acquired during execution, at which point it's too late
> to replan.  If so, then I think something along the lines of the above
> might make a lot of sense.
> 

It happens because ConditionalLockRelation does this:

    if (res != LOCKACQUIRE_ALREADY_CLEAR)
    {
        AcceptInvalidationMessages();
        MarkLockClear(locallock);
    }

and with the prepared statement already planned, we end up skipping that
because res == LOCKACQUIRE_ALREADY_CLEAR. Which happens because
GrantLockLocal (called from LockAcquireExtended) find the relation in as
already locked.

I don't know if this is correct or not, though.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Fri, 1 Feb 2019 at 10:32, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Jan 27, 2019 at 8:26 PM David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> > One way around this would be to always perform an invalidation on the
> > partition's parent when performing a relcache invalidation on the
> > partition.  We could perhaps invalidate all the way up to the top
> > level partitioned table, that way we could just obtain a lock on the
> > target partitioned table during AcquireExecutorLocks(). I'm currently
> > only setting the delaylock flag to false for leaf partitions only.
>
> Would this problem go away if we adopted the proposal discussed in
> http://postgr.es/m/24823.1544220272@sss.pgh.pa.us and, if so, is that
> a good fix?
>
> I don't quite understand why this is happening.  It seems like as long
> as you take at least one new lock, you'll process *every* pending
> invalidation message, and that should trigger replanning as long as
> the dependencies are correct.  But maybe the issue is that you hold
> all the other locks you need already, and the lock on the partition at
> issue is only acquired during execution, at which point it's too late
> to replan.  If so, then I think something along the lines of the above
> might make a lot of sense.

I think perhaps accepting invalidations at the start of the statement
might appear to fix the problem in master, but I think there's still a
race condition around CheckCachedPlan() since we'll ignore
invalidation messages when we perform LockRelationOid() in
AcquireExecutorLocks() due to the lock already being held.  So there's
likely still a window where the invalidation message could come in and
we miss it.    That said, if lock is already taken in one session, and
some other session alters some relation we have a lock on, then that
alternation can't have been done with an AEL, as that would have
blocked on our lock, so it must be some ShareUpdateExclusiveLock
change, and if so that change must not be important enough to block
concurrently executing queries, so likely shouldn't actually break
anything.

So in summary, I think that checking for invalidation messages at the
start of the statement allows us to replan within a transaction, but
does not guarantee we can the exact correct plan for the current
settings.

I also don't have a full grasp on why we must ignore invalidation
messages when we already have a lock on the relation.  I assume it's
not a huge issue since obtaining a new lock on any other relation will
process the invalidation queue anyway. I know that f868a8143a9
recently made some changes to fix some recursive issues around these
invalidations.

(I admit to not having the best grasp on how all this works, so feel
free to shoot this down)



--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Delay locking partitions during query execution

From
Robert Haas
Date:
On Thu, Jan 31, 2019 at 8:29 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> I think perhaps accepting invalidations at the start of the statement
> might appear to fix the problem in master, but I think there's still a
> race condition around CheckCachedPlan() since we'll ignore
> invalidation messages when we perform LockRelationOid() in
> AcquireExecutorLocks() due to the lock already being held.  So there's
> likely still a window where the invalidation message could come in and
> we miss it.    That said, if lock is already taken in one session, and
> some other session alters some relation we have a lock on, then that
> alternation can't have been done with an AEL, as that would have
> blocked on our lock, so it must be some ShareUpdateExclusiveLock
> change, and if so that change must not be important enough to block
> concurrently executing queries, so likely shouldn't actually break
> anything.
>
> So in summary, I think that checking for invalidation messages at the
> start of the statement allows us to replan within a transaction, but
> does not guarantee we can the exact correct plan for the current
> settings.

Right, but that's also an inevitably consequence of postponing
locking.  If you start running the statement without locking every
object involved in the query, then somebody could be holding an
AccessExclusiveLock on that object and changing it in any way that
they're allowed to do without holding a lock that does conflict with
one you're holding.  Nothing we do with invalidation messages can
prevent that from happening, though it can make the race narrower. We
*must* be able to cope with having the out-of-date plan or this whole
idea is sunk.

> I also don't have a full grasp on why we must ignore invalidation
> messages when we already have a lock on the relation.  I assume it's
> not a huge issue since obtaining a new lock on any other relation will
> process the invalidation queue anyway. I know that f868a8143a9
> recently made some changes to fix some recursive issues around these
> invalidations.

It's not necessary.  It's a performance optimization.  The idea is: we
need our caches to be up to date.  Anything that changes an object
should take AccessExclusiveLock, send invalidation messages, and only
then release the AccessExclusiveLock.  Anything that looks at an
object should acquire at least AccessShareLock and then process
invalidation messages.  This protocol guarantees that any changes that
are made under AEL will be seen in a race-free way, since the
invalidation messages are sent before releasing the lock and processed
after taking one.  But it doesn't require processing those messages
when we haven't taken a lock, so we don't.

> (I admit to not having the best grasp on how all this works, so feel
> free to shoot this down)

I think the key question here is whether or not you can cope with
someone having done arbitrary AEL-requiring modifications to the
delaylocked partitions.  If you can, the fact that the plan might
sometimes be out-of-date is an inevitable consequence of doing this at
all, but I think we can argue that it's an acceptable consequence as
long as the resulting behavior is not too bletcherous.  If you can't,
then this patch is dead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Delay locking partitions during query execution

From
Tomas Vondra
Date:
On 2/1/19 2:51 PM, Robert Haas wrote:
>> (I admit to not having the best grasp on how all this works, so feel
>> free to shoot this down)
>>
> I think the key question here is whether or not you can cope with
> someone having done arbitrary AEL-requiring modifications to the
> delaylocked partitions.  If you can, the fact that the plan might
> sometimes be out-of-date is an inevitable consequence of doing this at
> all, but I think we can argue that it's an acceptable consequence as
> long as the resulting behavior is not too bletcherous.  If you can't,
> then this patch is dead.

I'm a bit confused - does the patch actually change anything here? As
demonstrated earlier in this thread, it actually behaves just like
master. No?

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Sat, 2 Feb 2019 at 13:43, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> On 2/1/19 2:51 PM, Robert Haas wrote:
> >> (I admit to not having the best grasp on how all this works, so feel
> >> free to shoot this down)
> >>
> > I think the key question here is whether or not you can cope with
> > someone having done arbitrary AEL-requiring modifications to the
> > delaylocked partitions.  If you can, the fact that the plan might
> > sometimes be out-of-date is an inevitable consequence of doing this at
> > all, but I think we can argue that it's an acceptable consequence as
> > long as the resulting behavior is not too bletcherous.  If you can't,
> > then this patch is dead.
>
> I'm a bit confused - does the patch actually change anything here? As
> demonstrated earlier in this thread, it actually behaves just like
> master. No?

I guess Robert meant if it say, failed to execute a cached plan with
say "unable to open relation ..." after someone concurrently did
something like ALTER TABLE ... SET TABLESPACE on one of the
partitions.

This particular case can't happen with the patch since we always
accept invalidation message after we obtain a new lock, but I'm
working through various scenarios just in case there is something that
could invalidate the plan, rather than just the relcache entry.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Delay locking partitions during query execution

From
Tom Lane
Date:
[ reposting some questions asked in the wrong thread ]

What I'd like to understand about this patch is how it relates
to Amit L.'s work on making the planner faster for partitioned
UPDATE/DELETE cases (https://commitfest.postgresql.org/22/1778/).
I think that that might render this moot?  Amit's already
getting rid of unnecessary partition locking.

In any case, I'd counsel against putting planner outputs into
RangeTblEntry.  Someday we ought to make parse trees read-only
to the planner, as plan trees are to the executor; defining them
to carry planner results kneecaps any such effort before it starts.
Not to mention the unpleasant consequences for the footprint
of this patch.

            regards, tom lane


Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Tue, 19 Feb 2019 at 12:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> [ reposting some questions asked in the wrong thread ]
>
> What I'd like to understand about this patch is how it relates
> to Amit L.'s work on making the planner faster for partitioned
> UPDATE/DELETE cases (https://commitfest.postgresql.org/22/1778/).
> I think that that might render this moot?  Amit's already
> getting rid of unnecessary partition locking.

(I'm still not certain of this patch myself and it does need me to do
some more analysis to check on what was mentioned recently upthread,
but...)

As for the overlap with Amit's patch, it is true he's reducing the
locking down to just non-pruned partitions, so for this patch, that
only leaves us able to skip locking partitions that are about to be
run-time pruned.   That happens to be quite useful when using generic
plans as it's quite possible that the planner is unable to prune any
partitions and the executor would otherwise have to lock all of them,
but perhaps only end up scanning 1 of them.    It's true that when
plan_cache_mode = auto that a generic plan doing this is unlikely to
ever be chosen since the planner does not cost in the possibility of
run-time pruning, but there's also plan_cache_mode =
force_generic_plan.

> In any case, I'd counsel against putting planner outputs into
> RangeTblEntry.  Someday we ought to make parse trees read-only
> to the planner, as plan trees are to the executor; defining them
> to carry planner results kneecaps any such effort before it starts.
> Not to mention the unpleasant consequences for the footprint
> of this patch.

I agree that one day it would be nice to move towards having the
planner never touch the parse, but I didn't really see it as anything
much beyond what rellockmode is today.  If we one day have a solution
for that, then likely the same solution will support fixing the
delaylock flag too.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Sat, 2 Feb 2019 at 02:52, Robert Haas <robertmhaas@gmail.com> wrote:
> I think the key question here is whether or not you can cope with
> someone having done arbitrary AEL-requiring modifications to the
> delaylocked partitions.  If you can, the fact that the plan might
> sometimes be out-of-date is an inevitable consequence of doing this at
> all, but I think we can argue that it's an acceptable consequence as
> long as the resulting behavior is not too bletcherous.  If you can't,
> then this patch is dead.

I spent some time looking at this patch again and thinking about the
locking. I ended up looking through each alter table subcommand by way
of AlterTableGetLockLevel() and going through scenarios with each of
them in my head to try to understand:

a) If the subcommand can even be applied to a leaf partition; and
b) Can the subcommand cause a cached plan to become invalid?

I did all the easy ones first then started on the harder ones. I came
to AT_DropConstraint and imagined the following scenario:

-- ** session 1
create table listp (a int) partition by list(a);
create table listp1 partition of listp for values in(1);
create table listp2 partition of listp for values in(2);

create index listp1_a_idx on listp1 (a);
create index listp2_a_idx on listp2 (a);

set enable_seqscan = off;
set plan_cache_mode = force_generic_plan;
prepare q1 (int) as select * from listp where a = $1;
execute q1 (1);
begin;
execute q1 (1);


-- ** session 2
drop index listp2_a_idx;

-- ** session 1
execute q1 (2);
ERROR:  could not open relation with OID 16401

The only way I can think to fix this is to just never lock partitions
at all, and if a lock is to be obtained on a partition, it must be
instead obtained on the top-level partitioned table.  That's a rather
large change that could have large consequences, and I'm not even sure
it's possible since we'd need to find the top-level parent before
obtaining the lock, by then the hierarchy might have changed and we'd
need to recheck, which seems like quite a lot of effort just to obtain
a lock... Apart from that, it's not this patch, so looks like I'll
need to withdraw this one :-(

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Delay locking partitions during query execution

From
Tomas Vondra
Date:

On 3/5/19 6:55 AM, David Rowley wrote:
> On Sat, 2 Feb 2019 at 02:52, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think the key question here is whether or not you can cope with
>> someone having done arbitrary AEL-requiring modifications to the
>> delaylocked partitions.  If you can, the fact that the plan might
>> sometimes be out-of-date is an inevitable consequence of doing this at
>> all, but I think we can argue that it's an acceptable consequence as
>> long as the resulting behavior is not too bletcherous.  If you can't,
>> then this patch is dead.
> 
> I spent some time looking at this patch again and thinking about the
> locking. I ended up looking through each alter table subcommand by way
> of AlterTableGetLockLevel() and going through scenarios with each of
> them in my head to try to understand:
> 
> a) If the subcommand can even be applied to a leaf partition; and
> b) Can the subcommand cause a cached plan to become invalid?
> 
> I did all the easy ones first then started on the harder ones. I came
> to AT_DropConstraint and imagined the following scenario:
> 
> -- ** session 1
> create table listp (a int) partition by list(a);
> create table listp1 partition of listp for values in(1);
> create table listp2 partition of listp for values in(2);
> 
> create index listp1_a_idx on listp1 (a);
> create index listp2_a_idx on listp2 (a);
> 
> set enable_seqscan = off;
> set plan_cache_mode = force_generic_plan;
> prepare q1 (int) as select * from listp where a = $1;
> execute q1 (1);
> begin;
> execute q1 (1);
> 
> 
> -- ** session 2
> drop index listp2_a_idx;
> 
> -- ** session 1
> execute q1 (2);
> ERROR:  could not open relation with OID 16401
> 
> The only way I can think to fix this is to just never lock partitions
> at all, and if a lock is to be obtained on a partition, it must be
> instead obtained on the top-level partitioned table.  That's a rather
> large change that could have large consequences, and I'm not even sure
> it's possible since we'd need to find the top-level parent before
> obtaining the lock, by then the hierarchy might have changed and we'd
> need to recheck, which seems like quite a lot of effort just to obtain
> a lock... Apart from that, it's not this patch, so looks like I'll
> need to withdraw this one :-(
> 

So you're saying we could

1) lookup the parent and lock it
2) repeat the lookup to verify it did not change

I think that could still be a win, assuming that most hierarchies will
be rather shallow (I'd say 2-3 levels will cover like 95% of cases, and
4 levels would be 100% in practice). And the second lookup should be
fairly cheap thanks to syscache and the fact that the hierarchies do not
change very often.

I can't judge how invasive this patch would be, but I agree it's more
complex than the originally proposed patch.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Delay locking partitions during query execution

From
David Rowley
Date:
On Wed, 6 Mar 2019 at 04:46, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> On 3/5/19 6:55 AM, David Rowley wrote:
> > The only way I can think to fix this is to just never lock partitions
> > at all, and if a lock is to be obtained on a partition, it must be
> > instead obtained on the top-level partitioned table.  That's a rather
> > large change that could have large consequences, and I'm not even sure
> > it's possible since we'd need to find the top-level parent before
> > obtaining the lock, by then the hierarchy might have changed and we'd
> > need to recheck, which seems like quite a lot of effort just to obtain
> > a lock... Apart from that, it's not this patch, so looks like I'll
> > need to withdraw this one :-(
> >
>
> So you're saying we could
>
> 1) lookup the parent and lock it
> 2) repeat the lookup to verify it did not change
>
> I think that could still be a win, assuming that most hierarchies will
> be rather shallow (I'd say 2-3 levels will cover like 95% of cases, and
> 4 levels would be 100% in practice). And the second lookup should be
> fairly cheap thanks to syscache and the fact that the hierarchies do not
> change very often.

Actually, I'm not sure it could work at all.  It does not seem very
safe to lookup a partition's parent without actually holding a lock on
the partition and we can't lock the partition and then lock each
parent in turn as that's the exact opposite locking order that we do
when querying a partitioned table, so could result in deadlocking.

Many moons ago in the 0002 patch in [1] I proposed to change the way
we store partition hierarchies which involved ditching bool
pg_class.relispartition in favour of Oid pg_class.relpartitionparent.
That would make parent lookups pretty fast, but... given the above it
seems like we couldn't do this at all.

[1] https://www.postgresql.org/message-id/CAKJS1f9QjUwQrio20Pi%3DyCHmnouf4z3SfN8sqXaAcwREG6k0zQ%40mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Delay locking partitions during query execution

From
Robert Haas
Date:
On Tue, Mar 5, 2019 at 8:04 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Actually, I'm not sure it could work at all.  It does not seem very
> safe to lookup a partition's parent without actually holding a lock on
> the partition and we can't lock the partition and then lock each
> parent in turn as that's the exact opposite locking order that we do
> when querying a partitioned table, so could result in deadlocking.
>
> Many moons ago in the 0002 patch in [1] I proposed to change the way
> we store partition hierarchies which involved ditching bool
> pg_class.relispartition in favour of Oid pg_class.relpartitionparent.
> That would make parent lookups pretty fast, but... given the above it
> seems like we couldn't do this at all.

One thing that is both kinda nice and kinda strange about our
heavyweight lock manager is that it has no idea what it is locking.
If you say "please lock OID 12345" for me, it does, but it doesn't
know anything about the relationship between that OID and any other
thing you might want to lock.  Compare that to what Gray and Reuter
describe in "Transaction Processing: Concepts and Techniques", Section
7.8, Granular Locking.  There, there is an idea that the set of
possible locks forms a tree, and that locking a node of the tree is
tantamount to locking all of its descendents.  Such a concept would be
useful here: you could take e.g. AccessShareLock on the root of the
partitioning hierarchy and that would in effect give you that lock
mode on every partition, but without needing to make a separate entry
for each partition lock.

Sadly, implementing such a thing for PostgreSQL seems extremely
challenging.  We'd somehow have to build up in shared memory an idea
of what the locking hierarchy was, and then update it as DDL happens,
and drop entries that aren't interesting any more so we don't run out
of shared memory.  Performance and concurrency would be really hard
problems, and assumptions about the current locking model are baked
into code in MANY parts of the system, so finding everything that
needed to be (or could be) changed would probably be extremely
challenging.  But if you could make it all work we might well end up
in a better state than we are today.  However, I'm leaving this
problem for a time when I have six months or a year to do nothing
else, because I'm pretty sure that's what it would take.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company