Thread: Querying one partition in a function takes locks on all partitions

Querying one partition in a function takes locks on all partitions

From
Evgeny Morozov
Date:
I have a list-partitioned table. When I query the base table but filter
by the partition column in a regular SQL query this takes a lock only on
the one partition being queried, as I expect. However, when the exact
same SQL query is run fom a DB function, with the partition ID passed in
as argument, it takes (shared) locks on ALL partitions - which blocks
any other process that wants an exclusive lock on another partition (and
vice-versa).

Originally found on PG 15.12, but happens on 17.4 as well. Easily
reproducible:

-- One-time setup

create table entity
(
    part_id integer not null
) partition by list (part_id);

create table entity_1 partition of entity for values in (1);
create table entity_2 partition of entity for values in (2);

create function read_partition(which_part int) returns bigint as
'select count(*) from entity where part_id = which_part;'
language sql stable;

-- Then try this, keeping the connection open (so the transaction is
pending):

begin;
select read_partition(1); -- This takes shared locks on entity_1 AND
entity_2

-- select count(*) from entity where part_id = 1; -- but this would only
take a shared lock only on entity_1

If another session tries something that takes an exclusive lock on
another partition, like

alter table entity_2 add column new_column text;

I would expect that to be able to run concurrently, but it blocks due to
the shared lock on entity_2. (The way I originally found the problem was
the opposite: once one client took an exclusive lock on a partition many
others were blocked from reading from ANY partition.)

This seems like quite the "gotcha", especially when the query plan for
the function call (logged via autoexplain) shows it only accessing one
partition (entity_1). Is this expected behavior? If so, is it documented
somewhere?




On 3/21/25 09:27, Evgeny Morozov wrote:
> I have a list-partitioned table. When I query the base table but filter
> by the partition column in a regular SQL query this takes a lock only on
> the one partition being queried, as I expect. However, when the exact
> same SQL query is run fom a DB function, with the partition ID passed in
> as argument, it takes (shared) locks on ALL partitions - which blocks
> any other process that wants an exclusive lock on another partition (and
> vice-versa).
> 
> Originally found on PG 15.12, but happens on 17.4 as well. Easily
> reproducible:
> 
> -- One-time setup
> 
> create table entity
> (
>      part_id integer not null
> ) partition by list (part_id);
> 
> create table entity_1 partition of entity for values in (1);
> create table entity_2 partition of entity for values in (2);
> 
> create function read_partition(which_part int) returns bigint as
> 'select count(*) from entity where part_id = which_part;'
> language sql stable;
> 
> -- Then try this, keeping the connection open (so the transaction is
> pending):
> 
> begin;
> select read_partition(1); -- This takes shared locks on entity_1 AND
> entity_2
> 
> -- select count(*) from entity where part_id = 1; -- but this would only
> take a shared lock only on entity_1
> 
> If another session tries something that takes an exclusive lock on
> another partition, like
> 
> alter table entity_2 add column new_column text;
> 
> I would expect that to be able to run concurrently, but it blocks due to
> the shared lock on entity_2. (The way I originally found the problem was
> the opposite: once one client took an exclusive lock on a partition many
> others were blocked from reading from ANY partition.)
> 
> This seems like quite the "gotcha", especially when the query plan for
> the function call (logged via autoexplain) shows it only accessing one
> partition (entity_1). Is this expected behavior? If so, is it documented
> somewhere?

Hmm, seems to be a sql function issue:

CREATE OR REPLACE FUNCTION public.read_partition(which_part integer)
  RETURNS bigint
  LANGUAGE plpgsql
  STABLE
AS $$
DECLARE
     id_ct bigint;
BEGIN
     select count(*) into id_ct from entity where part_id = $1;
RETURN id_ct;
END;
$$;

BEGIN;

select read_partition(1);
  read_partition
----------------
               0

select relation::regclass, mode from pg_locks ;
  relation |      mode
----------+-----------------
  pg_locks | AccessShareLock
  entity_1 | AccessShareLock
  entity   | AccessShareLock
           | ExclusiveLock


> 
> 
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com




On Sat, 22 Mar 2025 at 05:27, Evgeny Morozov
<postgresql4@realityexists.net> wrote:
> select read_partition(1); -- This takes shared locks on entity_1 AND
> entity_2
>
> -- select count(*) from entity where part_id = 1; -- but this would only
> take a shared lock only on entity_1
>
> If another session tries something that takes an exclusive lock on
> another partition, like
>
> alter table entity_2 add column new_column text;

Is this just an example command?  You can't add a column to a
partition directly.

> I would expect that to be able to run concurrently, but it blocks due to
> the shared lock on entity_2. (The way I originally found the problem was
> the opposite: once one client took an exclusive lock on a partition many
> others were blocked from reading from ANY partition.)
>
> This seems like quite the "gotcha", especially when the query plan for
> the function call (logged via autoexplain) shows it only accessing one
> partition (entity_1). Is this expected behavior? If so, is it documented
> somewhere?

It is expected behaviour and, unfortunately, not really documented
anywhere outside of the source code.

What's going on is that PostgreSQL is creating a generic plan for your
query, that is, a plan that will work with any parameter value that
you give to your function. When the generic plan is executed and the
locks are taken for the relations mentioned in the query, and since
the plan is generic, it includes all partitions that could match any
possible parameter value you could pass. When the locks are taken,
it's not yet known which partition will be needed as the partition
pruning that occurs only does so after the locks are taken.

There has been work done for PostgreSQL 18 which swaps the order of
these operations and makes it so that the executor only obtains the
locks on the partitions which will actually be scanned. Hopefully,
we'll see that feature released with PostgreSQL 18 toward the end of
2025.

As for the documentation, it might be worth adding a mention of this
at the end of the following paragraph in [1]:

"During initialization of the query plan. Partition pruning can be
performed here for parameter values which are known during the
initialization phase of execution. Partitions which are pruned during
this stage will not show up in the query's EXPLAIN or EXPLAIN ANALYZE.
It is possible to determine the number of partitions which were
removed during this phase by observing the “Subplans Removed” property
in the EXPLAIN output."

Perhaps something like. "It's important to note that any partitions
removed by the partition pruning done at this time are still locked at
the beginning of execution".

David

[1] https://www.postgresql.org/docs/17/ddl-partitioning.html



Re: Querying one partition in a function takes locks on all partitions

From
Evgeny Morozov
Date:
On 23/03/2025 2:35 pm, David Rowley wrote:
>> alter table entity_2 add column new_column text;
> Is this just an example command?  You can't add a column to a
> partition directly.

Yes, it was just the simplest way I could think of to take an exclusive
lock. But on this note: I guess it takes a lock before checking that the
table is partition because it might otherwise get attached/detached as a
partition between doing the check and taking a lock, but it would be
nice if it took a shared lock to do all checks, then upgraded to an
exclusive lock to actually do the work. Then the user would not have to
wait for an exclusive lock only for the command to fail.


> What's going on is that PostgreSQL is creating a generic plan for your
> query, that is, a plan that will work with any parameter value that
> you give to your function. When the generic plan is executed and the
> locks are taken for the relations mentioned in the query, and since
> the plan is generic, it includes all partitions that could match any
> possible parameter value you could pass. When the locks are taken,
> it's not yet known which partition will be needed as the partition
> pruning that occurs only does so after the locks are taken.

I see, thank you for the explanation. This seems like a bad plan,
though, because even at query preparation time it can be determined that
only one partition will need to be scanned, since the query filters on
the partition key, so it may need to read *any one* partition, but never
all partitions. So in this case, isn't it better to avoid caching a
generic plan at all? Even if the locking issue is fixed in PG 18, isn't
such a plan likely to be sub-optimal in other ways (for a more complex
query)? I don't know anything about the internals of the query planner,
but I have run into other performance issues with SQL functions querying
partitioned tables. One function we have *sometimes* uses very slow
sequential scans instead of index scans, and I've had to work around
that with "set enable_seqscan = off".

In this case, even "set plan_cache_mode = force_custom_plan" doesn't
help somehow. Isn't that supposed to... make PG use a custom a plan?

Here is another weird thing I forgot to menton in the original post: for
a set-returning function whether one partition table is locked or all of
them seems to depend on whether you "select func" or "select from func"!


create function read_partition_rows(which_part int) returns
table(part_id int) as
'select part_id from entity where part_id = which_part;'
language sql stable;

select * from read_partition_rows(1); -- This takes a lock only on entity_1
select read_partition_rows(1); -- but this takes locks on entity_1 and
entity_2!


> There has been work done for PostgreSQL 18 which swaps the order of
> these operations and makes it so that the executor only obtains the
> locks on the partitions which will actually be scanned. Hopefully,
> we'll see that feature released with PostgreSQL 18 toward the end of
> 2025.

Happy to hear that! I hope that makes it in, but in the meanwhile, yes,
it would be nice to at least document this gotcha and any workarounds
for it. The only one I've found is to use a pl/PgSQL function.




Re: Querying one partition in a function takes locks on all partitions

From
Renan Alves Fonseca
Date:
I've investigated further and found out that the code that processes
SQL functions is completely different from the code that processes SQL
statements. The latter is more efficient, and there is ongoing work to
merge both.

Currently, in the SQL function path the plan is always generic. The
planner ignores the function arguments. The plan_cache_mode setting
has no effect in this path.

I agree that the docs should be more explicit about this. There is a
high penalty for using generic plans in complex functions.

If you can use prepared statements
(https://www.postgresql.org/docs/current/sql-prepare.html), they run
on the optimized path and respect plan_cache_mode.

Regards,
Renan

On Fri, Mar 28, 2025 at 5:38 PM Evgeny Morozov
<postgresql4@realityexists.net> wrote:
>
> Thank you for doing the extra investigation! I realised only now you didn't send this to the mailing list, only to
me.As you say, force_custom_plan doesn't seem to help with a SQL function - just tested that. 
>
Thanks for noting. I've seen just now.

> Regards,
> Evgeny Morozov
>
> On 23/03/2025 12:08 am, Renan Alves Fonseca wrote:
>
> It seems that when we create a function using pure sql, the query planner uses a generic plan.
> We can mimic this behavior using prepared statements and plan_cache_mode:
>
> # prepare read1(int) as select count(*) from entity where part_id=$1;
> # set plan_cache_mode = force_generic_plan ;
> # explain (costs off) execute read1(1);
>               QUERY PLAN
> --------------------------------------
>  Aggregate
>    ->  Append
>          Subplans Removed: 1
>          ->  Seq Scan on entity_1
>                Filter: (part_id = $1)
> Note "Subplans Removed". This plan causes lock.
>
> # set plan_cache_mode = force_custom_plan ;
> # explain (costs off) execute read1(1);
>             QUERY PLAN
> -----------------------------------
>  Aggregate
>    ->  Seq Scan on entity_1 entity
>          Filter: (part_id = 1)
>
> No lock in this case.
>
> However, I didn't find a solution to force a custom plan in the stored procedure (written in pure sql). I don't know
ifit is not supported or if I'm missing some parameter. Anyway, it would be nice to have custom plans in sql stored
procedures.I've run into other troubles in the past due to the generic plan. 
>
> Regards,
> Renan Fonseca
>
> On Fri, Mar 21, 2025 at 5:27 PM Evgeny Morozov <postgresql4@realityexists.net> wrote:
>>
>> I have a list-partitioned table. When I query the base table but filter
>> by the partition column in a regular SQL query this takes a lock only on
>> the one partition being queried, as I expect. However, when the exact
>> same SQL query is run fom a DB function, with the partition ID passed in
>> as argument, it takes (shared) locks on ALL partitions - which blocks
>> any other process that wants an exclusive lock on another partition (and
>> vice-versa).
>>
>> Originally found on PG 15.12, but happens on 17.4 as well. Easily
>> reproducible:
>>
>> -- One-time setup
>>
>> create table entity
>> (
>>     part_id integer not null
>> ) partition by list (part_id);
>>
>> create table entity_1 partition of entity for values in (1);
>> create table entity_2 partition of entity for values in (2);
>>
>> create function read_partition(which_part int) returns bigint as
>> 'select count(*) from entity where part_id = which_part;'
>> language sql stable;
>>
>> -- Then try this, keeping the connection open (so the transaction is
>> pending):
>>
>> begin;
>> select read_partition(1); -- This takes shared locks on entity_1 AND
>> entity_2
>>
>> -- select count(*) from entity where part_id = 1; -- but this would only
>> take a shared lock only on entity_1
>>
>> If another session tries something that takes an exclusive lock on
>> another partition, like
>>
>> alter table entity_2 add column new_column text;
>>
>> I would expect that to be able to run concurrently, but it blocks due to
>> the shared lock on entity_2. (The way I originally found the problem was
>> the opposite: once one client took an exclusive lock on a partition many
>> others were blocked from reading from ANY partition.)
>>
>> This seems like quite the "gotcha", especially when the query plan for
>> the function call (logged via autoexplain) shows it only accessing one
>> partition (entity_1). Is this expected behavior? If so, is it documented
>> somewhere?
>>
>>
>>
>



On Sat, 29 Mar 2025 at 10:30, Renan Alves Fonseca
<renanfonseca@gmail.com> wrote:
> Currently, in the SQL function path the plan is always generic. The
> planner ignores the function arguments. The plan_cache_mode setting
> has no effect in this path.
>
> I agree that the docs should be more explicit about this. There is a
> high penalty for using generic plans in complex functions.

If you have any suggestions about where you think those should be
added or wording for that, please feel free to suggest.

For the part that's specific to partitioning, I've suggested some
additional wording in the relevant spot in [1].

David

[1] https://postgr.es/m/CAApHDvogvzANoTOCyXUWgEuPFx1nT6S63aAN0bDRSJ=TaGBWew@mail.gmail.com



David Rowley <dgrowleyml@gmail.com> writes:
> On Sat, 29 Mar 2025 at 10:30, Renan Alves Fonseca
> <renanfonseca@gmail.com> wrote:
>> Currently, in the SQL function path the plan is always generic. The
>> planner ignores the function arguments. The plan_cache_mode setting
>> has no effect in this path.
>> I agree that the docs should be more explicit about this. There is a
>> high penalty for using generic plans in complex functions.

> If you have any suggestions about where you think those should be
> added or wording for that, please feel free to suggest.

The issue's likely to be moot very shortly [1].  Maybe somebody
will be excited enough to consider changing the docs in the back
branches.  But since it was like this for a couple of decades,
I bet not.

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/8216639.NyiUUSuA9g%40aivenlaptop



Re: Querying one partition in a function takes locks on all partitions

From
Renan Alves Fonseca
Date:
On Mon, Mar 31, 2025 at 5:10 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sat, 29 Mar 2025 at 10:30, Renan Alves Fonseca
> <renanfonseca@gmail.com> wrote:
> > Currently, in the SQL function path the plan is always generic. The
> > planner ignores the function arguments. The plan_cache_mode setting
> > has no effect in this path.
> >
> > I agree that the docs should be more explicit about this. There is a
> > high penalty for using generic plans in complex functions.
>
> If you have any suggestions about where you think those should be
> added or wording for that, please feel free to suggest.
>

There is a specific chapter about functions written in SQL: [1]. It is
in an advanced section of the docs, so I think it is a suitable place
to address this level of detail.

There is a Note that says: "The entire body of an SQL function is
parsed before any of it is executed. While an SQL function can contain
commands that alter ..."
I would add another Note below like:
"Except when inlined, an SQL function is always executed with a
generic plan. This behavior may not be desired in some situations, and
it will be fixed in future versions."

I'm not sure if we should mention the fix or if we should mention a
workaround...

If I understood well [2], then both notes may be discarded together in
the next version.

Renan

[1] https://www.postgresql.org/docs/17/xfunc-sql.html
[2] https://www.postgresql.org/message-id/db42573039cc66815e80a48589eebea8%40postgrespro.ru



On Mon, 2025-03-31 at 18:41 +0200, Renan Alves Fonseca wrote:
> I would add another Note below like:
> "Except when inlined, an SQL function is always executed with a
> generic plan. This behavior may not be desired in some situations, and
> it will be fixed in future versions."

But that is not true, as far as I can tell: just like any other prepared
statement, it may keep using custom plans.

Yours,
Laurenz Albe



Re: Querying one partition in a function takes locks on all partitions

From
"David G. Johnston"
Date:
On Mon, Mar 31, 2025 at 9:42 AM Renan Alves Fonseca <renanfonseca@gmail.com> wrote:
I'm not sure if we should mention the fix or if we should mention a
workaround...

Workarounds are ok but my observation is that "this may change in the future" comments are pointless and should be stricken from the manual because people are bad at predicting the future.  Just document what is and don't string people along.  We never go back to old releases and say "This was fixed in v17."

David J.

Renan Alves Fonseca <renanfonseca@gmail.com> writes:
> There is a Note that says: "The entire body of an SQL function is
> parsed before any of it is executed. While an SQL function can contain
> commands that alter ..."

Yup.

> If I understood well [2], then both notes may be discarded together in
> the next version.

No, that aspect is not changed in the currently-proposed patch:
we still do parse analysis of all the queries at the beginning.
I'd like to make it work differently eventually, but with feature
freeze so close I do not think that'll happen for v18.

Thinking a bit more about that, if the initial parse analysis succeeds
then the patched code will in fact re-parse before execution if any
DDL changes have occurred.  So you get at least some of the desired
behavioral change.  But the example given in the docs would still
fail, so I don't feel a need to change the docs yet.

            regards, tom lane



Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Mon, 2025-03-31 at 18:41 +0200, Renan Alves Fonseca wrote:
>> I would add another Note below like:
>> "Except when inlined, an SQL function is always executed with a
>> generic plan. This behavior may not be desired in some situations, and
>> it will be fixed in future versions."

> But that is not true, as far as I can tell: just like any other prepared
> statement, it may keep using custom plans.

No, it is a generic plan in the current code.  See functions.c's
init_execution_state, which calls pg_plan_query without passing
any boundParams.  That's kind of necessary in the current scheme
of things, because the SQL function holds onto the plan for the
duration of the current query and re-uses the plan during subsequent
calls with (possibly) different parameter values.  So it's sort of
a poor man's plan cache, but implemented in about the stupidest
way possible.  Jacking that up and rolling the plancache.c
infrastructure underneath has been on my to-do list for years.

            regards, tom lane



On Sat, 29 Mar 2025 at 06:00, Evgeny Morozov
<postgresql4@realityexists.net> wrote:
>
> On 23/03/2025 2:35 pm, David Rowley wrote:
> >> alter table entity_2 add column new_column text;
> > Is this just an example command?  You can't add a column to a
> > partition directly.
>
> Yes, it was just the simplest way I could think of to take an exclusive
> lock. But on this note: I guess it takes a lock before checking that the
> table is partition because it might otherwise get attached/detached as a
> partition between doing the check and taking a lock, but it would be
> nice if it took a shared lock to do all checks, then upgraded to an
> exclusive lock to actually do the work. Then the user would not have to
> wait for an exclusive lock only for the command to fail.

Unfortunately, that's probably just swapping one problem for another.
Once you have 2 sessions following such a pattern of locking, you're
prone to unnecessary deadlocking.

For example:

-- session1
begin;
lock table t in access share mode; -- gets lock

-- session2
begin;
lock table t in access share mode; -- gets lock
lock table t in access exclusive mode; -- waits

-- session1
lock table t in access exclusive mode; -- deadlock

If you don't bother with the access share lock, there's no deadlock.

David