Thread: Querying one partition in a function takes locks on all partitions
I have a list-partitioned table. When I query the base table but filter by the partition column in a regular SQL query this takes a lock only on the one partition being queried, as I expect. However, when the exact same SQL query is run fom a DB function, with the partition ID passed in as argument, it takes (shared) locks on ALL partitions - which blocks any other process that wants an exclusive lock on another partition (and vice-versa). Originally found on PG 15.12, but happens on 17.4 as well. Easily reproducible: -- One-time setup create table entity ( part_id integer not null ) partition by list (part_id); create table entity_1 partition of entity for values in (1); create table entity_2 partition of entity for values in (2); create function read_partition(which_part int) returns bigint as 'select count(*) from entity where part_id = which_part;' language sql stable; -- Then try this, keeping the connection open (so the transaction is pending): begin; select read_partition(1); -- This takes shared locks on entity_1 AND entity_2 -- select count(*) from entity where part_id = 1; -- but this would only take a shared lock only on entity_1 If another session tries something that takes an exclusive lock on another partition, like alter table entity_2 add column new_column text; I would expect that to be able to run concurrently, but it blocks due to the shared lock on entity_2. (The way I originally found the problem was the opposite: once one client took an exclusive lock on a partition many others were blocked from reading from ANY partition.) This seems like quite the "gotcha", especially when the query plan for the function call (logged via autoexplain) shows it only accessing one partition (entity_1). Is this expected behavior? If so, is it documented somewhere?
On 3/21/25 09:27, Evgeny Morozov wrote: > I have a list-partitioned table. When I query the base table but filter > by the partition column in a regular SQL query this takes a lock only on > the one partition being queried, as I expect. However, when the exact > same SQL query is run fom a DB function, with the partition ID passed in > as argument, it takes (shared) locks on ALL partitions - which blocks > any other process that wants an exclusive lock on another partition (and > vice-versa). > > Originally found on PG 15.12, but happens on 17.4 as well. Easily > reproducible: > > -- One-time setup > > create table entity > ( > part_id integer not null > ) partition by list (part_id); > > create table entity_1 partition of entity for values in (1); > create table entity_2 partition of entity for values in (2); > > create function read_partition(which_part int) returns bigint as > 'select count(*) from entity where part_id = which_part;' > language sql stable; > > -- Then try this, keeping the connection open (so the transaction is > pending): > > begin; > select read_partition(1); -- This takes shared locks on entity_1 AND > entity_2 > > -- select count(*) from entity where part_id = 1; -- but this would only > take a shared lock only on entity_1 > > If another session tries something that takes an exclusive lock on > another partition, like > > alter table entity_2 add column new_column text; > > I would expect that to be able to run concurrently, but it blocks due to > the shared lock on entity_2. (The way I originally found the problem was > the opposite: once one client took an exclusive lock on a partition many > others were blocked from reading from ANY partition.) > > This seems like quite the "gotcha", especially when the query plan for > the function call (logged via autoexplain) shows it only accessing one > partition (entity_1). Is this expected behavior? If so, is it documented > somewhere? Hmm, seems to be a sql function issue: CREATE OR REPLACE FUNCTION public.read_partition(which_part integer) RETURNS bigint LANGUAGE plpgsql STABLE AS $$ DECLARE id_ct bigint; BEGIN select count(*) into id_ct from entity where part_id = $1; RETURN id_ct; END; $$; BEGIN; select read_partition(1); read_partition ---------------- 0 select relation::regclass, mode from pg_locks ; relation | mode ----------+----------------- pg_locks | AccessShareLock entity_1 | AccessShareLock entity | AccessShareLock | ExclusiveLock > > > -- Adrian Klaver adrian.klaver@aklaver.com
On Sat, 22 Mar 2025 at 05:27, Evgeny Morozov <postgresql4@realityexists.net> wrote: > select read_partition(1); -- This takes shared locks on entity_1 AND > entity_2 > > -- select count(*) from entity where part_id = 1; -- but this would only > take a shared lock only on entity_1 > > If another session tries something that takes an exclusive lock on > another partition, like > > alter table entity_2 add column new_column text; Is this just an example command? You can't add a column to a partition directly. > I would expect that to be able to run concurrently, but it blocks due to > the shared lock on entity_2. (The way I originally found the problem was > the opposite: once one client took an exclusive lock on a partition many > others were blocked from reading from ANY partition.) > > This seems like quite the "gotcha", especially when the query plan for > the function call (logged via autoexplain) shows it only accessing one > partition (entity_1). Is this expected behavior? If so, is it documented > somewhere? It is expected behaviour and, unfortunately, not really documented anywhere outside of the source code. What's going on is that PostgreSQL is creating a generic plan for your query, that is, a plan that will work with any parameter value that you give to your function. When the generic plan is executed and the locks are taken for the relations mentioned in the query, and since the plan is generic, it includes all partitions that could match any possible parameter value you could pass. When the locks are taken, it's not yet known which partition will be needed as the partition pruning that occurs only does so after the locks are taken. There has been work done for PostgreSQL 18 which swaps the order of these operations and makes it so that the executor only obtains the locks on the partitions which will actually be scanned. Hopefully, we'll see that feature released with PostgreSQL 18 toward the end of 2025. As for the documentation, it might be worth adding a mention of this at the end of the following paragraph in [1]: "During initialization of the query plan. Partition pruning can be performed here for parameter values which are known during the initialization phase of execution. Partitions which are pruned during this stage will not show up in the query's EXPLAIN or EXPLAIN ANALYZE. It is possible to determine the number of partitions which were removed during this phase by observing the “Subplans Removed” property in the EXPLAIN output." Perhaps something like. "It's important to note that any partitions removed by the partition pruning done at this time are still locked at the beginning of execution". David [1] https://www.postgresql.org/docs/17/ddl-partitioning.html
On 23/03/2025 2:35 pm, David Rowley wrote: >> alter table entity_2 add column new_column text; > Is this just an example command? You can't add a column to a > partition directly. Yes, it was just the simplest way I could think of to take an exclusive lock. But on this note: I guess it takes a lock before checking that the table is partition because it might otherwise get attached/detached as a partition between doing the check and taking a lock, but it would be nice if it took a shared lock to do all checks, then upgraded to an exclusive lock to actually do the work. Then the user would not have to wait for an exclusive lock only for the command to fail. > What's going on is that PostgreSQL is creating a generic plan for your > query, that is, a plan that will work with any parameter value that > you give to your function. When the generic plan is executed and the > locks are taken for the relations mentioned in the query, and since > the plan is generic, it includes all partitions that could match any > possible parameter value you could pass. When the locks are taken, > it's not yet known which partition will be needed as the partition > pruning that occurs only does so after the locks are taken. I see, thank you for the explanation. This seems like a bad plan, though, because even at query preparation time it can be determined that only one partition will need to be scanned, since the query filters on the partition key, so it may need to read *any one* partition, but never all partitions. So in this case, isn't it better to avoid caching a generic plan at all? Even if the locking issue is fixed in PG 18, isn't such a plan likely to be sub-optimal in other ways (for a more complex query)? I don't know anything about the internals of the query planner, but I have run into other performance issues with SQL functions querying partitioned tables. One function we have *sometimes* uses very slow sequential scans instead of index scans, and I've had to work around that with "set enable_seqscan = off". In this case, even "set plan_cache_mode = force_custom_plan" doesn't help somehow. Isn't that supposed to... make PG use a custom a plan? Here is another weird thing I forgot to menton in the original post: for a set-returning function whether one partition table is locked or all of them seems to depend on whether you "select func" or "select from func"! create function read_partition_rows(which_part int) returns table(part_id int) as 'select part_id from entity where part_id = which_part;' language sql stable; select * from read_partition_rows(1); -- This takes a lock only on entity_1 select read_partition_rows(1); -- but this takes locks on entity_1 and entity_2! > There has been work done for PostgreSQL 18 which swaps the order of > these operations and makes it so that the executor only obtains the > locks on the partitions which will actually be scanned. Hopefully, > we'll see that feature released with PostgreSQL 18 toward the end of > 2025. Happy to hear that! I hope that makes it in, but in the meanwhile, yes, it would be nice to at least document this gotcha and any workarounds for it. The only one I've found is to use a pl/PgSQL function.
Re: Querying one partition in a function takes locks on all partitions
From
Renan Alves Fonseca
Date:
I've investigated further and found out that the code that processes SQL functions is completely different from the code that processes SQL statements. The latter is more efficient, and there is ongoing work to merge both. Currently, in the SQL function path the plan is always generic. The planner ignores the function arguments. The plan_cache_mode setting has no effect in this path. I agree that the docs should be more explicit about this. There is a high penalty for using generic plans in complex functions. If you can use prepared statements (https://www.postgresql.org/docs/current/sql-prepare.html), they run on the optimized path and respect plan_cache_mode. Regards, Renan On Fri, Mar 28, 2025 at 5:38 PM Evgeny Morozov <postgresql4@realityexists.net> wrote: > > Thank you for doing the extra investigation! I realised only now you didn't send this to the mailing list, only to me.As you say, force_custom_plan doesn't seem to help with a SQL function - just tested that. > Thanks for noting. I've seen just now. > Regards, > Evgeny Morozov > > On 23/03/2025 12:08 am, Renan Alves Fonseca wrote: > > It seems that when we create a function using pure sql, the query planner uses a generic plan. > We can mimic this behavior using prepared statements and plan_cache_mode: > > # prepare read1(int) as select count(*) from entity where part_id=$1; > # set plan_cache_mode = force_generic_plan ; > # explain (costs off) execute read1(1); > QUERY PLAN > -------------------------------------- > Aggregate > -> Append > Subplans Removed: 1 > -> Seq Scan on entity_1 > Filter: (part_id = $1) > Note "Subplans Removed". This plan causes lock. > > # set plan_cache_mode = force_custom_plan ; > # explain (costs off) execute read1(1); > QUERY PLAN > ----------------------------------- > Aggregate > -> Seq Scan on entity_1 entity > Filter: (part_id = 1) > > No lock in this case. > > However, I didn't find a solution to force a custom plan in the stored procedure (written in pure sql). I don't know ifit is not supported or if I'm missing some parameter. Anyway, it would be nice to have custom plans in sql stored procedures.I've run into other troubles in the past due to the generic plan. > > Regards, > Renan Fonseca > > On Fri, Mar 21, 2025 at 5:27 PM Evgeny Morozov <postgresql4@realityexists.net> wrote: >> >> I have a list-partitioned table. When I query the base table but filter >> by the partition column in a regular SQL query this takes a lock only on >> the one partition being queried, as I expect. However, when the exact >> same SQL query is run fom a DB function, with the partition ID passed in >> as argument, it takes (shared) locks on ALL partitions - which blocks >> any other process that wants an exclusive lock on another partition (and >> vice-versa). >> >> Originally found on PG 15.12, but happens on 17.4 as well. Easily >> reproducible: >> >> -- One-time setup >> >> create table entity >> ( >> part_id integer not null >> ) partition by list (part_id); >> >> create table entity_1 partition of entity for values in (1); >> create table entity_2 partition of entity for values in (2); >> >> create function read_partition(which_part int) returns bigint as >> 'select count(*) from entity where part_id = which_part;' >> language sql stable; >> >> -- Then try this, keeping the connection open (so the transaction is >> pending): >> >> begin; >> select read_partition(1); -- This takes shared locks on entity_1 AND >> entity_2 >> >> -- select count(*) from entity where part_id = 1; -- but this would only >> take a shared lock only on entity_1 >> >> If another session tries something that takes an exclusive lock on >> another partition, like >> >> alter table entity_2 add column new_column text; >> >> I would expect that to be able to run concurrently, but it blocks due to >> the shared lock on entity_2. (The way I originally found the problem was >> the opposite: once one client took an exclusive lock on a partition many >> others were blocked from reading from ANY partition.) >> >> This seems like quite the "gotcha", especially when the query plan for >> the function call (logged via autoexplain) shows it only accessing one >> partition (entity_1). Is this expected behavior? If so, is it documented >> somewhere? >> >> >> >
On Sat, 29 Mar 2025 at 10:30, Renan Alves Fonseca <renanfonseca@gmail.com> wrote: > Currently, in the SQL function path the plan is always generic. The > planner ignores the function arguments. The plan_cache_mode setting > has no effect in this path. > > I agree that the docs should be more explicit about this. There is a > high penalty for using generic plans in complex functions. If you have any suggestions about where you think those should be added or wording for that, please feel free to suggest. For the part that's specific to partitioning, I've suggested some additional wording in the relevant spot in [1]. David [1] https://postgr.es/m/CAApHDvogvzANoTOCyXUWgEuPFx1nT6S63aAN0bDRSJ=TaGBWew@mail.gmail.com
David Rowley <dgrowleyml@gmail.com> writes: > On Sat, 29 Mar 2025 at 10:30, Renan Alves Fonseca > <renanfonseca@gmail.com> wrote: >> Currently, in the SQL function path the plan is always generic. The >> planner ignores the function arguments. The plan_cache_mode setting >> has no effect in this path. >> I agree that the docs should be more explicit about this. There is a >> high penalty for using generic plans in complex functions. > If you have any suggestions about where you think those should be > added or wording for that, please feel free to suggest. The issue's likely to be moot very shortly [1]. Maybe somebody will be excited enough to consider changing the docs in the back branches. But since it was like this for a couple of decades, I bet not. regards, tom lane [1] https://www.postgresql.org/message-id/flat/8216639.NyiUUSuA9g%40aivenlaptop
Re: Querying one partition in a function takes locks on all partitions
From
Renan Alves Fonseca
Date:
On Mon, Mar 31, 2025 at 5:10 AM David Rowley <dgrowleyml@gmail.com> wrote: > > On Sat, 29 Mar 2025 at 10:30, Renan Alves Fonseca > <renanfonseca@gmail.com> wrote: > > Currently, in the SQL function path the plan is always generic. The > > planner ignores the function arguments. The plan_cache_mode setting > > has no effect in this path. > > > > I agree that the docs should be more explicit about this. There is a > > high penalty for using generic plans in complex functions. > > If you have any suggestions about where you think those should be > added or wording for that, please feel free to suggest. > There is a specific chapter about functions written in SQL: [1]. It is in an advanced section of the docs, so I think it is a suitable place to address this level of detail. There is a Note that says: "The entire body of an SQL function is parsed before any of it is executed. While an SQL function can contain commands that alter ..." I would add another Note below like: "Except when inlined, an SQL function is always executed with a generic plan. This behavior may not be desired in some situations, and it will be fixed in future versions." I'm not sure if we should mention the fix or if we should mention a workaround... If I understood well [2], then both notes may be discarded together in the next version. Renan [1] https://www.postgresql.org/docs/17/xfunc-sql.html [2] https://www.postgresql.org/message-id/db42573039cc66815e80a48589eebea8%40postgrespro.ru
On Mon, 2025-03-31 at 18:41 +0200, Renan Alves Fonseca wrote: > I would add another Note below like: > "Except when inlined, an SQL function is always executed with a > generic plan. This behavior may not be desired in some situations, and > it will be fixed in future versions." But that is not true, as far as I can tell: just like any other prepared statement, it may keep using custom plans. Yours, Laurenz Albe
Re: Querying one partition in a function takes locks on all partitions
From
"David G. Johnston"
Date:
On Mon, Mar 31, 2025 at 9:42 AM Renan Alves Fonseca <renanfonseca@gmail.com> wrote:
I'm not sure if we should mention the fix or if we should mention a
workaround...
Workarounds are ok but my observation is that "this may change in the future" comments are pointless and should be stricken from the manual because people are bad at predicting the future. Just document what is and don't string people along. We never go back to old releases and say "This was fixed in v17."
David J.
Renan Alves Fonseca <renanfonseca@gmail.com> writes: > There is a Note that says: "The entire body of an SQL function is > parsed before any of it is executed. While an SQL function can contain > commands that alter ..." Yup. > If I understood well [2], then both notes may be discarded together in > the next version. No, that aspect is not changed in the currently-proposed patch: we still do parse analysis of all the queries at the beginning. I'd like to make it work differently eventually, but with feature freeze so close I do not think that'll happen for v18. Thinking a bit more about that, if the initial parse analysis succeeds then the patched code will in fact re-parse before execution if any DDL changes have occurred. So you get at least some of the desired behavioral change. But the example given in the docs would still fail, so I don't feel a need to change the docs yet. regards, tom lane
Laurenz Albe <laurenz.albe@cybertec.at> writes: > On Mon, 2025-03-31 at 18:41 +0200, Renan Alves Fonseca wrote: >> I would add another Note below like: >> "Except when inlined, an SQL function is always executed with a >> generic plan. This behavior may not be desired in some situations, and >> it will be fixed in future versions." > But that is not true, as far as I can tell: just like any other prepared > statement, it may keep using custom plans. No, it is a generic plan in the current code. See functions.c's init_execution_state, which calls pg_plan_query without passing any boundParams. That's kind of necessary in the current scheme of things, because the SQL function holds onto the plan for the duration of the current query and re-uses the plan during subsequent calls with (possibly) different parameter values. So it's sort of a poor man's plan cache, but implemented in about the stupidest way possible. Jacking that up and rolling the plancache.c infrastructure underneath has been on my to-do list for years. regards, tom lane
On Sat, 29 Mar 2025 at 06:00, Evgeny Morozov <postgresql4@realityexists.net> wrote: > > On 23/03/2025 2:35 pm, David Rowley wrote: > >> alter table entity_2 add column new_column text; > > Is this just an example command? You can't add a column to a > > partition directly. > > Yes, it was just the simplest way I could think of to take an exclusive > lock. But on this note: I guess it takes a lock before checking that the > table is partition because it might otherwise get attached/detached as a > partition between doing the check and taking a lock, but it would be > nice if it took a shared lock to do all checks, then upgraded to an > exclusive lock to actually do the work. Then the user would not have to > wait for an exclusive lock only for the command to fail. Unfortunately, that's probably just swapping one problem for another. Once you have 2 sessions following such a pattern of locking, you're prone to unnecessary deadlocking. For example: -- session1 begin; lock table t in access share mode; -- gets lock -- session2 begin; lock table t in access share mode; -- gets lock lock table t in access exclusive mode; -- waits -- session1 lock table t in access exclusive mode; -- deadlock If you don't bother with the access share lock, there's no deadlock. David