Thread: Postgres 16.1 - Bug: cache entry already complete
Hi,
Upgrading from Postgres 14 to Postgres 16.1 introduced the following error when running a query:
nativeError: error: select "date", coalesce(sum(hours), 0) as "hours" from "allocation_date" where "allocation_date"."project_id" = $1 and "allocation_date"."date" > $2 and "allocation_date"."hours" > $3 group by "date" order by "date" asc - cache entry already complete
Note that `allocation_date` is a fairly complex view. Not quite sure what causes the "cache entry already complete", but it consistently failed in both the MacOS and Windows installations.
Maybe it's a regression of:
https://www.postgresql.org/message-id/CAApHDvrkXT%2Bzhrusz7xdRBS3jZR%3DkB32AgBc4cnFCtxTaxQHvA%40mail.gmail.comThere's also a comment here:
Which points to:
Anything I can do to help you reproduce this?
Amadeo Gallardo <amadeo@ruddr.io> writes: > Upgrading from Postgres 14 to Postgres 16.1 introduced the following error > when running a query: > nativeError: error: select "date", coalesce(sum(hours), 0) as "hours" from > "allocation_date" where "allocation_date"."project_id" = $1 and > "allocation_date"."date" > $2 and "allocation_date"."hours" > $3 group by > "date" order by "date" asc - cache entry already complete > Note that `allocation_date` is a fairly complex view. Not quite sure what > causes the "cache entry already complete", but it consistently failed in > both the MacOS and Windows installations. The only occurrence of that error string is in nodeMemoize.c, so I suppose you've found a bug in Memoize. A bit odd that it doesn't manifest in v14 where Memoize came in. But anyway, can you reduce this to a self-contained test case? regards, tom lane
Hi Tom,
Thanks for reaching out. So far, I was able to narrow it down to a problem when doing a union between two tables, one of them having a varchar column and the other one having a text column. Adding an explicit cast seems to have fixed it, but not really sure why yet. I will try to create a simple repro scenario.
Thanks,
On Tue, Jan 2, 2024 at 2:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Amadeo Gallardo <amadeo@ruddr.io> writes:
> Upgrading from Postgres 14 to Postgres 16.1 introduced the following error
> when running a query:
> nativeError: error: select "date", coalesce(sum(hours), 0) as "hours" from
> "allocation_date" where "allocation_date"."project_id" = $1 and
> "allocation_date"."date" > $2 and "allocation_date"."hours" > $3 group by
> "date" order by "date" asc - cache entry already complete
> Note that `allocation_date` is a fairly complex view. Not quite sure what
> causes the "cache entry already complete", but it consistently failed in
> both the MacOS and Windows installations.
The only occurrence of that error string is in nodeMemoize.c, so
I suppose you've found a bug in Memoize. A bit odd that it doesn't
manifest in v14 where Memoize came in. But anyway, can you reduce
this to a self-contained test case?
regards, tom lane
Hi,
Attached is a sample script that I used to reproduce the error. I had to add some specific sample data and force-disable certain planner settings, as otherwise the problem did not occur.
Additionally, here's a trace of the error:
org.postgresql.util.PSQLException: ERROR: cache entry already complete
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2676)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2366)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:356)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:496)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:413)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:333)
at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:319)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:295)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:290)
Hopefully this helps to narrow down the issue.
Thanks
On Tue, Jan 2, 2024 at 4:51 PM Amadeo Gallardo <amadeo@ruddr.io> wrote:
Hi Tom,Thanks for reaching out. So far, I was able to narrow it down to a problem when doing a union between two tables, one of them having a varchar column and the other one having a text column. Adding an explicit cast seems to have fixed it, but not really sure why yet. I will try to create a simple repro scenario.Thanks,On Tue, Jan 2, 2024 at 2:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:Amadeo Gallardo <amadeo@ruddr.io> writes:
> Upgrading from Postgres 14 to Postgres 16.1 introduced the following error
> when running a query:
> nativeError: error: select "date", coalesce(sum(hours), 0) as "hours" from
> "allocation_date" where "allocation_date"."project_id" = $1 and
> "allocation_date"."date" > $2 and "allocation_date"."hours" > $3 group by
> "date" order by "date" asc - cache entry already complete
> Note that `allocation_date` is a fairly complex view. Not quite sure what
> causes the "cache entry already complete", but it consistently failed in
> both the MacOS and Windows installations.
The only occurrence of that error string is in nodeMemoize.c, so
I suppose you've found a bug in Memoize. A bit odd that it doesn't
manifest in v14 where Memoize came in. But anyway, can you reduce
this to a self-contained test case?
regards, tom lane--
Attachment
On Wed, 3 Jan 2024 at 12:49, Amadeo Gallardo <amadeo@ruddr.io> wrote: > Attached is a sample script that I used to reproduce the error. I had to add some specific sample data and force-disablecertain planner settings, as otherwise the problem did not occur. > Additionally, here's a trace of the error: Thanks for the reproducer. I can recreate this using your script after adding an ANALYZE after the INSERTs. Without ndistinct statistics, the planner won't risk using Memoize. I'll investigate the cache complete error. David
Amadeo Gallardo <amadeo@ruddr.io> writes: > Attached is a sample script that I used to reproduce the error. I had to > add some specific sample data and force-disable certain planner settings, > as otherwise the problem did not occur. Thanks! EXPLAIN shows QUERY PLAN ------------------------------------------------------------------------------------------------ Nested Loop Left Join (cost=0.46..82.87 rows=315 width=1) Join Filter: (t3.t2_id = t2.id) -> Nested Loop Left Join (cost=0.30..32.29 rows=315 width=32) -> Nested Loop Left Join (cost=0.15..17.29 rows=315 width=32) -> Seq Scan on t4 (cost=0.00..6.15 rows=315 width=32) -> Memoize (cost=0.15..0.19 rows=1 width=16) Cache Key: t4.t2_id Cache Mode: logical -> Index Only Scan using t2_pkey on t2 (cost=0.14..0.18 rows=1 width=16) Index Cond: (id = t4.t2_id) -> Memoize (cost=0.15..0.20 rows=1 width=16) Cache Key: t4.t1_id Cache Mode: logical -> Index Only Scan using t1_pkey on t1 (cost=0.14..0.19 rows=1 width=16) Index Cond: (id = t4.t1_id) -> Memoize (cost=0.16..0.55 rows=1 width=33) Cache Key: t1.id, t1.id Cache Mode: logical -> Index Scan using t3_t1_id_index on t3 (cost=0.15..0.54 rows=1 width=33) Index Cond: (t1_id = t1.id) (20 rows) It's that last Memoize node with the duplicate cache keys that is failing. (It's unclear to me whether having duplicate cache keys is supposed to be a valid situation, and even less clear whether that has any direct connection to the failure.) What I guess is happening is that the single-row flag is getting set because t1.id is a primary key, despite the fact that it's coming from inside a join that could result in duplicates. regards, tom lane
On Wed, 3 Jan 2024 at 13:05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > QUERY PLAN > ------------------------------------------------------------------------------------------------ > Nested Loop Left Join (cost=0.46..82.87 rows=315 width=1) > Join Filter: (t3.t2_id = t2.id) > -> Nested Loop Left Join (cost=0.30..32.29 rows=315 width=32) > -> Nested Loop Left Join (cost=0.15..17.29 rows=315 width=32) > -> Seq Scan on t4 (cost=0.00..6.15 rows=315 width=32) > -> Memoize (cost=0.15..0.19 rows=1 width=16) > Cache Key: t4.t2_id > Cache Mode: logical > -> Index Only Scan using t2_pkey on t2 (cost=0.14..0.18 rows=1 width=16) > Index Cond: (id = t4.t2_id) > -> Memoize (cost=0.15..0.20 rows=1 width=16) > Cache Key: t4.t1_id > Cache Mode: logical > -> Index Only Scan using t1_pkey on t1 (cost=0.14..0.19 rows=1 width=16) > Index Cond: (id = t4.t1_id) > -> Memoize (cost=0.16..0.55 rows=1 width=33) > Cache Key: t1.id, t1.id > Cache Mode: logical > -> Index Scan using t3_t1_id_index on t3 (cost=0.15..0.54 rows=1 width=33) > Index Cond: (t1_id = t1.id) > (20 rows) > > It's that last Memoize node with the duplicate cache keys that is > failing. (It's unclear to me whether having duplicate cache keys > is supposed to be a valid situation, and even less clear whether > that has any direct connection to the failure.) What I guess > is happening is that the single-row flag is getting set because > t1.id is a primary key, despite the fact that it's coming from > inside a join that could result in duplicates. The problem is that singlerow is set entirely based on innerunique. That's wrong because in this case, the unique properties include the Join Filter, not just the parameterized qual. I think the fix should be in get_memoize_path(). What we pass to create_memoize_path() for the "singlerow" parameter needs to be more than just extra->inner_unique. The Join Filter must also be empty. David
On Wed, 3 Jan 2024 at 13:40, David Rowley <dgrowleyml@gmail.com> wrote: > The problem is that singlerow is set entirely based on innerunique. > That's wrong because in this case, the unique properties include the > Join Filter, not just the parameterized qual. > > I think the fix should be in get_memoize_path(). What we pass to > create_memoize_path() for the "singlerow" parameter needs to be more > than just extra->inner_unique. The Join Filter must also be empty. It looks like we only figure out what will become the "Join Filter" in create_nestloop_path(), which is slightly too late as we create the MemoizePath before creating the NestLoopPath. In the attached, I drafted up some code that duplicates the logic in create_nestloop_path() to determine if there's going to be a joinrestrictinfo (Join Filter) and to set MemoizePath.singlerow to false if there is going to be one. My concern with that is that the Unique Join optimisation will cause execution to skip to the next outer row and that will leave us no means of marking the Memoize cache entry as complete. In singlerow==false Memoize nodes, we only mark the cache as complete when we read the inner node to completion. The unique join optimisation means we often don't do that due to skipping to the next outer row on finding the first inner match. Basically, what I'm saying is that Memoize is going to result in many more cache misses due to incomplete cache entries. Maybe we should have get_memoize_path() return NULL for this case so that we don't Memoize when there's a Join Filter and extra->inner_unique is set to true. With the attached, I see: QUERY PLAN ------------------------------------------------------------------------------------------ Nested Loop Left Join (actual rows=315 loops=1) Join Filter: (t3.t2_id = t2.id) Rows Removed by Join Filter: 1007 -> Nested Loop Left Join (actual rows=315 loops=1) -> Nested Loop Left Join (actual rows=315 loops=1) -> Seq Scan on t4 (actual rows=315 loops=1) -> Memoize (actual rows=1 loops=315) Cache Key: t4.t2_id Cache Mode: logical SingleRow: true Hits: 296 Misses: 19 Evictions: 0 Overflows: 0 Memory Usage: 3kB -> Index Only Scan using t2_pkey on t2 (actual rows=1 loops=19) Index Cond: (id = t4.t2_id) Heap Fetches: 18 -> Memoize (actual rows=1 loops=315) Cache Key: t4.t1_id Cache Mode: logical SingleRow: true Hits: 276 Misses: 39 Evictions: 0 Overflows: 0 Memory Usage: 5kB -> Index Only Scan using t1_pkey on t1 (actual rows=1 loops=39) Index Cond: (id = t4.t1_id) Heap Fetches: 38 -> Memoize (actual rows=4 loops=315) Cache Key: t1.id, t1.id Cache Mode: logical SingleRow: false Hits: 199 Misses: 116 Evictions: 0 Overflows: 0 Memory Usage: 15kB -> Index Scan using t3_t1_id_index on t3 (actual rows=3 loops=116) Index Cond: (t1_id = t1.id) Planning Time: 0.322 ms Execution Time: 3654.894 ms (31 rows) (I just added SingleRow to explain to make it easier to see what's going on) Notice the Cache Misses is 116 for the problem Memoize node. There are only 77 distinct values for t1_id in t3. So we're certainly classing cache entries as invalid due to them being complete==false. David
Attachment
On Wed, Jan 3, 2024 at 7:59 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 3 Jan 2024 at 12:49, Amadeo Gallardo <amadeo@ruddr.io> wrote:
> Attached is a sample script that I used to reproduce the error. I had to add some specific sample data and force-disable certain planner settings, as otherwise the problem did not occur.
> Additionally, here's a trace of the error:
Thanks for the reproducer. I can recreate this using your script
after adding an ANALYZE after the INSERTs. Without ndistinct
statistics, the planner won't risk using Memoize.
It seems that the repro script was not sent to the mailing list. I
cannot see it.
Thanks
Richard
cannot see it.
Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes: > It seems that the repro script was not sent to the mailing list. I > cannot see it. Awaiting moderator approval, I suppose. regards, tom lane
On Wed, Jan 3, 2024 at 9:48 AM David Rowley <dgrowleyml@gmail.com> wrote:
My concern with that is that the Unique Join optimisation will cause
execution to skip to the next outer row and that will leave us no
means of marking the Memoize cache entry as complete. In
singlerow==false Memoize nodes, we only mark the cache as complete
when we read the inner node to completion. The unique join
optimisation means we often don't do that due to skipping to the next
outer row on finding the first inner match.
Basically, what I'm saying is that Memoize is going to result in many
more cache misses due to incomplete cache entries. Maybe we should
have get_memoize_path() return NULL for this case so that we don't
Memoize when there's a Join Filter and extra->inner_unique is set to
true.
I agree that we should just avoid making a memoize path in this case.
And I think we already have this logic in get_memoize_path():
if (extra->inner_unique &&
(inner_path->param_info == NULL ||
list_length(inner_path->param_info->ppi_clauses) <
list_length(extra->restrictlist)))
return NULL;
But why does the code above fail to capture the problematic query? It
turns out that inner_path->param_info->ppi_clauses contains cloned
clauses, i.e. multiple versions of the same clause in order to cope with
outer join identity 3. And this causes the above code to give incorrect
conclusions about the length of 'ppi_clauses'. And this also explains
why this issue only occurs in v16.
As a clause's serial number is unique within the current PlannerInfo
context, and multiple clones of the same clause have the same serial
number, it seems to me that it's more correct to calculate the length of
ppi_clauses by:
bms_num_members(inner_path->param_info->ppi_serials)
So I think maybe we can fix this issue with the attached patch.
BTW, the SingleRow in explain seems quite handy. I wonder if we can
keep it.
Thanks
Richard
And I think we already have this logic in get_memoize_path():
if (extra->inner_unique &&
(inner_path->param_info == NULL ||
list_length(inner_path->param_info->ppi_clauses) <
list_length(extra->restrictlist)))
return NULL;
But why does the code above fail to capture the problematic query? It
turns out that inner_path->param_info->ppi_clauses contains cloned
clauses, i.e. multiple versions of the same clause in order to cope with
outer join identity 3. And this causes the above code to give incorrect
conclusions about the length of 'ppi_clauses'. And this also explains
why this issue only occurs in v16.
As a clause's serial number is unique within the current PlannerInfo
context, and multiple clones of the same clause have the same serial
number, it seems to me that it's more correct to calculate the length of
ppi_clauses by:
bms_num_members(inner_path->param_info->ppi_serials)
So I think maybe we can fix this issue with the attached patch.
BTW, the SingleRow in explain seems quite handy. I wonder if we can
keep it.
Thanks
Richard
Attachment
On Thu, 4 Jan 2024 at 00:01, Richard Guo <guofenglinux@gmail.com> wrote: > if (extra->inner_unique && > (inner_path->param_info == NULL || > list_length(inner_path->param_info->ppi_clauses) < > list_length(extra->restrictlist))) > return NULL; I see. So it worked fine until Tom's " Make Vars be outer-join-aware." stuff. 9e215378d fixed this already and looks like I came to the same conclusion about the cache completion issue back then too. > But why does the code above fail to capture the problematic query? It > turns out that inner_path->param_info->ppi_clauses contains cloned > clauses, i.e. multiple versions of the same clause in order to cope with > outer join identity 3. And this causes the above code to give incorrect > conclusions about the length of 'ppi_clauses'. And this also explains > why this issue only occurs in v16. > > As a clause's serial number is unique within the current PlannerInfo > context, and multiple clones of the same clause have the same serial > number, it seems to me that it's more correct to calculate the length of > ppi_clauses by: > > bms_num_members(inner_path->param_info->ppi_serials) > > So I think maybe we can fix this issue with the attached patch. hmm, I'd need to study what Tom's been changing around this. I don't have a good handle on which qual lists can have duplicated quals. In [1], you can see I took the inspiration for that fix from generate_mergejoin_paths(). I don't yet know if it's just ppi_clauses that can have duplicates or if there's a similar hazard in the list_length() checks in generate_mergejoin_paths() > BTW, the SingleRow in explain seems quite handy. I wonder if we can > keep it. If the fix is to just disable Memoize when the ppi_clauses don't cover the entire join condition, then the Inner Unique from the parent nested loop tells us this. Isn't that enough? The reason I added SingleRow was because in the patch I posted, I'd made it so SingleRow could be false in cases where Inner Unique was true. David [1] https://postgr.es/m/CAApHDvrWsfc3naVQZxS0efU%3DvJOA7dG3NV7fGhkgo2%3DJ38OEpg%40mail.gmail.com
On Thu, Jan 4, 2024 at 8:04 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 4 Jan 2024 at 00:01, Richard Guo <guofenglinux@gmail.com> wrote:
> if (extra->inner_unique &&
> (inner_path->param_info == NULL ||
> list_length(inner_path->param_info->ppi_clauses) <
> list_length(extra->restrictlist)))
> return NULL;
I see. So it worked fine until Tom's " Make Vars be outer-join-aware." stuff.
9e215378d fixed this already and looks like I came to the same
conclusion about the cache completion issue back then too.
Exactly.
> But why does the code above fail to capture the problematic query? It
> turns out that inner_path->param_info->ppi_clauses contains cloned
> clauses, i.e. multiple versions of the same clause in order to cope with
> outer join identity 3. And this causes the above code to give incorrect
> conclusions about the length of 'ppi_clauses'. And this also explains
> why this issue only occurs in v16.
>
> As a clause's serial number is unique within the current PlannerInfo
> context, and multiple clones of the same clause have the same serial
> number, it seems to me that it's more correct to calculate the length of
> ppi_clauses by:
>
> bms_num_members(inner_path->param_info->ppi_serials)
>
> So I think maybe we can fix this issue with the attached patch.
hmm, I'd need to study what Tom's been changing around this. I don't
have a good handle on which qual lists can have duplicated quals.
In [1], you can see I took the inspiration for that fix from
generate_mergejoin_paths(). I don't yet know if it's just ppi_clauses
that can have duplicates or if there's a similar hazard in the
list_length() checks in generate_mergejoin_paths()
AFAIU we may have clone clauses in RelOptInfo.joininfo to cope with
commuted-left-join cases per identity 3. When we build ParamPathInfo
for parameterized paths we may move join clauses to ppi_clauses when
possible, so ParamPathInfo.ppi_clauses may also have clone clauses.
When we construct restrict list for a join relation from the joininfo
lists of the relations it joins, only one of the clone clauses would be
chosen to be put in the restrict list if they are restriction clauses,
thanks to the check against required_relids and incompatible_relids.
if (rinfo->has_clone || rinfo->is_clone)
{
Assert(!RINFO_IS_PUSHED_DOWN(rinfo, joinrel->relids));
if (!bms_is_subset(rinfo->required_relids, both_input_relids))
continue;
if (bms_overlap(rinfo->incompatible_relids, both_input_relids))
continue;
}
The mergeclause_list examined in generate_mergejoin_paths() is extracted
from joinrel's restrictlist, so it will not have multiple versions for
one clause. Therefore, I think we're good with the list_length() checks
in generate_mergejoin_paths().
commuted-left-join cases per identity 3. When we build ParamPathInfo
for parameterized paths we may move join clauses to ppi_clauses when
possible, so ParamPathInfo.ppi_clauses may also have clone clauses.
When we construct restrict list for a join relation from the joininfo
lists of the relations it joins, only one of the clone clauses would be
chosen to be put in the restrict list if they are restriction clauses,
thanks to the check against required_relids and incompatible_relids.
if (rinfo->has_clone || rinfo->is_clone)
{
Assert(!RINFO_IS_PUSHED_DOWN(rinfo, joinrel->relids));
if (!bms_is_subset(rinfo->required_relids, both_input_relids))
continue;
if (bms_overlap(rinfo->incompatible_relids, both_input_relids))
continue;
}
The mergeclause_list examined in generate_mergejoin_paths() is extracted
from joinrel's restrictlist, so it will not have multiple versions for
one clause. Therefore, I think we're good with the list_length() checks
in generate_mergejoin_paths().
> BTW, the SingleRow in explain seems quite handy. I wonder if we can
> keep it.
If the fix is to just disable Memoize when the ppi_clauses don't cover
the entire join condition, then the Inner Unique from the parent
nested loop tells us this. Isn't that enough? The reason I added
SingleRow was because in the patch I posted, I'd made it so SingleRow
could be false in cases where Inner Unique was true.
Ah yes, you're right.
Thanks
Richard
Thanks
Richard
On Thu, 4 Jan 2024 at 00:01, Richard Guo <guofenglinux@gmail.com> wrote: > As a clause's serial number is unique within the current PlannerInfo > context, and multiple clones of the same clause have the same serial > number, it seems to me that it's more correct to calculate the length of > ppi_clauses by: > > bms_num_members(inner_path->param_info->ppi_serials) > > So I think maybe we can fix this issue with the attached patch. Thanks for the patch. I looked at this again and believe the bms_num_member() check is a good fix. I considered if Amadeo's case is worth including as a test and I concluded that having that as a test won't necessarily guard us against getting this error again if there are still bugs in this area. David