Thread: PHJ file leak.
Hello. While looking a patch, I found that PHJ sometimes complains for file leaks if accompanied by LIMIT. Repro is very simple: create table t as (select a, a as b from generate_series(0, 999999) a); analyze t; select t.a from t join t t2 on (t.a = t2.a) limit 1; Once in several (or dozen of) times execution of the last query complains as follows. WARNING: temporary file leak: File 15 still referenced WARNING: temporary file leak: File 17 still referenced This is using PHJ and the leaked file was a shared tuplestore for outer tuples, which was opend by sts_parallel_scan_next() called from ExecParallelHashJoinOuterGetTuple(). It seems to me that ExecHashTableDestroy is forgeting to release shared tuplestore accessors. Please find the attached. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 224cbb32ba..8399683569 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -871,6 +871,9 @@ ExecHashTableDestroy(HashJoinTable hashtable) } } + /* Also, free up accessors to shared tuplestores if any */ + ExecParallelHashCloseBatchAccessors(hashtable); + /* Release working memory (batchCxt is a child, so it goes away too) */ MemoryContextDelete(hashtable->hashCxt); @@ -2991,6 +2994,7 @@ ExecParallelHashCloseBatchAccessors(HashJoinTable hashtable) sts_end_parallel_scan(hashtable->batches[i].outer_tuples); } pfree(hashtable->batches); + hashtable->nbatch = 0; hashtable->batches = NULL; }
On Tue, Nov 12, 2019 at 1:24 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Hello. While looking a patch, I found that PHJ sometimes complains for > file leaks if accompanied by LIMIT. Oops. > Repro is very simple: > > create table t as (select a, a as b from generate_series(0, 999999) a); > analyze t; > select t.a from t join t t2 on (t.a = t2.a) limit 1; > > Once in several (or dozen of) times execution of the last query > complains as follows. > > WARNING: temporary file leak: File 15 still referenced > WARNING: temporary file leak: File 17 still referenced Ack. Reproduced here. > This is using PHJ and the leaked file was a shared tuplestore for > outer tuples, which was opend by sts_parallel_scan_next() called from > ExecParallelHashJoinOuterGetTuple(). It seems to me that > ExecHashTableDestroy is forgeting to release shared tuplestore > accessors. Please find the attached. Thanks for the patch! Yeah, this seems correct, but I'd like to think about it some more before committing. I'm going to be a bit tied up with travel so that might be next week.
Thomas Munro <thomas.munro@gmail.com> writes: > On Tue, Nov 12, 2019 at 1:24 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: >> Hello. While looking a patch, I found that PHJ sometimes complains for >> file leaks if accompanied by LIMIT. > Thanks for the patch! Yeah, this seems correct, but I'd like to think > about it some more before committing. I'm going to be a bit tied up > with travel so that might be next week. At this point we've missed the window for this week's releases, so there's no great hurry (and it'd be best not to push any noncritical patches into the back branches anyway, for the next 24 hours). Although the patch seems unobjectionable as far as it goes, I'd like to understand why we didn't see the need for it long since. Is there another call to ExecParallelHashCloseBatchAccessors somewhere, and if so, is that one wrongly placed? regards, tom lane
At Mon, 11 Nov 2019 17:24:45 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Thomas Munro <thomas.munro@gmail.com> writes: > > On Tue, Nov 12, 2019 at 1:24 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > >> Hello. While looking a patch, I found that PHJ sometimes complains for > >> file leaks if accompanied by LIMIT. > > > Thanks for the patch! Yeah, this seems correct, but I'd like to think > > about it some more before committing. I'm going to be a bit tied up > > with travel so that might be next week. > > At this point we've missed the window for this week's releases, so > there's no great hurry (and it'd be best not to push any noncritical > patches into the back branches anyway, for the next 24 hours). > > Although the patch seems unobjectionable as far as it goes, I'd like > to understand why we didn't see the need for it long since. Is there > another call to ExecParallelHashCloseBatchAccessors somewhere, and > if so, is that one wrongly placed? It's a simple race conditions between leader and workers. If a scan on workers reached to the end, no batch file is open, but a batch file is open if it doesn't reaches to the end. If a worker notices that the channel to the leader is already closed before reaching the limit, it calls ExecEndNode and doesn't call ExecShutdownNode. Otherwise, if the worker finds the limit first, the worker *shutdowns* itself then ends. Everything's clean. On second thought, it seems a issue of ExecutePlan, rather than PHJ itself. ExecutePlan should shutdown nodes when output channel is broken. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index ea4b586984..038bafe777 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -1683,7 +1683,15 @@ ExecutePlan(EState *estate, * end the loop. */ if (!dest->receiveSlot(slot, dest)) + { + /* + * If we know we won't need to back up, we can release + * resources at this point. + */ + if (!(estate->es_top_eflags & EXEC_FLAG_BACKWARD)) + (void) ExecShutdownNode(planstate); break; + } } /*
At Mon, 11 Nov 2019 17:24:45 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Although the patch seems unobjectionable as far as it goes, I'd like > to understand why we didn't see the need for it long since. Is there > another call to ExecParallelHashCloseBatchAccessors somewhere, and > if so, is that one wrongly placed? The previous patch would be wrong. The root cause is a open batch so the right thing to be done at scan end is ExecHashTableDeatchBatch. And the real issue here seems to be in ExecutePlan, not in PHJ. regards -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Nov 12, 2019 at 4:20 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Mon, 11 Nov 2019 17:24:45 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in > > Although the patch seems unobjectionable as far as it goes, I'd like > > to understand why we didn't see the need for it long since. Is there > > another call to ExecParallelHashCloseBatchAccessors somewhere, and > > if so, is that one wrongly placed? > > The previous patch would be wrong. The root cause is a open batch so > the right thing to be done at scan end is > ExecHashTableDeatchBatch. And the real issue here seems to be in > ExecutePlan, not in PHJ. You are right. Here is the email I just wrote that says the same thing, but with less efficiency: On Tue, Nov 12, 2019 at 11:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Tue, Nov 12, 2019 at 1:24 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > >> Hello. While looking a patch, I found that PHJ sometimes complains for > >> file leaks if accompanied by LIMIT. > > > Thanks for the patch! Yeah, this seems correct, but I'd like to think > > about it some more before committing. I'm going to be a bit tied up > > with travel so that might be next week. > > At this point we've missed the window for this week's releases, so > there's no great hurry (and it'd be best not to push any noncritical > patches into the back branches anyway, for the next 24 hours). > > Although the patch seems unobjectionable as far as it goes, I'd like > to understand why we didn't see the need for it long since. Is there > another call to ExecParallelHashCloseBatchAccessors somewhere, and > if so, is that one wrongly placed? I'll need to look into this some more in a few days, but here's a partial analysis: The usual way that these files get closed is by sts_end_parallel_scan(). For the particular file in question here -- an outer batch file that is open for reading while we probe -- that usually happens at the end of the per-batch probe phase before we try to move to another batch or reach end of data. In case of an early end, it also happens in ExecShutdownHashJoin(), which detaches from and cleans up shared resources, which includes closing these files. There are a few ways that ExecShutdownNode() can be reached: 1. ExecLimit() (which we now suspect to be bogus -- see nearby unfinished business[1]), though that only happens in the leader in this case 2. ExecutePlan() on end-of-data. 3. ExecutePlan() on reaching the requested tuple count. Unfortunately those aren't the only ways out of ExecutePlan()'s loop, and that may be a problem. I think what's happening in this case is that there is a race where dest->receiveSlot(slot, dest) returns false because the leader has stopped receiving tuples (having received enough tuples to satisfy the LIMIT) so we exit early, but that path out of the loop doesn't run ExecShutdownNode(). EXEC_FLAG_BACKWARD would also inhibit it, but that shouldn't be set in a parallel plan. So, after thinking about it, I'm not sure the proposed patch is conceptually sound (even if it seems to work), because ExecHashTableDestroy() runs at 'end' time and that's after shared memory has disappeared, so it shouldn't be doing shared resource-related cleanup work, whereas ExecParallelHashCloseBatchAccessors() is relates to shared resources (for example, it calls sts_end_write(), which should never actually do anything at this time but it is potentially a shm-updating routine, which seems wrong to me). Recommending a change is going to require more brainpower than I have spare today due to other commitments. ExecShutdownNode() is certainly a bit tricky. [1] https://www.postgresql.org/message-id/flat/87ims2amh6.fsf%40jsievers.enova.com
On Tue, Nov 12, 2019 at 4:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Nov 12, 2019 at 4:20 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > At Mon, 11 Nov 2019 17:24:45 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in > > > Although the patch seems unobjectionable as far as it goes, I'd like > > > to understand why we didn't see the need for it long since. Is there > > > another call to ExecParallelHashCloseBatchAccessors somewhere, and > > > if so, is that one wrongly placed? > > > > The previous patch would be wrong. The root cause is a open batch so > > the right thing to be done at scan end is > > ExecHashTableDeatchBatch. And the real issue here seems to be in > > ExecutePlan, not in PHJ. > > You are right. Here is the email I just wrote that says the same > thing, but with less efficiency: And yeah, your Make_parallel_shutdown_on_broken_channel.patch seems like the real fix here. It's not optional to run that at end-of-query, though you might get that impression from various comments, and it's also not OK to call it before the end of the query, though you might get that impression from what the code actually does. CC'ing to Robert who designed ExecShutdownNode(), though admittedly the matter might have been theoretical until PHJ came along (since the only other executor node that implements ExecShutdownNode() was Gather, and you can't have a Gather under a Gather, and if this happens to you in the leader process then the user is gone and no one will hear the screams).
On Tue, Nov 12, 2019 at 5:03 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Nov 12, 2019 at 4:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Tue, Nov 12, 2019 at 4:20 PM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > The previous patch would be wrong. The root cause is a open batch so > > > the right thing to be done at scan end is > > > ExecHashTableDeatchBatch. And the real issue here seems to be in > > > ExecutePlan, not in PHJ. > > > > You are right. Here is the email I just wrote that says the same > > thing, but with less efficiency: > > And yeah, your Make_parallel_shutdown_on_broken_channel.patch seems > like the real fix here. It's not optional to run that at > end-of-query, though you might get that impression from various > comments, and it's also not OK to call it before the end of the query, > though you might get that impression from what the code actually does. Here's the version I'd like to commit in a day or two, once the dust has settled on the minor release. Instead of adding yet another copy of that code, I just moved it out of the loop; this way there is no way to miss it. I think the comment could also be better, but I'll wait for the concurrent discussions about the meaning of ExecShutdownNode() to fix that in master.
Attachment
At Wed, 13 Nov 2019 09:48:19 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in > On Tue, Nov 12, 2019 at 5:03 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Tue, Nov 12, 2019 at 4:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > On Tue, Nov 12, 2019 at 4:20 PM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > > > The previous patch would be wrong. The root cause is a open batch so > > > > the right thing to be done at scan end is > > > > ExecHashTableDeatchBatch. And the real issue here seems to be in > > > > ExecutePlan, not in PHJ. > > > > > > You are right. Here is the email I just wrote that says the same > > > thing, but with less efficiency: > > > > And yeah, your Make_parallel_shutdown_on_broken_channel.patch seems > > like the real fix here. It's not optional to run that at > > end-of-query, though you might get that impression from various > > comments, and it's also not OK to call it before the end of the query, > > though you might get that impression from what the code actually does. > > Here's the version I'd like to commit in a day or two, once the dust > has settled on the minor release. Instead of adding yet another copy > of that code, I just moved it out of the loop; this way there is no > way to miss it. I think the comment could also be better, but I'll > wait for the concurrent discussions about the meaning of > ExecShutdownNode() to fix that in master. The phatch's shape looks better. Thanks. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Nov 13, 2019 at 6:13 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Wed, 13 Nov 2019 09:48:19 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in > > > > Here's the version I'd like to commit in a day or two, once the dust > > has settled on the minor release. Instead of adding yet another copy > > of that code, I just moved it out of the loop; this way there is no > > way to miss it. I think the comment could also be better, but I'll > > wait for the concurrent discussions about the meaning of > > ExecShutdownNode() to fix that in master. > > The phatch's shape looks better. Thanks. > +1. LGTM as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 13, 2019 at 9:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Nov 13, 2019 at 6:13 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > The phatch's shape looks better. Thanks. > > +1. LGTM as well. Thanks. Pushed.
On Fri, Nov 15, 2019 at 3:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Nov 13, 2019 at 9:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Nov 13, 2019 at 6:13 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> > The phatch's shape looks better. Thanks.
>
> +1. LGTM as well.
Thanks. Pushed.
We are hitting this leak in production on an 11.6 system for a query that is using a parallel hash join. Was this fix pushed in 11.7? I can't tell clearly from the release notes for 11.7 or this thread.
Thanks!
Jeremy
Jeremy
Jeremy Finzel <finzelj@gmail.com> writes: > We are hitting this leak in production on an 11.6 system for a query that > is using a parallel hash join. Was this fix pushed in 11.7? I can't tell > clearly from the release notes for 11.7 or this thread. It looks like you're asking about this commit: Author: Thomas Munro <tmunro@postgresql.org> Branch: master [76cbfcdf3] 2019-11-16 10:11:30 +1300 Branch: REL_12_STABLE Release: REL_12_2 [24897e1a1] 2019-11-16 10:18:45 +1300 Branch: REL_11_STABLE Release: REL_11_7 [bc049d0d4] 2019-11-16 10:19:16 +1300 Always call ExecShutdownNode() if appropriate. which is documented thus in the 11.7 release notes: <listitem> <!-- Author: Thomas Munro <tmunro@postgresql.org> Branch: master [76cbfcdf3] 2019-11-16 10:11:30 +1300 Branch: REL_12_STABLE [24897e1a1] 2019-11-16 10:18:45 +1300 Branch: REL_11_STABLE [bc049d0d4] 2019-11-16 10:19:16 +1300 --> <para> Ensure parallel plans are always shut down at the correct time (Kyotaro Horiguchi) </para> <para> This oversight is known to result in <quote>temporary file leak</quote> warnings from multi-batch parallel hash joins. </para> </listitem> regards, tom lane
On Fri, Mar 6, 2020 at 9:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeremy Finzel <finzelj@gmail.com> writes:
> We are hitting this leak in production on an 11.6 system for a query that
> is using a parallel hash join. Was this fix pushed in 11.7? I can't tell
> clearly from the release notes for 11.7 or this thread.
It looks like you're asking about this commit:
Author: Thomas Munro <tmunro@postgresql.org>
Branch: master [76cbfcdf3] 2019-11-16 10:11:30 +1300
Branch: REL_12_STABLE Release: REL_12_2 [24897e1a1] 2019-11-16 10:18:45 +1300
Branch: REL_11_STABLE Release: REL_11_7 [bc049d0d4] 2019-11-16 10:19:16 +1300
Always call ExecShutdownNode() if appropriate.
which is documented thus in the 11.7 release notes:
<listitem>
<!--
Author: Thomas Munro <tmunro@postgresql.org>
Branch: master [76cbfcdf3] 2019-11-16 10:11:30 +1300
Branch: REL_12_STABLE [24897e1a1] 2019-11-16 10:18:45 +1300
Branch: REL_11_STABLE [bc049d0d4] 2019-11-16 10:19:16 +1300
-->
<para>
Ensure parallel plans are always shut down at the correct time
(Kyotaro Horiguchi)
</para>
<para>
This oversight is known to result in <quote>temporary file
leak</quote> warnings from multi-batch parallel hash joins.
</para>
</listitem>
regards, tom lane
Thank you! Yep, pretty clear :).