Thread: A potential memory leak on Merge Join when Sort node is not below Materialize node

A potential memory leak on Merge Join when Sort node is not below Materialize node

From
Önder Kalacı
Date:
Hi hackers,

With PG 15 (rc1 or beta4), I'm observing an interesting memory pattern. I have not seen a similar discussion on the mailing list. If I missed that, please refer me there. The problem that I'm going to explain does not happen on PG 13/14.

It seems like there is a memory leak(?) with $title. Still, not sure about what is going on and, thought it'd be useful to share at least my initial investigation.

After running the query and waiting a few minutes (see steps to repro below), use pg_log_backend_memory_contexts() to get the contexts of the backend executing the command. See that it goes beyond 100GB. And depending on vm.overcommit_memory, you get an OOM error or OOM crash eventually.

```
2022-09-28 17:33:38.155 CEST [32224] LOG:  level: 2; PortalContext: 1024 total in 1 blocks; 592 free (0 chunks); 432 used: <unnamed>
2022-09-28 17:33:38.159 CEST [32224] LOG:  level: 3; ExecutorState: 114923929600 total in 13710 blocks; 7783264 free (3 chunks); 114916146336 used
2022-09-28 17:33:38.159 CEST [32224] LOG:  level: 4; TupleSort main: 8192 total in 1 blocks; 3928 free (0 chunks); 4264 used
2022-09-28 17:33:38.159 CEST [32224] LOG:  level: 5; TupleSort sort: 295096 total in 8 blocks; 256952 free (67 chunks); 38144 used
2022-09-28 17:33:38.159 CEST [32224] LOG:  level: 6; Caller tuples: 8192 total in 1 blocks (0 chunks); 7992 free (0 chunks); 200 used
2022-09-28 17:33:38.159 CEST [32224] LOG:  level: 4; TupleSort main: 8192 total in 1 blocks; 3928 free (0 chunks); 4264 used
2022-09-28 17:33:38.159 CEST [32224] LOG:  level: 5; TupleSort sort: 4309736 total in 18 blocks; 263864 free (59 chunks); 4045872 used
2022-09-28 17:33:38.159 CEST [32224] LOG:  level: 6; Caller tuples: 8192 total in 1 blocks (0 chunks); 7992 free (0 chunks); 200 used
...
2022-09-28 17:33:38.160 CEST [32224] LOG:  Grand total: 114930446784 bytes in 13972 blocks; 8802248 free (275 chunks); 114921644536 used
```

I observed this with a merge join involving a table and set returning function. To simulate the problem with two tables, I have the following steps:

```
CREATE TABLE t1 (a text);
CREATE TABLE t2 (a text);

-- make the text a little large by adding 100000000000
INSERT INTO t1 SELECT (100000000000+i%1000)::text FROM generate_series(0,10000000) i;

-- make the text a little large by adding 100000000000
INSERT INTO t2 SELECT (100000000000+i%10000)::text FROM generate_series(0,10000000) i;

-- to simplify the explain plan, not strictly necessary
SET max_parallel_workers_per_gather TO 0;

-- these two are necessary so that the problem is triggered
-- these are helpful to use Merge join and avoid materialization
SET enable_hashjoin TO false;
SET enable_material TO false;

-- the join is on a TEXT column
-- when the join is on INT column with a similar setup, I do not observe this problem
SELECT count(*) FROM t1 JOIN t2 USING (a);
```


The explain output for the query like the following:
```
explain SELECT count(*) FROM t1 JOIN t2 USING (a);
┌─────────────────────────────────────────────────────────────────────────────────┐
│                                   QUERY PLAN                                    │
├─────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate  (cost=177735283.36..177735283.37 rows=1 width=8)                     │
│   ->  Merge Join  (cost=2556923.81..152703372.24 rows=10012764448 width=0)      │
│         Merge Cond: (t1.a = t2.a)                                               │
│         ->  Sort  (cost=1658556.19..1683556.63 rows=10000175 width=13)          │
│               Sort Key: t1.a                                                    │
│               ->  Seq Scan on t1  (cost=0.00..154056.75 rows=10000175 width=13) │
│         ->  Sort  (cost=1658507.28..1683506.93 rows=9999861 width=13)           │
│               Sort Key: t2.a                                                    │
│               ->  Seq Scan on t2  (cost=0.00..154053.61 rows=9999861 width=13)  │
└─────────────────────────────────────────────────────────────────────────────────┘
(9 rows)
```

In the end, my investigation mostly got me to the following palloc(), where we seem to allocate memory over and over again as memory grows:
```
(gdb) bt
#0  __GI___libc_malloc (bytes=bytes@entry=8388608) at malloc.c:3038
#1  0x00005589f3c55444 in AllocSetAlloc (context=0x5589f4896300, size=14) at aset.c:920
#2  0x00005589f3c5d763 in palloc (size=size@entry=14) at mcxt.c:1082
#3  0x00005589f3b1f553 in datumCopy (value=94051002161216, typByVal=typByVal@entry=false,
    typLen=<optimized out>) at datum.c:162
#4  0x00005589f3c6ed0b in tuplesort_getdatum (state=state@entry=0x5589f49274e0,
    forward=forward@entry=true, val=0x5589f48d7860, isNull=0x5589f48d7868, abbrev=abbrev@entry=0x0)
    at tuplesort.c:2675
#5  0x00005589f3947925 in ExecSort (pstate=0x5589f48d0a38) at nodeSort.c:200
#6  0x00005589f393d74c in ExecProcNode (node=0x5589f48d0a38)
    at ../../../src/include/executor/executor.h:259
#7  ExecMergeJoin (pstate=0x5589f4896cc8) at nodeMergejoin.c:871
#8  0x00005589f391fbc8 in ExecProcNode (node=0x5589f4896cc8)
    at ../../../src/include/executor/executor.h:259
#9  fetch_input_tuple (aggstate=aggstate@entry=0x5589f4896670) at nodeAgg.c:563
#10 0x00005589f3923742 in agg_retrieve_direct (aggstate=aggstate@entry=0x5589f4896670)
    at nodeAgg.c:2441
....
```

Could this be a bug, or am I missing anything?

Thanks,
Onder KALACI


=?UTF-8?B?w5ZuZGVyIEthbGFjxLE=?= <onderkalaci@gmail.com> writes:
> With PG 15 (rc1 or beta4), I'm observing an interesting memory pattern.

Yup, that is a leak.  valgrind'ing it blames this call chain:

==00:00:16:12.228 4011013== 790,404,056 bytes in 60,800,312 blocks are definitely lost in loss record 1,108 of 1,108
==00:00:16:12.228 4011013==    at 0x9A5104: palloc (mcxt.c:1170)
==00:00:16:12.228 4011013==    by 0x89F8D9: datumCopy (datum.c:175)
==00:00:16:12.228 4011013==    by 0x9B5BEE: tuplesort_getdatum (tuplesortvariants.c:882)
==00:00:16:12.228 4011013==    by 0x6FA8B3: ExecSort (nodeSort.c:200)
==00:00:16:12.228 4011013==    by 0x6F1E87: ExecProcNode (executor.h:259)
==00:00:16:12.228 4011013==    by 0x6F1E87: ExecMergeJoin (nodeMergejoin.c:871)
==00:00:16:12.228 4011013==    by 0x6D7800: ExecProcNode (executor.h:259)
==00:00:16:12.228 4011013==    by 0x6D7800: fetch_input_tuple (nodeAgg.c:562)
==00:00:16:12.228 4011013==    by 0x6DAE2E: agg_retrieve_direct (nodeAgg.c:2454)
==00:00:16:12.228 4011013==    by 0x6DAE2E: ExecAgg (nodeAgg.c:2174)
==00:00:16:12.228 4011013==    by 0x6C6122: ExecProcNode (executor.h:259)
==00:00:16:12.228 4011013==    by 0x6C6122: ExecutePlan (execMain.c:1636)

and bisecting fingers this commit as the guilty party:

commit 91e9e89dccdfdf4216953d3d8f5515dcdef177fb
Author: David Rowley <drowley@postgresql.org>
Date:   Thu Jul 22 14:03:19 2021 +1200

    Make nodeSort.c use Datum sorts for single column sorts

Looks like that forgot that tuplesort_getdatum()'s result has to
be freed by the caller.

            regards, tom lane



I wrote:
> and bisecting fingers this commit as the guilty party:

> commit 91e9e89dccdfdf4216953d3d8f5515dcdef177fb
> Author: David Rowley <drowley@postgresql.org>
> Date:   Thu Jul 22 14:03:19 2021 +1200

>     Make nodeSort.c use Datum sorts for single column sorts

After looking at that for a little while, I wonder if we shouldn't
fix this by restricting the Datum-sort path to be used only with
pass-by-value data types.  That'd require only a minor addition
to the new logic in ExecInitSort.

The alternative of inserting a pfree of the old value would complicate
the code nontrivially, I think, and really it would necessitate a
complete performance re-test.  I'm wondering if the claimed speedup
for pass-by-ref types wasn't fictional and based on skipping the
required pfrees.  Besides, if you think this code is hot enough that
you don't want to add a test-and-branch per tuple (a claim I also
doubt, BTW) then you probably don't want to add such overhead into
the pass-by-value case where the speedup is clear.

            regards, tom lane



Thanks for investigating this and finding the guilty commit.

On Thu, 29 Sept 2022 at 07:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> After looking at that for a little while, I wonder if we shouldn't
> fix this by restricting the Datum-sort path to be used only with
> pass-by-value data types.  That'd require only a minor addition
> to the new logic in ExecInitSort.

I'm also wondering if that's the best fix given the timing of this discovery.

> The alternative of inserting a pfree of the old value would complicate
> the code nontrivially, I think, and really it would necessitate a
> complete performance re-test.  I'm wondering if the claimed speedup
> for pass-by-ref types wasn't fictional and based on skipping the
> required pfrees.  Besides, if you think this code is hot enough that
> you don't want to add a test-and-branch per tuple (a claim I also
> doubt, BTW) then you probably don't want to add such overhead into
> the pass-by-value case where the speedup is clear.

I'm wondering if the best way to fix it if doing it that way would be
to invent tuplesort_getdatum_nocopy() which would be the same as
tuplesort_getdatum() except it wouldn't do the datumCopy for byref
types.  It looks like tuplesort_gettupleslot() when copy==false just
directly stores the MinimalTuple that's in stup.tuple and shouldFree
is set to false.

Going by [1], it looks like I saw gains in test 6, which was a byref
Datum. Skipping the datumCopy() I imagine could only make the gains
slightly higher on that.  That puts me a bit more on the fence about
the best fix for PG15.

I've attached a patch to restrict the optimisation to byval types in
the meantime.

David

[1] https://www.postgresql.org/message-id/CAApHDvrWV%3Dv0qKsC9_BHqhCn9TusrNvCaZDz77StCO--fmgbKA%40mail.gmail.com

Attachment
David Rowley <dgrowleyml@gmail.com> writes:
> I'm wondering if the best way to fix it if doing it that way would be
> to invent tuplesort_getdatum_nocopy() which would be the same as
> tuplesort_getdatum() except it wouldn't do the datumCopy for byref
> types.

Yeah, perhaps.  We'd need a clear spec on how long the Datum could
be presumed good --- probably till the next tuplesort_getdatum_nocopy
call, but that'd need to be checked --- and then check if that is
satisfactory for nodeSort's purposes.

If we had such a thing, I wonder if any of the other existing
tuplesort_getdatum callers would be happier with that.  nodeAgg for
one is tediously freeing the result, but could we drop that logic?
(I hasten to add that I'm not proposing we touch that for v15.)

            regards, tom lane



On Thu, 29 Sept 2022 at 08:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > I'm wondering if the best way to fix it if doing it that way would be
> > to invent tuplesort_getdatum_nocopy() which would be the same as
> > tuplesort_getdatum() except it wouldn't do the datumCopy for byref
> > types.
>
> Yeah, perhaps.  We'd need a clear spec on how long the Datum could
> be presumed good --- probably till the next tuplesort_getdatum_nocopy
> call, but that'd need to be checked --- and then check if that is
> satisfactory for nodeSort's purposes.

Yeah, I think the same rules around scope apply as
tuplesort_gettupleslot() with copy==false.  We could do it by adding a
copy flag to the existing function, but I'd rather not add the
branching to that function. It's probably just better to duplicate it
and adjust.

> If we had such a thing, I wonder if any of the other existing
> tuplesort_getdatum callers would be happier with that.  nodeAgg for
> one is tediously freeing the result, but could we drop that logic?

Looking at process_ordered_aggregate_single(), it's likely more
efficient to use the nocopy version and just perform a datumCopy()
when we need to store the oldVal.  At least, that would be more
efficient when many values are being skipped due to being the same as
the last one.

I've just pushed the disable byref Datums patch I posted earlier. I
only made a small adjustment to make use of the TupleDescAttr() macro.
Önder, thank you for the report.

David



Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

From
Peter Geoghegan
Date:
On Wed, Sep 28, 2022 at 12:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <dgrowleyml@gmail.com> writes:
> > I'm wondering if the best way to fix it if doing it that way would be
> > to invent tuplesort_getdatum_nocopy() which would be the same as
> > tuplesort_getdatum() except it wouldn't do the datumCopy for byref
> > types.
>
> Yeah, perhaps.  We'd need a clear spec on how long the Datum could
> be presumed good --- probably till the next tuplesort_getdatum_nocopy
> call, but that'd need to be checked --- and then check if that is
> satisfactory for nodeSort's purposes.

I am reminded of the discussion that led to bugfix commit c2d4eb1b
some years back.

As the commit message of that old bugfix notes, tuplesort_getdatum()
and tuplesort_gettupleslot() are "the odd ones out" among "get tuple"
routines (i.e. routines that get a tuple from a tuplesort by calling
tuplesort_gettuple_common()). We used to sometimes do that with
tuplesort_getindextuple() and possible other such routines, but the
need for that capability was eliminated on the caller side around the
same time as the bugfix went in.

-- 
Peter Geoghegan



Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

From
Peter Geoghegan
Date:
On Wed, Sep 28, 2022 at 4:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I am reminded of the discussion that led to bugfix commit c2d4eb1b
> some years back.

Also potentially relevant: the 2017 commit fa117ee4 anticipated adding
a "copy" argument to tuplesort_getdatum() (the same commit added such
a "copy" argument to tuplesort_gettupleslot()). I see that that still
hasn't happened to tuplesort_getdatum() all these years later. Might
be a good idea to do it in the next year or two, though.

If David is interested in pursuing this now then I certainly won't object.

-- 
Peter Geoghegan



Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

From
Michael Paquier
Date:
On Thu, Sep 29, 2022 at 11:58:17AM +1300, David Rowley wrote:
> I've just pushed the disable byref Datums patch I posted earlier. I
> only made a small adjustment to make use of the TupleDescAttr() macro.
> Önder, thank you for the report.

Wouldn't it be better to have 3a58176 reflect the non-optimization
path in the EXPLAIN output of a new regression test if none of the
existing tests are able to show any difference?
--
Michael

Attachment
On Thu, 29 Sept 2022 at 12:30, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Sep 29, 2022 at 11:58:17AM +1300, David Rowley wrote:
> > I've just pushed the disable byref Datums patch I posted earlier. I
> > only made a small adjustment to make use of the TupleDescAttr() macro.
> > Önder, thank you for the report.
>
> Wouldn't it be better to have 3a58176 reflect the non-optimization
> path in the EXPLAIN output of a new regression test if none of the
> existing tests are able to show any difference?

There's nothing in EXPLAIN that shows that this optimization occurs.
Or, are you proposing that you think there should be something? and
for 15??

David



Michael Paquier <michael@paquier.xyz> writes:
> Wouldn't it be better to have 3a58176 reflect the non-optimization
> path in the EXPLAIN output of a new regression test if none of the
> existing tests are able to show any difference?

This decision is not visible in EXPLAIN in any case.

            regards, tom lane



Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

From
Michael Paquier
Date:
On Wed, Sep 28, 2022 at 07:35:07PM -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> Wouldn't it be better to have 3a58176 reflect the non-optimization
>> path in the EXPLAIN output of a new regression test if none of the
>> existing tests are able to show any difference?
>
> This decision is not visible in EXPLAIN in any case.

Okay, thanks!
--
Michael

Attachment
On Thu, 29 Sept 2022 at 12:07, Peter Geoghegan <pg@bowt.ie> wrote:
> Also potentially relevant: the 2017 commit fa117ee4 anticipated adding
> a "copy" argument to tuplesort_getdatum() (the same commit added such
> a "copy" argument to tuplesort_gettupleslot()). I see that that still
> hasn't happened to tuplesort_getdatum() all these years later. Might
> be a good idea to do it in the next year or two, though.
>
> If David is interested in pursuing this now then I certainly won't object.

Just while this is fresh in my head, I wrote some code to make this
happen.  My preference would be not to add the "copy" param to the
existing function and instead just add a new function to prevent
additional branching.

The attached puts back the datum sort in nodeSort.c for byref types
and adjusts process_ordered_aggregate_single() to make use of this
function.

I did a quick benchmark to see if this help DISTINCT aggregate any:

create table t1 (a varchar(32) not null, b varchar(32) not null);
insert into t1 select md5((x%10)::text),md5((x%10)::text) from
generate_Series(1,1000000)x;
vacuum freeze t1;
create index on t1(a);

With a work_mem of 256MBs I get:

query = select max(distinct a), max(distinct b) from t1;

Master:
latency average = 313.197 ms

Patched:
latency average = 304.335 ms

So not a very impressive speedup there (about 3%)

Some excerpts from perf top show:

Master:
   1.40%  postgres          [.] palloc
   1.13%  postgres          [.] tuplesort_getdatum
   0.77%  postgres          [.] datumCopy

Patched:
   0.91%  postgres          [.] tuplesort_getdatum_nocopy
   0.65%  postgres          [.] palloc

I stared for a while at the mode_final() function and thought maybe we
could use the nocopy variant there. I just didn't quite pluck up the
motivation to write any code to see if it could be made faster.

David

Attachment

Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

From
Peter Geoghegan
Date:
On Wed, Sep 28, 2022 at 6:13 PM David Rowley <dgrowleyml@gmail.com> wrote:
> Master:
> latency average = 313.197 ms
>
> Patched:
> latency average = 304.335 ms
>
> So not a very impressive speedup there (about 3%)

Worth a try, at least. Having a more consistent interface is valuable
in itself too.

-- 
Peter Geoghegan



On Thu, 29 Sept 2022 at 14:32, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Sep 28, 2022 at 6:13 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > Master:
> > latency average = 313.197 ms
> >
> > Patched:
> > latency average = 304.335 ms
> >
> > So not a very impressive speedup there (about 3%)
>
> Worth a try, at least. Having a more consistent interface is valuable
> in itself too.

Just testing the datum sort in nodeSort.c with the same table as
before but using the query:

select b from t1 order by b offset 1000000;

Master:
latency average = 344.763 ms

Patched:
latency average = 268.374 ms

about 28% faster.

I'll take this to another thread and put it in the next CF

David



Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

From
Peter Geoghegan
Date:
On Wed, Sep 28, 2022 at 9:59 PM David Rowley <dgrowleyml@gmail.com> wrote:
> select b from t1 order by b offset 1000000;
>
> Master:
> latency average = 344.763 ms
>
> Patched:
> latency average = 268.374 ms
>
> about 28% faster.

That's more like it!

-- 
Peter Geoghegan



> I've just pushed the disable byref Datums patch I posted earlier. I
> only made a small adjustment to make use of the TupleDescAttr() macro.
> Önder, thank you for the report.

Thank you David for taking care of this.

> Yeah, I think the same rules around scope apply as
> tuplesort_gettupleslot() with copy==false.  We could do it by adding a
> copy flag to the existing function, but I'd rather not add the
> branching to that function. It's probably just better to duplicate it
> and adjust.
>

For the record, I tried to see if gcc would optimize the function by
generating two different versions when copy is true or false, thus getting rid
of the branching while still having only one function to deal with. Using the
-fipa-cp-clone (or even the whole set of additional flags coming with -O3), it
does generate a special-case version of the function, but it seems to then
only be used by heapam_index_validate_scan and
percentile_cont_multi_final_common. This is from my investigation looking for
references to the specialized version in the DWARF debug information.

Regards,

--
Ronan Dunklau





Ronan Dunklau <ronan.dunklau@aiven.io> writes:
>> Yeah, I think the same rules around scope apply as
>> tuplesort_gettupleslot() with copy==false.  We could do it by adding a
>> copy flag to the existing function, but I'd rather not add the
>> branching to that function. It's probably just better to duplicate it
>> and adjust.

> For the record, I tried to see if gcc would optimize the function by
> generating two different versions when copy is true or false, thus getting rid
> of the branching while still having only one function to deal with.

TBH, I think this is completely ridiculous over-optimization.
There's exactly zero evidence that a second copy of the function
would improve performance, or do anything but contribute to code
bloat (which does have a distributed performance cost).

            regards, tom lane



Le jeudi 29 septembre 2022, 16:10:03 CEST Tom Lane a écrit :
> Ronan Dunklau <ronan.dunklau@aiven.io> writes:
> >> Yeah, I think the same rules around scope apply as
> >> tuplesort_gettupleslot() with copy==false.  We could do it by adding a
> >> copy flag to the existing function, but I'd rather not add the
> >> branching to that function. It's probably just better to duplicate it
> >> and adjust.
> >
> > For the record, I tried to see if gcc would optimize the function by
> > generating two different versions when copy is true or false, thus getting
rid
> > of the branching while still having only one function to deal with.
>
> TBH, I think this is completely ridiculous over-optimization.
> There's exactly zero evidence that a second copy of the function
> would improve performance, or do anything but contribute to code
> bloat (which does have a distributed performance cost).

I wasn't commenting on the merit of the optimization, but just that I tried to
get gcc to apply it itself, which it doesn't.

Regards,

--
Ronan Dunklau





Re: A potential memory leak on Merge Join when Sort node is not below Materialize node

From
Peter Geoghegan
Date:
On Thu, Sep 29, 2022 at 7:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> TBH, I think this is completely ridiculous over-optimization.
> There's exactly zero evidence that a second copy of the function
> would improve performance, or do anything but contribute to code
> bloat (which does have a distributed performance cost).

I thought that that was unjustified myself.

-- 
Peter Geoghegan



Hi David, Tom, all,


I've just pushed the disable byref Datums patch I posted earlier. I
only made a small adjustment to make use of the TupleDescAttr() macro.
Önder, thank you for the report.


With this commit, I re-run the query patterns where we observed the problem, all looks good now. Wanted to share this information as fyi.

Thanks for the quick turnaround!

Onder KALACI