Thread: Sort memory not being released

Sort memory not being released

From
"Jim C. Nasby"
Date:
Wasn't sure if this should go to hackers or not, so here it is...

It appears that after a pgsql backend uses memory for sort, it doesn't
get released again, at least on solaris. IE:

17660 jnasby     1   0    0  430M  423M cpu3   82:49  0.00% postgres
17662 jnasby     1  58    0   94M   81M sleep   6:29  0.00% postgres
17678 jnasby     1  58    0   80M   72M sleep   1:06  0.00% postgres
17650 jnasby     1  58    0   46M 2112K sleep   0:00  0.00% postgres

In this case, I have shared buffers set low and sort mem set really high
(5000 and 350,000) to try and avoid sorts going to disk.

Even if my settings are bad for whatever reason, shouldn't sort memory
be released after it's been used? Before I had sort set to 20,000 and I
still saw pgsql processes using more than 50M even when idle.

Of course this isn't an issue if you disconnect frequently, but it
really would hurt if you were using connection pooling.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Sort memory not being released

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> It appears that after a pgsql backend uses memory for sort, it doesn't
> get released again, at least on solaris. IE:

That's true on many Unixen --- malloc'd space is never released back to
the OS until the process dies.  Not much we can do about it.  If you're
concerned, start a fresh session, or use another Unix (or at least a
better malloc package).

            regards, tom lane

Re: Sort memory not being released

From
"Jim C. Nasby"
Date:
On Mon, Jun 16, 2003 at 10:13:02AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > It appears that after a pgsql backend uses memory for sort, it doesn't
> > get released again, at least on solaris. IE:
>
> That's true on many Unixen --- malloc'd space is never released back to
> the OS until the process dies.  Not much we can do about it.  If you're
> concerned, start a fresh session, or use another Unix (or at least a
> better malloc package).

Holy ugly batman...

This is on solaris; is there a different/better malloc I could use?

Also, would doing the sort in a seperate process solve the problem? IE:
doing a 'blocking fork()' on entry to the sort routine? (Sorry, I'm not
a C coder...)
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Sort memory not being released

From
Sean Chittenden
Date:
> > > It appears that after a pgsql backend uses memory for sort, it doesn't
> > > get released again, at least on solaris. IE:
> >
> > That's true on many Unixen --- malloc'd space is never released back to
> > the OS until the process dies.  Not much we can do about it.  If you're
> > concerned, start a fresh session, or use another Unix (or at least a
> > better malloc package).
>
> Holy ugly batman...
>
> This is on solaris; is there a different/better malloc I could use?

See if there's an madvise(2) call on Slowaris, specifically, look for
something akin to (taken from FreeBSD):

     MADV_FREE        Gives the VM system the freedom to free pages, and tells
                      the system that information in the specified page range
                      is no longer important.  This is an efficient way of
                      allowing malloc(3) to free pages anywhere in the address
                      space, while keeping the address space valid.  The next
                      time that the page is referenced, the page might be
                      demand zeroed, or might contain the data that was there
                      before the MADV_FREE call.  References made to that
                      address space range will not make the VM system page the
                      information back in from backing store until the page is
                      modified again.

That'll allow data that's malloc'ed to be reused by the OS.  If there
is such a call, it might be prudent to stick one in the sort code just
before or after the relevant free() call.  -sc

--
Sean Chittenden

Re: Sort memory not being released

From
Martijn van Oosterhout
Date:
On Mon, Jun 16, 2003 at 02:58:55PM -0700, Sean Chittenden wrote:
> See if there's an madvise(2) call on Slowaris, specifically, look for
> something akin to (taken from FreeBSD):
>
>      MADV_FREE        Gives the VM system the freedom to free pages, and tells
>                       the system that information in the specified page range
>                       is no longer important.  This is an efficient way of
>                       allowing malloc(3) to free pages anywhere in the address
>                       space, while keeping the address space valid.  The next
>                       time that the page is referenced, the page might be
>                       demand zeroed, or might contain the data that was there
>                       before the MADV_FREE call.  References made to that
>                       address space range will not make the VM system page the
>                       information back in from backing store until the page is
>                       modified again.
>
> That'll allow data that's malloc'ed to be reused by the OS.  If there
> is such a call, it might be prudent to stick one in the sort code just
> before or after the relevant free() call.  -sc

Seems not worth it. That just means you have to remember to re-madvise() it
when you allocate it again. The memory will be used by the OS again (after
swapping it out I guess).

For large allocations glibc tends to mmap() which does get unmapped. There's
a threshold of 4KB I think. Ofcourse, thousands of allocations for a few
bytes will never trigger it.

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> "the West won the world not by the superiority of its ideas or values or
> religion but rather by its superiority in applying organized violence.
> Westerners often forget this fact, non-Westerners never do."
>   - Samuel P. Huntington

Attachment

Re: Sort memory not being released

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> For large allocations glibc tends to mmap() which does get unmapped. There's
> a threshold of 4KB I think. Ofcourse, thousands of allocations for a few
> bytes will never trigger it.

But essentially all our allocation traffic goes through palloc, which
bunches small allocations together.  In typical scenarios malloc will
only see requests of 8K or more, so we should be in good shape on this
front.

(Not that this is very relevant to Jim's problem, since he's not using
glibc...)

            regards, tom lane

Re: Sort memory not being released

From
Martijn van Oosterhout
Date:
On Tue, Jun 17, 2003 at 10:45:39AM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > For large allocations glibc tends to mmap() which does get unmapped. There's
> > a threshold of 4KB I think. Ofcourse, thousands of allocations for a few
> > bytes will never trigger it.
>
> But essentially all our allocation traffic goes through palloc, which
> bunches small allocations together.  In typical scenarios malloc will
> only see requests of 8K or more, so we should be in good shape on this
> front.

Ah, bad news. The threshold appears to be closer to 64-128KB, so for small
allocations normal brk() calls will be made until the third or fourth
expansion. This can be tuned (mallopt()) but that's probably not too good an
idea.

On the other hand, there is a function malloc_trim().

/* Release all but __pad bytes of freed top-most memory back to the
   system. Return 1 if successful, else 0. */
extern int malloc_trim __MALLOC_P ((size_t __pad));

Not entirely sure if it will help at all. Obviously memory fragmentation is
your enemy here.

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> "the West won the world not by the superiority of its ideas or values or
> religion but rather by its superiority in applying organized violence.
> Westerners often forget this fact, non-Westerners never do."
>   - Samuel P. Huntington

Attachment

Re: Sort memory not being released

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Tue, Jun 17, 2003 at 10:45:39AM -0400, Tom Lane wrote:
>> But essentially all our allocation traffic goes through palloc, which
>> bunches small allocations together.  In typical scenarios malloc will
>> only see requests of 8K or more, so we should be in good shape on this
>> front.

> Ah, bad news. The threshold appears to be closer to 64-128KB, so for small
> allocations normal brk() calls will be made until the third or fourth
> expansion.

That's probably good, actually.  I'd imagine that mmap'ing for every 8K
would be a bad idea ... until a context gets up to a few hundred K you
shouldn't get too worried about whether you can eventually give it back
to the OS.

> Obviously memory fragmentation is
> your enemy here.

True.  I think the memory-context structure helps on that, but it cannot
solve it completely.  (AFAIK, no one has yet done any studies to see
what sorts of memory fragmentation issues may exist in a backend that's
been running for a long while.  It'd be an interesting little project
if anyone wants to take it up.)

            regards, tom lane

Re: Sort memory not being released

From
Sean Chittenden
Date:
> > > For large allocations glibc tends to mmap() which does get
> > > unmapped. There's a threshold of 4KB I think. Ofcourse,
> > > thousands of allocations for a few bytes will never trigger it.
> >
> > But essentially all our allocation traffic goes through palloc,
> > which bunches small allocations together.  In typical scenarios
> > malloc will only see requests of 8K or more, so we should be in
> > good shape on this front.
>
> Ah, bad news. The threshold appears to be closer to 64-128KB, so for
> small allocations normal brk() calls will be made until the third or
> fourth expansion. This can be tuned (mallopt()) but that's probably
> not too good an idea.
[snip]
> Not entirely sure if it will help at all. Obviously memory
> fragmentation is your enemy here.

Depending on data use constraints and the malloc() routine in use
(this works with phk malloc() on FreeBSD, don't know about glibc or
Slowaris' routines) there's a cute trick that you can do help with
this scenario so that a large malloc()'ed region is at the end of the
data segment and therefore a process can be sbrk()'ed and shrink when
free() is called on the large allocated region.

*) malloc() the memory used in normal operations

*) malloc() the memory needed for sorting

*) free() the memory used in normal operations

*) Do whatever needs to be done with the region of memory allocated
 for sorting

*) free() the memory used for sorting

Because phk malloc() works through chains and regions, if the 1st
malloc is big enough to handle all malloc() requests during the sort
operations, the process's memory will remain reasonably unfragmented
as the chain once malloc()'ed for normal operations will be split up
and used to handle the requests during the sort operations.  Once the
sort ops are done and the sort mem is free()'ed, phk malloc will
sbrk(-1 * sort_mem) the process (shrinks the process space/releases
the top end of the data segment back to the OS).  If the malloc order
happens like: malloc() sort mem, malloc normal ops, you're hosed
because the normal ops mem region is at the top of the address space
and potentially persists longer than the sort region (likely what's
happening now), the proc can't be sbrk()'ed and the process remains
huge until the proc dies or until the regions at the top of the data
segment are free()'ed, collapsed into free contiguous regions at the
top of BSS, and then sbrk()'ed.

For long running servers and processes that grow quite large when
processing something, but you'd like to have a small foot print when
not processing data, this is what I have to do as a chump defrag
routine.  Works well for platforms that have a halfway decent
malloc().  Another option is to mmap() private anonymous regions,
though I haven't don this for anything huge yet as someone reported
being able to mmap() less than they were able to malloc()... something
I need to test.

Anyway, food for thought.  -sc

--
Sean Chittenden

Attachment

Re: Sort memory not being released

From
Antti Haapala
Date:
On Tue, 17 Jun 2003, Sean Chittenden wrote:

> > > > For large allocations glibc tends to mmap() which does get
> > > > unmapped. There's a threshold of 4KB I think. Ofcourse,
> > > > thousands of allocations for a few bytes will never trigger it.
> > >
> > > But essentially all our allocation traffic goes through palloc,
> > > which bunches small allocations together.  In typical scenarios
> > > malloc will only see requests of 8K or more, so we should be in
> > > good shape on this front.
> >
> > Ah, bad news. The threshold appears to be closer to 64-128KB, so for
> > small allocations normal brk() calls will be made until the third or
> > fourth expansion. This can be tuned (mallopt()) but that's probably
> > not too good an idea.
> [snip]
> > Not entirely sure if it will help at all. Obviously memory
> > fragmentation is your enemy here.
>
> Depending on data use constraints and the malloc() routine in use
> (this works with phk malloc() on FreeBSD, don't know about glibc or
> Slowaris' routines) there's a cute trick that you can do help with
> this scenario so that a large malloc()'ed region is at the end of the
> data segment and therefore a process can be sbrk()'ed and shrink when
> free() is called on the large allocated region.

Glibc allows malloc parameters to be tuned through environment variables.

Linux Journal had an article about tuning malloc in May's issue. The
article is available online at http://www.linuxjournal.com/article.php?sid=6390

--
Antti Haapala

Re: Sort memory not being released

From
"Jim C. Nasby"
Date:
On Tue, Jun 17, 2003 at 10:45:39AM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > For large allocations glibc tends to mmap() which does get unmapped. There's
> > a threshold of 4KB I think. Ofcourse, thousands of allocations for a few
> > bytes will never trigger it.
>
> But essentially all our allocation traffic goes through palloc, which
> bunches small allocations together.  In typical scenarios malloc will
> only see requests of 8K or more, so we should be in good shape on this
> front.
>
> (Not that this is very relevant to Jim's problem, since he's not using
> glibc...)

Maybe it would be helpful to describe why I noticed this...

I've been doing some things that require very large sorts. I generally
have very few connections though, so I thought I'd set sort_mem to about
1/3 of my memory. My thought was that it's better to suck down a ton of
memory and blow out the disk cache if it means we can avoid hitting the
disk for a sort at all.

Of course I wasn't planning on sucking down a bunch of memory and
holding on to it. :)

I've read through the sort code and it seems that the pre-buffering once
you go to disk will probably hurt with a huge sort_mem setting, since
the data could be double or even triple buffered (in memtuples[], in
pgsql's shared buffers, and by the OS).

I think that a more ideal scenario (which I've been meaning to email
hackers about) would be something like this:
If the OS is running low on free physical memory, a sort will use less
than sort_mem, as an attempt to avoid swapping.

sort_mem is the maximum amount of sort memory a single sort (or maybe a
single connection) can take.

If sort_mem is over X size, then use only Y for pre-buffering (How much
does a large sort_mem help if you have to spill to disk?)

If it's pretty clear that the sort won't fit in memory (due to sort_mem
or system free memory being low), I think it might help if tuplesort
just went to disk right away, instead of waiting until all the memory
was used up, but again, I'm not sure how the sort algorithm works when
it goes to tape.

This should mean that you can set the system up to allow very large
sorts before spilling to disk... if there's not a lot of sorts sucking
down memory, a large sort will be able to avoid overflowing to disk,
which is obviously a huge performance gain. If the system is busy/memory
bound though, sorts will overflow to disk, rather than using swap space
which I'm sure would be a lot worse.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Sort memory not being released

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> Of course I wasn't planning on sucking down a bunch of memory and
> holding on to it. :)

Sure.  But when you're done with the big sort, just start a fresh
session.  I don't see that this is worth agonizing over.

> If sort_mem is over X size, then use only Y for pre-buffering (How much
> does a large sort_mem help if you have to spill to disk?)

It still helps quite a lot, because the average initial run length is
(if I recall Knuth correctly) twice the working buffer size.  I can't
see a reason for cutting back usage once you've been forced to start
spilling.

The bigger problem with your discussion is the assumption that we can
find out "if the OS is running low on free physical memory".  That seems
(a) unportable and (b) a moving target.

            regards, tom lane

Re: Sort memory not being released

From
"Jim C. Nasby"
Date:
On Tue, Jun 17, 2003 at 05:38:36PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > Of course I wasn't planning on sucking down a bunch of memory and
> > holding on to it. :)
>
> Sure.  But when you're done with the big sort, just start a fresh
> session.  I don't see that this is worth agonizing over.

In this case I could do that, but that's not always possible. It would
certainly wreck havoc with connection pooling, for example.

> > If sort_mem is over X size, then use only Y for pre-buffering (How much
> > does a large sort_mem help if you have to spill to disk?)
>
> It still helps quite a lot, because the average initial run length is
> (if I recall Knuth correctly) twice the working buffer size.  I can't
> see a reason for cutting back usage once you've been forced to start
> spilling.

Only because of double/triple buffering. If having the memory around
helps the algorithm then it should be used, at least up to the point of
diminishing returns.

> The bigger problem with your discussion is the assumption that we can
> find out "if the OS is running low on free physical memory".  That seems
> (a) unportable and (b) a moving target.

Well, there's other ways to do what I'm thinking of that don't rely on
getting a free memory number from the OS. For example, there could be a
'total_sort_mem' parameter that specifies the total amount of memory
that can be used for all sorts on the entire machine.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Sort memory not being released

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> Well, there's other ways to do what I'm thinking of that don't rely on
> getting a free memory number from the OS. For example, there could be a
> 'total_sort_mem' parameter that specifies the total amount of memory
> that can be used for all sorts on the entire machine.

How would you find out how many other sorts are going on (and how much
memory they're actually using)?  And probably more to the point, what do
you do if you want to sort and the parameter's already exhausted?

            regards, tom lane

Re: Sort memory not being released

From
Martijn van Oosterhout
Date:
On Tue, Jun 17, 2003 at 10:02:07AM -0700, Sean Chittenden wrote:
> For long running servers and processes that grow quite large when
> processing something, but you'd like to have a small foot print when
> not processing data, this is what I have to do as a chump defrag
> routine.  Works well for platforms that have a halfway decent
> malloc().  Another option is to mmap() private anonymous regions,
> though I haven't don this for anything huge yet as someone reported
> being able to mmap() less than they were able to malloc()... something
> I need to test.

Look at the process memory layout. On Linux you get stack+heap is limited to
2GB. To access the rest you need to mmap(). This would vary depending on the
OS. IMHO glibc's approach (use mmap() for large allocations) is a good one
since the sortmems will generally be mmap()ed (at least they were in my
quick test last night.

As Tom pointed out, some study into memory framentation would be useful.

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> "the West won the world not by the superiority of its ideas or values or
> religion but rather by its superiority in applying organized violence.
> Westerners often forget this fact, non-Westerners never do."
>   - Samuel P. Huntington

Attachment

Re: Sort memory not being released

From
"Jim C. Nasby"
Date:
On Tue, Jun 17, 2003 at 08:08:29PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > Well, there's other ways to do what I'm thinking of that don't rely on
> > getting a free memory number from the OS. For example, there could be a
> > 'total_sort_mem' parameter that specifies the total amount of memory
> > that can be used for all sorts on the entire machine.
>
> How would you find out how many other sorts are going on (and how much
> memory they're actually using)?  And probably more to the point, what do
> you do if you want to sort and the parameter's already exhausted?

There'd need to be a list of sorts kept in the backend, presumably in
shared memory. When a sort started, it would add an entry to the list
specifying how much memory it intended to use. Once it was done, it
could update with how much was actually used. Actually, I guess a simple
counter would suffice instead of a list.

As for when memory runs out, there's two things you can do. Obviously,
you can just sleep until more memory becomes available (presumably the
lock on the shared list/counter would prevent more than one backend from
starting a sort at a time). A more elegant solution would be to start
decreasing how much memory a sort will use as the limit is approached. A
possible algorithm would be:

IF total_sort_mem - active_sort_mem < desired_sort_mem THEN
    desired_sort_mem = (total_sort_mem - active_sort_mem) / 2

So if you can't get all the memory you'd like, take half of whatever's
available. Obviously there would have to be a limit to this... you can't
sort on 100 bytes. If (total_sort_mem - active_sort_mem) drops below a
certain threshold, you would either ignore it and use some small amount
of memory to do the sort, or you'd sleep until memory became available.

I know this might sound like a lot of added complexity, but if it means
you have a much better chance of being able to perform large sorts
in-memory instead of on-disk, I think it's well worth it.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Sort memory not being released

From
dalgoda@ix.netcom.com (Mike Castle)
Date:
In article <20030617212500.GO40542@flake.decibel.org>,
Jim C. Nasby <jim@nasby.net> wrote:
>Of course I wasn't planning on sucking down a bunch of memory and
>holding on to it. :)

What are you worried about?  The unused portions will eventually be paged
out to disk.  On the next sort, you'll spend a little less time allocating
the memory (saving time) and a little more time paging the disk in (taking
time).  Probably, all in all, you'll end up breaking even.

Just because your process has access to a lot of memory, doesn't mean that
it's all in physical memory at once.

Unless your system ran out of physical memory and/or swap, there shouldn't
be an issue.

It may well be than when you up the sort memory, you may also have to up
swap space.  No big deal.

mrc

--
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc