Thread: Releasing memory during External sorting?

Releasing memory during External sorting?

From
Simon Riggs
Date:
I have concerns about whether we are overallocating memory for use in
external sorts. (All code relating to this is in tuplesort.c)

When we begin a sort we allocate (work_mem | maintenance_work_mem) and
attempt to do the sort in memory. If the sort set is too big to fit in
memory we then write to disk and begin an external sort. The same memory
allocation is used for both types of sort, AFAICS.

The external sort algorithm benefits from some memory but not much.
Knuth says that the amount of memory required is very low, with a value
typically less than 1 kB. I/O overheads mean that there is benefit from
having longer sequential writes, so the optimum is much larger than
that. I've not seen any data that indicates that a setting higher than
16 MB adds any value at all to a large external sort. I have some
indications from private tests that very high memory settings may
actually hinder performance of the sorts, though I cannot explain that
and wonder whether it is the performance tests themselves that have
issues.

Does anyone have any clear data that shows the value of large settings
of work_mem when the data to be sorted is much larger than memory? (I am
well aware of the value of setting work_mem higher for smaller sorts, so
any performance data needs to reflect only very large sorts).

If not, I would propose that when we move from qsort to tapesort mode we
free the larger work_mem setting (if one exists) and allocate only a
lower, though still optimal setting for the tapesort. That way the
memory can be freed for use by other users or the OS while the tapesort
proceeds (which is usually quite a while...).

Feedback, please.

Best Regards, Simon Riggs


Re: Releasing memory during External sorting?

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> If not, I would propose that when we move from qsort to tapesort mode we
> free the larger work_mem setting (if one exists) and allocate only a
> lower, though still optimal setting for the tapesort. That way the
> memory can be freed for use by other users or the OS while the tapesort
> proceeds (which is usually quite a while...).

On most platforms it's quite unlikely that any memory would actually get
released back to the OS before transaction end, because the memory
blocks belonging to the tuplesort context will be intermixed with blocks
belonging to other contexts.  So I think this is pretty pointless.
(If you can't afford to have the sort using all of sort_mem, you've set
sort_mem too large, anyway.)

            regards, tom lane

Re: Releasing memory during External sorting?

From
Simon Riggs
Date:
On Fri, 2005-09-23 at 10:09 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > If not, I would propose that when we move from qsort to tapesort mode we
> > free the larger work_mem setting (if one exists) and allocate only a
> > lower, though still optimal setting for the tapesort. That way the
> > memory can be freed for use by other users or the OS while the tapesort
> > proceeds (which is usually quite a while...).
>
> On most platforms it's quite unlikely that any memory would actually get
> released back to the OS before transaction end, because the memory
> blocks belonging to the tuplesort context will be intermixed with blocks
> belonging to other contexts.  So I think this is pretty pointless.

I take it you mean pointless because of the way the memory allocation
works, rather than because giving memory back isn't worthwhile ?

Surely the sort memory would be allocated in contiguous chunks? In some
cases we might be talking about more than a GB of memory, so it'd be
good to get that back ASAP. I'm speculating....

> (If you can't afford to have the sort using all of sort_mem, you've set
> sort_mem too large, anyway.)

Sort takes care to allocate only what it needs as starts up. All I'm
suggesting is to take the same care when the sort mode changes. If the
above argument held water then we would just allocate all the memory in
one lump at startup, "because we can afford to", so I don't buy that.

Since we know the predicted size of the sort set prior to starting the
sort node, could we not use that information to allocate memory
appropriately? i.e. if sort size is predicted to be more than twice the
size of work_mem, then just move straight to the external sort algorithm
and set the work_mem down at the lower limit?

That is, unless somebody has evidence that having a very large memory
has any performance benefit for external sorting?

Best Regards, Simon Riggs





Re: Releasing memory during External sorting?

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Since we know the predicted size of the sort set prior to starting the
> sort node, could we not use that information to allocate memory
> appropriately? i.e. if sort size is predicted to be more than twice the
> size of work_mem, then just move straight to the external sort algorithm
> and set the work_mem down at the lower limit?

Have you actually read the sort code?

During the run-forming phase it's definitely useful to eat all the
memory you can: that translates directly to longer initial runs and
hence fewer merge passes.  During the run-merging phase it's possible
that using less memory would not hurt performance any, but as already
stated, I don't think it will actually end up cutting the backend's
memory footprint --- the sbrk point will be established during the run
forming phase and it's unlikely to move back much until transaction end.

Also, if I recall the development of that code correctly, the reason for
using more than minimum memory during the merge phase is that writing or
reading lots of tuples at once improves sequentiality of access to the
temp files.  So I'm not sure that cutting down the memory wouldn't hurt
performance.

            regards, tom lane

Re: Releasing memory during External sorting?

From
Pailloncy Jean-Gerard
Date:
> On most platforms it's quite unlikely that any memory would
> actually get
> released back to the OS before transaction end, because the memory
> blocks belonging to the tuplesort context will be intermixed with
> blocks
> belonging to other contexts.  So I think this is pretty pointless.
> (If you can't afford to have the sort using all of sort_mem, you've
> set
> sort_mem too large, anyway.)
On OpenBSD 3.8 malloc use mmap(2) and no more sbrk.
So, as soon as the bloc is free, it returns to the OS.
Access to the freed pointer crashs immediatly.

Cordialement,
Jean-Gérard Pailloncy



Re: Releasing memory during External sorting?

From
Martijn van Oosterhout
Date:
On Fri, Sep 23, 2005 at 06:39:35PM +0200, Pailloncy Jean-Gerard wrote:
> >On most platforms it's quite unlikely that any memory would actually
> >get released back to the OS before transaction end, because the
> >memory blocks belonging to the tuplesort context will be intermixed
> >with blocks belonging to other contexts.  So I think this is pretty
> >pointless. (If you can't afford to have the sort using all of
> >sort_mem, you've set sort_mem too large, anyway.)

> On OpenBSD 3.8 malloc use mmap(2) and no more sbrk.
> So, as soon as the bloc is free, it returns to the OS.
> Access to the freed pointer crashs immediatly.

Interesting point. Glibc also uses mmap() but only for allocations
greater than a few K, otherwise it's a waste of space.

I guess you would have to look into the postgresql allocator to see if
it doesn't divide the mmap()ed space up between multiple contexts.
Large allocations certainly appear to be passed off to malloc() but I
don't think execSort allocates all it's space in one go, it just counts
the space allocated by palloc().

So, unless someone goes and adds changes the tuplesort code to allocate
big blocks and use them only for tuples, I think you're going to run
into issues with data interleaved, meaning not much to give back to the
OS...
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Releasing memory during External sorting?

From
Meir Maor
Date:
Calculating Optimal memory for disk based sort is based only on minimizing IO.
A previous post stated we can merge as many subfiles as we want in a single pass,
this is not accurate, as we want to eliminate disk seeks also in the merge phase,
also the merging should be done by reading blocks of data from each subfile,
if we have data of size N and M memory, then we will have K=N/M subfiles to merge
after sorting each.
in the merge operation if we want to merge all blocks in one pass we will read
M/K data from each subfile into memory and begin merging, we will read another M/K block
when the buffer from a subfile is empty,
we would like disk seek time to be irrelavant when comparing to sequential IO time.
We notice that we are performing IO in blocks of N/K^2 which is M/(N/M)^2
let us assume that sequeential IO is done at 100MB/s and that
a random seek requires ~15ms. and we want seek time to be irrelavnt in one order of
magnitute we get, that in the time of one random seek we can read 1.5MB of data
and would get optimal performance if we perform IO in blocks of 15MB.
and since in the merge algorithm showed above we perform IO in blocks of M/K
we would like M>K*15MB which results in a very large memory requirement.
M^2>N*15MB
M>sqrt(N*15MB)
for example for sorting 10GB of data, we would like M>380MB
for optimal performance.

alternativly if we can choose a diffrent algorithm in which we merge only a constant
number of sunfiles to gether at a time but then we will require multiple passes to merge
the entire file. we will require log(K) passes over the entire data and this approach obviously
improves with increase of memory.

The first aproach requires 2 passes of the entire data and K^2+K random seeks,
the second aproach(when merging l blocks at a time) requires: log(l,K) passes over the data
and K*l+K random seeks.


On 9/23/05, Simon Riggs <simon@2ndquadrant.com> wrote:
I have concerns about whether we are overallocating memory for use in
external sorts. (All code relating to this is in tuplesort.c)

When we begin a sort we allocate (work_mem | maintenance_work_mem) and
attempt to do the sort in memory. If the sort set is too big to fit in
memory we then write to disk and begin an external sort. The same memory
allocation is used for both types of sort, AFAICS.

The external sort algorithm benefits from some memory but not much.
Knuth says that the amount of memory required is very low, with a value
typically less than 1 kB. I/O overheads mean that there is benefit from
having longer sequential writes, so the optimum is much larger than
that. I've not seen any data that indicates that a setting higher than
16 MB adds any value at all to a large external sort. I have some
indications from private tests that very high memory settings may
actually hinder performance of the sorts, though I cannot explain that
and wonder whether it is the performance tests themselves that have
issues.

Does anyone have any clear data that shows the value of large settings
of work_mem when the data to be sorted is much larger than memory? (I am
well aware of the value of setting work_mem higher for smaller sorts, so
any performance data needs to reflect only very large sorts).

If not, I would propose that when we move from qsort to tapesort mode we
free the larger work_mem setting (if one exists) and allocate only a
lower, though still optimal setting for the tapesort. That way the
memory can be freed for use by other users or the OS while the tapesort
proceeds (which is usually quite a while...).

Feedback, please.

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: Releasing memory during External sorting?

From
"Dann Corbit"
Date:

Generally, when you read from a set of subfiles, the OS will cache the reads to some degree, so the disk-seek jitter is not always that bad. On a highly fragmented disk drive, you might also jump all over the place reading serially from a single subfile.  Of course, every situation is different.  At any rate, I would recommend to benchmark many different approaches.  It is also rather important how much memory is available to perform sorts and reads.  But with memory becoming larger and cheaper, it becomes less and less of a worry.

 


From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Meir Maor
Sent: Friday, September 23, 2005 10:24 PM
To: Simon Riggs
Cc: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org
Subject: Re: [HACKERS] Releasing memory during External sorting?

 

Calculating Optimal memory for disk based sort is based only on minimizing IO.
A previous post stated we can merge as many subfiles as we want in a single pass,
this is not accurate, as we want to eliminate disk seeks also in the merge phase,
also the merging should be done by reading blocks of data from each subfile,
if we have data of size N and M memory, then we will have K=N/M subfiles to merge
after sorting each.
in the merge operation if we want to merge all blocks in one pass we will read
M/K data from each subfile into memory and begin merging, we will read another M/K block
when the buffer from a subfile is empty,
we would like disk seek time to be irrelavant when comparing to sequential IO time.
We notice that we are performing IO in blocks of N/K^2 which is M/(N/M)^2
let us assume that sequeential IO is done at 100MB/s and that
a random seek requires ~15ms. and we want seek time to be irrelavnt in one order of
magnitute we get, that in the time of one random seek we can read 1.5MB of data
and would get optimal performance if we perform IO in blocks of 15MB.
and since in the merge algorithm showed above we perform IO in blocks of M/K
we would like M>K*15MB which results in a very large memory requirement.
M^2>N*15MB
M>sqrt(N*15MB)
for example for sorting 10GB of data, we would like M>380MB
for optimal performance.

alternativly if we can choose a diffrent algorithm in which we merge only a constant
number of sunfiles to gether at a time but then we will require multiple passes to merge
the entire file. we will require log(K) passes over the entire data and this approach obviously
improves with increase of memory.

The first aproach requires 2 passes of the entire data and K^2+K random seeks,
the second aproach(when merging l blocks at a time) requires: log(l,K) passes over the data
and K*l+K random seeks.

On 9/23/05, Simon Riggs <simon@2ndquadrant.com> wrote:

I have concerns about whether we are overallocating memory for use in
external sorts. (All code relating to this is in tuplesort.c)

When we begin a sort we allocate (work_mem | maintenance_work_mem) and
attempt to do the sort in memory. If the sort set is too big to fit in
memory we then write to disk and begin an external sort. The same memory
allocation is used for both types of sort, AFAICS.

The external sort algorithm benefits from some memory but not much.
Knuth says that the amount of memory required is very low, with a value
typically less than 1 kB. I/O overheads mean that there is benefit from
having longer sequential writes, so the optimum is much larger than
that. I've not seen any data that indicates that a setting higher than
16 MB adds any value at all to a large external sort. I have some
indications from private tests that very high memory settings may
actually hinder performance of the sorts, though I cannot explain that
and wonder whether it is the performance tests themselves that have
issues.

Does anyone have any clear data that shows the value of large settings
of work_mem when the data to be sorted is much larger than memory? (I am
well aware of the value of setting work_mem higher for smaller sorts, so
any performance data needs to reflect only very large sorts).

If not, I would propose that when we move from qsort to tapesort mode we
free the larger work_mem setting (if one exists) and allocate only a
lower, though still optimal setting for the tapesort. That way the
memory can be freed for use by other users or the OS while the tapesort
proceeds (which is usually quite a while...).

Feedback, please.

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

 

Re: Releasing memory during External sorting?

From
Simon Riggs
Date:
On Fri, 2005-09-23 at 11:31 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Since we know the predicted size of the sort set prior to starting the
> > sort node, could we not use that information to allocate memory
> > appropriately? i.e. if sort size is predicted to be more than twice the
> > size of work_mem, then just move straight to the external sort algorithm
> > and set the work_mem down at the lower limit?
>
> Have you actually read the sort code?

Yes and Knuth too. Your research and code are incredible, almost
untouchable. Yet sort performance is important and empirical evidence
suggests that this can be improved upon significantly, so I am and will
be spending time trying to improve upon that. Another time...

This thread was aiming to plug a problem I saw with 8.1's ability to use
very large work_mem settings. I felt that either my performance numbers
were wrong or we needed to do something; I've not had anybody show me
performance numbers that prove mine doubtful, yet.

> During the run-forming phase it's definitely useful to eat all the
> memory you can: that translates directly to longer initial runs and
> hence fewer merge passes.

Sounds good, but maybe that is not the dominant effect. I'll retest, on
the assumption that there is a benefit, but there's something wrong with
my earlier tests.

> During the run-merging phase it's possible
> that using less memory would not hurt performance any, but as already
> stated, I don't think it will actually end up cutting the backend's
> memory footprint --- the sbrk point will be established during the run
> forming phase and it's unlikely to move back much until transaction end.

> Also, if I recall the development of that code correctly, the reason for
> using more than minimum memory during the merge phase is that writing or
> reading lots of tuples at once improves sequentiality of access to the
> temp files.  So I'm not sure that cutting down the memory wouldn't hurt
> performance.

Cutting memory below about 16 MB does definitely hurt external sort
performance; I explain that as being the effect of sequential access. I
haven't looked to nail down the breakpoint exactly since it seemed more
important simply to say that there looked like there was one.. Its just
that raising it above that mark doesn't help much, according to my
current results.

I'll get some more test results and repost them, next week. I will be
very happy if the results show that more memory helps.

Best Regards, Simon Riggs