Thread: Memory Accounting
Previous discussion: https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop This patch introduces a way to ask a memory context how much memory it currently has allocated. Each time a new block (not an individual chunk, but a new malloc'd block) is allocated, it increments a struct member; and each time a block is free'd, it decrements the struct member. So it tracks blocks allocated by malloc, not what is actually used for chunks allocated by palloc. The purpose is for Memory Bounded Hash Aggregation, but it may be useful in more cases as we try to better adapt to and control memory usage at execution time. I ran the same tests as Robert did before[1] on my laptop[2]. The only difference is that I also set max_parallel_workers[_per_gather]=0 to be sure. I did 5 runs, alternating between memory-accounting and master, and I got the following results for "elapsed" (as reported by trace_sort): regression=# select version, min(s), max(s), percentile_disc(0.5) within group (order by s) as median, avg(s)::numeric(10,2) from tmp group by version; version | min | max | median | avg -------------------+-------+-------+--------+------- master | 13.92 | 14.40 | 14.06 | 14.12 memory accounting | 13.43 | 14.46 | 14.11 | 14.09 (2 rows) So, I don't see any significant performance impact for this patch in this test. That may be because: * It was never really significant except on PPC64. * I changed it to only update mem_allocated for the current context, not recursively for all parent contexts. It's now up to the function that reports memory usage to recurse or not (as needed). * A lot of changes to sort have happened since that time, so perhaps it's not a good test of memory contexts any more. pgbench didn't show any slowdown either. I also did another test with hash aggregation that uses significant memory (t10m is a table with 10M distinct values and work_mem is 1GB): postgres=# select (select (i, count(*)) from t10m group by i having count(*) > n) from (values(1),(2),(3),(4),(5)) as s(n); I didn't see any noticable difference there, either. Regards, Jeff Davis [1] https://postgr.es/m/CA%2BTgmobnu7XEn1gRdXnFo37P79bF%3DqLt46%3D37ajP3Cro9dBRaA%40mail.gmail.com [2] Linux jdavis 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Attachment
On 18/07/2019 21:24, Jeff Davis wrote: > Previous discussion: > https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop > > This patch introduces a way to ask a memory context how much memory it > currently has allocated. Each time a new block (not an individual > chunk, but a new malloc'd block) is allocated, it increments a struct > member; and each time a block is free'd, it decrements the struct > member. So it tracks blocks allocated by malloc, not what is actually > used for chunks allocated by palloc. > > The purpose is for Memory Bounded Hash Aggregation, but it may be > useful in more cases as we try to better adapt to and control memory > usage at execution time. Seems handy. > * I changed it to only update mem_allocated for the current context, > not recursively for all parent contexts. It's now up to the function > that reports memory usage to recurse or not (as needed). Is that OK for memory bounded hash aggregation? Might there be a lot of sub-contexts during hash aggregation? - Heikki
On Mon, Jul 22, 2019 at 11:30:37AM +0300, Heikki Linnakangas wrote: >On 18/07/2019 21:24, Jeff Davis wrote: >>Previous discussion: >>https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop >> >>This patch introduces a way to ask a memory context how much memory it >>currently has allocated. Each time a new block (not an individual >>chunk, but a new malloc'd block) is allocated, it increments a struct >>member; and each time a block is free'd, it decrements the struct >>member. So it tracks blocks allocated by malloc, not what is actually >>used for chunks allocated by palloc. >> >>The purpose is for Memory Bounded Hash Aggregation, but it may be >>useful in more cases as we try to better adapt to and control memory >>usage at execution time. > >Seems handy. > Indeed. >>* I changed it to only update mem_allocated for the current context, >>not recursively for all parent contexts. It's now up to the function >>that reports memory usage to recurse or not (as needed). > >Is that OK for memory bounded hash aggregation? Might there be a lot >of sub-contexts during hash aggregation? > There shouldn't be, at least not since b419865a814abbc. There might be cases where custom aggregates still do that, but I think that's simply a design we should discourage. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, 2019-07-22 at 18:16 +0200, Tomas Vondra wrote: > * I changed it to only update mem_allocated for the current > > > context, > > > not recursively for all parent contexts. It's now up to the > > > function > > > that reports memory usage to recurse or not (as needed). > > > > Is that OK for memory bounded hash aggregation? Might there be a > > lot > > of sub-contexts during hash aggregation? > > > > There shouldn't be, at least not since b419865a814abbc. There might > be > cases where custom aggregates still do that, but I think that's > simply a > design we should discourage. Right, I don't think memory-context-per-group is something we should optimize for. Discussion: https://www.postgresql.org/message-id/3839201.Nfa2RvcheX%40techfox.foxi https://www.postgresql.org/message-id/5334D7A5.2000907%40fuzzy.cz Commit link: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b419865a814abbca12bdd6eef6a3d5ed67f432e1 Regards, Jeff Davis
On Thu, Jul 18, 2019 at 11:24 AM Jeff Davis <pgsql@j-davis.com> wrote:
Previous discussion:
https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop
This patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.
Cool! I like how straight-forward this approach is. It seems easy to
build on, as well.
Are there cases where we are likely to palloc a lot without needing to
malloc in a certain memory context? For example, do we have a pattern
where, for some kind of memory intensive operator, we might palloc in
a per tuple context and consistently get chunks without having to
malloc and then later, where we to try and check the bytes allocated
for one of these per tuple contexts to decide on some behavior, the
number would not be representative?
I think that is basically what Heikki is asking about with HashAgg,
but I wondered if there were other cases that you had already thought
through where this might happen.
build on, as well.
Are there cases where we are likely to palloc a lot without needing to
malloc in a certain memory context? For example, do we have a pattern
where, for some kind of memory intensive operator, we might palloc in
a per tuple context and consistently get chunks without having to
malloc and then later, where we to try and check the bytes allocated
for one of these per tuple contexts to decide on some behavior, the
number would not be representative?
I think that is basically what Heikki is asking about with HashAgg,
but I wondered if there were other cases that you had already thought
through where this might happen.
The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.
This approach seems like it would be good for memory intensive
operators which use a large, representative context. I think the
HashTableContext for HashJoin might be one?
operators which use a large, representative context. I think the
HashTableContext for HashJoin might be one?
--
Melanie Plageman
On Tue, Jul 23, 2019 at 06:18:26PM -0700, Melanie Plageman wrote: >On Thu, Jul 18, 2019 at 11:24 AM Jeff Davis <pgsql@j-davis.com> wrote: > >> Previous discussion: >> https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop >> >> This patch introduces a way to ask a memory context how much memory it >> currently has allocated. Each time a new block (not an individual >> chunk, but a new malloc'd block) is allocated, it increments a struct >> member; and each time a block is free'd, it decrements the struct >> member. So it tracks blocks allocated by malloc, not what is actually >> used for chunks allocated by palloc. >> >> >Cool! I like how straight-forward this approach is. It seems easy to >build on, as well. > >Are there cases where we are likely to palloc a lot without needing to >malloc in a certain memory context? For example, do we have a pattern >where, for some kind of memory intensive operator, we might palloc in >a per tuple context and consistently get chunks without having to >malloc and then later, where we to try and check the bytes allocated >for one of these per tuple contexts to decide on some behavior, the >number would not be representative? > I think there's plenty of places where we quickly get into a state with enough chunks in the freelist - the per-tuple contexts are a good example of that, I think. >I think that is basically what Heikki is asking about with HashAgg, >but I wondered if there were other cases that you had already thought >through where this might happen. > I think Heikki was asking about places with a lot of sub-contexts, which a completely different issue. It used to be the case that some aggregates created a separate context for each group - like array_agg. That would make Jeff's approach to accounting rather inefficient, because checking how much memory is used would be very expensive (having to loop over a large number of contexts). > >> The purpose is for Memory Bounded Hash Aggregation, but it may be >> useful in more cases as we try to better adapt to and control memory >> usage at execution time. >> >> >This approach seems like it would be good for memory intensive >operators which use a large, representative context. I think the >HashTableContext for HashJoin might be one? > Yes, that might me a good candidate (and it would be much simpler than the manual accounting we use now). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I wanted to address a couple of questions Jeff asked me off-list about
Greenplum's implementations of Memory Accounting.
Greenplum has two memory accounting sub-systems -- one is the
MemoryContext-based system proposed here.
The other memory accounting system tracks "logical" memory owners in
global accounts. For example, every node in a plan has an account,
however, there are other execution contexts, such as the parser, which
have their own logical memory accounts.
Notably, this logical memory account system also tracks chunks instead
of blocks.
The rationale for tracking memory at the logical owner level was that
memory for a logical owner may allocate memory across multiple
contexts and a single context may contain memory belonging to several
of these logical owners.
More compellingly, many of the allocations done during execution are
done directly in the per query or per tuple context--as opposed to
being done in their own uniquely named context. Arguably, this is
because those logical owners (a Result node, for example) are not
memory-intensive and thus do not require specific memory accounting.
However, when debugging a memory leak or OOM, the specificity of
logical owner accounts was seen as desirable. A discrepancy between
memory allocated and memory freed in the per query context doesn't
provide a lot of hints as to the source of the leak.
At the least, there was no meaningful way to represent MemoryContext
account balances in EXPLAIN ANALYZE. Where would the TopMemoryContext
be represented, for example.
Also, by using logical accounts, each node in the plan could be
assigned a quota at plan time--because many memory intensive operators
will not have relinquished the memory they hold when other nodes are
executing (e.g. Materialize)--so, instead of granting each node
work_mem, work_mem is divided up into quotas for each operator in a
particular way. This was meant to pave the way for work_mem
enforcement. This is a topic that has come up in various ways in other
forums. For example, in the XPRS thread, the discussion of erroring
out for queries with no "escape mechanism" brought up by Thomas Munro
[1] is a kind of work_mem enforcement (this discussion was focused
more on a proposed session-level memory setting, but it is still
enforcement of a memory setting).
It was also discussed at PGCon this year in an unconference session on
OOM-detection and debugging, runaway query termination, and
session-level memory consumption tracking [2].
The motivation for tracking chunks instead of blocks was to understand
the "actual" memory consumption of different components in the
database. Then, eventually, memory consumption patterns would emerge
and improvements could be made to memory allocation strategies to suit
different use cases--perhaps other implementations of the
MemoryContext API similar to Slab and Generation were envisioned.
Apparently, it did lead to the discovery of some memory fragmentation
issues which were tuned.
I bring these up not just to answer Jeff's question but also to
provide some anecdotal evidence that the patch here is a good base for
other memory accounting and tracking schemes.
Even if HashAgg is the only initial consumer of the memory accounting
framework, we know that tuplesort can make use of it in its current
state as well. And, if another node or component requires
chunk-tracking, they could implement a different MemoryContext API
implementation which uses the MemoryContextData->mem_allocated field
to track chunks instead of blocks by tracking chunks in its alloc/free
functions.
Ideas like logical memory accounting could leverage the mem_allocated
field and build on top of it.
[1] https://www.postgresql.org/message-id/CA%2BhUKGJEMT7SSZRqt-knu_3iLkdscBCe9M2nrhC259FdE5bX7g%40mail.gmail.com
[2] https://wiki.postgresql.org/wiki/PgCon_2019_Developer_Unconference
Greenplum's implementations of Memory Accounting.
Greenplum has two memory accounting sub-systems -- one is the
MemoryContext-based system proposed here.
The other memory accounting system tracks "logical" memory owners in
global accounts. For example, every node in a plan has an account,
however, there are other execution contexts, such as the parser, which
have their own logical memory accounts.
Notably, this logical memory account system also tracks chunks instead
of blocks.
The rationale for tracking memory at the logical owner level was that
memory for a logical owner may allocate memory across multiple
contexts and a single context may contain memory belonging to several
of these logical owners.
More compellingly, many of the allocations done during execution are
done directly in the per query or per tuple context--as opposed to
being done in their own uniquely named context. Arguably, this is
because those logical owners (a Result node, for example) are not
memory-intensive and thus do not require specific memory accounting.
However, when debugging a memory leak or OOM, the specificity of
logical owner accounts was seen as desirable. A discrepancy between
memory allocated and memory freed in the per query context doesn't
provide a lot of hints as to the source of the leak.
At the least, there was no meaningful way to represent MemoryContext
account balances in EXPLAIN ANALYZE. Where would the TopMemoryContext
be represented, for example.
Also, by using logical accounts, each node in the plan could be
assigned a quota at plan time--because many memory intensive operators
will not have relinquished the memory they hold when other nodes are
executing (e.g. Materialize)--so, instead of granting each node
work_mem, work_mem is divided up into quotas for each operator in a
particular way. This was meant to pave the way for work_mem
enforcement. This is a topic that has come up in various ways in other
forums. For example, in the XPRS thread, the discussion of erroring
out for queries with no "escape mechanism" brought up by Thomas Munro
[1] is a kind of work_mem enforcement (this discussion was focused
more on a proposed session-level memory setting, but it is still
enforcement of a memory setting).
It was also discussed at PGCon this year in an unconference session on
OOM-detection and debugging, runaway query termination, and
session-level memory consumption tracking [2].
The motivation for tracking chunks instead of blocks was to understand
the "actual" memory consumption of different components in the
database. Then, eventually, memory consumption patterns would emerge
and improvements could be made to memory allocation strategies to suit
different use cases--perhaps other implementations of the
MemoryContext API similar to Slab and Generation were envisioned.
Apparently, it did lead to the discovery of some memory fragmentation
issues which were tuned.
I bring these up not just to answer Jeff's question but also to
provide some anecdotal evidence that the patch here is a good base for
other memory accounting and tracking schemes.
Even if HashAgg is the only initial consumer of the memory accounting
framework, we know that tuplesort can make use of it in its current
state as well. And, if another node or component requires
chunk-tracking, they could implement a different MemoryContext API
implementation which uses the MemoryContextData->mem_allocated field
to track chunks instead of blocks by tracking chunks in its alloc/free
functions.
Ideas like logical memory accounting could leverage the mem_allocated
field and build on top of it.
[1] https://www.postgresql.org/message-id/CA%2BhUKGJEMT7SSZRqt-knu_3iLkdscBCe9M2nrhC259FdE5bX7g%40mail.gmail.com
[2] https://wiki.postgresql.org/wiki/PgCon_2019_Developer_Unconference
On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote: > Hi Jeff, Hi Soumyadeep and Melanie, Thank you for the review! > max_stack_depth max level lazy (ms) eager (ms) (eage > r/lazy) > 2MB 82 302.715 427.554 1.4123978 > 3MB 3474 567.829 896.143 1.578191674 > 7.67MB 8694 2657.972 4903.063 1.844663149 Thank you for collecting data on this. Were you able to find any regression when compared to no memory accounting at all? It looks like you agree with the approach and the results. Did you find any other issues with the patch? I am also including Robert in this thread. He had some concerns the last time around due to a small regression on POWER. Regards, Jeff Davis
On Wed, Jul 24, 2019 at 11:52:28PM +0200, Tomas Vondra wrote: > I think Heikki was asking about places with a lot of sub-contexts, which a > completely different issue. It used to be the case that some aggregates > created a separate context for each group - like array_agg. That would > make Jeff's approach to accounting rather inefficient, because checking > how much memory is used would be very expensive (having to loop over a > large number of contexts). The patch has been marked as ready for committer for a week or so, but it seems to me that this comment has not been addressed, no? Are we sure that we want this method if it proves to be inefficient when there are many sub-contexts and shouldn't we at least test such scenarios with a worst-case, customly-made, function? -- Michael
Attachment
On Tue, Sep 24, 2019 at 02:21:40PM +0900, Michael Paquier wrote: >On Wed, Jul 24, 2019 at 11:52:28PM +0200, Tomas Vondra wrote: >> I think Heikki was asking about places with a lot of sub-contexts, which a >> completely different issue. It used to be the case that some aggregates >> created a separate context for each group - like array_agg. That would >> make Jeff's approach to accounting rather inefficient, because checking >> how much memory is used would be very expensive (having to loop over a >> large number of contexts). > >The patch has been marked as ready for committer for a week or so, but >it seems to me that this comment has not been addressed, no? Are we >sure that we want this method if it proves to be inefficient when >there are many sub-contexts and shouldn't we at least test such >scenarios with a worst-case, customly-made, function? I don't think so. Aggregates creating many memory contexts (context for each group) was discussed extensively in the thread about v11 [1] in 2015. And back then the conclusion was that that's a pretty awful pattern anyway, as it uses much more memory (no cross-context freelists), and has various other issues. In a way, those aggregates are wrong and should be fixed just like we fixed array_agg/string_agg (even without the memory accounting). The way I see it we can do either eager or lazy accounting. Eager might work better for aggregates with many contexts, but it does increase the overhead for the "regular" aggregates with just one or two contexts. Considering how rare those many-context aggregates are (I'm not aware of any such aggregate at the moment), it seems reasonable to pick the lazy accounting. (Note: Another factor affecting the lazy vs. eager efficiency is the number of palloc/pfree calls vs. calls to determine amount of mem used, but that's mostly orthogonal and we cn ignore it here). So I think the approach Jeff ended up with sensible - certainly as a first step. We may improve it in the future, of course, once we have more practical experience. Barring objections, I do plan to get this committed by the end of this CF (i.e. sometime later this week). [1] https://www.postgresql.org/message-id/1434311039.4369.39.camel%40jeff-desktop -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Sep 19, 2019 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote:
> Hi Jeff,
Hi Soumyadeep and Melanie,
Thank you for the review!
> max_stack_depth max level lazy (ms) eager (ms) (eage
> r/lazy)
> 2MB 82 302.715 427.554 1.4123978
> 3MB 3474 567.829 896.143 1.578191674
> 7.67MB 8694 2657.972 4903.063 1.844663149
Thank you for collecting data on this. Were you able to find any
regression when compared to no memory accounting at all?
We didn't spend much time comparing performance with and without
memory accounting, as it seems like this was discussed extensively in
the previous thread.
memory accounting, as it seems like this was discussed extensively in
the previous thread.
It looks like you agree with the approach and the results. Did you find
any other issues with the patch?
We didn't observe any other problems with the patch and agree with the
approach. It is a good start.
approach. It is a good start.
I am also including Robert in this thread. He had some concerns the
last time around due to a small regression on POWER.
I think it would be helpful if we could repeat the performance tests
Robert did on that machine with the current patch (unless this version
of the patch is exactly the same as the ones he tested previously).
Robert did on that machine with the current patch (unless this version
of the patch is exactly the same as the ones he tested previously).
Thanks,
Soumyadeep & Melanie
On Tue, Sep 24, 2019 at 02:05:51PM +0200, Tomas Vondra wrote: > The way I see it we can do either eager or lazy accounting. Eager might > work better for aggregates with many contexts, but it does increase the > overhead for the "regular" aggregates with just one or two contexts. > Considering how rare those many-context aggregates are (I'm not aware of > any such aggregate at the moment), it seems reasonable to pick the lazy > accounting. Okay. > So I think the approach Jeff ended up with sensible - certainly as a > first step. We may improve it in the future, of course, once we have > more practical experience. > > Barring objections, I do plan to get this committed by the end of this > CF (i.e. sometime later this week). Sounds good to me. Though I have not looked at the patch in details, the arguments are sensible. Thanks for confirming. -- Michael
Attachment
On Tue, Sep 24, 2019 at 11:46:49AM -0700, Melanie Plageman wrote: >On Thu, Sep 19, 2019 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote: > >> On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote: >> > Hi Jeff, >> >> Hi Soumyadeep and Melanie, >> >> Thank you for the review! >> >> > max_stack_depth max level lazy (ms) eager (ms) >> (eage >> > r/lazy) >> > 2MB 82 302.715 427.554 1.4123978 >> > 3MB 3474 567.829 896.143 1.578191674 >> > 7.67MB 8694 2657.972 4903.063 1.844663149 >> >> Thank you for collecting data on this. Were you able to find any >> regression when compared to no memory accounting at all? >> >> >We didn't spend much time comparing performance with and without >memory accounting, as it seems like this was discussed extensively in >the previous thread. > > >> It looks like you agree with the approach and the results. Did you find >> any other issues with the patch? >> > >We didn't observe any other problems with the patch and agree with the >approach. It is a good start. > > >> >> I am also including Robert in this thread. He had some concerns the >> last time around due to a small regression on POWER. >> > >I think it would be helpful if we could repeat the performance tests >Robert did on that machine with the current patch (unless this version >of the patch is exactly the same as the ones he tested previously). > I agree that would be nice, but I don't have access to any Power machine suitable for this kind of benchmarking :-( Robert, any chance you still have access to that machine? It's worth mentioning that those bechmarks (I'm assuming we're talking about the numbers Rober shared in [1]) were done on patches that used the eager accounting approach (i.e. walking all parent contexts and updating the accounting for them). I'm pretty sure the current "lazy accounting" patches don't have that issue, so unless someone objects and/or can show numbers demonstrating I'wrong I'll stick to my plan to get this committed soon. regards [1] https://www.postgresql.org/message-id/flat/CA%2BTgmobnu7XEn1gRdXnFo37P79bF%3DqLt46%3D37ajP3Cro9dBRaA%40mail.gmail.com#3e2dc9e70a9f9eb2d695ab94a580c5a2 -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote: > It's worth mentioning that those bechmarks (I'm assuming we're > talking > about the numbers Rober shared in [1]) were done on patches that used > the eager accounting approach (i.e. walking all parent contexts and > updating the accounting for them). > > I'm pretty sure the current "lazy accounting" patches don't have that > issue, so unless someone objects and/or can show numbers > demonstrating > I'wrong I'll stick to my plan to get this committed soon. That was my conclusion, as well. Regards, Jeff Davis
On Thu, Sep 26, 2019 at 01:36:46PM -0700, Jeff Davis wrote: >On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote: >> It's worth mentioning that those bechmarks (I'm assuming we're >> talking >> about the numbers Rober shared in [1]) were done on patches that used >> the eager accounting approach (i.e. walking all parent contexts and >> updating the accounting for them). >> >> I'm pretty sure the current "lazy accounting" patches don't have that >> issue, so unless someone objects and/or can show numbers >> demonstrating >> I'wrong I'll stick to my plan to get this committed soon. > >That was my conclusion, as well. > I was about to commit the patch, but during the final review I've noticed two places that I think are bugs: 1) aset.c (AllocSetDelete) -------------------------- #ifdef CLOBBER_FREED_MEMORY wipe_mem(block, block->freeptr - ((char *) block)); #endif if (block != set->keeper) { context->mem_allocated -= block->endptr - ((char *) block); free(block); } 2) generation.c (GenerationReset) --------------------------------- #ifdef CLOBBER_FREED_MEMORY wipe_mem(block, block->blksize); #endif context->mem_allocated -= block->blksize; Notice that when CLOBBER_FREED_MEMORY is defined, the code first calls wipe_mem and then accesses fields of the (wiped) block. Interesringly enough, the regression tests don't seem to exercise these bits - I've tried adding elog(ERROR) and it still passes. For (2) that's not very surprising because Generation context is only really used in logical decoding (and we don't delete the context I think). Not sure about (1) but it might be because AllocSetReset does the right thing and only leaves behind the keeper block. I'm pretty sure a custom function calling the contexts explicitly would fall over, but I haven't tried. Aside from that, I've repeated the REINDEX benchmarks done by Robert in [1] with different scales on two different machines, and I've measured no difference. Both machines are x86_64, I don't have access to any Power machine at the moment, unfortunately. [1] https://www.postgresql.org/message-id/CA%2BTgmobnu7XEn1gRdXnFo37P79bF%3DqLt46%3D37ajP3Cro9dBRaA%40mail.gmail.com regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Sep 29, 2019 at 12:12:49AM +0200, Tomas Vondra wrote: >On Thu, Sep 26, 2019 at 01:36:46PM -0700, Jeff Davis wrote: >>On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote: >>>It's worth mentioning that those bechmarks (I'm assuming we're >>>talking >>>about the numbers Rober shared in [1]) were done on patches that used >>>the eager accounting approach (i.e. walking all parent contexts and >>>updating the accounting for them). >>> >>>I'm pretty sure the current "lazy accounting" patches don't have that >>>issue, so unless someone objects and/or can show numbers >>>demonstrating >>>I'wrong I'll stick to my plan to get this committed soon. >> >>That was my conclusion, as well. >> > >I was about to commit the patch, but during the final review I've >noticed two places that I think are bugs: > >1) aset.c (AllocSetDelete) >-------------------------- > >#ifdef CLOBBER_FREED_MEMORY > wipe_mem(block, block->freeptr - ((char *) block)); >#endif > > if (block != set->keeper) > { > context->mem_allocated -= block->endptr - ((char *) block); > free(block); > } > >2) generation.c (GenerationReset) >--------------------------------- > >#ifdef CLOBBER_FREED_MEMORY > wipe_mem(block, block->blksize); >#endif > context->mem_allocated -= block->blksize; > > >Notice that when CLOBBER_FREED_MEMORY is defined, the code first calls >wipe_mem and then accesses fields of the (wiped) block. Interesringly >enough, the regression tests don't seem to exercise these bits - I've >tried adding elog(ERROR) and it still passes. For (2) that's not very >surprising because Generation context is only really used in logical >decoding (and we don't delete the context I think). Not sure about (1) >but it might be because AllocSetReset does the right thing and only >leaves behind the keeper block. > >I'm pretty sure a custom function calling the contexts explicitly would >fall over, but I haven't tried. > Oh, and one more thing - this probably needs to add at least some basic explanation of the accounting to src/backend/mmgr/README. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, 2019-09-29 at 00:22 +0200, Tomas Vondra wrote: > Notice that when CLOBBER_FREED_MEMORY is defined, the code first > > calls > > wipe_mem and then accesses fields of the (wiped) block. > > Interesringly > > enough, the regression tests don't seem to exercise these bits - > > I've > > tried adding elog(ERROR) and it still passes. For (2) that's not > > very > > surprising because Generation context is only really used in > > logical > > decoding (and we don't delete the context I think). Not sure about > > (1) > > but it might be because AllocSetReset does the right thing and only > > leaves behind the keeper block. > > > > I'm pretty sure a custom function calling the contexts explicitly > > would > > fall over, but I haven't tried. > > Fixed. I tested with some custom use of memory contexts. The reason AllocSetDelete() didn't fail before is that most memory contexts use the free lists (the list of free memory contexts, not the free list of chunks), so you need to specify a non-default minsize in order to prevent that and trigger the bug. AllocSetReset() worked, but it was reading the header of the keeper block after wiping the contents of the keeper block. It technically worked, because the header of the keeper block was not wiped, but it seems more clear to explicitly save the size of the keeper block. In AllocSetDelete(), saving the keeper size is required, because it wipes the block headers in addition to the contents. > Oh, and one more thing - this probably needs to add at least some > basic > explanation of the accounting to src/backend/mmgr/README. Added. Regards, Jeff Davis
Attachment
On Mon, Sep 30, 2019 at 01:34:13PM -0700, Jeff Davis wrote: >On Sun, 2019-09-29 at 00:22 +0200, Tomas Vondra wrote: >> Notice that when CLOBBER_FREED_MEMORY is defined, the code first >> > calls >> > wipe_mem and then accesses fields of the (wiped) block. >> > Interesringly >> > enough, the regression tests don't seem to exercise these bits - >> > I've >> > tried adding elog(ERROR) and it still passes. For (2) that's not >> > very >> > surprising because Generation context is only really used in >> > logical >> > decoding (and we don't delete the context I think). Not sure about >> > (1) >> > but it might be because AllocSetReset does the right thing and only >> > leaves behind the keeper block. >> > >> > I'm pretty sure a custom function calling the contexts explicitly >> > would >> > fall over, but I haven't tried. >> > > >Fixed. > >I tested with some custom use of memory contexts. The reason >AllocSetDelete() didn't fail before is that most memory contexts use >the free lists (the list of free memory contexts, not the free list of >chunks), so you need to specify a non-default minsize in order to >prevent that and trigger the bug. > >AllocSetReset() worked, but it was reading the header of the keeper >block after wiping the contents of the keeper block. It technically >worked, because the header of the keeper block was not wiped, but it >seems more clear to explicitly save the size of the keeper block. In >AllocSetDelete(), saving the keeper size is required, because it wipes >the block headers in addition to the contents. > OK, looks good to me now. >> Oh, and one more thing - this probably needs to add at least some >> basic >> explanation of the accounting to src/backend/mmgr/README. > >Added. > Well, I've meant a couple of paragraphs explaining the motivation, and relevant trade-offs considered. So I've written a brief summary of the design as I understand it and pushed it like that. Of course, if you could proof-read it, that'd be good. I had a bit of a hard time deciding who to list as a reviewer - this patch started sometime in ~2015, and it was initially discussed as part of the larger hashagg effort, with plenty of people discussing various ancient versions of the patch. In the end I've included just people from the current thread. If that omits important past reviews, I'm sorry. For the record, results of the benchmarks I've done over the past couple of days are in [1]. It includes both the reindex benchmark used by by Robert in 2015, and a regular read-only pgbench. In general, the overhead of the accounting is pretty much indistinguishable from noise (at least on those two machines). In any case, thanks for the perseverance working on this. [1] https://github.com/tvondra/memory-accounting-benchmarks regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
So ... why exactly did this patch define MemoryContextData.mem_allocated as int64? That seems to me to be doubly wrong: it is not the right width on 32-bit machines, and it is not the right signedness anywhere. I think that field ought to be of type Size (a/k/a size_t, but memnodes.h always calls it Size). I let this pass when the patch went in, but now I'm on the warpath about it, because since c477f3e449 went in, some of the 32-bit buildfarm members are failing with 2019-10-04 00:41:56.569 CEST [66916:86] pg_regress/_int LOG: statement: CREATE INDEX text_idx on test__int using gist (a gist__int_ops ); TRAP: FailedAssertion("total_allocated == context->mem_allocated", File: "aset.c", Line: 1533) 2019-10-04 00:42:25.505 CEST [63836:11] LOG: server process (PID 66916) was terminated by signal 6: Abort trap 2019-10-04 00:42:25.505 CEST [63836:12] DETAIL: Failed process was running: CREATE INDEX text_idx on test__int using gist( a gist__int_ops ); What I think is happening is that c477f3e449 allowed this bit in AllocSetRealloc: context->mem_allocated += blksize - oldblksize; to be executed in situations where blksize < oldblksize, where before that was not possible. Of course blksize and oldblksize are of type Size, hence unsigned, so the subtraction result underflows in this case. If mem_allocated is of the same width as Size then this does not matter because the final result wraps around to the proper value, but if we're going to allow mem_allocated to be wider than Size then we will need more logic here to add or subtract as appropriate. (I'm not quite sure why we're not seeing this failure on *all* the 32-bit machines; maybe there's some other factor involved?) I see no value in defining mem_allocated to be wider than Size. Yes, the C standard contemplates the possibility that the total available address space is larger than the largest chunk you can ever ask malloc for; but nobody has built a platform like that in this century, and they sucked to program on back in the dark ages when they did exist. (I speak from experience.) I do not think we need to design Postgres to allow for that. Likewise, there's no evident value in allowing mem_allocated to be negative. I haven't chased down exactly what else would need to change. It might be that s/int64/Size/g throughout the patch is sufficient, but I haven't analyzed it. regards, tom lane
On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote: >So ... why exactly did this patch define MemoryContextData.mem_allocated >as int64? That seems to me to be doubly wrong: it is not the right width >on 32-bit machines, and it is not the right signedness anywhere. I think >that field ought to be of type Size (a/k/a size_t, but memnodes.h always >calls it Size). > Yeah, I think that's an oversight. Maybe there's a reason why Jeff used int64, but I can't think of any. >I let this pass when the patch went in, but now I'm on the warpath >about it, because since c477f3e449 went in, some of the 32-bit buildfarm >members are failing with > >2019-10-04 00:41:56.569 CEST [66916:86] pg_regress/_int LOG: statement: CREATE INDEX text_idx on test__int using gist (a gist__int_ops ); >TRAP: FailedAssertion("total_allocated == context->mem_allocated", File: "aset.c", Line: 1533) >2019-10-04 00:42:25.505 CEST [63836:11] LOG: server process (PID 66916) was terminated by signal 6: Abort trap >2019-10-04 00:42:25.505 CEST [63836:12] DETAIL: Failed process was running: CREATE INDEX text_idx on test__int using gist( a gist__int_ops ); > >What I think is happening is that c477f3e449 allowed this bit in >AllocSetRealloc: > > context->mem_allocated += blksize - oldblksize; > >to be executed in situations where blksize < oldblksize, where before >that was not possible. Of course blksize and oldblksize are of type >Size, hence unsigned, so the subtraction result underflows in this >case. If mem_allocated is of the same width as Size then this does >not matter because the final result wraps around to the proper value, >but if we're going to allow mem_allocated to be wider than Size then >we will need more logic here to add or subtract as appropriate. > >(I'm not quite sure why we're not seeing this failure on *all* the >32-bit machines; maybe there's some other factor involved?) > Interesting failure mode (especially that it does *not* fail on some 32-bit machines). >I see no value in defining mem_allocated to be wider than Size. >Yes, the C standard contemplates the possibility that the total >available address space is larger than the largest chunk you can >ever ask malloc for; but nobody has built a platform like that in >this century, and they sucked to program on back in the dark ages >when they did exist. (I speak from experience.) I do not think >we need to design Postgres to allow for that. > >Likewise, there's no evident value in allowing mem_allocated >to be negative. > >I haven't chased down exactly what else would need to change. >It might be that s/int64/Size/g throughout the patch is >sufficient, but I haven't analyzed it. > I think so too, but I'll take a closer look in the afternoon, unless you beat me to it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Oct 04, 2019 at 10:26:44AM +0200, Tomas Vondra wrote: >On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote: >>I haven't chased down exactly what else would need to change. >>It might be that s/int64/Size/g throughout the patch is >>sufficient, but I haven't analyzed it. >> > >I think so too, but I'll take a closer look in the afternoon, unless you >beat me to it. > I've pushed a fix changing the type to Size, splitting the mem_allocated to two separate updates (to prevent any underflows in the subtraction). Hopefully this fixes the 32-bit machines ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, 2019-10-04 at 10:26 +0200, Tomas Vondra wrote: > On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote: > > So ... why exactly did this patch define > > MemoryContextData.mem_allocated > > as int64? That seems to me to be doubly wrong: it is not the right > > width > > on 32-bit machines, and it is not the right signedness anywhere. I > > think > > that field ought to be of type Size (a/k/a size_t, but memnodes.h > > always > > calls it Size). > > > > Yeah, I think that's an oversight. Maybe there's a reason why Jeff > used > int64, but I can't think of any. I had chosen a 64-bit value to account for the situation Tom mentioned: that, in theory, Size might not be large enough to represent all allocations in a memory context. Apparently, that theoretical situation is not worth being concerned about. The patch has been floating around for a very long time, so I don't remember exactly why I chose a signed value. Sorry. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Fri, 2019-10-04 at 10:26 +0200, Tomas Vondra wrote: >> Yeah, I think that's an oversight. Maybe there's a reason why Jeff >> used int64, but I can't think of any. > I had chosen a 64-bit value to account for the situation Tom mentioned: > that, in theory, Size might not be large enough to represent all > allocations in a memory context. Apparently, that theoretical situation > is not worth being concerned about. Well, you could also argue it the other way: maybe in our children's time, int64 won't be as wide as Size. (Yeah, I know that sounds ridiculous, but needing pointers wider than 32 bits was a ridiculous idea too when I started in this business.) The committed fix seems OK to me except that I think you should've also changed MemoryContextMemAllocated() to return Size. regards, tom lane
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote: >> What I think is happening is that c477f3e449 allowed this bit in >> AllocSetRealloc: >> context->mem_allocated += blksize - oldblksize; >> to be executed in situations where blksize < oldblksize, where before >> that was not possible. >> ... >> (I'm not quite sure why we're not seeing this failure on *all* the >> 32-bit machines; maybe there's some other factor involved?) > Interesting failure mode (especially that it does *not* fail on some > 32-bit machines). Just to make things even more mysterious, prairiedog finally showed the Assert failure on its fourth run with c477f3e449 included: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-04%2012%3A35%3A41 It's also now apparent that lapwing and locust were failing only sometimes, as well. I totally don't understand why that failure would've been only partially reproducible. Maybe we should dig a bit harder, rather than just deciding that we fixed it. regards, tom lane
I wrote: > Just to make things even more mysterious, prairiedog finally showed > the Assert failure on its fourth run with c477f3e449 included: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-04%2012%3A35%3A41 > It's also now apparent that lapwing and locust were failing only > sometimes, as well. I totally don't understand why that failure > would've been only partially reproducible. Maybe we should dig > a bit harder, rather than just deciding that we fixed it. I did that digging, reproducing the problem on florican's host (again intermittently). Here's a stack trace from the spot where we sometimes downsize a large chunk: #5 0x0851c70a in AllocSetRealloc (context=0x31b35000, pointer=0x319e5020, size=1224) at aset.c:1158 #6 0x085232eb in repalloc (pointer=0x319e5020, size=1224) at mcxt.c:1082 #7 0x31b69591 in resize_intArrayType (num=300, a=0x319e5020) at _int_tool.c:268 #8 resize_intArrayType (a=0x319e5020, num=300) at _int_tool.c:250 #9 0x31b6995d in _int_unique (r=0x319e5020) at _int_tool.c:329 #10 0x31b66a00 in g_int_union (fcinfo=0xffbfcc5c) at _int_gist.c:146 #11 0x084ff9c4 in FunctionCall2Coll (flinfo=0x319e2bb4, collation=100, arg1=834250780, arg2=4290759884) at fmgr.c:1162 #12 0x080db3a3 in gistMakeUnionItVec (giststate=0x319e2820, itvec=0x31bae4a4, len=15, attr=0xffbfce5c, isnull=0xffbfcedc) at gistutil.c:204 #13 0x080e410d in gistunionsubkeyvec (giststate=giststate@entry=0x319e2820, itvec=itvec@entry=0x31bb5ed4, gsvp=gsvp@entry=0xffbfcd4c) at gistsplit.c:64 #14 0x080e416f in gistunionsubkey (giststate=giststate@entry=0x319e2820, itvec=itvec@entry=0x31bb5ed4, spl=spl@entry=0xffbfce3c) at gistsplit.c:91 #15 0x080e4456 in gistSplitByKey (r=<optimized out>, page=<optimized out>, itup=<optimized out>, len=<optimized out>, giststate=<optimized out>, v=<optimized out>, attno=<optimized out>) at gistsplit.c:689 #16 0x080d8797 in gistSplit (r=0x31bbb424, page=0x297e0b80 "", itup=0x31bb5ed4, len=16, giststate=0x319e2820) at gist.c:1432 #17 0x080d8d9c in gistplacetopage (rel=<optimized out>, freespace=<optimized out>, giststate=<optimized out>, buffer=<optimized out>, itup=<optimized out>, ntup=<optimized out>, oldoffnum=<optimized out>, newblkno=<optimized out>, leftchildbuf=<optimized out>, splitinfo=<optimized out>, markfollowright=<optimized out>, heapRel=<optimized out>, is_build=<optimized out>) at gist.c:299 So the potential downsize is expected, triggered by _int_unique being able to remove some duplicate entries from a GIST union key. AFAICT the fact that it happens only intermittently must boil down to the randomized insertion choices that gistchoose() sometimes makes. In short: nothing to see here, move along. regards, tom lane
On Fri, Oct 4, 2019 at 7:32 AM Jeff Davis <pgsql@j-davis.com> wrote: > The patch has been floating around for a very long time, so I don't > remember exactly why I chose a signed value. Sorry. I am reminded of the fact that int64 is used to size buffers within tuplesort.c, because it needs to support negative availMem sizes -- when huge allocations were first supported, tuplesort.c briefly used "Size", which didn't work. Perhaps it had something to do with that. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Fri, Oct 4, 2019 at 7:32 AM Jeff Davis <pgsql@j-davis.com> wrote: >> The patch has been floating around for a very long time, so I don't >> remember exactly why I chose a signed value. Sorry. > I am reminded of the fact that int64 is used to size buffers within > tuplesort.c, because it needs to support negative availMem sizes -- > when huge allocations were first supported, tuplesort.c briefly used > "Size", which didn't work. Perhaps it had something to do with that. I wonder if we should make that use ssize_t instead. Probably not worth the trouble. regards, tom lane
On Tue, Sep 24, 2019 at 2:47 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > I think it would be helpful if we could repeat the performance tests > Robert did on that machine with the current patch (unless this version > of the patch is exactly the same as the ones he tested previously). I still have access to a POWER machine, but it's currently being used by somebody else for a benchmarking project, so I can't test this immediately. It's probably worth noting that, in addition to whatever's changed in this patch, tuplesort.c's memory management has been altered significantly since 2015 - see 0011c0091e886b874e485a46ff2c94222ffbf550 and e94568ecc10f2638e542ae34f2990b821bbf90ac. I'm not immediately sure how that would affect the REINDEX case that I tested back then, but it seems at least possible that they would have the effect of making palloc overhead less significant. More generally, so much of the sorting machinery has been overhauled by Peter Geoghegan since then that what happens now may just not be very comparable to what happened back then. I do agree that this approach looks pretty light weight. Tomas's point about the difference between updating only the current context and updating all parent contexts seems right on target to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company