Home > mailing lists
v12 "won't fix" item regarding memory leak in "ATTACH PARTITIONwithout AEL"; (or, relcache ref counting) - Mailing list pgsql-hackers

From	Justin Pryzby
Subject	v12 "won't fix" item regarding memory leak in "ATTACH PARTITIONwithout AEL"; (or, relcache ref counting)
Date	February 24, 2020 01:01:43
Msg-id	20200223220143.GA31506@telsasoft.com Whole thread Raw
Responses	Re: v12 "won't fix" item regarding memory leak in "ATTACH PARTITIONwithout AEL"; (or, relcache ref counting)
List	pgsql-hackers
Tree view
This links to a long thread, from which I've tried to quote some of the
most important mails, below.
https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items#Won.27t_Fix

I wondered if there's an effort to pursue a resolution for v13 ?

On Fri, Apr 12, 2019 at 11:42:24AM -0400, Tom Lane wrote in <31027.1555083744@sss.pgh.pa.us>:
> Michael Paquier <michael@paquier.xyz> writes:
> > On Wed, Apr 10, 2019 at 05:03:21PM +0900, Amit Langote wrote:
> >> The problem lies in all branches that have partitioning, so it should be
> >> listed under Older Bugs, right?  You may have noticed that I posted
> >> patches for all branches down to 10.
> 
> > I have noticed.  The message from Tom upthread outlined that an open
> > item should be added, but this is not one.  That's what I wanted to
> > emphasize.  Thanks for making it clearer.
> 
> To clarify a bit: there's more than one problem here.  Amit added an
> open item about pre-existing leaks related to rd_partcheck.  (I'm going
> to review and hopefully push his fix for that today.)  However, there's
> a completely separate leak associated with mismanagement of rd_pdcxt,
> as I showed in [1], and it doesn't seem like we have consensus about
> what to do about that one.  AFAIK that's a new bug in 12 (caused by
> 898e5e329) and so it ought to be in the main open-items list.
> 
> This thread has discussed a bunch of possible future changes like
> trying to replace copying of relcache-provided data structures
> with reference-counting, but I don't think any such thing could be
> v12 material at this point.  We do need to fix the newly added
> leak though.
> 
>             regards, tom lane
> 
> [1] https://www.postgresql.org/message-id/10797.1552679128%40sss.pgh.pa.us
> 
> 

On Fri, Mar 15, 2019 at 05:41:47PM -0400, Robert Haas wrote in
<CA+Tgmoa5rT+ZR+Vv+q1XLwQtDMCqkNL6B4PjR4V6YAC9K_LBxw@mail.gmail.com>:
> On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > More to the point, we turned *one* rebuild = false situation into
> > a bunch of rebuild = true situations.  I haven't studied it closely,
> > but I think a CCA animal would probably see O(N^2) rebuild = true
> > invocations in a query with N partitions, since each time
> > expand_partitioned_rtentry opens another child table, we'll incur
> > an AcceptInvalidationMessages call which leads to forced rebuilds
> > of the previously-pinned rels.  In non-CCA situations, almost always
> > nothing happens with the previously-examined relcache entries.
> 
> That's rather unfortunate.  I start to think that clobbering the cache
> "always" is too hard a line.
> 
> > I agree that copying data isn't great.  What I don't agree is that this
> > solution is better.  In particular, copying data out of the relcache
> > rather than expecting the relcache to hold still over long periods
> > is the way we've done things everywhere else, cf RelationGetIndexList,
> > RelationGetStatExtList, RelationGetIndexExpressions,
> > RelationGetIndexPredicate, RelationGetIndexAttrBitmap,
> > RelationGetExclusionInfo, GetRelationPublicationActions.  I don't care
> > for a patch randomly deciding to do things differently on the basis of an
> > unsupported-by-evidence argument that it might cost too much to copy the
> > data.  If we're going to make a push to reduce the amount of copying of
> > that sort that we do, it should be a separately (and carefully) designed
> > thing that applies to all the relcache substructures that have the issue,
> > not one-off hacks that haven't been reviewed thoroughly.
> 
> That's not an unreasonable argument.  On the other hand, if you never
> try new stuff, you lose opportunities to explore things that might
> turn out to be better and worth adopting more widely.
> 
> I am not very convinced that it makes sense to lump something like
> RelationGetIndexAttrBitmap in with something like
> RelationGetPartitionDesc.  RelationGetIndexAttrBitmap is returning a
> single Bitmapset, whereas the data structure RelationGetPartitionDesc
> is vastly larger and more complex.  You can't say "well, if it's best
> to copy 32 bytes of data out of the relcache every time we need it, it
> must also be right to copy 10k or 100k of data out of the relcache
> every time we need it."
> 
> There is another difference as well: there's a good chance that
> somebody is going to want to mutate a Bitmapset, whereas they had
> BETTER NOT think that they can mutate the PartitionDesc.  So returning
> an uncopied Bitmapset is kinda risky in a way that returning an
> uncopied PartitionDesc is not.
> 
> If we want an at-least-somewhat unified solution here, I think we need
> to bite the bullet and make the planner hold a reference to the
> relcache throughout planning.  (The executor already keeps it open, I
> believe.) Otherwise, how is the relcache supposed to know when it can
> throw stuff away and when it can't?  The only alternative seems to be
> to have each subsystem hold its own reference count, as I did with the
> PartitionDirectory stuff, which is not something we'd want to scale
> out.
> 
> > I especially don't care for the much-less-than-half-baked kluge of
> > chaining the old rd_pdcxt onto the new one and hoping that it will go away
> > at a suitable time.
> 
> It will go away at a suitable time, but maybe not at the soonest
> suitable time.  It wouldn't be hard to improve that, though.  The
> easiest thing to do, I think, would be to have an rd_oldpdcxt and
> stuff the old context there.  If there already is one there, parent
> the new one under it.  When RelationDecrementReferenceCount reduces
> the reference count to zero, destroy anything found in rd_oldpdcxt.
> With that change, things get destroyed at the earliest time at which
> we know the old things aren't referenced, instead of the earliest time
> at which they are not referenced + an invalidation arrives.
> 
> > regression=# create table parent (a text, b int) partition by list (a);
> > CREATE TABLE
> > regression=# create table child (a text, b int);
> > CREATE TABLE
> > regression=# do $$
> > regression$# begin
> > regression$# for i in 1..10000000 loop
> > regression$# alter table parent attach partition child for values in ('x');
> > regression$# alter table parent detach partition child;
> > regression$# end loop;
> > regression$# end $$;
> 
> Interesting example.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 

On Sun, Apr 14, 2019 at 03:29:26PM -0400, Tom Lane wrote in <27380.1555270166@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I agree that copying data isn't great.  What I don't agree is that this
> >> solution is better.
> 
> > I am not very convinced that it makes sense to lump something like
> > RelationGetIndexAttrBitmap in with something like
> > RelationGetPartitionDesc.  RelationGetIndexAttrBitmap is returning a
> > single Bitmapset, whereas the data structure RelationGetPartitionDesc
> > is vastly larger and more complex.  You can't say "well, if it's best
> > to copy 32 bytes of data out of the relcache every time we need it, it
> > must also be right to copy 10k or 100k of data out of the relcache
> > every time we need it."
> 
> I did not say that.  What I did say is that they're both correct
> solutions.  Copying large data structures is clearly a potential
> performance problem, but that doesn't mean we can take correctness
> shortcuts.
> 
> > If we want an at-least-somewhat unified solution here, I think we need
> > to bite the bullet and make the planner hold a reference to the
> > relcache throughout planning.  (The executor already keeps it open, I
> > believe.) Otherwise, how is the relcache supposed to know when it can
> > throw stuff away and when it can't?
> 
> The real problem with that is that we *still* won't know whether it's
> okay to throw stuff away or not.  The issue with these subsidiary
> data structures is exactly that we're trying to reduce the lock strength
> required for changing one of them, and as soon as you make that lock
> strength less than AEL, you have the problem that that value may need
> to change in relcache entries despite them being pinned.  The code I'm
> complaining about is trying to devise a shortcut solution to exactly
> that situation ... and it's not a good shortcut.
> 
> > The only alternative seems to be to have each subsystem hold its own
> > reference count, as I did with the PartitionDirectory stuff, which is
> > not something we'd want to scale out.
> 
> Well, we clearly don't want to devise a separate solution for every
> subsystem, but it doesn't seem out of the question to me that we could
> build a "reference counted cache sub-structure" mechanism and apply it
> to each of these relcache fields.  Maybe it could be unified with the
> existing code in the typcache that does a similar thing.  Sure, this'd
> be a fair amount of work, but we've done it before.  Syscache entries
> didn't use to have reference counts, for example.
> 
> BTW, the problem I have with the PartitionDirectory stuff is exactly
> that it *isn't* a reference-counted solution.  If it were, we'd not
> have this problem of not knowing when we could free rd_pdcxt.
> 
> >> I especially don't care for the much-less-than-half-baked kluge of
> >> chaining the old rd_pdcxt onto the new one and hoping that it will go away
> >> at a suitable time.
> 
> > It will go away at a suitable time, but maybe not at the soonest
> > suitable time.  It wouldn't be hard to improve that, though.  The
> > easiest thing to do, I think, would be to have an rd_oldpdcxt and
> > stuff the old context there.  If there already is one there, parent
> > the new one under it.  When RelationDecrementReferenceCount reduces
> > the reference count to zero, destroy anything found in rd_oldpdcxt.
> 
> Meh.  While it seems likely that that would mask most practical problems,
> it still seems like covering up a wart with a dirty bandage.  In
> particular, that fix doesn't fix anything unless relcache reference counts
> go to zero pretty quickly --- which I'll just note is directly contrary
> to your enthusiasm for holding relcache pins longer.
> 
> I think that what we ought to do for v12 is have PartitionDirectory
> copy the data, and then in v13 work on creating real reference-count
> infrastructure that would allow eliminating the copy steps with full
> safety.  The $64 question is whether that really would cause unacceptable
> performance problems.  To look into that, I made the attached WIP patches.
> (These are functionally complete, but I didn't bother for instance with
> removing the hunk that 898e5e329 added to relcache.c, and the comments
> need work, etc.)  The first one just changes the PartitionDirectory
> code to do that, and then the second one micro-optimizes
> partition_bounds_copy() to make it somewhat less expensive, mostly by
> collapsing lots of small palloc's into one big one.
> 
> What I get for test cases like [1] is
> 
> single-partition SELECT, hash partitioning:
> 
> N       tps, HEAD       tps, patch
> 2       11426.243754    11448.615193
> 8       11254.833267    11374.278861
> 32      11288.329114    11371.942425
> 128     11222.329256    11185.845258
> 512     11001.177137    10572.917288
> 1024    10612.456470    9834.172965
> 4096    8819.110195     7021.864625
> 8192    7372.611355     5276.130161
> 
> single-partition SELECT, range partitioning:
> 
> N       tps, HEAD       tps, patch
> 2       11037.855338    11153.595860
> 8       11085.218022    11019.132341
> 32      10994.348207    10935.719951
> 128     10884.417324    10532.685237
> 512     10635.583411    9578.108915
> 1024    10407.286414    8689.585136
> 4096    8361.463829     5139.084405
> 8192    7075.880701     3442.542768
> 
> Now certainly these numbers suggest that avoiding the copy could be worth
> our trouble, but these results are still several orders of magnitude
> better than where we were two weeks ago [2].  Plus, this is an extreme
> case that's not really representative of real-world usage, since the test
> tables have neither indexes nor any data.  In practical situations the
> baseline would be lower and the dropoff less bad.  So I don't feel bad
> about shipping v12 with these sorts of numbers and hoping for more
> improvement later.
> 
>             regards, tom lane
> 
> [1] https://www.postgresql.org/message-id/3529.1554051867%40sss.pgh.pa.us
> 
> [2] https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512BAC60%40g01jpexmbkw24
> 


On Wed, May 01, 2019 at 01:09:07PM -0400, Robert Haas wrote in
<CA+Tgmob-cska+-WUC7T-G4zkSJp7yum6M_bzYd4YFzwQ51qiow@mail.gmail.com>:
> On Wed, May 1, 2019 at 11:59 AM Andres Freund <andres@anarazel.de> wrote:
> > The message I'm replying to is marked as an open item. Robert, what do
> > you think needs to be done here before release?  Could you summarize,
> > so we then can see what others think?
> 
> The original issue on this thread was that hyrax started running out
> of memory when it hadn't been doing so previously.  That happened
> because, for complicated reasons, commit
> 898e5e3290a72d288923260143930fb32036c00c resulted in the leak being
> hit lots of times in CLOBBER_CACHE_ALWAYS builds instead of just once.
> Commits 2455ab48844c90419714e27eafd235a85de23232 and
> d3f48dfae42f9655425d1f58f396e495c7fb7812 fixed that problem.
> 
> In the email at issue, Tom complains about two things.  First, he
> complains about the fact that I try to arrange things so that relcache
> data remains valid for as long as required instead of just copying it.
> Second, he complains about the fact repeated ATTACH and DETACH
> PARTITION operations can produce a slow session-lifespan memory leak.
> 
> I think it's reasonable to fix the second issue for v12.  I am not
> sure how important it is, because (1) the leak is small, (2) it seems
> unlikely that anyone would execute enough ATTACH/DETACH PARTITION
> operations in one backend for the leak to amount to anything
> significant, and (3) if a relcache flush ever happens when you don't
> have the relation open, all of the "leaked" memory will be un-leaked.
> However, I believe that we could fix it as follows.  First, invent
> rd_pdoldcxt and put the first old context there; if that pointer is
> already in use, then parent the new context under the old one.
> Second, in RelationDecrementReferenceCount, if the refcount hits 0,
> nuke rd_pdoldcxt and set the pointer back to NULL.  With that change,
> you would only keep the old PartitionDesc around until the ref count
> hits 0, whereas at present, you keep the old PartitionDesc around
> until an invalidation happens while the ref count is 0.
> 
> I think the first issue is not v12 material.  Tom proposed to fix it
> by copying all the relevant data out of the relcache, but his own
> performance results show a pretty significant hit, and AFAICS he
> hasn't pointed to anything that's actually broken by the current
> approach.  What I think should be done is actually generalize the
> approach I took in this patch, so that instead of the partition
> directory holding a reference count, the planner or executor holds
> one.  Then not only could people who want information about the
> PartitionDesc avoid copying stuff from the relcache, but maybe other
> things as well.  I think that would be significantly better than
> continuing to double-down on the current copy-everything approach,
> which really only works well in a world where a table can't have all
> that much metadata, which is clearly not true for PostgreSQL any more.
> I'm not sure that Tom is actually opposed to this approach -- although
> I may have misunderstood his position -- but where we disagree, I
> think, is that I see what I did in this commit as a stepping-stone
> toward a better world, and he sees it as something that should be
> killed with fire until that better world has fully arrived.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 
> 

On Tue, Jun 11, 2019 at 01:57:16PM -0400, Tom Lane wrote in <18286.1560275836@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Thu, Jun 6, 2019 at 2:48 AM Amit Langote <amitlangote09@gmail.com> wrote:
> >> Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch,
> >> which seems to fix this issue for me.
> 
> > Yeah, that looks right.  I think my patch was full of fuzzy thinking
> > and inadequate testing; thanks for checking it over and coming up with
> > the right solution.
> 
> > Anyone else want to look/comment?
> 
> I think the existing code is horribly ugly and this is even worse.
> It adds cycles to RelationDecrementReferenceCount which is a hotspot
> that has no business dealing with this; the invariants are unclear;
> and there's no strong reason to think there aren't still cases where
> we accumulate lots of copies of old partition descriptors during a
> sequence of operations.  Basically you're just doubling down on a
> wrong design.
> 
> As I said upthread, my current inclination is to do nothing in this
> area for v12 and then try to replace the whole thing with proper
> reference counting in v13.  I think the cases where we have a major
> leak are corner-case-ish enough that we can leave it as-is for one
> release.
> 
>             regards, tom lane
> 
> 

On Wed, Jun 12, 2019 at 03:11:56PM -0400, Tom Lane wrote in <3800.1560366716@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Jun 11, 2019 at 1:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
> > I think the change is responsive to your previous complaint that the
> > timing of stuff getting freed is not very well pinned down.  With this
> > change, it's much more tightly pinned down: it happens when the
> > refcount goes to 0.  That is definitely not perfect, but I think that
> > it is a lot easier to come up with scenarios where the leak
> > accumulates because no cache flush happens while the relfcount is 0
> > than it is to come up with scenarios where the refcount never reaches
> > 0.  I agree that the latter type of scenario probably exists, but I
> > don't think we've come up with one yet.
> 
> I don't know why you think that's improbable, given that the changes
> around PartitionDirectory-s cause relcache entries to be held open much
> longer than before (something I've also objected to on this thread).
> 
> >> As I said upthread, my current inclination is to do nothing in this
> >> area for v12 and then try to replace the whole thing with proper
> >> reference counting in v13.  I think the cases where we have a major
> >> leak are corner-case-ish enough that we can leave it as-is for one
> >> release.
> 
> > Is this something you're planning to work on yourself?
> 
> Well, I'd rather farm it out to somebody else, but ...
> 
> > Do you have a
> > design in mind?  Is the idea to reference-count the PartitionDesc?
> 
> What we discussed upthread was refcounting each of the various
> large sub-objects of relcache entries, not just the partdesc.
> I think if we're going to go this way we should bite the bullet and fix
> them all.  I really want to burn down RememberToFreeTupleDescAtEOX() in
> particular ... it seems likely to me that that's also a source of
> unpleasant memory leaks.
> 
>             regards, tom lane
> 
>
pgsql-hackers by date:
From: Tom Lane
Date: 24 February 2020, 01:00:50
Subject: Re: Ought binary literals to allow spaces?
From: Tom Lane
Date: 24 February 2020, 02:24:07
Subject: Re: pgsql: Add kqueue(2) support to the WaitEventSet API.
v12 "won't fix" item regarding memory leak in "ATTACH PARTITIONwithout AEL"; (or, relcache ref counting) - Mailing list pgsql-hackers

Previous

Next