v12 "won't fix" item regarding memory leak in "ATTACH PARTITIONwithout AEL"; (or, relcache ref counting) - Mailing list pgsql-hackers
From | Justin Pryzby |
---|---|
Subject | v12 "won't fix" item regarding memory leak in "ATTACH PARTITIONwithout AEL"; (or, relcache ref counting) |
Date | |
Msg-id | 20200223220143.GA31506@telsasoft.com Whole thread Raw |
Responses |
Re: v12 "won't fix" item regarding memory leak in "ATTACH PARTITIONwithout AEL"; (or, relcache ref counting)
|
List | pgsql-hackers |
This links to a long thread, from which I've tried to quote some of the most important mails, below. https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items#Won.27t_Fix I wondered if there's an effort to pursue a resolution for v13 ? On Fri, Apr 12, 2019 at 11:42:24AM -0400, Tom Lane wrote in <31027.1555083744@sss.pgh.pa.us>: > Michael Paquier <michael@paquier.xyz> writes: > > On Wed, Apr 10, 2019 at 05:03:21PM +0900, Amit Langote wrote: > >> The problem lies in all branches that have partitioning, so it should be > >> listed under Older Bugs, right? You may have noticed that I posted > >> patches for all branches down to 10. > > > I have noticed. The message from Tom upthread outlined that an open > > item should be added, but this is not one. That's what I wanted to > > emphasize. Thanks for making it clearer. > > To clarify a bit: there's more than one problem here. Amit added an > open item about pre-existing leaks related to rd_partcheck. (I'm going > to review and hopefully push his fix for that today.) However, there's > a completely separate leak associated with mismanagement of rd_pdcxt, > as I showed in [1], and it doesn't seem like we have consensus about > what to do about that one. AFAIK that's a new bug in 12 (caused by > 898e5e329) and so it ought to be in the main open-items list. > > This thread has discussed a bunch of possible future changes like > trying to replace copying of relcache-provided data structures > with reference-counting, but I don't think any such thing could be > v12 material at this point. We do need to fix the newly added > leak though. > > regards, tom lane > > [1] https://www.postgresql.org/message-id/10797.1552679128%40sss.pgh.pa.us > > On Fri, Mar 15, 2019 at 05:41:47PM -0400, Robert Haas wrote in <CA+Tgmoa5rT+ZR+Vv+q1XLwQtDMCqkNL6B4PjR4V6YAC9K_LBxw@mail.gmail.com>: > On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > More to the point, we turned *one* rebuild = false situation into > > a bunch of rebuild = true situations. I haven't studied it closely, > > but I think a CCA animal would probably see O(N^2) rebuild = true > > invocations in a query with N partitions, since each time > > expand_partitioned_rtentry opens another child table, we'll incur > > an AcceptInvalidationMessages call which leads to forced rebuilds > > of the previously-pinned rels. In non-CCA situations, almost always > > nothing happens with the previously-examined relcache entries. > > That's rather unfortunate. I start to think that clobbering the cache > "always" is too hard a line. > > > I agree that copying data isn't great. What I don't agree is that this > > solution is better. In particular, copying data out of the relcache > > rather than expecting the relcache to hold still over long periods > > is the way we've done things everywhere else, cf RelationGetIndexList, > > RelationGetStatExtList, RelationGetIndexExpressions, > > RelationGetIndexPredicate, RelationGetIndexAttrBitmap, > > RelationGetExclusionInfo, GetRelationPublicationActions. I don't care > > for a patch randomly deciding to do things differently on the basis of an > > unsupported-by-evidence argument that it might cost too much to copy the > > data. If we're going to make a push to reduce the amount of copying of > > that sort that we do, it should be a separately (and carefully) designed > > thing that applies to all the relcache substructures that have the issue, > > not one-off hacks that haven't been reviewed thoroughly. > > That's not an unreasonable argument. On the other hand, if you never > try new stuff, you lose opportunities to explore things that might > turn out to be better and worth adopting more widely. > > I am not very convinced that it makes sense to lump something like > RelationGetIndexAttrBitmap in with something like > RelationGetPartitionDesc. RelationGetIndexAttrBitmap is returning a > single Bitmapset, whereas the data structure RelationGetPartitionDesc > is vastly larger and more complex. You can't say "well, if it's best > to copy 32 bytes of data out of the relcache every time we need it, it > must also be right to copy 10k or 100k of data out of the relcache > every time we need it." > > There is another difference as well: there's a good chance that > somebody is going to want to mutate a Bitmapset, whereas they had > BETTER NOT think that they can mutate the PartitionDesc. So returning > an uncopied Bitmapset is kinda risky in a way that returning an > uncopied PartitionDesc is not. > > If we want an at-least-somewhat unified solution here, I think we need > to bite the bullet and make the planner hold a reference to the > relcache throughout planning. (The executor already keeps it open, I > believe.) Otherwise, how is the relcache supposed to know when it can > throw stuff away and when it can't? The only alternative seems to be > to have each subsystem hold its own reference count, as I did with the > PartitionDirectory stuff, which is not something we'd want to scale > out. > > > I especially don't care for the much-less-than-half-baked kluge of > > chaining the old rd_pdcxt onto the new one and hoping that it will go away > > at a suitable time. > > It will go away at a suitable time, but maybe not at the soonest > suitable time. It wouldn't be hard to improve that, though. The > easiest thing to do, I think, would be to have an rd_oldpdcxt and > stuff the old context there. If there already is one there, parent > the new one under it. When RelationDecrementReferenceCount reduces > the reference count to zero, destroy anything found in rd_oldpdcxt. > With that change, things get destroyed at the earliest time at which > we know the old things aren't referenced, instead of the earliest time > at which they are not referenced + an invalidation arrives. > > > regression=# create table parent (a text, b int) partition by list (a); > > CREATE TABLE > > regression=# create table child (a text, b int); > > CREATE TABLE > > regression=# do $$ > > regression$# begin > > regression$# for i in 1..10000000 loop > > regression$# alter table parent attach partition child for values in ('x'); > > regression$# alter table parent detach partition child; > > regression$# end loop; > > regression$# end $$; > > Interesting example. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > On Sun, Apr 14, 2019 at 03:29:26PM -0400, Tom Lane wrote in <27380.1555270166@sss.pgh.pa.us>: > Robert Haas <robertmhaas@gmail.com> writes: > > On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I agree that copying data isn't great. What I don't agree is that this > >> solution is better. > > > I am not very convinced that it makes sense to lump something like > > RelationGetIndexAttrBitmap in with something like > > RelationGetPartitionDesc. RelationGetIndexAttrBitmap is returning a > > single Bitmapset, whereas the data structure RelationGetPartitionDesc > > is vastly larger and more complex. You can't say "well, if it's best > > to copy 32 bytes of data out of the relcache every time we need it, it > > must also be right to copy 10k or 100k of data out of the relcache > > every time we need it." > > I did not say that. What I did say is that they're both correct > solutions. Copying large data structures is clearly a potential > performance problem, but that doesn't mean we can take correctness > shortcuts. > > > If we want an at-least-somewhat unified solution here, I think we need > > to bite the bullet and make the planner hold a reference to the > > relcache throughout planning. (The executor already keeps it open, I > > believe.) Otherwise, how is the relcache supposed to know when it can > > throw stuff away and when it can't? > > The real problem with that is that we *still* won't know whether it's > okay to throw stuff away or not. The issue with these subsidiary > data structures is exactly that we're trying to reduce the lock strength > required for changing one of them, and as soon as you make that lock > strength less than AEL, you have the problem that that value may need > to change in relcache entries despite them being pinned. The code I'm > complaining about is trying to devise a shortcut solution to exactly > that situation ... and it's not a good shortcut. > > > The only alternative seems to be to have each subsystem hold its own > > reference count, as I did with the PartitionDirectory stuff, which is > > not something we'd want to scale out. > > Well, we clearly don't want to devise a separate solution for every > subsystem, but it doesn't seem out of the question to me that we could > build a "reference counted cache sub-structure" mechanism and apply it > to each of these relcache fields. Maybe it could be unified with the > existing code in the typcache that does a similar thing. Sure, this'd > be a fair amount of work, but we've done it before. Syscache entries > didn't use to have reference counts, for example. > > BTW, the problem I have with the PartitionDirectory stuff is exactly > that it *isn't* a reference-counted solution. If it were, we'd not > have this problem of not knowing when we could free rd_pdcxt. > > >> I especially don't care for the much-less-than-half-baked kluge of > >> chaining the old rd_pdcxt onto the new one and hoping that it will go away > >> at a suitable time. > > > It will go away at a suitable time, but maybe not at the soonest > > suitable time. It wouldn't be hard to improve that, though. The > > easiest thing to do, I think, would be to have an rd_oldpdcxt and > > stuff the old context there. If there already is one there, parent > > the new one under it. When RelationDecrementReferenceCount reduces > > the reference count to zero, destroy anything found in rd_oldpdcxt. > > Meh. While it seems likely that that would mask most practical problems, > it still seems like covering up a wart with a dirty bandage. In > particular, that fix doesn't fix anything unless relcache reference counts > go to zero pretty quickly --- which I'll just note is directly contrary > to your enthusiasm for holding relcache pins longer. > > I think that what we ought to do for v12 is have PartitionDirectory > copy the data, and then in v13 work on creating real reference-count > infrastructure that would allow eliminating the copy steps with full > safety. The $64 question is whether that really would cause unacceptable > performance problems. To look into that, I made the attached WIP patches. > (These are functionally complete, but I didn't bother for instance with > removing the hunk that 898e5e329 added to relcache.c, and the comments > need work, etc.) The first one just changes the PartitionDirectory > code to do that, and then the second one micro-optimizes > partition_bounds_copy() to make it somewhat less expensive, mostly by > collapsing lots of small palloc's into one big one. > > What I get for test cases like [1] is > > single-partition SELECT, hash partitioning: > > N tps, HEAD tps, patch > 2 11426.243754 11448.615193 > 8 11254.833267 11374.278861 > 32 11288.329114 11371.942425 > 128 11222.329256 11185.845258 > 512 11001.177137 10572.917288 > 1024 10612.456470 9834.172965 > 4096 8819.110195 7021.864625 > 8192 7372.611355 5276.130161 > > single-partition SELECT, range partitioning: > > N tps, HEAD tps, patch > 2 11037.855338 11153.595860 > 8 11085.218022 11019.132341 > 32 10994.348207 10935.719951 > 128 10884.417324 10532.685237 > 512 10635.583411 9578.108915 > 1024 10407.286414 8689.585136 > 4096 8361.463829 5139.084405 > 8192 7075.880701 3442.542768 > > Now certainly these numbers suggest that avoiding the copy could be worth > our trouble, but these results are still several orders of magnitude > better than where we were two weeks ago [2]. Plus, this is an extreme > case that's not really representative of real-world usage, since the test > tables have neither indexes nor any data. In practical situations the > baseline would be lower and the dropoff less bad. So I don't feel bad > about shipping v12 with these sorts of numbers and hoping for more > improvement later. > > regards, tom lane > > [1] https://www.postgresql.org/message-id/3529.1554051867%40sss.pgh.pa.us > > [2] https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512BAC60%40g01jpexmbkw24 > On Wed, May 01, 2019 at 01:09:07PM -0400, Robert Haas wrote in <CA+Tgmob-cska+-WUC7T-G4zkSJp7yum6M_bzYd4YFzwQ51qiow@mail.gmail.com>: > On Wed, May 1, 2019 at 11:59 AM Andres Freund <andres@anarazel.de> wrote: > > The message I'm replying to is marked as an open item. Robert, what do > > you think needs to be done here before release? Could you summarize, > > so we then can see what others think? > > The original issue on this thread was that hyrax started running out > of memory when it hadn't been doing so previously. That happened > because, for complicated reasons, commit > 898e5e3290a72d288923260143930fb32036c00c resulted in the leak being > hit lots of times in CLOBBER_CACHE_ALWAYS builds instead of just once. > Commits 2455ab48844c90419714e27eafd235a85de23232 and > d3f48dfae42f9655425d1f58f396e495c7fb7812 fixed that problem. > > In the email at issue, Tom complains about two things. First, he > complains about the fact that I try to arrange things so that relcache > data remains valid for as long as required instead of just copying it. > Second, he complains about the fact repeated ATTACH and DETACH > PARTITION operations can produce a slow session-lifespan memory leak. > > I think it's reasonable to fix the second issue for v12. I am not > sure how important it is, because (1) the leak is small, (2) it seems > unlikely that anyone would execute enough ATTACH/DETACH PARTITION > operations in one backend for the leak to amount to anything > significant, and (3) if a relcache flush ever happens when you don't > have the relation open, all of the "leaked" memory will be un-leaked. > However, I believe that we could fix it as follows. First, invent > rd_pdoldcxt and put the first old context there; if that pointer is > already in use, then parent the new context under the old one. > Second, in RelationDecrementReferenceCount, if the refcount hits 0, > nuke rd_pdoldcxt and set the pointer back to NULL. With that change, > you would only keep the old PartitionDesc around until the ref count > hits 0, whereas at present, you keep the old PartitionDesc around > until an invalidation happens while the ref count is 0. > > I think the first issue is not v12 material. Tom proposed to fix it > by copying all the relevant data out of the relcache, but his own > performance results show a pretty significant hit, and AFAICS he > hasn't pointed to anything that's actually broken by the current > approach. What I think should be done is actually generalize the > approach I took in this patch, so that instead of the partition > directory holding a reference count, the planner or executor holds > one. Then not only could people who want information about the > PartitionDesc avoid copying stuff from the relcache, but maybe other > things as well. I think that would be significantly better than > continuing to double-down on the current copy-everything approach, > which really only works well in a world where a table can't have all > that much metadata, which is clearly not true for PostgreSQL any more. > I'm not sure that Tom is actually opposed to this approach -- although > I may have misunderstood his position -- but where we disagree, I > think, is that I see what I did in this commit as a stepping-stone > toward a better world, and he sees it as something that should be > killed with fire until that better world has fully arrived. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > On Tue, Jun 11, 2019 at 01:57:16PM -0400, Tom Lane wrote in <18286.1560275836@sss.pgh.pa.us>: > Robert Haas <robertmhaas@gmail.com> writes: > > On Thu, Jun 6, 2019 at 2:48 AM Amit Langote <amitlangote09@gmail.com> wrote: > >> Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch, > >> which seems to fix this issue for me. > > > Yeah, that looks right. I think my patch was full of fuzzy thinking > > and inadequate testing; thanks for checking it over and coming up with > > the right solution. > > > Anyone else want to look/comment? > > I think the existing code is horribly ugly and this is even worse. > It adds cycles to RelationDecrementReferenceCount which is a hotspot > that has no business dealing with this; the invariants are unclear; > and there's no strong reason to think there aren't still cases where > we accumulate lots of copies of old partition descriptors during a > sequence of operations. Basically you're just doubling down on a > wrong design. > > As I said upthread, my current inclination is to do nothing in this > area for v12 and then try to replace the whole thing with proper > reference counting in v13. I think the cases where we have a major > leak are corner-case-ish enough that we can leave it as-is for one > release. > > regards, tom lane > > On Wed, Jun 12, 2019 at 03:11:56PM -0400, Tom Lane wrote in <3800.1560366716@sss.pgh.pa.us>: > Robert Haas <robertmhaas@gmail.com> writes: > > On Tue, Jun 11, 2019 at 1:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > I think the change is responsive to your previous complaint that the > > timing of stuff getting freed is not very well pinned down. With this > > change, it's much more tightly pinned down: it happens when the > > refcount goes to 0. That is definitely not perfect, but I think that > > it is a lot easier to come up with scenarios where the leak > > accumulates because no cache flush happens while the relfcount is 0 > > than it is to come up with scenarios where the refcount never reaches > > 0. I agree that the latter type of scenario probably exists, but I > > don't think we've come up with one yet. > > I don't know why you think that's improbable, given that the changes > around PartitionDirectory-s cause relcache entries to be held open much > longer than before (something I've also objected to on this thread). > > >> As I said upthread, my current inclination is to do nothing in this > >> area for v12 and then try to replace the whole thing with proper > >> reference counting in v13. I think the cases where we have a major > >> leak are corner-case-ish enough that we can leave it as-is for one > >> release. > > > Is this something you're planning to work on yourself? > > Well, I'd rather farm it out to somebody else, but ... > > > Do you have a > > design in mind? Is the idea to reference-count the PartitionDesc? > > What we discussed upthread was refcounting each of the various > large sub-objects of relcache entries, not just the partdesc. > I think if we're going to go this way we should bite the bullet and fix > them all. I really want to burn down RememberToFreeTupleDescAtEOX() in > particular ... it seems likely to me that that's also a source of > unpleasant memory leaks. > > regards, tom lane > >
pgsql-hackers by date: