Thread: Re: [HACKERS] [GENERAL] huge RAM use in multi-command ALTER of table heirarchy
Justin Pryzby <pryzby@telsasoft.com> writes: > I've seen this before while doing SET STATISTICS on a larger number of columns > using xargs, but just came up while doing ADD of a large number of columns. > Seems to be roughly linear in number of children but superlinear WRT columns. > I think having to do with catalog update / cache invalidation with many > ALTERs*children*columns? I poked into this a bit. The operation is necessarily roughly O(N^2) in the number of columns, because we rebuild the affected table's relcache entry after each elementary ADD COLUMN operation, and one of the principal components of that cost is reading all the pg_attribute entries. However, that should only be a time cost not a space cost. Eventually I traced the O(N^2) space consumption to RememberToFreeTupleDescAtEOX, which seems to have been introduced in Simon's commit e5550d5fe, and which strikes me as a kluge of the first magnitude. Unless I am missing something, that function's design concept can fairly be characterized as "let's leak memory like there's no tomorrow, on the off chance that somebody somewhere is ignoring basic coding rules". I tried ripping that out, and the peak space consumption of your example (with 20 child tables and 1600 columns) decreased from ~3GB to ~200MB. Moreover, the system still passes make check-world, so it's not clear to me what excuse this code has to live. It's probably a bit late in the v10 cycle to be taking any risks in this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX as soon as the v11 cycle opens, unless someone can show an example of non-broken coding that requires it. (And if so, there ought to be a regression test incorporating that.) regards, tom lane
Re: [HACKERS] [GENERAL] huge RAM use in multi-command ALTER of table heirarchy
From
Craig Ringer
Date:
On 19 July 2017 at 07:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Justin Pryzby <pryzby@telsasoft.com> writes:
> I've seen this before while doing SET STATISTICS on a larger number of columns
> using xargs, but just came up while doing ADD of a large number of columns.
> Seems to be roughly linear in number of children but superlinear WRT columns.
> I think having to do with catalog update / cache invalidation with many
> ALTERs*children*columns?
I poked into this a bit. The operation is necessarily roughly O(N^2) in
the number of columns, because we rebuild the affected table's relcache
entry after each elementary ADD COLUMN operation, and one of the principal
components of that cost is reading all the pg_attribute entries. However,
that should only be a time cost not a space cost. Eventually I traced the
O(N^2) space consumption to RememberToFreeTupleDescAtEOX, which seems to
have been introduced in Simon's commit e5550d5fe, and which strikes me as
a kluge of the first magnitude. Unless I am missing something, that
function's design concept can fairly be characterized as "let's leak
memory like there's no tomorrow, on the off chance that somebody somewhere
is ignoring basic coding rules".
I tried ripping that out, and the peak space consumption of your example
(with 20 child tables and 1600 columns) decreased from ~3GB to ~200MB.
Moreover, the system still passes make check-world, so it's not clear
to me what excuse this code has to live.
It's probably a bit late in the v10 cycle to be taking any risks in
this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX
as soon as the v11 cycle opens, unless someone can show an example
of non-broken coding that requires it. (And if so, there ought to
be a regression test incorporating that.)
Just FYI, I believe Simon's currently on holiday, so may not notice this discussion as promptly as usual.
Re: [HACKERS] [GENERAL] huge RAM use in multi-command ALTER of table heirarchy
From
Greg Stark
Date:
On 19 July 2017 at 00:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's probably a bit late in the v10 cycle to be taking any risks in > this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX > as soon as the v11 cycle opens, unless someone can show an example > of non-broken coding that requires it. (And if so, there ought to > be a regression test incorporating that.) Would it be useful to keep in one of the memory checking assertion builds? -- greg
Greg Stark <stark@mit.edu> writes: > On 19 July 2017 at 00:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It's probably a bit late in the v10 cycle to be taking any risks in >> this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX >> as soon as the v11 cycle opens, unless someone can show an example >> of non-broken coding that requires it. (And if so, there ought to >> be a regression test incorporating that.) > Would it be useful to keep in one of the memory checking assertion builds? Why? Code that expects to continue accessing a relcache entry's tupdesc after closing the relcache entry is broken, independently of whether it's in a debug build or not. regards, tom lane
Re: [HACKERS] [GENERAL] huge RAM use in multi-command ALTER of tableheirarchy
From
Amit Langote
Date:
On 2017/07/20 22:19, Tom Lane wrote: > Greg Stark <stark@mit.edu> writes: >> On 19 July 2017 at 00:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> It's probably a bit late in the v10 cycle to be taking any risks in >>> this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX >>> as soon as the v11 cycle opens, unless someone can show an example >>> of non-broken coding that requires it. (And if so, there ought to >>> be a regression test incorporating that.) > >> Would it be useful to keep in one of the memory checking assertion builds? > > Why? Code that expects to continue accessing a relcache entry's tupdesc > after closing the relcache entry is broken, independently of whether it's > in a debug build or not. Am I wrong in thinking that TupleDesc refcounting (along with resowner tracking) allows one to use a TupleDesc without worrying about the lifetime of its parent relcache entry? I'm asking this independently of the concerns being discussed in this thread about having the RememberToFreeTupleDescAtEOX() mechanism on top of TupleDesc refcounting. Thanks, Amit
Re: [HACKERS] [GENERAL] huge RAM use in multi-command ALTER of table heirarchy
From
Greg Stark
Date:
On 20 July 2017 at 14:19, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@mit.edu> writes: > >> Would it be useful to keep in one of the memory checking assertion builds? > > Why? Code that expects to continue accessing a relcache entry's tupdesc > after closing the relcache entry is broken, independently of whether it's > in a debug build or not. Mea Culpa. I hadn't actually read the code and assumed it would be sensible to change from something that freed these tupdescs into something that raised an assertion if they were still unfreed at end of transaction. Reading the code I see that it's only there to *avoid* freeing the tupledesc just in case there's something still using it. If we just free it then the normal memory checking builds would catch cases where they're used after being freed. But what I still don't understand is whether the fact that it still passes the regression tests means anything. Unless there happened to be a sinval message at the wrong time the regression tests wouldn't uncover any problem cases. If it passed the tests on a CLOBBER_CACHE_ALWAYS build then perhaps that would prove it's safe? Or perhaps if the columns haven't actually been altered it would still fail to fail? -- greg
Re: [HACKERS] [GENERAL] huge RAM use in multi-command ALTER of table heirarchy
From
Justin Pryzby
Date:
Hi, I just ran into this again in another context (see original dicussion, quoted below). Some time ago, while initially introducting non-default stats target, I set our non-filtering columns to "STATISTICS 10", lower than default, to minimize the size of pg_statistic, which (at least at one point) I perceived to have become bloated and causing issue (partially due to having an excessive number of "daily" granularity partitions, a problem I've since mitigated). The large number of columns with non-default stats target was (I think) causing pg_dump --section=pre-data to take 10+ minutes, which makes pg_upgrade more disruptive than necessary, so now I'm going back and fixing it. [pryzbyj@database ~]$ time sed '/SET STATISTICS 10;$/!d; s//SET STATISTICS -1;/' /srv/cdrperfbackup/ts/2017-10-17/pg_dump-section\=pre-data|psql -1q ts server closed the connection unexpectedly This probably means the server terminated abnormally before or whileprocessing the request. connection to server was lost [pryzbyj@database ~]$ dmesg |tail -n2 Out of memory: Kill process 6725 (postmaster) score 550 or sacrifice child Killed process 6725, UID 26, (postmaster) total-vm:13544792kB, anon-rss:8977764kB, file-rss:8kB So I'm hoping to encourage someone to commit the change contemplated earlier. Thanks in advance. Justin On Tue, Jul 18, 2017 at 07:26:30PM -0400, Tom Lane wrote: > Justin Pryzby <pryzby@telsasoft.com> writes: > > I've seen this before while doing SET STATISTICS on a larger number of columns > > using xargs, but just came up while doing ADD of a large number of columns. > > Seems to be roughly linear in number of children but superlinear WRT columns. > > I think having to do with catalog update / cache invalidation with many > > ALTERs*children*columns? > > I poked into this a bit. The operation is necessarily roughly O(N^2) in > the number of columns, because we rebuild the affected table's relcache > entry after each elementary ADD COLUMN operation, and one of the principal > components of that cost is reading all the pg_attribute entries. However, > that should only be a time cost not a space cost. Eventually I traced the > O(N^2) space consumption to RememberToFreeTupleDescAtEOX, which seems to > have been introduced in Simon's commit e5550d5fe, and which strikes me as > a kluge of the first magnitude. Unless I am missing something, that > function's design concept can fairly be characterized as "let's leak > memory like there's no tomorrow, on the off chance that somebody somewhere > is ignoring basic coding rules". > > I tried ripping that out, and the peak space consumption of your example > (with 20 child tables and 1600 columns) decreased from ~3GB to ~200MB. > Moreover, the system still passes make check-world, so it's not clear > to me what excuse this code has to live. > > It's probably a bit late in the v10 cycle to be taking any risks in > this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX > as soon as the v11 cycle opens, unless someone can show an example > of non-broken coding that requires it. (And if so, there ought to > be a regression test incorporating that.) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jul 18, 2017 at 07:26:30PM -0400, Tom Lane wrote: > > It's probably a bit late in the v10 cycle to be taking any risks in > > this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX > > as soon as the v11 cycle opens, unless someone can show an example > > of non-broken coding that requires it. (And if so, there ought to > > be a regression test incorporating that.) I'm resending this in case it's been forgotten about and in case there's still time this cycle to follow through in removing RememberToFreeTupleDescAtEOX. ..And because I ran into it again earlier this month, ALTERing an 1600 column table with 500 children (actually while rewriting to reduce to 12 childs); on a dedicated DB VM with 8GB RAM: Mar 7 11:44:52 telsaDB kernel: Out of memory: Kill process 47490 (postmaster) score 644 or sacrifice child Mar 7 11:44:52 telsaDB kernel: Killed process 47490, UID 26, (postmaster) total-vm:6813528kB, anon-rss:5212288kB, file-rss:2296kB On Wed, Oct 18, 2017 at 02:54:54PM -0500, Justin Pryzby wrote: > Hi, > > I just ran into this again in another context (see original dicussion, quoted > below). > > Some time ago, while initially introducting non-default stats target, I set our > non-filtering columns to "STATISTICS 10", lower than default, to minimize the > size of pg_statistic, which (at least at one point) I perceived to have become > bloated and causing issue (partially due to having an excessive number of > "daily" granularity partitions, a problem I've since mitigated). > > The large number of columns with non-default stats target was (I think) causing > pg_dump --section=pre-data to take 10+ minutes, which makes pg_upgrade more > disruptive than necessary, so now I'm going back and fixing it. > > [pryzbyj@database ~]$ time sed '/SET STATISTICS 10;$/!d; s//SET STATISTICS -1;/' /srv/cdrperfbackup/ts/2017-10-17/pg_dump-section\=pre-data|psql -1q ts > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > connection to server was lost > > [pryzbyj@database ~]$ dmesg |tail -n2 > Out of memory: Kill process 6725 (postmaster) score 550 or sacrifice child > Killed process 6725, UID 26, (postmaster) total-vm:13544792kB, anon-rss:8977764kB, file-rss:8kB > > So I'm hoping to encourage someone to commit the change contemplated earlier.
On 2018/04/29 1:00, Justin Pryzby wrote: > On Tue, Jul 18, 2017 at 07:26:30PM -0400, Tom Lane wrote: >>> It's probably a bit late in the v10 cycle to be taking any risks in >>> this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX >>> as soon as the v11 cycle opens, unless someone can show an example >>> of non-broken coding that requires it. (And if so, there ought to >>> be a regression test incorporating that.) Fwiw, I too thought it should be possible by now to get rid of RememberToFreeTupleDescAtEOX, but I was able to come up with at least one case where its existence prevents a crash. The test case is based on one provided by Noah Misch a while ago when the code in question was introduced as a measure for the problem he described: https://www.postgresql.org/message-id/20130801005351.GA325106%40tornado.leadboat.com Suppose in get_relation_constraints() we're processing a table that has 2 constraints, one which has ccvalid=true and another which has ccvalid=false. Further suppose a concurrent session performs a VALIDATE CONSTRAINT on the second constraint, which requires just ShareUpdateExclusiveLock and so it succeeds, sending out a sinval. Back in the first session, eval_const_expressions() performed on the first constraint would end up in RelationClearRelation being called on the table, wherein, since the 2nd constraint is now marked by valid by the concurrent session, would cause rd_att to be swapped. Thus the pointer into rd_att that get_relation_constraints possesses would be a dangling pointer, if not for the call to RememberToFreeTupleDescAtEOX() from RelationDestroyRelation(). If however get_relation_constraints() had incremented the reference count of rd_att to begin with, this problem wouldn't have arisen. IOW, we'd need to replace all the direct accesses to rd_att (including those via RelationGetDescr, of course) by something that increments rd_att's reference count, before concluding that RememberToFreeTupleDescAtEOX() is unnecessary. I'm afraid that's a lot of places. Thanks, Amit