Thread: Re: [HACKERS] [GENERAL] huge RAM use in multi-command ALTER of table heirarchy

Justin Pryzby <pryzby@telsasoft.com> writes:
> I've seen this before while doing SET STATISTICS on a larger number of columns
> using xargs, but just came up while doing ADD of a large number of columns.
> Seems to be roughly linear in number of children but superlinear WRT columns.
> I think having to do with catalog update / cache invalidation with many
> ALTERs*children*columns?

I poked into this a bit.  The operation is necessarily roughly O(N^2) in
the number of columns, because we rebuild the affected table's relcache
entry after each elementary ADD COLUMN operation, and one of the principal
components of that cost is reading all the pg_attribute entries.  However,
that should only be a time cost not a space cost.  Eventually I traced the
O(N^2) space consumption to RememberToFreeTupleDescAtEOX, which seems to
have been introduced in Simon's commit e5550d5fe, and which strikes me as
a kluge of the first magnitude.  Unless I am missing something, that
function's design concept can fairly be characterized as "let's leak
memory like there's no tomorrow, on the off chance that somebody somewhere
is ignoring basic coding rules".

I tried ripping that out, and the peak space consumption of your example
(with 20 child tables and 1600 columns) decreased from ~3GB to ~200MB.
Moreover, the system still passes make check-world, so it's not clear
to me what excuse this code has to live.

It's probably a bit late in the v10 cycle to be taking any risks in
this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX
as soon as the v11 cycle opens, unless someone can show an example
of non-broken coding that requires it.  (And if so, there ought to
be a regression test incorporating that.)
        regards, tom lane



On 19 July 2017 at 07:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Justin Pryzby <pryzby@telsasoft.com> writes:
> I've seen this before while doing SET STATISTICS on a larger number of columns
> using xargs, but just came up while doing ADD of a large number of columns.
> Seems to be roughly linear in number of children but superlinear WRT columns.
> I think having to do with catalog update / cache invalidation with many
> ALTERs*children*columns?

I poked into this a bit.  The operation is necessarily roughly O(N^2) in
the number of columns, because we rebuild the affected table's relcache
entry after each elementary ADD COLUMN operation, and one of the principal
components of that cost is reading all the pg_attribute entries.  However,
that should only be a time cost not a space cost.  Eventually I traced the
O(N^2) space consumption to RememberToFreeTupleDescAtEOX, which seems to
have been introduced in Simon's commit e5550d5fe, and which strikes me as
a kluge of the first magnitude.  Unless I am missing something, that
function's design concept can fairly be characterized as "let's leak
memory like there's no tomorrow, on the off chance that somebody somewhere
is ignoring basic coding rules".

I tried ripping that out, and the peak space consumption of your example
(with 20 child tables and 1600 columns) decreased from ~3GB to ~200MB.
Moreover, the system still passes make check-world, so it's not clear
to me what excuse this code has to live.

It's probably a bit late in the v10 cycle to be taking any risks in
this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX
as soon as the v11 cycle opens, unless someone can show an example
of non-broken coding that requires it.  (And if so, there ought to
be a regression test incorporating that.)

Just FYI, I believe Simon's currently on holiday, so may not notice this discussion as promptly as usual. 


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On 19 July 2017 at 00:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> It's probably a bit late in the v10 cycle to be taking any risks in
> this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX
> as soon as the v11 cycle opens, unless someone can show an example
> of non-broken coding that requires it.  (And if so, there ought to
> be a regression test incorporating that.)

Would it be useful to keep in one of the memory checking assertion builds?

-- 
greg



Greg Stark <stark@mit.edu> writes:
> On 19 July 2017 at 00:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It's probably a bit late in the v10 cycle to be taking any risks in
>> this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX
>> as soon as the v11 cycle opens, unless someone can show an example
>> of non-broken coding that requires it.  (And if so, there ought to
>> be a regression test incorporating that.)

> Would it be useful to keep in one of the memory checking assertion builds?

Why?  Code that expects to continue accessing a relcache entry's tupdesc
after closing the relcache entry is broken, independently of whether it's
in a debug build or not.
        regards, tom lane



On 2017/07/20 22:19, Tom Lane wrote:
> Greg Stark <stark@mit.edu> writes:
>> On 19 July 2017 at 00:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> It's probably a bit late in the v10 cycle to be taking any risks in
>>> this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX
>>> as soon as the v11 cycle opens, unless someone can show an example
>>> of non-broken coding that requires it.  (And if so, there ought to
>>> be a regression test incorporating that.)
> 
>> Would it be useful to keep in one of the memory checking assertion builds?
> 
> Why?  Code that expects to continue accessing a relcache entry's tupdesc
> after closing the relcache entry is broken, independently of whether it's
> in a debug build or not.

Am I wrong in thinking that TupleDesc refcounting (along with resowner
tracking) allows one to use a TupleDesc without worrying about the
lifetime of its parent relcache entry?

I'm asking this independently of the concerns being discussed in this
thread about having the RememberToFreeTupleDescAtEOX() mechanism on top of
TupleDesc refcounting.

Thanks,
Amit




On 20 July 2017 at 14:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@mit.edu> writes:
>
>> Would it be useful to keep in one of the memory checking assertion builds?
>
> Why?  Code that expects to continue accessing a relcache entry's tupdesc
> after closing the relcache entry is broken, independently of whether it's
> in a debug build or not.

Mea Culpa. I hadn't actually read the code and assumed it would be
sensible to change from something that freed these tupdescs into
something that raised an assertion if they were still unfreed at end
of transaction.

Reading the code I see that it's only there to *avoid* freeing the
tupledesc just in case there's something still using it. If we just
free it then the normal memory checking builds would catch cases where
they're used after being freed.

But what I still don't understand is whether the fact that it still
passes the regression tests means anything. Unless there happened to
be a sinval message at the wrong time the regression tests wouldn't
uncover any problem cases. If it passed the tests on a
CLOBBER_CACHE_ALWAYS build then perhaps that would prove it's safe? Or
perhaps if the columns haven't actually been altered it would still
fail to fail?

-- 
greg



Re: [HACKERS] [GENERAL] huge RAM use in multi-command ALTER of table heirarchy

From
Justin Pryzby
Date:
Hi,

I just ran into this again in another context (see original dicussion, quoted
below).

Some time ago, while initially introducting non-default stats target, I set our
non-filtering columns to "STATISTICS 10", lower than default, to minimize the
size of pg_statistic, which (at least at one point) I perceived to have become
bloated and causing issue (partially due to having an excessive number of
"daily" granularity partitions, a problem I've since mitigated).

The large number of columns with non-default stats target was (I think) causing
pg_dump --section=pre-data to take 10+ minutes, which makes pg_upgrade more
disruptive than necessary, so now I'm going back and fixing it.

[pryzbyj@database ~]$ time sed '/SET STATISTICS 10;$/!d; s//SET STATISTICS -1;/'
/srv/cdrperfbackup/ts/2017-10-17/pg_dump-section\=pre-data|psql -1q ts
 
server closed the connection unexpectedly       This probably means the server terminated abnormally       before or
whileprocessing the request.
 
connection to server was lost

[pryzbyj@database ~]$ dmesg |tail -n2
Out of memory: Kill process 6725 (postmaster) score 550 or sacrifice child
Killed process 6725, UID 26, (postmaster) total-vm:13544792kB, anon-rss:8977764kB, file-rss:8kB

So I'm hoping to encourage someone to commit the change contemplated earlier.

Thanks in advance.

Justin

On Tue, Jul 18, 2017 at 07:26:30PM -0400, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > I've seen this before while doing SET STATISTICS on a larger number of columns
> > using xargs, but just came up while doing ADD of a large number of columns.
> > Seems to be roughly linear in number of children but superlinear WRT columns.
> > I think having to do with catalog update / cache invalidation with many
> > ALTERs*children*columns?
> 
> I poked into this a bit.  The operation is necessarily roughly O(N^2) in
> the number of columns, because we rebuild the affected table's relcache
> entry after each elementary ADD COLUMN operation, and one of the principal
> components of that cost is reading all the pg_attribute entries.  However,
> that should only be a time cost not a space cost.  Eventually I traced the
> O(N^2) space consumption to RememberToFreeTupleDescAtEOX, which seems to
> have been introduced in Simon's commit e5550d5fe, and which strikes me as
> a kluge of the first magnitude.  Unless I am missing something, that
> function's design concept can fairly be characterized as "let's leak
> memory like there's no tomorrow, on the off chance that somebody somewhere
> is ignoring basic coding rules".
> 
> I tried ripping that out, and the peak space consumption of your example
> (with 20 child tables and 1600 columns) decreased from ~3GB to ~200MB.
> Moreover, the system still passes make check-world, so it's not clear
> to me what excuse this code has to live.
> 
> It's probably a bit late in the v10 cycle to be taking any risks in
> this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX
> as soon as the v11 cycle opens, unless someone can show an example
> of non-broken coding that requires it.  (And if so, there ought to
> be a regression test incorporating that.)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [GENERAL] huge RAM use in multi-command ALTER of table heirarchy

From
Justin Pryzby
Date:
On Tue, Jul 18, 2017 at 07:26:30PM -0400, Tom Lane wrote:
> > It's probably a bit late in the v10 cycle to be taking any risks in
> > this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX
> > as soon as the v11 cycle opens, unless someone can show an example
> > of non-broken coding that requires it.  (And if so, there ought to
> > be a regression test incorporating that.)

I'm resending this in case it's been forgotten about and in case there's still
time this cycle to follow through in removing RememberToFreeTupleDescAtEOX.

..And because I ran into it again earlier this month, ALTERing an 1600 column
table with 500 children (actually while rewriting to reduce to 12 childs); on a
dedicated DB VM with 8GB RAM:

Mar  7 11:44:52 telsaDB kernel: Out of memory: Kill process 47490 (postmaster) score 644 or sacrifice child
Mar  7 11:44:52 telsaDB kernel: Killed process 47490, UID 26, (postmaster) total-vm:6813528kB, anon-rss:5212288kB,
file-rss:2296kB

On Wed, Oct 18, 2017 at 02:54:54PM -0500, Justin Pryzby wrote:
> Hi,
> 
> I just ran into this again in another context (see original dicussion, quoted
> below).
> 
> Some time ago, while initially introducting non-default stats target, I set our
> non-filtering columns to "STATISTICS 10", lower than default, to minimize the
> size of pg_statistic, which (at least at one point) I perceived to have become
> bloated and causing issue (partially due to having an excessive number of
> "daily" granularity partitions, a problem I've since mitigated).
> 
> The large number of columns with non-default stats target was (I think) causing
> pg_dump --section=pre-data to take 10+ minutes, which makes pg_upgrade more
> disruptive than necessary, so now I'm going back and fixing it.
> 
> [pryzbyj@database ~]$ time sed '/SET STATISTICS 10;$/!d; s//SET STATISTICS -1;/'
/srv/cdrperfbackup/ts/2017-10-17/pg_dump-section\=pre-data|psql -1q ts
 
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> connection to server was lost
> 
> [pryzbyj@database ~]$ dmesg |tail -n2
> Out of memory: Kill process 6725 (postmaster) score 550 or sacrifice child
> Killed process 6725, UID 26, (postmaster) total-vm:13544792kB, anon-rss:8977764kB, file-rss:8kB
> 
> So I'm hoping to encourage someone to commit the change contemplated earlier.


Re: [GENERAL] huge RAM use in multi-command ALTER of table heirarchy

From
Amit Langote
Date:
On 2018/04/29 1:00, Justin Pryzby wrote:
> On Tue, Jul 18, 2017 at 07:26:30PM -0400, Tom Lane wrote:
>>> It's probably a bit late in the v10 cycle to be taking any risks in
>>> this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX
>>> as soon as the v11 cycle opens, unless someone can show an example
>>> of non-broken coding that requires it.  (And if so, there ought to
>>> be a regression test incorporating that.)

Fwiw, I too thought it should be possible by now to get rid of
RememberToFreeTupleDescAtEOX, but I was able to come up with at least one
case where its existence prevents a crash.  The test case is based on one
provided by Noah Misch a while ago when the code in question was
introduced as a measure for the problem he described:

https://www.postgresql.org/message-id/20130801005351.GA325106%40tornado.leadboat.com

Suppose in get_relation_constraints() we're processing a table that has 2
constraints, one which has ccvalid=true and another which has
ccvalid=false.  Further suppose a concurrent session performs a VALIDATE
CONSTRAINT on the second constraint, which requires just
ShareUpdateExclusiveLock and so it succeeds, sending out a sinval.  Back
in the first session, eval_const_expressions() performed on the first
constraint would end up in RelationClearRelation being called on the
table, wherein, since the 2nd constraint is now marked by valid by the
concurrent session, would cause rd_att to be swapped.  Thus the pointer
into rd_att that get_relation_constraints possesses would be a dangling
pointer, if not for the call to RememberToFreeTupleDescAtEOX() from
RelationDestroyRelation().

If however get_relation_constraints() had incremented the reference count
of rd_att to begin with, this problem wouldn't have arisen.  IOW, we'd
need to replace all the direct accesses to rd_att (including those via
RelationGetDescr, of course) by something that increments rd_att's
reference count, before concluding that RememberToFreeTupleDescAtEOX() is
unnecessary.  I'm afraid that's a lot of places.

Thanks,
Amit