Thread: autovacuum worker running amok - and me too ;)

autovacuum worker running amok - and me too ;)

From
wambacher
Date:
Hi,

running postgresql on ubuntu for many years, but now i'm in big trouble.

My system has 24GB of real memory but after some hours one autovacuum worker
is using 80-90% of  memory, the OOM-Killer (out of memory killer) kills the
process with kill -9 and the postgresql-server is restarting because of that
problem.

i changed the base configuration to use very small buffers, restartetd the
server twice but the problem still exists.

i think, it's allways the same table and that table is huge: 111GB data and
3 Indices with 4GB, 128 GB and 12 GB. It's the table planet_osm_ways from
openstreetmap. maybe that helps.

without running server 8GB are user by other programs and 16GB are free.

no idea whats going on.

regards
walter








--
View this message in context: http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
Tom Lane
Date:
wambacher <wnordmann@gmx.de> writes:
> My system has 24GB of real memory but after some hours one autovacuum worker
> is using 80-90% of  memory, the OOM-Killer (out of memory killer) kills the
> process with kill -9 and the postgresql-server is restarting because of that
> problem.

> i changed the base configuration to use very small buffers, restartetd the
> server twice but the problem still exists.

> i think, it's allways the same table and that table is huge: 111GB data and
> 3 Indices with 4GB, 128 GB and 12 GB. It's the table planet_osm_ways from
> openstreetmap. maybe that helps.

Maybe you could reduce the statistics targets for that table.

I think we've heard that the analyze functions for PostGIS data types are
memory hogs, too --- maybe it's worth inquiring about that on the postgis
mailing lists.

            regards, tom lane


Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
Tom Lane-2 wrote
> Maybe you could reduce the statistics targets for that table.

don't understand what you mean. do you mean how often that table is
autovacuumed? at the moment about once a day or once in two days, i think.

> I think we've heard that the analyze functions for PostGIS data types are
> memory hogs, too --- maybe it's worth inquiring about that on the postgis
> mailing listl

good idea and yes, it's a table with postgis-geometries and the
corresponding indices.

will ask the postgis guys too.

at the moment i disabled autovacuum but i'll need it soon again, because the
server is running 24/7 and data is changing permantly.

thanks
walter

btw: no big problems in my mini-config, ok?



--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840305.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
Tom Lane
Date:
wambacher <wnordmann@gmx.de> writes:
> Tom Lane-2 wrote
>> Maybe you could reduce the statistics targets for that table.

> don't understand what you mean. do you mean how often that table is
> autovacuumed? at the moment about once a day or once in two days, i think.

No, I mean the amount of stats detail that ANALYZE tries to collect.
I'm guessing that it's not auto-vacuum but auto-analyze that's getting
OOMed.

See ALTER TABLE SET STATISTICS TARGET.

            regards, tom lane


Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
Tom Lane-2 wrote
> See ALTER TABLE SET STATISTICS TARGET.

thanks, will try it

regards
walter

btw: the postgis analyze problem has been fixed more than one year ago, but
i'll ask them too.




--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840313.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
Roxanne Reid-Bennett
Date:
On 3/3/2015 6:17 PM, Tom Lane wrote:
> wambacher <wnordmann@gmx.de> writes:
>> My system has 24GB of real memory but after some hours one autovacuum worker
>> is using 80-90% of  memory, the OOM-Killer (out of memory killer) kills the
>> process with kill -9 and the postgresql-server is restarting because of that
>> problem.
>> i changed the base configuration to use very small buffers, restartetd the
>> server twice but the problem still exists.
>> i think, it's allways the same table and that table is huge: 111GB data and
>> 3 Indices with 4GB, 128 GB and 12 GB. It's the table planet_osm_ways from
>> openstreetmap. maybe that helps.
> Maybe you could reduce the statistics targets for that table.
>
> I think we've heard that the analyze functions for PostGIS data types are
> memory hogs, too --- maybe it's worth inquiring about that on the postgis
> mailing lists.

Most definitely ask on the Postgis list.  Identify the full Postgis
version and Postgres versions as well.
We had a case on a trial upgrade (9.1 to 9.3 and Postgis upgrade
(2.0->2.1)) where analyze was running amok memory wise on 3 tables with
geometry types. (posted on this list)
Unfortunately the VM that system was on got corrupted and I wasn't able
to provide the data for a test scenario to Paul Ramsey - so never filed
the bug report.
(perhaps the VM was the issue and NOT the upgrade...)  At the time, we
ended up NOT upgrading the production box due this issue potentially
being a show stopper.  I've also never tried to re-create the upgrade
stack to test it out on a current copy of the data.

Roxanne
>
>             regards, tom lane
>
>


--
[At other schools] I think the most common fault in general is to teach students how to pass exams instead of teaching
themthe science. 
Donald Knuth



Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
Roxanne Reid-Bennett wrote
> Most definitely ask on the Postgis list.  Identify the full Postgis
> version and Postgres versions as well.

Hi Roxanne,

seconds before sending it to the postgis-list i checked the table
planet_osm_ways and there is no geometry:

That can't be a postgis problem. I'll check fastupdate and the other hints,
i got yesterday

But here the missing infos:


Regards
walter




--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840452.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
Paul Ramsey
Date:
Circling back on this one, I had a look at our analyze code. I found
one place where *maybe* we weren't freeing memory and freed it, but
analyzing a 2M record table I barely see any bump up in memory usage
(from 22M up to 24M at peak) during analyze. And the change I made
didn't appear to alter that (though the objects were probably all
small enough that they weren't being detoasted into copies in any
event). Though maybe with a really big table? (with really big
objects?) Though still, doesn't analyze just pull a limited sample
(30K approx max) so why would table size make any difference after a
certain point?

P.

On Tue, Mar 3, 2015 at 3:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> wambacher <wnordmann@gmx.de> writes:
>> My system has 24GB of real memory but after some hours one autovacuum worker
>> is using 80-90% of  memory, the OOM-Killer (out of memory killer) kills the
>> process with kill -9 and the postgresql-server is restarting because of that
>> problem.
>
>> i changed the base configuration to use very small buffers, restartetd the
>> server twice but the problem still exists.
>
>> i think, it's allways the same table and that table is huge: 111GB data and
>> 3 Indices with 4GB, 128 GB and 12 GB. It's the table planet_osm_ways from
>> openstreetmap. maybe that helps.
>
> Maybe you could reduce the statistics targets for that table.
>
> I think we've heard that the analyze functions for PostGIS data types are
> memory hogs, too --- maybe it's worth inquiring about that on the postgis
> mailing lists.
>
>                         regards, tom lane
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
Paul Ramsey wrote
> Though maybe with a really big table? (with really big
> objects?) Though still, doesn't analyze just pull a limited sample
> (30K approx max) so why would table size make any difference after a
> certain point?

Hi paul, "my" table is quite big (about 293.049.000 records) but the objects
are not.

nodes[] contains maximal 2000 bigint and tags[] up to some hundred chars,
sometimes some thousands chars.

watching the memory usage of the autovaccum process: is was getting bigger
and bigger at nearly constant speed. some MB per minute, iir.

i'm just recreating planet_osm_ways_nodes without "fastupdate=off"

regards
walter



--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840485.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
Kevin Grittner
Date:
wambacher <wnordmann@gmx.de> wrote:
> watching the memory usage of the autovaccum process: is was getting bigger

> and bigger at nearly constant speed. some MB per minute, iir.


What are your settings for maintenance_work_mem and autovacuum_max_workers?


--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
Hi,

in my first post you can see all params:

maintenance_work_mem = 64MB and two workers. i configured my system to the
absolutely minimum (  got 24 GB real  memory) and the problem was still
there.

Last night i rebuilt one index (122GB Size) and just in this minutes i
started a manual "analyze verbose planet_osm_ways" to see whats happening.
this will need some hours.

Regards
walter



--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840685.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
> ... this will need some hours.

Done after 30 Minutes :)

nearly 50% dead rows - strange.

Now i'll run a "vacuum verbose planet_osm_ways" because the system crashed
during the autovacuum.

cross my fingers.

Walter





--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840688.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
crashed:

no idea what to do now.

walter



--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840696.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
Jim Nasby
Date:
On 3/5/15 2:06 PM, wambacher wrote:
> crashed:
>
> no idea what to do now.

Crashed? Or hit by the OOM killer? What's the log say?

While this is going on you might as well disable autovac for that table.
It'll keep crashing and will interfere with your manual vacuums.

It sounds at this point like the problem is in vacuuming, not analyze.
Can you confirm? If so, please forgo analyzing the table until we can
get vacuum figured out.

What's the largest memory size that a vacuum/autovac against that table
gets to compared to other backends? You meantioned 80-90% of memory
before, but I don't know if that was for analyze or what.

I wonder if we have some kind of memory leak in GIN's vacuum support...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
Jim Nasby-5 wrote
> On 3/5/15 2:06 PM, wambacher wrote:
> Crashed? Or hit by the OOM killer? What's the log say?

killed by OOM, but has only 1.2 GB mem, which is ok to me.


> While this is going on you might as well disable autovac for that table.
> It'll keep crashing and will interfere with your manual vacuums.

did it this morning, the crash was running "vacuum verbose planet_osm_ways"
by cli.

> It sounds at this point like the problem is in vacuuming, not analyze.
> Can you confirm? If so, please forgo analyzing the table until we can
> get vacuum figured out.

yes, it's the vacuum task.

> What's the largest memory size that a vacuum/autovac against that table
> gets to compared to other backends? You meantioned 80-90% of memory
> before, but I don't know if that was for analyze or what.

vacuum

> I wonder if we have some kind of memory leak in GIN's vacuum support...

may be.

At least i did:

- droped the gin-index
- cluster
- analyze
- vacuum

all without any problems.

now i'll add the index again and tomorrow do another vacuum by hand.

2:30 in germany, feeling tired ;)

Regards
walter






--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840730.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
Jim Nasby
Date:
On 3/5/15 7:36 PM, wambacher wrote:
> Jim Nasby-5 wrote
>> >On 3/5/15 2:06 PM, wambacher wrote:
>> >Crashed? Or hit by the OOM killer? What's the log say?
> killed by OOM, but has only 1.2 GB mem, which is ok to me.

Ok, but...

>> >What's the largest memory size that a vacuum/autovac against that table
>> >gets to compared to other backends? You meantioned 80-90% of memory
>> >before, but I don't know if that was for analyze or what.
> vacuum

Which is it? Is the vacuum process is using 1.2GB (5% of memory) or is
it using 90% (~22GB)?

BTW, with 1GB shared buffers and 64MB maintenance_work_mem top reporting
a size of 1.2GB doesn't surprise me at all (assuming it's including
shared memory in there).

This is starting to sound like a regular OOM problem. Have you tried the
steps in
http://postgresql.org/docs/9.4/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
Jim Nasby-5 wrote
> Which is it? Is the vacuum process is using 1.2GB (5% of memory) or is
> it using 90% (~22GB)?

i ran the job 2-3 times.

- first with 18GB swap too. I heared it thrashing, performance went extremly
down and after 2 hours i killed the job (reboot system, no other way to do
it)

- next without swap: i monitored the system with hmon and the vacuum task
was getting bigger and bigger until oom killed it.  VIRT at about 20.x GB,
MEM% at 80-90%

At this time i called for help.

- next: rebuilt the gin-index without fastupdate=off to use the default.
- vacuum planet_osm_ways on console
- VIRT about 1.2 GB, MEM% about 3.4% on HTOP
- crashed again, system logs are attached saying "OOM killed him, but using
about 1.2 GB, which is fine to me (and you)

- dropped index, clustered, vacuum --> no problems
- recreating of gin index is still running. 96/121 GB, some hours to go.

waiting for next test.

> reporting
> a size of 1.2GB doesn't surprise me at all (assuming it's including
> shared memory in there).
>
> This is starting to sound like a regular OOM problem. Have you tried the
> steps in
> http://postgresql.org/docs/9.4/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

not yet, but i'll check it right now.

Regards
walter




--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840765.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
hi,

waiting for the index (104/121GB), i read his document
http://www.postgresql.org/docs/9.4/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
and will do some changes before the next analyze:

some comments:

- the OOM did not kill the Postmaster but the Analyze-Job.
- started with 24GB real and 18GB Swap - that must be enought! --> killed

- will reduce shared_mem from 1GB to 512 MB
- will reduce max_connections to 100 (but no user was actice, the
applications are down)
- will do "sysctl -w vm.overcommit_memory=2" (just did it)
- may do "echo -1000 > /proc/self/oom_score_adj" but only if the other
actions fail
- the last steps for older kernels are not relevant, i'm running ubuntu
14.04 LTS, which is 3.13

i'll keep you informed.

regards
walter






--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840772.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
Karsten Hilbert
Date:
On Fri, Mar 06, 2015 at 02:39:34AM -0700, wambacher wrote:

> some comments:
>
> - the OOM did not kill the Postmaster but the Analyze-Job.
> - started with 24GB real and 18GB Swap - that must be enought! --> killed

"Back in the days" it was conventional wisdom to have twice
as much swap as you've got physical memory. Of course, I am
not suggesting you provide 48GB of swap and your problem is
magically solved _but_ one thing we might take away from that
old adage is that one might "hope things to work better"
(say, while debugging) if there is at least as much swap as
there is physical RAM based on the naive assumption that in
this case '"everything" can be swapped out'.

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
Karsten Hilbert wrote
> Of course, I am
> not suggesting you provide 48GB of swap and your problem is
> magically solved _but_ one thing we might take away from that
> old adage is that one might "hope things to work better"
> (say, while debugging) if there is at least as much swap as
> there is physical RAM based on the naive assumption that in
> this case '"everything" can be swapped out'.

no problem at all, got TBytes of free Diskpace.




--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840781.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
wambacher wrote
> hi,
>
> waiting for the index (104/121GB), i read his document
> http://www.postgresql.org/docs/9.4/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
> and will do some changes before the next analyze:
>
> some comments:
>
> - the OOM did not kill the Postmaster but the Analyze-Job.
> - started with 24GB real and 18GB Swap - that must be enought! --> killed
>
> - will reduce shared_mem from 1GB to 512 MB
> - will reduce max_connections to 100 (but no user was actice, the
> applications are down)
> - will do "sysctl -w vm.overcommit_memory=2" (just did it)
> - may do "echo -1000 > /proc/self/oom_score_adj" but only if the other
> actions fail
> - the last steps for older kernels are not relevant, i'm running ubuntu
> 14.04 LTS, which is 3.13
>
> i'll keep you informed.
>
> regards
> walter

very strange, after clustering, recreating the gis-index and changing
nothing else no poblems at all.

may be the index was corrupt? or there was so much reorganization needed
that analyze run in trouble?

i'll do most of the changes now, reboot and bring my system live again.
Let's see whats happening the next days.

Thanks to all
walter



--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840786.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
Hi,

some final results:

I monitored the vaccum process and logged some data using one big table and
doing analyze/vaccum by hand. Table has two btree-indexes and one gin.
maintenance_work_mem was 1GB.

the analyze job used abot 1.2 GB virt mem during the whole task, no problems
at all.

The vacuum josb started with 3.3 GB and after processing the two "simple"
indexes it used up to 5.5 GB of Vmem going down to 2.3 GB for the final
work.

<http://postgresql.nabble.com/file/n5840914/pidstat.png>

This lead to out of memory problems during the last days. The vacuum of the
first table (planet_osm_ways)  used *upto 12 GB* until the OOM-Killer killed
him.

Sorry, but *why do analyze and vacuum ignore maintenance_work_mem?* I have
no control about memory usage and ran in big trouble. Now i added 50GB swap
to my 24GB system and have to cross my fingers.

regards
walter




--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5840914.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
Hi paul

just found my system (24 GB Mem, 72 GB Swap) running nearly at it's limits:

The current vaccum is using 66.7 GB (sixty-six dot seven) GB of memory and
my System is nearly down.

I'm sorry, but that must be an bug. Remember: It's the Analyze of an
GIN-Index that is making that problems. Various tables - same Problem.

Regards
walter

<http://postgresql.nabble.com/file/n5841074/top.png>
duríng the last 10 Minutes vaccum took one more GB, now it's using 67.5 if
mem.

should i ask at the dev-list? Open a ticket?



--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5841074.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
wambacher
Date:
sorry, 64 GB swap



--
View this message in context:
http://postgresql.nabble.com/autovacuum-worker-running-amok-and-me-too-tp5840299p5841075.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: autovacuum worker running amok - and me too ;)

From
Jim Nasby
Date:
On 3/9/15 3:56 AM, wambacher wrote:
> Hi paul
>
> just found my system (24 GB Mem, 72 GB Swap) running nearly at it's limits:
>
> The current vaccum is using 66.7 GB (sixty-six dot seven) GB of memory and
> my System is nearly down.
>
> I'm sorry, but that must be an bug. Remember: It's the Analyze of an
> GIN-Index that is making that problems. Various tables - same Problem.
>
> Regards
> walter
>
> <http://postgresql.nabble.com/file/n5841074/top.png>
> duríng the last 10 Minutes vaccum took one more GB, now it's using 67.5 if
> mem.
>
> should i ask at the dev-list? Open a ticket?

Sorry for the late reply.

Yes, that sounds like a bug in the GIN code. Please post to pgsql-bugs
or hit http://www.postgresql.org/support/submitbug/
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com