Thread: memory usage of group by select

memory usage of group by select

From
Anthony
Date:
Hi all,

I'm running a group by query on a table with over a billion rows and my memory usage is seemingly growing without bounds.  Eventually the mem usage exceeds my physical memory and everything starts swapping.  Here is what I gather to be the relevant info:

My machine has 768 megs of ram.

shared_buffers = 128MB
work_mem = 8MB # this was originally higher, but I brought it down to try to fix the problem - it hasn't
maintenance_work_mem = 256MB
fsync = off
checkpoint_segments = 30
effective_cache_size = 256MB #this was originally 512MB but I just recently brought it down - as I expected that didn't affect anything

data=# explain select pid, min(oid) into nd_min from nd group by pid;
                               QUERY PLAN
------------------------------------------------------------------------
 HashAggregate  (cost=28173891.00..28174955.26 rows=85141 width=8)
   ->  Seq Scan on nd  (cost=0.00..21270355.00 rows=1380707200 width=8)
(2 rows)

data=# \d+ nd
            Table "fullplanet091207osm.nd"
 Column |  Type   | Modifiers | Storage | Description
--------+---------+-----------+---------+-------------
 oid    | integer | not null  | plain   |
 pid    | integer | not null  | plain   |
 ref    | integer |           | plain   |
Indexes:
    "nd_pkey" PRIMARY KEY, btree (pid, oid)
Has OIDs: no

VERSION = 'PostgreSQL 8.4.1 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.1-3ubuntu3) 4.4.1, 64-bit'

Re: memory usage of group by select

From
Anthony
Date:
On Tue, Dec 29, 2009 at 3:41 PM, Anthony <osm@inbox.org> wrote:
I'm running a group by query on a table with over a billion rows and my memory usage is seemingly growing without bounds.  Eventually the mem usage exceeds my physical memory and everything starts swapping.

I guess I didn't ask my question.  Is this expected behavior?  Is there any way for me to adjust my settings to avoid using so much memory?

Re: memory usage of group by select

From
Alvaro Herrera
Date:
Anthony wrote:
> On Tue, Dec 29, 2009 at 3:41 PM, Anthony <osm@inbox.org> wrote:
>
> > I'm running a group by query on a table with over a billion rows and my
> > memory usage is seemingly growing without bounds.  Eventually the mem usage
> > exceeds my physical memory and everything starts swapping.
> >
>
> I guess I didn't ask my question.  Is this expected behavior?  Is there any
> way for me to adjust my settings to avoid using so much memory?

It's expecting 85k distinct groups.  If that's not accurate, then
HashAggregate would use more memory than expected.  See if you can make
it work by setting enable_hashagg = off.

If that works, good -- the real solution is different.  Maybe you need
to ANALYZE more, or increase the size of the stats bin for this column.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: memory usage of group by select

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> It's expecting 85k distinct groups.  If that's not accurate, then
> HashAggregate would use more memory than expected.  See if you can make
> it work by setting enable_hashagg = off.

> If that works, good -- the real solution is different.  Maybe you need
> to ANALYZE more, or increase the size of the stats bin for this column.

If ANALYZE consistently underestimates the number of distinct values,
you may have to force matters with ALTER TABLE SET STATISTICS DISTINCT.

            regards, tom lane

Re: memory usage of group by select

From
Anthony
Date:
On Tue, Dec 29, 2009 at 4:09 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote:
It's expecting 85k distinct groups.  If that's not accurate, then
HashAggregate would use more memory than expected.

Great diagnosis.  There are actually about 76 million distinct groups.
 
See if you can make it work by setting enable_hashagg = off.

Will do.  Maybe overnight tonight.  The explain is estimating that to take 10 times as long, and I can't afford to do that right now.

If that works, good -- the real solution is different.  Maybe you need
to ANALYZE more, or increase the size of the stats bin for this column.

On Tue, Dec 29, 2009 at 5:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If ANALYZE consistently underestimates the number of distinct values,
you may have to force matters with ALTER TABLE SET STATISTICS DISTINCT.

Seems to be an 8.5 feature.  Otherwise, it'd be perfect since I happen to know the actual number (it's the row count of another table).

I've run the analyze again and it's still estimating around 85K distinct pids.  That's with the default setting of 100 for default_statistics_target, but I'm not sure I want to mess with that right now (this table is just going to be dropped in a few days after I manipulate it a bit more).

The only question I really have is this: if I fix this number and leave enable_hashagg on, is it just going to have the same effect (for this one query) of turning enable_hashagg off?  Because if I'm just going to have to bite the bullet and run the query with GroupAggregate (cost=287511359.15..297867743.71), I guess I'll just have to do that.  Or try to figure out another way to get what it is I'm trying to get (I should have just imported the right numbers from the beginning, but the import took 3 days so I don't feel like doing that again).

In any case, thanks a lot for the help, both of you.