Re: Sort performance on large tables - Mailing list pgsql-performance

From Ron Peacetree
Subject Re: Sort performance on large tables
Date
Msg-id 11779240.1131560773750.JavaMail.root@elwamui-hound.atl.sa.earthlink.net
Whole thread Raw
In response to Sort performance on large tables  (Charlie Savage <cfis@interserv.com>)
List pgsql-performance
...and on those notes, let me repeat my often stated advice that a DB server should be configured with as much RAM as
isfeasible.  4GB or more strongly recommended. 

I'll add that the HW you are using for a DB server should be able to hold _at least_ 4GB of RAM (note that modern
_laptops_can hold 2GB.  Next year's are likely to be able to hold 4GB.).  I can't casually find specs on the D3000, but
ifit can't be upgraded to at least 4GB, you should be looking for new DB server HW. 

At this writing, 4 1GB DIMMs (4GB) should set you back ~$300 or less.  4 2GB DIMMs (8GB) should cost ~$600.
As of now, very few mainboards support 4GB DIMMs and I doubt the D3000 has such a mainboard.  If you can use them, 4
4GBDIMMs (16GB) will currently set you back ~$1600-$2400. 

Whatever the way you do it, it's well worth the money to have at least 4GB of RAM in a DB server.  It makes all kinds
ofproblems just not exist. 

Ron


-----Original Message-----
From: Simon Riggs <simon@2ndquadrant.com>
Sent: Nov 9, 2005 4:35 AM
To: Charlie Savage <cfis@interserv.com>, Luke Lonergan <llonergan@greenplum.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Sort performance on large tables

On Tue, 2005-11-08 at 00:05 -0700, Charlie Savage wrote:

> Setup: Dell Dimension 3000, Suse 10, 1GB ram, PostgreSQL 8.1 RC 1 with

> I want to extract data out of the file, with the most important values
> being stored in a column called tlid.  The tlid field is an integer, and
> the values are 98% unique.  There is a second column called ogc_fid
> which is unique (it is a serial field).  I need to extract out unique
> TLID's (doesn't matter which duplicate I get rid of).  To do this I am
> running this query:
>
> SELECT tlid, min(ogc_fid)
> FROM completechain
> GROUP BY tlid;
>
> The results from explain analyze are:
>
> "GroupAggregate  (cost=10400373.80..11361807.88 rows=48071704 width=8)
> (actual time=7311682.715..8315746.835 rows=47599910 loops=1)"
> "  ->  Sort  (cost=10400373.80..10520553.06 rows=48071704 width=8)
> (actual time=7311682.682..7972304.777 rows=48199165 loops=1)"
> "        Sort Key: tlid"
> "        ->  Seq Scan on completechain  (cost=0.00..2228584.04
> rows=48071704 width=8) (actual time=27.514..773245.046 rows=48199165
> loops=1)"
> "Total runtime: 8486057.185 ms"

> Should I expect results like this?  I realize that the computer is quite
> low-end and is very IO bound for this query, but I'm still surprised
> that the sort operation takes so long.
>
> Out of curiosity, I setup an Oracle database on the same machine with
> the same data and ran the same query.  Oracle was over an order of
> magnitude faster. Looking at its query plan, it avoided the sort by
> using "HASH GROUP BY."  Does such a construct exist in PostgreSQL (I see
> only hash joins)?

PostgreSQL can do HashAggregates as well as GroupAggregates, just like
Oracle. HashAggs avoid the sort phase, so would improve performance
considerably. The difference in performance you are getting is because
of the different plan used. Did you specifically do anything to Oracle
to help it get that plan, or was it a pure out-of-the-box install (or
maybe even a "set this up for Data Warehousing" install)?

To get a HashAgg plan, you need to be able to fit all of the unique
values in memory. That would be 98% of 48071704 rows, each 8+ bytes
wide, giving a HashAgg memory sizing of over 375MB. You must allocate
memory of the next power of two above the level you want, so we would
need to allocate 512MB to work_mem before it would consider using a
HashAgg.

Can you let us know how high you have to set work_mem before an EXPLAIN
(not EXPLAIN ANALYZE) chooses the HashAgg plan?

Please be aware that publishing Oracle performance results is against
the terms of their licence and we seek to be both fair and legitimate,
especially within this public discussion forum.

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@postgresql.org so that your
       message can get through to the mailing list cleanly


pgsql-performance by date:

Previous
From: Michael Alan Dorman
Date:
Subject: Re: Outer Join performance in PostgreSQL
Next
From: "Eric Lauzon"
Date:
Subject: (View and SQL) VS plpgsql