Re: Large Database Performance suggestions - Mailing list pgsql-performance

From Joshua Marsh
Subject Re: Large Database Performance suggestions
Date
Msg-id 38242de904102608246d060f7d@mail.gmail.com
Whole thread Raw
In response to Re: Large Database Performance suggestions  (Dave Cramer <pg@fastcrypt.com>)
Responses Re: Large Database Performance suggestions  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Large Database Performance suggestions  (Andrew Hammond <ahammond@ca.afilias.info>)
List pgsql-performance
Thanks for all of your help so far.  Here is some of the information
you guys were asking for:

Test System:
2x AMD Opteron 244 (1.8Ghz)
8GB RAM
7x 72GB SCSI HDD (Raid 5)

postrgesql.conf information:
#---------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#---------------------------------------------------------------------------

# - Memory -

shared_buffers = 1000           # min 16, at least max_connections*2, 8KB each
#sort_mem = 1024                # min 64, size in KB
#vacuum_mem = 8192              # min 1024, size in KB
sort_mem = 4096000
vacuum_mem = 1024000

# - Free Space Map -

#max_fsm_pages = 20000          # min max_fsm_relations*16, 6 bytes each
#max_fsm_relations = 1000       # min 100, ~50 bytes each

# - Kernel Resource Usage -

#max_files_per_process = 1000   # min 25
#preload_libraries = ''

#---------------------------------------------------------------------------
# WRITE AHEAD LOG
#---------------------------------------------------------------------------

# - Settings -

#fsync = true                   # turns forced synchronization on or off
#wal_sync_method = fsync        # the default varies across platforms:
                                # fsync, fdatasync, open_sync, or open_datasync
#wal_buffers = 8                # min 4, 8KB each

# - Checkpoints -

#checkpoint_segments = 3        # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 300       # range 30-3600, in seconds
#checkpoint_warning = 30        # 0 is off, in seconds
#commit_delay = 0               # range 0-100000, in microseconds
#commit_siblings = 5            # range 1-1000

Everything else are at their defaults.  I actually think the WAL
options are set to defaults as well, but I don't recall exactly :)

As for the queries and table, The data we store is confidential, but
it is essentially an account number with a bunch of boolean fields
that specify if a person applies to criteria.  So a query may look
something like:

SELECT acctno FROM view_of_data WHERE has_name AND is_active_member
AND state = 'OH';

which is explained as something like this:
                           QUERY PLAN
-----------------------------------------------------------------
 Seq Scan on view_of_data (cost=0.00..25304.26 rows=22054 width=11)
   Filter: (has_name AND is_active_member AND ((state)::text = 'OH'::text))
(2 rows)

Occasionally, because we store data from several sources, we will have
requests for data from several sources.  We simply intersect the
view_of_data table with a sources table that lists what acctno belong
to what source.  This query would look something like this:

SELECT acctno FROM view_of_data WHERE has_name AND is_active_member
AND state = 'OH' INTERSECT SELECT acctno FROM sources_data WHERE
source = 175;

which is explained as follows:
                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 SetOp Intersect  (cost=882226.14..885698.20 rows=69441 width=11)
   ->  Sort  (cost=882226.14..883962.17 rows=694411 width=11)
         Sort Key: acctno
         ->  Append  (cost=0.00..814849.42 rows=694411 width=11)
               ->  Subquery Scan "*SELECT* 1"  (cost=0.00..25524.80
rows=22054 width=11)
                     ->  Seq Scan on view_of_data
(cost=0.00..25304.26 rows=22054 width=11)
                           Filter: (has_name AND is_active_member AND
((state)::text = 'OH'::text))
               ->  Subquery Scan "*SELECT* 2"  (cost=0.00..789324.62
rows=672357 width=11)
                     ->  Seq Scan on sources_data
(cost=0.00..782601.05 rows=672357 width=11)
                           Filter: (source = 23)


Again, we see our biggest bottlenecks when we get over about 50
million records.  The time to execute grows exponentially from that
point.

Thanks again for all of your help!

-Josh


On Fri, 22 Oct 2004 07:38:49 -0400, Dave Cramer <pg@fastcrypt.com> wrote:
> Josh,
>
> Your hardware setup would be useful too. It's surprising how slow some
> big name servers really are.
> If you are seriously considering memory sizes over 4G you may want to
> look at an opteron.
>
> Dave
>
>
>
> Joshua Marsh wrote:
>
> >Hello everyone,
> >
> >I am currently working on a data project that uses PostgreSQL
> >extensively to store, manage and maintain the data.  We haven't had
> >any problems regarding database size until recently.  The three major
> >tables we use never get bigger than 10 million records.  With this
> >size, we can do things like storing the indexes or even the tables in
> >memory to allow faster access.
> >
> >Recently, we have found customers who are wanting to use our service
> >with data files between 100 million and 300 million records.  At that
> >size, each of the three major tables will hold between 150 million and
> >700 million records.  At this size, I can't expect it to run queries
> >in 10-15 seconds (what we can do with 10 million records), but would
> >prefer to keep them all under a minute.
> >
> >We did some original testing and with a server with 8GB or RAM and
> >found we can do operations on data file up to 50 million fairly well,
> >but performance drop dramatically after that.  Does anyone have any
> >suggestions on a good way to improve performance for these extra large
> >tables?  Things that have come to mind are Replication and Beowulf
> >clusters, but from what I have recently studied, these don't do so wel
> >with singular processes.  We will have parallel process running, but
> >it's more important that the speed of each process be faster than
> >several parallel processes at once.
> >
> >Any help would be greatly appreciated!
> >
> >Thanks,
> >
> >Joshua Marsh
> >
> >P.S. Off-topic, I have a few invitations to gmail.  If anyone would
> >like one, let me know.
> >
> >---------------------------(end of broadcast)---------------------------
> >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> >
> >
> >
> >
>
> --
> Dave Cramer
> http://www.postgresintl.com
> 519 939 0336
> ICQ#14675561
>
>

pgsql-performance by date:

Previous
From: "Joost Kraaijeveld"
Date:
Subject: Measuring server performance with psql and pgAdmin
Next
From: Tom Lane
Date:
Subject: Re: Large Database Performance suggestions