Thread: Large Database Performance suggestions

Large Database Performance suggestions

From
Joshua Marsh
Date:
Hello everyone,

I am currently working on a data project that uses PostgreSQL
extensively to store, manage and maintain the data.  We haven't had
any problems regarding database size until recently.  The three major
tables we use never get bigger than 10 million records.  With this
size, we can do things like storing the indexes or even the tables in
memory to allow faster access.

Recently, we have found customers who are wanting to use our service
with data files between 100 million and 300 million records.  At that
size, each of the three major tables will hold between 150 million and
700 million records.  At this size, I can't expect it to run queries
in 10-15 seconds (what we can do with 10 million records), but would
prefer to keep them all under a minute.

We did some original testing and with a server with 8GB or RAM and
found we can do operations on data file up to 50 million fairly well,
but performance drop dramatically after that.  Does anyone have any
suggestions on a good way to improve performance for these extra large
tables?  Things that have come to mind are Replication and Beowulf
clusters, but from what I have recently studied, these don't do so wel
with singular processes.  We will have parallel process running, but
it's more important that the speed of each process be faster than
several parallel processes at once.

Any help would be greatly appreciated!

Thanks,

Joshua Marsh

P.S. Off-topic, I have a few invitations to gmail.  If anyone would
like one, let me know.

Re: Large Database Performance suggestions

From
Gavin Sherry
Date:
On Thu, 21 Oct 2004, Joshua Marsh wrote:

> Recently, we have found customers who are wanting to use our service
> with data files between 100 million and 300 million records.  At that
> size, each of the three major tables will hold between 150 million and
> 700 million records.  At this size, I can't expect it to run queries
> in 10-15 seconds (what we can do with 10 million records), but would
> prefer to keep them all under a minute.

To provide any useful information, we'd need to look at your table schemas
and sample queries.

The values for sort_mem and shared_buffers will also be useful.

Are you VACUUMing and ANALYZEing? (or is the data read only?))

gavin

Re: Large Database Performance suggestions

From
Tom Lane
Date:
Joshua Marsh <icub3d@gmail.com> writes:
> ... We did some original testing and with a server with 8GB or RAM and
> found we can do operations on data file up to 50 million fairly well,
> but performance drop dramatically after that.

What you have to ask is *why* does it drop dramatically?  There aren't
any inherent limits in Postgres that are going to kick in at that level.
I'm suspicious that you could improve the situation by adjusting
sort_mem and/or other configuration parameters; but there's not enough
info here to make specific recommendations.  I would suggest posting
EXPLAIN ANALYZE results for your most important queries both in the size
range where you are getting good results, and the range where you are not.
Then we'd have something to chew on.

            regards, tom lane

Re: Large Database Performance suggestions

From
"Scott Marlowe"
Date:
On Thu, 2004-10-21 at 21:14, Joshua Marsh wrote:
> Hello everyone,
>
> I am currently working on a data project that uses PostgreSQL
> extensively to store, manage and maintain the data.  We haven't had
> any problems regarding database size until recently.  The three major
> tables we use never get bigger than 10 million records.  With this
> size, we can do things like storing the indexes or even the tables in
> memory to allow faster access.
>
> Recently, we have found customers who are wanting to use our service
> with data files between 100 million and 300 million records.  At that
> size, each of the three major tables will hold between 150 million and
> 700 million records.  At this size, I can't expect it to run queries
> in 10-15 seconds (what we can do with 10 million records), but would
> prefer to keep them all under a minute.
>
> We did some original testing and with a server with 8GB or RAM and
> found we can do operations on data file up to 50 million fairly well,
> but performance drop dramatically after that.  Does anyone have any
> suggestions on a good way to improve performance for these extra large
> tables?  Things that have come to mind are Replication and Beowulf
> clusters, but from what I have recently studied, these don't do so wel
> with singular processes.  We will have parallel process running, but
> it's more important that the speed of each process be faster than
> several parallel processes at once.


I'd assume that what's happening is that up to a certain data set size,
it all fits in memory, and you're going from CPU/memory bandwidth
limited to I/O limited.  If this is the case, then a faster storage
subsystem is the only real answer.  If the database is mostly read, then
a large RAID5 or RAID 1+0 array should help quite a bit.

You might wanna post some explain analyze of the queries that are going
slower at some point in size, along with schema for those tables etc...


Re: Large Database Performance suggestions

From
Dave Cramer
Date:
Josh,

Your hardware setup would be useful too. It's surprising how slow some
big name servers really are.
If you are seriously considering memory sizes over 4G you may want to
look at an opteron.

Dave

Joshua Marsh wrote:

>Hello everyone,
>
>I am currently working on a data project that uses PostgreSQL
>extensively to store, manage and maintain the data.  We haven't had
>any problems regarding database size until recently.  The three major
>tables we use never get bigger than 10 million records.  With this
>size, we can do things like storing the indexes or even the tables in
>memory to allow faster access.
>
>Recently, we have found customers who are wanting to use our service
>with data files between 100 million and 300 million records.  At that
>size, each of the three major tables will hold between 150 million and
>700 million records.  At this size, I can't expect it to run queries
>in 10-15 seconds (what we can do with 10 million records), but would
>prefer to keep them all under a minute.
>
>We did some original testing and with a server with 8GB or RAM and
>found we can do operations on data file up to 50 million fairly well,
>but performance drop dramatically after that.  Does anyone have any
>suggestions on a good way to improve performance for these extra large
>tables?  Things that have come to mind are Replication and Beowulf
>clusters, but from what I have recently studied, these don't do so wel
>with singular processes.  We will have parallel process running, but
>it's more important that the speed of each process be faster than
>several parallel processes at once.
>
>Any help would be greatly appreciated!
>
>Thanks,
>
>Joshua Marsh
>
>P.S. Off-topic, I have a few invitations to gmail.  If anyone would
>like one, let me know.
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>
>
>

--
Dave Cramer
http://www.postgresintl.com
519 939 0336
ICQ#14675561


Re: Large Database Performance suggestions

From
Joshua Marsh
Date:
Thanks for all of your help so far.  Here is some of the information
you guys were asking for:

Test System:
2x AMD Opteron 244 (1.8Ghz)
8GB RAM
7x 72GB SCSI HDD (Raid 5)

postrgesql.conf information:
#---------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#---------------------------------------------------------------------------

# - Memory -

shared_buffers = 1000           # min 16, at least max_connections*2, 8KB each
#sort_mem = 1024                # min 64, size in KB
#vacuum_mem = 8192              # min 1024, size in KB
sort_mem = 4096000
vacuum_mem = 1024000

# - Free Space Map -

#max_fsm_pages = 20000          # min max_fsm_relations*16, 6 bytes each
#max_fsm_relations = 1000       # min 100, ~50 bytes each

# - Kernel Resource Usage -

#max_files_per_process = 1000   # min 25
#preload_libraries = ''

#---------------------------------------------------------------------------
# WRITE AHEAD LOG
#---------------------------------------------------------------------------

# - Settings -

#fsync = true                   # turns forced synchronization on or off
#wal_sync_method = fsync        # the default varies across platforms:
                                # fsync, fdatasync, open_sync, or open_datasync
#wal_buffers = 8                # min 4, 8KB each

# - Checkpoints -

#checkpoint_segments = 3        # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 300       # range 30-3600, in seconds
#checkpoint_warning = 30        # 0 is off, in seconds
#commit_delay = 0               # range 0-100000, in microseconds
#commit_siblings = 5            # range 1-1000

Everything else are at their defaults.  I actually think the WAL
options are set to defaults as well, but I don't recall exactly :)

As for the queries and table, The data we store is confidential, but
it is essentially an account number with a bunch of boolean fields
that specify if a person applies to criteria.  So a query may look
something like:

SELECT acctno FROM view_of_data WHERE has_name AND is_active_member
AND state = 'OH';

which is explained as something like this:
                           QUERY PLAN
-----------------------------------------------------------------
 Seq Scan on view_of_data (cost=0.00..25304.26 rows=22054 width=11)
   Filter: (has_name AND is_active_member AND ((state)::text = 'OH'::text))
(2 rows)

Occasionally, because we store data from several sources, we will have
requests for data from several sources.  We simply intersect the
view_of_data table with a sources table that lists what acctno belong
to what source.  This query would look something like this:

SELECT acctno FROM view_of_data WHERE has_name AND is_active_member
AND state = 'OH' INTERSECT SELECT acctno FROM sources_data WHERE
source = 175;

which is explained as follows:
                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 SetOp Intersect  (cost=882226.14..885698.20 rows=69441 width=11)
   ->  Sort  (cost=882226.14..883962.17 rows=694411 width=11)
         Sort Key: acctno
         ->  Append  (cost=0.00..814849.42 rows=694411 width=11)
               ->  Subquery Scan "*SELECT* 1"  (cost=0.00..25524.80
rows=22054 width=11)
                     ->  Seq Scan on view_of_data
(cost=0.00..25304.26 rows=22054 width=11)
                           Filter: (has_name AND is_active_member AND
((state)::text = 'OH'::text))
               ->  Subquery Scan "*SELECT* 2"  (cost=0.00..789324.62
rows=672357 width=11)
                     ->  Seq Scan on sources_data
(cost=0.00..782601.05 rows=672357 width=11)
                           Filter: (source = 23)


Again, we see our biggest bottlenecks when we get over about 50
million records.  The time to execute grows exponentially from that
point.

Thanks again for all of your help!

-Josh


On Fri, 22 Oct 2004 07:38:49 -0400, Dave Cramer <pg@fastcrypt.com> wrote:
> Josh,
>
> Your hardware setup would be useful too. It's surprising how slow some
> big name servers really are.
> If you are seriously considering memory sizes over 4G you may want to
> look at an opteron.
>
> Dave
>
>
>
> Joshua Marsh wrote:
>
> >Hello everyone,
> >
> >I am currently working on a data project that uses PostgreSQL
> >extensively to store, manage and maintain the data.  We haven't had
> >any problems regarding database size until recently.  The three major
> >tables we use never get bigger than 10 million records.  With this
> >size, we can do things like storing the indexes or even the tables in
> >memory to allow faster access.
> >
> >Recently, we have found customers who are wanting to use our service
> >with data files between 100 million and 300 million records.  At that
> >size, each of the three major tables will hold between 150 million and
> >700 million records.  At this size, I can't expect it to run queries
> >in 10-15 seconds (what we can do with 10 million records), but would
> >prefer to keep them all under a minute.
> >
> >We did some original testing and with a server with 8GB or RAM and
> >found we can do operations on data file up to 50 million fairly well,
> >but performance drop dramatically after that.  Does anyone have any
> >suggestions on a good way to improve performance for these extra large
> >tables?  Things that have come to mind are Replication and Beowulf
> >clusters, but from what I have recently studied, these don't do so wel
> >with singular processes.  We will have parallel process running, but
> >it's more important that the speed of each process be faster than
> >several parallel processes at once.
> >
> >Any help would be greatly appreciated!
> >
> >Thanks,
> >
> >Joshua Marsh
> >
> >P.S. Off-topic, I have a few invitations to gmail.  If anyone would
> >like one, let me know.
> >
> >---------------------------(end of broadcast)---------------------------
> >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> >
> >
> >
> >
>
> --
> Dave Cramer
> http://www.postgresintl.com
> 519 939 0336
> ICQ#14675561
>
>

Re: Large Database Performance suggestions

From
Tom Lane
Date:
Joshua Marsh <icub3d@gmail.com> writes:
> shared_buffers = 1000           # min 16, at least max_connections*2, 8KB each

This is on the small side for an 8G machine.  I'd try 10000 or so.

> sort_mem = 4096000

Yikes.  You do realize you just said that *each sort operation* can use 4G?
(Actually, it's probably overflowing internally; I dunno what amount of
sort space you are really ending up with but it could be small.)  Try
something saner, maybe in the 10 to 100MB range.

> vacuum_mem = 1024000

This is probably excessive as well.

> #max_fsm_pages = 20000          # min max_fsm_relations*16, 6 bytes each
> #max_fsm_relations = 1000       # min 100, ~50 bytes each

You will need to bump these up a good deal to avoid database bloat.

> Occasionally, because we store data from several sources, we will have
> requests for data from several sources.  We simply intersect the
> view_of_data table with a sources table that lists what acctno belong
> to what source.  This query would look something like this:

> SELECT acctno FROM view_of_data WHERE has_name AND is_active_member
> AND state = 'OH' INTERSECT SELECT acctno FROM sources_data WHERE
> source = 175;

IMHO you need to rethink your table layout.  There is simply no way that
that query is going to be fast.  Adding a source column to view_of_data
would work much better.

If you're not in a position to redo the tables, you might try it as a
join:

SELECT acctno FROM view_of_data JOIN sources_data USING (acctno)
WHERE has_name AND is_active_member AND state = 'OH'
      AND source = 175;

but I'm not really sure if that will be better or not.

            regards, tom lane

Re: Large Database Performance suggestions

From
Andrew Hammond
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Joshua Marsh wrote:
| Thanks for all of your help so far.  Here is some of the information
| you guys were asking for:
|
| Test System:
| 2x AMD Opteron 244 (1.8Ghz)
| 8GB RAM
| 7x 72GB SCSI HDD (Raid 5)

You probably want to look at investing in a SAN.

- --
Andrew Hammond    416-673-4138    ahammond@ca.afilias.info
Database Administrator, Afilias Canada Corp.
CB83 2838 4B67 D40F D086 3568 81FC E7E5 27AF 4A9A
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFBmMnSgfzn5SevSpoRAlp2AKCVXQkZLR7TuGId/OLveHPqpzC4zwCffNFC
7zjXzJ6Ukg4TeO1ecWj/nFQ=
=N5vp
-----END PGP SIGNATURE-----