Thread: very large db performance question

very large db performance question

From

LIANHE SHAO

Date:

26 November 2003, 16:19:19

Hello All,
We will have a very large database to store
microarray data (may exceed 80-100G some day). now
we have 1G RAM, 2G Hz Pentium 4, 1 CPU. and enough
hard disk.

I never touched such large database before. I ask
several dbas if the hardware is ok, some said it is
ok for the query, but I am not so convinced. Because
I check the mailing list and learned that it is not
unreasonable to take several minutes to do the
query. But  I want to query to be as fast as possible.

Could anybody tell me that our hardware is an issue
or not? do we really need better hardware to make
real difference?

Regards,
William

Re: very large db performance question

From

Neil Conway

Date:

26 November 2003, 18:04:01

LIANHE SHAO <lshao2@jhmi.edu> writes:
> We will have a very large database to store microarray data (may
> exceed 80-100G some day). now we have 1G RAM, 2G Hz Pentium 4, 1
> CPU. and enough hard disk.

> Could anybody tell me that our hardware is an issue or not?

IMHO the size of the DB is less relevant than the query workload. For
example, if you're storying 100GB of data but only doing a single
index scan on it every 10 seconds, any modern machine with enough HD
space should be fine.

If you give us an idea of the # of queries you expect per second, the
approximate mix of reads and writes, and some idea of how complex the
queries are, we might be able to give you some better advice.

-Neil

Re: very large db performance question

From

LIANHE SHAO

Date:

26 November 2003, 18:43:43

Thanks for reply. Actually our database only supply
some scientists to use (we predict that). so there
is no workload problem. there is only very
infrequent updates. the query is not complex. the
problem is, we have one table that store most of the
data ( with 200 million rows). In this table, there
is a text column which we need to do full text
search for each row. The result will then join the
data from another table which has 30,000 rows. Now
the query runs almost forever.

I tried a small table with 2 million rows using  the
following simple command, it takes me about 6
seconds to get the result back. So, I get confused.
That is why I ask: Is it the hardware problem or
something else. (I just vacuumed the whole database
yesterday).

PGA=> select count (*) from expressiondata ;
  count
---------
 2197497
(1 row)

PGA=> explain select count (*) from expressiondata ;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Aggregate  (cost=46731.71..46731.71 rows=1 width=0)
   ->  Seq Scan on expressiondata
(cost=0.00..41237.97 rows=2197497 width=0)
(2 rows)

Regards,
William

----- Original Message -----
From: Neil Conway <neilc@samurai.com>
Date: Wednesday, November 26, 2003 10:03 pm
Subject: Re: [PERFORM] very large db performance
question

> LIANHE SHAO <lshao2@jhmi.edu> writes:
> > We will have a very large database to store
microarray data (may
> > exceed 80-100G some day). now we have 1G RAM, 2G
Hz Pentium 4, 1
> > CPU. and enough hard disk.
>
> > Could anybody tell me that our hardware is an
issue or not?
>
> IMHO the size of the DB is less relevant than the
query workload. For
> example, if you're storying 100GB of data but only
doing a single
> index scan on it every 10 seconds, any modern
machine with enough HD
> space should be fine.
>
> If you give us an idea of the # of queries you
expect per second, the
> approximate mix of reads and writes, and some idea
of how complex the
> queries are, we might be able to give you some
better advice.
>
> -Neil
>
>
>

Re: very large db performance question

From

Christopher Kings-Lynne

Date:

26 November 2003, 20:59:58

> Thanks for reply. Actually our database only supply
> some scientists to use (we predict that). so there
> is no workload problem. there is only very
> infrequent updates. the query is not complex. the
> problem is, we have one table that store most of the
> data ( with 200 million rows). In this table, there
> is a text column which we need to do full text
> search for each row. The result will then join the
> data from another table which has 30,000 rows. Now
> the query runs almost forever.

Use TSearch2.

> I tried a small table with 2 million rows using  the
> following simple command, it takes me about 6
> seconds to get the result back. So, I get confused.
> That is why I ask: Is it the hardware problem or
> something else. (I just vacuumed the whole database
> yesterday).
>
> PGA=> select count (*) from expressiondata ;
>   count
> ---------
>  2197497
> (1 row)

select count(*) on a postgres table ALWAYS does a sequential scan.  Just
don't do it.  There are technical reasons (MVCC) why this is so.  It's a
bad "test".

Chris

Re: very large db performance question

From

Thierry Missimilly

Date:

27 November 2003, 04:44:51

Hi,

I have done some performance tests on 1Gb and 4 Gb Databases on a mono
Pentium 4 , 1 Gb RAM, IDE disk, SCSI disks and RAID0 LUN on DAS 5300 on
Linux RedHat 7.3.

In each cases my tests make select, update and insert.
One of them is pgbench. You can find it in Postgres/contrib/pgbench.
The other one is DBT1 form OSDL. I have port it on Postgres and you can find
it in Source Forge. Jenny Wang is writting a better Postgres DBT1 based on C
transactions instead of PL/pgSQL Transactions.

With this size of database, even after a fine tuning of Postgres the problem
is I/O Wait. So in you case with 100 Gb, you will have I/O Wait.
To resume my observations regarding diskss performances :
1) IDE disk are the slower.
2) SCSI disks are a little more faster but you can decrease I/O Wait by 25%
by creating a stripped volume group on 3 disks.
3) A RAID0 on 5 DAS5300 disks improve again performances by 20% as the DAS
Storage Processeur use internal caches

One thing, very important in my case was the time of (hot) backup / restore.

In that case the pgbench database schema is to simple to have an idea but
DBT1 schema is enough complex and on the RAID0 LUN the backup takes 12 min
but the restore takes 16 min + 10 min to recreate the indexes + 255 min to
recreate the Foreign Keys. So 4h41 for a 4Gb database.

That means for a 100 Gb database, if your schema as Foreign keys and indexes
: about 5 hours to backup and 117 hours to restore (~5 days).
So, if your database in a critical database, it is better to use cold backup
with Snapshot tools.

Regards,
Thierry Missimilly

LIANHE SHAO wrote:

> Hello All,
> We will have a very large database to store
> microarray data (may exceed 80-100G some day). now
> we have 1G RAM, 2G Hz Pentium 4, 1 CPU. and enough
> hard disk.
>
> I never touched such large database before. I ask
> several dbas if the hardware is ok, some said it is
> ok for the query, but I am not so convinced. Because
> I check the mailing list and learned that it is not
> unreasonable to take several minutes to do the
> query. But  I want to query to be as fast as possible.
>
> Could anybody tell me that our hardware is an issue
> or not? do we really need better hardware to make
> real difference?
>
> Regards,
> William
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend

Attachment

THIERRY.MISSIMILLY.vcf

Re: very large db performance question

From

"Eric Jain"

Date:

17 February 2004, 06:36:44

> IMHO the size of the DB is less relevant than the query workload. For
> example, if you're storying 100GB of data but only doing a single
> index scan on it every 10 seconds, any modern machine with enough HD
> space should be fine.

I agree that the workload is likely to be the main issue in most
situations. However, if your queries involve lots of counting and
aggregating, your databases contains several gigabytes of data, and you
are using common hardware, be prepared to wait anywhere from minutes to
hours, even if you are the only user.