Question about disk IO an index use and seeking advice - Mailing list pgsql-performance

From Nikolas Everett
Subject Question about disk IO an index use and seeking advice
Date
Msg-id d4e11e980804240659i2ab46b7fp3e90467d840cf6d4@mail.gmail.com
Whole thread Raw
Responses Re: Question about disk IO an index use and seeking advice
Re: Question about disk IO an index use and seeking advice
List pgsql-performance
I have a question about index us and IO and am seeking advice.

We are running postgres 8.2.  We have two big big tables.  Our ~600,000,000 row table is changed very infrequently and is on a 12 disk software raid-6 for historical reasons using an  LSI Logic / Symbios Logic SAS1068 PCI-X Fusion-MPT SAS  Our ~50,000,000 row staging table is on a 12 disk hardware raid-10 using a Dell PowerEdge Expandable RAID controller 5.  All of the rows in the staging table are changed at least once and then deleted and recreated in the bigger table.  All of the staging table's indexes are on the raid-10.  The postgres data directory itself is on the raid-6.  I think all the disks are SATA 10Ks.  The setup is kind of a beast.

So my disk IO and index question.  When I issue a query on the big table like this:
SELECT    column, count(*)
FROM    bigtable
GROUP BY column
ORDER BY count DESC
When I run dstat to see my disk IO I see the software raid-6 consistently holding over 70M/sec.  This is fine with me, but I generally don't like to do queries that table scan 600,000,000 rows.  So I do:
SELECT    column, count(*)
FROM    bigtable
WHERE date > '4-24-08'
GROUP BY column
ORDER BY count DESC
When I run dstat I see only around 2M/sec and it is not consistent at all.

So my question is, why do I see such low IO load on the index scan version?  If I could tweak some setting to make more aggressive use of IO, would it actually make the query faster?  The field I'm scanning has a .960858 correlation, but I haven't vacuumed since importing any of the data that I'm scanning, though the correlation should remain very high.  When I do a similar set of queries on the hardware raid I see similar performance except  the numbers are both more than doubled.

Here is the explain output for the queries:
SELECT    column, count(*)
FROM    bigtable
GROUP BY column
ORDER BY count DESC
"Sort  (cost=74404440.58..74404444.53 rows=1581 width=10)"
"  Sort Key: count(*)"
"  ->  HashAggregate  (cost=74404336.81..74404356.58 rows=1581 width=10)"
"        ->  Seq Scan on bigtable (cost=0.00..71422407.21 rows=596385921 width=10)"
---------------
SELECT    column, count(*)
FROM    bigtable
WHERE date > '4-24-08'
GROUP BY column
ORDER BY count DESC
"Sort  (cost=16948.80..16948.81 rows=1 width=10)"
"  Sort Key: count(*)"
"  ->  HashAggregate  (cost=16948.78..16948.79 rows=1 width=10)"
"        ->  Index Scan using date_idx on bigtable (cost=0.00..16652.77 rows=59201 width=10)"
"              Index Cond: (date > '2008-04-21 00:00:00'::timestamp without time zone)"

So now the asking for advice part.  I have two questions:
What is the fastest way to copy data from the smaller table to the larger table?

We plan to rearrange the setup when we move to Postgres 8.3.  We'll probably move all the storage over to a SAN and slice the larger table into monthly or weekly tables.  Can someone point me to a good page on partitioning?  My gut tells me it should be better, but I'd like to learn more about why.
Does anyone have experience migrating large databases to a SAN?  I understand that it'll give me better fail over capabilities so long as the SAN itself doesn't go out, but are we going to be sacrificing performance for this?  That doesn't even mention the cost....

Thanks so much for reading through all this,

--Nik

pgsql-performance by date:

Previous
From: Gregory Stark
Date:
Subject: Re: off-topic: SPAM
Next
From: Viktor Rosenfeld
Date:
Subject: Re: Performance of the Materialize operator in a query plan