Thread: help tuning queries on large database

help tuning queries on large database

From
peter royal
Date:
Howdy.

I'm running into scaling problems when testing with a 16gb (data
+indexes) database.

I can run a query, and it returns in a few seconds. If I run it
again, it returns in a few milliseconds. I realize this is because
during subsequent runs, the necessary disk pages have been cached by
the OS.

I have experimented with having all 8 disks in a single RAID0 set, a
single RAID10 set, and currently 4 RAID0 sets of 2 disks each. There
hasn't been an appreciable difference in the overall performance of
my test suite (which randomly generates queries like the samples
below as well as a few other types. this problem manifests itself on
other queries in the test suite as well).

So, my question is, is there anything I can do to boost performance
with what I've got, or am I in a position where the only 'fix' is
more faster disks? I can't think of any schema/index changes that
would help, since everything looks pretty optimal from the 'explain
analyze' output. I'd like to get a 10x improvement when querying from
the 'cold' state.

Thanks for any assistance. The advice from reading this list to
getting to where I am now has been invaluable.
-peter


Configuration:

PostgreSQL 8.1.1

shared_buffers = 10000  # (It was higher, 50k, but didn't help any,
so brought down to free ram for disk cache)
work_mem = 8196
random_page_cost = 3
effective_cache_size = 250000


Hardware:

CentOS 4.2 (Linux 2.6.9-22.0.1.ELsmp)
Areca ARC-1220 8-port PCI-E controller
8 x Hitachi Deskstar 7K80 (SATA2) (7200rpm)
2 x Opteron 242 @ 1.6ghz
3gb RAM (should be 4gb, but separate Linux issue preventing us from
getting it to see all of it)
Tyan Thunder K8WE


RAID Layout:

4 2-disk RAID0 sets created

Each raid set is a tablespace, formatted ext3. The majority of the
database is in the primary tablespace, and the popular object_data
table is in its own tablespace.


Sample 1:

triple_store=# explain analyze SELECT DISTINCT O.subject AS oid FROM
object_data O, object_tags T1, tags T2 WHERE O.type = 179 AND
O.subject = T1.object_id AND T1.tag_id = T2.tag_id AND T2.tag =
'transmitter\'s' LIMIT 1000;

         QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
-------------------------
Limit  (cost=1245.07..1245.55 rows=97 width=4) (actual
time=3702.697..3704.665 rows=206 loops=1)
    ->  Unique  (cost=1245.07..1245.55 rows=97 width=4) (actual
time=3702.691..3703.900 rows=206 loops=1)
          ->  Sort  (cost=1245.07..1245.31 rows=97 width=4) (actual
time=3702.686..3703.056 rows=206 loops=1)
                Sort Key: o.subject
                ->  Nested Loop  (cost=2.82..1241.87 rows=97 width=4)
(actual time=97.166..3701.970 rows=206 loops=1)
                      ->  Nested Loop  (cost=2.82..678.57 rows=186
width=4) (actual time=59.903..1213.170 rows=446 loops=1)
                            ->  Index Scan using tags_tag_key on tags
t2  (cost=0.00..5.01 rows=1 width=4) (actual time=13.139..13.143
rows=1 loops=1)
                                  Index Cond: (tag =
'transmitter''s'::text)
                            ->  Bitmap Heap Scan on object_tags t1
(cost=2.82..670.65 rows=233 width=8) (actual time=46.751..1198.198
rows=446 loops=1)
                                  Recheck Cond: (t1.tag_id =
"outer".tag_id)
                                  ->  Bitmap Index Scan on
object_tags_tag_id_object_id  (cost=0.00..2.82 rows=233 width=0)
(actual time=31.571..31.571 rows=446 loops=1)
                                        Index Cond: (t1.tag_id =
"outer".tag_id)
                      ->  Index Scan using object_data_pkey on
object_data o  (cost=0.00..3.02 rows=1 width=4) (actual
time=5.573..5.574 rows=0 loops=446)
                            Index Cond: (o.subject = "outer".object_id)
                            Filter: ("type" = 179)
Total runtime: 3705.166 ms
(16 rows)

triple_store=# explain analyze SELECT DISTINCT O.subject AS oid FROM
object_data O, object_tags T1, tags T2 WHERE O.type = 179 AND
O.subject = T1.object_id AND T1.tag_id = T2.tag_id AND T2.tag =
'transmitter\'s' LIMIT 1000;

        QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
-----------------------
Limit  (cost=1245.07..1245.55 rows=97 width=4) (actual
time=11.037..12.923 rows=206 loops=1)
    ->  Unique  (cost=1245.07..1245.55 rows=97 width=4) (actual
time=11.031..12.190 rows=206 loops=1)
          ->  Sort  (cost=1245.07..1245.31 rows=97 width=4) (actual
time=11.027..11.396 rows=206 loops=1)
                Sort Key: o.subject
                ->  Nested Loop  (cost=2.82..1241.87 rows=97 width=4)
(actual time=0.430..10.461 rows=206 loops=1)
                      ->  Nested Loop  (cost=2.82..678.57 rows=186
width=4) (actual time=0.381..3.479 rows=446 loops=1)
                            ->  Index Scan using tags_tag_key on tags
t2  (cost=0.00..5.01 rows=1 width=4) (actual time=0.058..0.061 rows=1
loops=1)
                                  Index Cond: (tag =
'transmitter''s'::text)
                            ->  Bitmap Heap Scan on object_tags t1
(cost=2.82..670.65 rows=233 width=8) (actual time=0.310..1.730
rows=446 loops=1)
                                  Recheck Cond: (t1.tag_id =
"outer".tag_id)
                                  ->  Bitmap Index Scan on
object_tags_tag_id_object_id  (cost=0.00..2.82 rows=233 width=0)
(actual time=0.199..0.199 rows=446 loops=1)
                                        Index Cond: (t1.tag_id =
"outer".tag_id)
                      ->  Index Scan using object_data_pkey on
object_data o  (cost=0.00..3.02 rows=1 width=4) (actual
time=0.009..0.010 rows=0 loops=446)
                            Index Cond: (o.subject = "outer".object_id)
                            Filter: ("type" = 179)
Total runtime: 13.411 ms
(16 rows)

triple_store=#


Sample 2:

triple_store=# explain analyze SELECT DISTINCT O.subject AS oid FROM
object_data O, object_tags T1, tags T2 WHERE O.type = 93 AND
O.subject = T1.object_id AND T1.tag_id = T2.tag_id AND T2.tag =
'current' LIMIT 1000;

         QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
-------------------------
Limit  (cost=1241.88..1241.88 rows=1 width=4) (actual
time=6411.409..6411.409 rows=0 loops=1)
    ->  Unique  (cost=1241.88..1241.88 rows=1 width=4) (actual
time=6411.405..6411.405 rows=0 loops=1)
          ->  Sort  (cost=1241.88..1241.88 rows=1 width=4) (actual
time=6411.400..6411.400 rows=0 loops=1)
                Sort Key: o.subject
                ->  Nested Loop  (cost=2.82..1241.87 rows=1 width=4)
(actual time=6411.386..6411.386 rows=0 loops=1)
                      ->  Nested Loop  (cost=2.82..678.57 rows=186
width=4) (actual time=46.045..2229.978 rows=446 loops=1)
                            ->  Index Scan using tags_tag_key on tags
t2  (cost=0.00..5.01 rows=1 width=4) (actual time=11.798..11.802
rows=1 loops=1)
                                  Index Cond: (tag = 'current'::text)
                            ->  Bitmap Heap Scan on object_tags t1
(cost=2.82..670.65 rows=233 width=8) (actual time=34.222..2216.321
rows=446 loops=1)
                                  Recheck Cond: (t1.tag_id =
"outer".tag_id)
                                  ->  Bitmap Index Scan on
object_tags_tag_id_object_id  (cost=0.00..2.82 rows=233 width=0)
(actual time=25.523..25.523 rows=446 loops=1)
                                        Index Cond: (t1.tag_id =
"outer".tag_id)
                      ->  Index Scan using object_data_pkey on
object_data o  (cost=0.00..3.02 rows=1 width=4) (actual
time=9.370..9.370 rows=0 loops=446)
                            Index Cond: (o.subject = "outer".object_id)
                            Filter: ("type" = 93)
Total runtime: 6411.516 ms
(16 rows)

triple_store=# explain analyze SELECT DISTINCT O.subject AS oid FROM
object_data O, object_tags T1, tags T2 WHERE O.type = 93 AND
O.subject = T1.object_id AND T1.tag_id = T2.tag_id AND T2.tag =
'current' LIMIT 1000;

        QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
-----------------------
Limit  (cost=1241.88..1241.88 rows=1 width=4) (actual
time=9.437..9.437 rows=0 loops=1)
    ->  Unique  (cost=1241.88..1241.88 rows=1 width=4) (actual
time=9.431..9.431 rows=0 loops=1)
          ->  Sort  (cost=1241.88..1241.88 rows=1 width=4) (actual
time=9.426..9.426 rows=0 loops=1)
                Sort Key: o.subject
                ->  Nested Loop  (cost=2.82..1241.87 rows=1 width=4)
(actual time=9.414..9.414 rows=0 loops=1)
                      ->  Nested Loop  (cost=2.82..678.57 rows=186
width=4) (actual time=0.347..3.477 rows=446 loops=1)
                            ->  Index Scan using tags_tag_key on tags
t2  (cost=0.00..5.01 rows=1 width=4) (actual time=0.039..0.042 rows=1
loops=1)
                                  Index Cond: (tag = 'current'::text)
                            ->  Bitmap Heap Scan on object_tags t1
(cost=2.82..670.65 rows=233 width=8) (actual time=0.297..1.688
rows=446 loops=1)
                                  Recheck Cond: (t1.tag_id =
"outer".tag_id)
                                  ->  Bitmap Index Scan on
object_tags_tag_id_object_id  (cost=0.00..2.82 rows=233 width=0)
(actual time=0.185..0.185 rows=446 loops=1)
                                        Index Cond: (t1.tag_id =
"outer".tag_id)
                      ->  Index Scan using object_data_pkey on
object_data o  (cost=0.00..3.02 rows=1 width=4) (actual
time=0.009..0.009 rows=0 loops=446)
                            Index Cond: (o.subject = "outer".object_id)
                            Filter: ("type" = 93)
Total runtime: 9.538 ms
(16 rows)

triple_store=#


Schema:

triple_store=# \d object_data
                Table "public.object_data"
     Column     |            Type             | Modifiers
---------------+-----------------------------+-----------
subject       | integer                     | not null
type          | integer                     | not null
owned_by      | integer                     | not null
created_by    | integer                     | not null
created       | timestamp without time zone | not null
last_modified | timestamp without time zone | not null
label         | text                        |
Indexes:
     "object_data_pkey" PRIMARY KEY, btree (subject)
     "object_data_type_created_by" btree ("type", created_by)
     "object_data_type_owned_by" btree ("type", owned_by)
Foreign-key constraints:
     "object_data_created_by_fkey" FOREIGN KEY (created_by)
REFERENCES objects(object_id) DEFERRABLE INITIALLY DEFERRED
     "object_data_owned_by_fkey" FOREIGN KEY (owned_by) REFERENCES
objects(object_id) DEFERRABLE INITIALLY DEFERRED
     "object_data_type_fkey" FOREIGN KEY ("type") REFERENCES objects
(object_id) DEFERRABLE INITIALLY DEFERRED
Tablespace: "alt_2"

triple_store=# \d object_tags
    Table "public.object_tags"
   Column   |  Type   | Modifiers
-----------+---------+-----------
object_id | integer | not null
tag_id    | integer | not null
Indexes:
     "object_tags_pkey" PRIMARY KEY, btree (object_id, tag_id)
     "object_tags_tag_id" btree (tag_id)
     "object_tags_tag_id_object_id" btree (tag_id, object_id)
Foreign-key constraints:
     "object_tags_object_id_fkey" FOREIGN KEY (object_id) REFERENCES
objects(object_id) DEFERRABLE INITIALLY DEFERRED
     "object_tags_tag_id_fkey" FOREIGN KEY (tag_id) REFERENCES tags
(tag_id) DEFERRABLE INITIALLY DEFERRED

triple_store=# \d tags
                            Table "public.tags"
Column |  Type   |                       Modifiers
--------+---------
+-------------------------------------------------------
tag_id | integer | not null default nextval('tags_tag_id_seq'::regclass)
tag    | text    | not null
Indexes:
     "tags_pkey" PRIMARY KEY, btree (tag_id)
     "tags_tag_key" UNIQUE, btree (tag)

--
(peter.royal|osi)@pobox.com - http://fotap.org/~osi


Attachment

Re: help tuning queries on large database

From
Tom Lane
Date:
peter royal <> writes:
> So, my question is, is there anything I can do to boost performance
> with what I've got, or am I in a position where the only 'fix' is
> more faster disks? I can't think of any schema/index changes that
> would help, since everything looks pretty optimal from the 'explain
> analyze' output. I'd like to get a 10x improvement when querying from
> the 'cold' state.

I don't think you have any hope of improving the "cold" state much.
The right way to think about this is not to be in the "cold" state.
Check your kernel parameters and make sure it's not set to limit
the amount of memory used for cache (I'm not actually sure if there
is such a limit on Linux, but there definitely is on some other Unixen).
Look around and see if you can reduce the memory used by processes,
or even better, offload non-database tasks to other machines.

Basically you need to get as much of the database as you can to stay
in disk cache.

            regards, tom lane

Re: help tuning queries on large database

From
Harry Jackson
Date:
On 1/6/06, peter royal <> wrote:
> PostgreSQL 8.1.1
>
> shared_buffers = 10000  # (It was higher, 50k, but didn't help any,
> so brought down to free ram for disk cache)
> work_mem = 8196
> random_page_cost = 3
> effective_cache_size = 250000

I have played with both disk cache settings and shared buffers and I
found that if I increased the shared buffers above a certain value
performance would increase dramatically. Playing with the effective
cache did not have the same amount of impact. I am currently running
with

shared_buffers = 254288 # approx 2.1Gb

and this is on a smaller dataset than yours.

--
Harry
http://www.hjackson.org
http://www.uklug.co.uk

Re: help tuning queries on large database

From
David Lang
Date:
On Fri, 6 Jan 2006, Tom Lane wrote:

> Date: Fri, 06 Jan 2006 18:47:55 -0500
> From: Tom Lane <>
> To: peter royal <>
> Cc: 
> Subject: Re: [PERFORM] help tuning queries on large database
>
> peter royal <> writes:
>> So, my question is, is there anything I can do to boost performance
>> with what I've got, or am I in a position where the only 'fix' is
>> more faster disks? I can't think of any schema/index changes that
>> would help, since everything looks pretty optimal from the 'explain
>> analyze' output. I'd like to get a 10x improvement when querying from
>> the 'cold' state.
>
> I don't think you have any hope of improving the "cold" state much.
> The right way to think about this is not to be in the "cold" state.
> Check your kernel parameters and make sure it's not set to limit
> the amount of memory used for cache (I'm not actually sure if there
> is such a limit on Linux, but there definitely is on some other Unixen).

Linux doesn't have any ability to limit the amount of memory used for
caching (there are periodicly requests for such a feature)

David Lang

> Look around and see if you can reduce the memory used by processes,
> or even better, offload non-database tasks to other machines.
>
> Basically you need to get as much of the database as you can to stay
> in disk cache.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

Re: help tuning queries on large database

From
"Luke Lonergan"
Date:
Peter,

On 1/6/06 2:59 PM, "peter royal" <> wrote:

> I have experimented with having all 8 disks in a single RAID0 set, a
> single RAID10 set, and currently 4 RAID0 sets of 2 disks each. There
> hasn't been an appreciable difference in the overall performance of
> my test suite (which randomly generates queries like the samples
> below as well as a few other types. this problem manifests itself on
> other queries in the test suite as well).

Have you tested the underlying filesystem for it's performance?  Run this:
  time bash -c 'dd if=/dev/zero of=/my_file_system/bigfile bs=8k
count=<your_memory_size_in_GB * 250000> && sync'

Then run this:
  time dd if=/my_file_system/bigfile bs=8k of=/dev/null

And report the times here please.  With your 8 disks in any of the RAID0
configurations you describe, you should be getting 480MB/s.  In the RAID10
configuration you should get 240.

Note that ext3 will not go faster than about 300MB/s in our experience.  You
should use xfs, which will run *much* faster.

You should also experiment with using larger readahead, which you can
implement like this:
  blockdev --setra 16384 /dev/<my_block_device>

E.g. "blockdev --setra 16384 /dev/sda"

This will set the readahead of Linux block device reads to 16MB.  Using
3Ware's newest controllers we have seen 500MB/s + on 8 disk drives in RAID0
on CentOS 4.1 with xfs.  Note that you will need to run the "CentOS
unsupported kernel" to get xfs.

> So, my question is, is there anything I can do to boost performance
> with what I've got, or am I in a position where the only 'fix' is
> more faster disks? I can't think of any schema/index changes that
> would help, since everything looks pretty optimal from the 'explain
> analyze' output. I'd like to get a 10x improvement when querying from
> the 'cold' state.

From what you describe, one of these is likely:
- hardware isn't configured properly or a driver problem.
- you need to use xfs and tune your Linux readahead

- Luke



Re: help tuning queries on large database

From
Ron
Date:
I'll second all of Luke Lonergan's comments and add these.

You should be able to increase both "cold" and "warm" performance (as
well as data integrity.  read below.) considerably.
Ron

At 05:59 PM 1/6/2006, peter royal wrote:
>Howdy.
>
>I'm running into scaling problems when testing with a 16gb (data
>+indexes) database.
>
>I can run a query, and it returns in a few seconds. If I run it
>again, it returns in a few milliseconds. I realize this is because
>during subsequent runs, the necessary disk pages have been cached by
>the OS.
>
>I have experimented with having all 8 disks in a single RAID0 set, a
>single RAID10 set, and currently 4 RAID0 sets of 2 disks each. There
>hasn't been an appreciable difference in the overall performance of
>my test suite (which randomly generates queries like the samples
>below as well as a few other types. this problem manifests itself on
>other queries in the test suite as well).
>
>So, my question is, is there anything I can do to boost performance
>with what I've got, or am I in a position where the only 'fix' is
>more faster disks? I can't think of any schema/index changes that
>would help, since everything looks pretty optimal from the 'explain
>analyze' output. I'd like to get a 10x improvement when querying from
>the 'cold' state.
>
>Thanks for any assistance. The advice from reading this list to
>getting to where I am now has been invaluable.
>-peter
>
>
>Configuration:
>
>PostgreSQL 8.1.1
>
>shared_buffers = 10000  # (It was higher, 50k, but didn't help any,
>so brought down to free ram for disk cache)
>work_mem = 8196
>random_page_cost = 3
>effective_cache_size = 250000
>
>
>Hardware:
>
>CentOS 4.2 (Linux 2.6.9-22.0.1.ELsmp)

Upgrade your kernel to at least  2.6.12
There's a known issue with earlier versions of the 2.6.x kernel and
64b CPUs like the Opteron.  See kernel.org for details.

>Areca ARC-1220 8-port PCI-E controller

Make sure you have 1GB or 2GB of cache.  Get the battery backup and
set the cache for write back rather than write through.

>8 x Hitachi Deskstar 7K80 (SATA2) (7200rpm)
>2 x Opteron 242 @ 1.6ghz
>3gb RAM (should be 4gb, but separate Linux issue preventing us from
>getting it to see all of it)
>Tyan Thunder K8WE
The K8WE has 8 DIMM slots.  That should be good for 16 or 32 GB of
RAM (Depending on whether the mainboard recognizes 4GB DIMMs or
not.  Ask Tyan about the latest K8WE firmare.).  If nothing else, 1GB
DIMMs are now so cheap that you should have no problems having 8GB on the K8WE.

A 2.6.12 or later based Linux distro should have NO problems using
more than 4GB or RAM.

Among the other tricks having lots of RAM allows:
If some of your tables are Read Only or VERY rarely written to, you
can preload them at boot time and make them RAM resident using the
/etc/tmpfs trick.

In addition there is at least one company making a cheap battery
backed PCI-X card that can hold up to 4GB of RAM and pretend to be a
small HD to the OS.  I don't remember any names at the moment, but
there have been posts here and at storage.review.com on such products.


>RAID Layout:
>
>4 2-disk RAID0 sets created
You do know that a RAID 0 set provides _worse_ data protection than a
single HD?  Don't use RAID 0 for any data you want kept reliably.

With 8 HDs, the best config is probably
1 2HD RAID 1 + 1 6HD RAID 10  or
2 4HD RAID 10's

It is certainly true that once you have done everything you can with
RAM, the next set of HW optimizations is to add HDs.  The more the
better up to a the limits of your available PCI-X bandwidth.

In short, a 2nd RAID fully populated controller is not unreasonable.



Re: help tuning queries on large database

From
Kelly Burkhart
Date:
On 1/8/06, Ron <> wrote:
<snip>
Among the other tricks having lots of RAM allows:
If some of your tables are Read Only or VERY rarely written to, you
can preload them at boot time and make them RAM resident using the
/etc/tmpfs trick.

What is the /etc/tmpfs trick?

-K


Re: help tuning queries on large database

From
Harry Jackson
Date:
On 1/9/06, Kelly Burkhart <> wrote:
> On 1/8/06, Ron <> wrote:
> > <snip>
> > Among the other tricks having lots of RAM allows:
> > If some of your tables are Read Only or VERY rarely written to, you
> > can preload them at boot time and make them RAM resident using the
> > /etc/tmpfs trick.
>
>  What is the /etc/tmpfs trick?

I think he means you can create a directory that mounts and area of
RAM. If you put the tables on it then it will be very fast. I would
not recommend it for anything you cannot afford to loose.

I have also tried it and found that it did not produce as good as
performance as I expected.

--
Harry
http://www.hjackson.org
http://www.uklug.co.uk

Re: help tuning queries on large database

From
peter royal
Date:
On Jan 8, 2006, at 1:42 PM, Luke Lonergan wrote:
> Have you tested the underlying filesystem for it's performance?
> Run this:
>   time bash -c 'dd if=/dev/zero of=/my_file_system/bigfile bs=8k
> count=<your_memory_size_in_GB * 250000> && sync'

This is a 2-disk RAID0

[root@bigboy /opt/alt-2]# time bash -c 'dd if=/dev/zero of=/opt/alt-2/
bigfile bs=8k count=1000000 && sync'
1000000+0 records in
1000000+0 records out

real    1m27.143s
user    0m0.276s
sys     0m37.338s

'iostat -x' showed writes peaking at ~100MB/s


> Then run this:
>   time dd if=/my_file_system/bigfile bs=8k of=/dev/null

[root@bigboy /opt/alt-2]# time dd if=/opt/alt-2/bigfile bs=8k of=/dev/
null
1000000+0 records in
1000000+0 records out

real    1m9.846s
user    0m0.189s
sys     0m11.099s

'iostat -x' showed reads peaking at ~116MB/s


Again with kernel 2.6.15:

[root@bigboy ~]# time bash -c 'dd if=/dev/zero of=/opt/alt-2/bigfile
bs=8k count=1000000 && sync'
1000000+0 records in
1000000+0 records out

real    1m29.144s
user    0m0.204s
sys     0m48.415s

[root@bigboy ~]# time dd if=/opt/alt-2/bigfile bs=8k of=/dev/null
1000000+0 records in
1000000+0 records out

real    1m9.701s
user    0m0.168s
sys     0m11.933s


> And report the times here please.  With your 8 disks in any of the
> RAID0
> configurations you describe, you should be getting 480MB/s.  In the
> RAID10
> configuration you should get 240.

Not anywhere near that. I'm scouring the 'net looking to see what
needs to be tuned at the HW level.

> You should also experiment with using larger readahead, which you can
> implement like this:
>   blockdev --setra 16384 /dev/<my_block_device>
>
> E.g. "blockdev --setra 16384 /dev/sda"

wow, this helped nicely. Without using the updated kernel, it took
28% off my testcase time.

> From what you describe, one of these is likely:
> - hardware isn't configured properly or a driver problem.

Using the latest Areca driver, looking to see if there is some
configuration that was missed.

> - you need to use xfs and tune your Linux readahead

Will try XFS soon, concentrating on the 'dd' speed issue first.


On Jan 8, 2006, at 4:35 PM, Ron wrote:
>> Areca ARC-1220 8-port PCI-E controller
>
> Make sure you have 1GB or 2GB of cache.  Get the battery backup and
> set the cache for write back rather than write through.

The card we've got doesn't have a SODIMM socket, since its only an 8-
port card.  My understanding was that was cache used when writing?

> A 2.6.12 or later based Linux distro should have NO problems using
> more than 4GB or RAM.

Upgraded the kernel to 2.6.15, then we were able to set the BIOS
option for the 'Memory Hole' to 'Software' and it saw all 4G (under
2.6.11 we got a kernel panic with that set)

>> RAID Layout:
>>
>> 4 2-disk RAID0 sets created
> You do know that a RAID 0 set provides _worse_ data protection than
> a single HD?  Don't use RAID 0 for any data you want kept reliably.

yup, aware of that. was planning on RAID10 for production, but just
broke it out into RAID0 sets for testing (from what I read, I
gathered that the read performance of RAID0 and RAID10 were comparable)


thanks for all the suggestions, I'll report back as I continue testing.

-pete

--
(peter.royal|osi)@pobox.com - http://fotap.org/~osi


Attachment

Re: help tuning queries on large database

From
"Luke Lonergan"
Date:
Peter,

On 1/9/06 9:23 AM, "peter royal" <> wrote:

> This is a 2-disk RAID0

Your 2-disk results look fine - what about your 8-disk results?

Given that you want to run in production with RAID10, the most you should
expect is 2x the 2-disk results using all 8 of your disks.  If you want the
best rate for production while preserving data integrity, I recommend
running your Areca in RAID5, in which case you should expect 3.5x your
2-disk results (7 drives).  You can assume you'll get that if you use XFS +
readahead.  OTOH - I'd like to see your test results anyway :-)

- Luke



Re: help tuning queries on large database

From
peter royal
Date:
On Jan 9, 2006, at 2:01 PM, Luke Lonergan wrote:
> Peter,
>
> On 1/9/06 9:23 AM, "peter royal" <> wrote:
>
>> This is a 2-disk RAID0
>
> Your 2-disk results look fine - what about your 8-disk results?

after some further research the 2-disk RAID0 numbers are not bad.

I have a single drive of the same type hooked up to the SATA2 port on
the motherboard to boot from, and its performance numbers are (linux
2.6.15, ext3):

[root@bigboy ~]# time bash -c 'dd if=/dev/zero of=/tmp/bigfile bs=8k
count=1000000 && sync'
1000000+0 records in
1000000+0 records out

real    4m55.032s
user    0m0.256s
sys     0m47.299s
[root@bigboy ~]# time dd if=/tmp/bigfile bs=8k of=/dev/null
1000000+0 records in
1000000+0 records out

real    3m27.229s
user    0m0.156s
sys     0m13.377s

so, there is a clear advantage to RAID over a single drive.


now, some stats in a 8-disk configuration:

8-disk RAID0, ext3, 16k read-ahead

[root@bigboy /opt/pgdata]# time bash -c 'dd if=/dev/zero of=/opt/
pgdata/bigfile bs=8k count=1000000 && sync'
1000000+0 records in
1000000+0 records out

real    0m53.030s
user    0m0.204s
sys     0m42.015s

[root@bigboy /opt/pgdata]# time dd if=/opt/pgdata/bigfile bs=8k of=/
dev/null
1000000+0 records in
1000000+0 records out

real    0m23.232s
user    0m0.144s
sys     0m13.213s


8-disk RAID0, xfs, 16k read-ahead

[root@bigboy /opt/pgdata]# time bash -c 'dd if=/dev/zero of=/opt/
pgdata/bigfile bs=8k count=1000000 && sync'
1000000+0 records in
1000000+0 records out

real    0m32.177s
user    0m0.212s
sys     0m21.277s

[root@bigboy /opt/pgdata]# time dd if=/opt/pgdata/bigfile bs=8k of=/
dev/null
1000000+0 records in
1000000+0 records out

real    0m21.814s
user    0m0.172s
sys     0m13.881s


... WOW.. highly impressed with the XFS write speed! going to stick
with that!

Overall, I got a 50% boost in the overall speed of my test suite by
using XFS and the 16k read-ahead.

> Given that you want to run in production with RAID10, the most you
> should
> expect is 2x the 2-disk results using all 8 of your disks.  If you
> want the
> best rate for production while preserving data integrity, I recommend
> running your Areca in RAID5, in which case you should expect 3.5x your
> 2-disk results (7 drives).  You can assume you'll get that if you
> use XFS +
> readahead.  OTOH - I'd like to see your test results anyway :-)

I've been avoiding RAID5 after reading how performance drops when a
drive is out/rebuilding. The performance benefit will outweigh the
cost I think.

Thanks for the help!
-pete

--
(peter.royal|osi)@pobox.com - http://fotap.org/~osi


Attachment

Re: help tuning queries on large database

From
"Luke Lonergan"
Date:
Peter,

On 1/9/06 12:59 PM, "peter royal" <> wrote:


> Overall, I got a 50% boost in the overall speed of my test suite by
> using XFS and the 16k read-ahead.

Yes, it all looks pretty good for your config, though it looks like you
might be adapter limited with the Areca - you should have seen a read time
with XFS of about 17 seconds.

OTOH - with RAID5, you are probably about balanced, you should see a read
time of about 19 seconds and instead you'll get your 22 which isn't too big
of a deal.

> Thanks for the help!

Sure - no problem!

BTW - I'm running tests right now with the 3Ware 9550SX controllers.  Two of
them on one machine running simultaneously with 16 drives and we're getting
800MB/s sustained read times.  That's a 32GB file read in 40 seconds (!!)

At that rate, we're only about 3x slower than memory access (practically
limited at around 2GB/s even though the system bus peak is 10GB/s). So, the
point is, if you want to get close to your "warm" speed, you need to get
your disk I/O as close to main memory speed as you can.  With parallel I/O
you can do that (see Bizgres MPP for more).

- Luke



Re: help tuning queries on large database

From
Ron
Date:
At 12:23 PM 1/9/2006, peter royal wrote:

>On Jan 8, 2006, at 4:35 PM, Ron wrote:
>>>Areca ARC-1220 8-port PCI-E controller
>>
>>Make sure you have 1GB or 2GB of cache.  Get the battery backup and
>>set the cache for write back rather than write through.
>
>The card we've got doesn't have a SODIMM socket, since its only an
>8- port card.  My understanding was that was cache used when writing?
Trade in your 8 port ARC-1220 that doesn't support 1-2GB of cache for
a 12, 16, or 24 port Areca one that does.  It's that important.

Present generation SATA2 HDs should average ~50MBps raw ASTR.  The
Intel IOP333 DSP on the ARC's is limited to 800MBps, so that's your
limit per card.  That's 16 SATA2 HD's operating in parallel (16HD
RAID 0, 17 HD RAID 5, 32 HD RAID 10).

Next generation 2.5" form factor 10Krpm SAS HD's due to retail in
2006 are supposed to average ~90MBps raw ASTR.  8 such HDs in
parallel per ARC-12xx will be the limit.

Side Note: the PCI-Ex8 bus on the 12xx cards is good for ~1.6GBps
RWPB, so I expect Areca is going to be upgrading this controller to
at least 2x, if not 4x (would require replacing the x8 bus with a x16
bus), the bandwidth at some point.

A PCI-Ex16 bus is good for ~3.2GBps RWPB, so if you have the slots 4
such populated ARC cards will max out a PCI-Ex16 bus.

In your shoes, I think I would recommend replacing your 8 port
ARC-1220 with a 12 port ARC-1230 with 1-2GB of battery backed cache
and planning to get more of them as need arises.


>>A 2.6.12 or later based Linux distro should have NO problems using
>>more than 4GB or RAM.
>
>Upgraded the kernel to 2.6.15, then we were able to set the BIOS
>option for the 'Memory Hole' to 'Software' and it saw all 4G (under
>2.6.11 we got a kernel panic with that set)
There are some other kernel tuning params that should help memory and
physical IO performance.  Talk to a Linux kernel guru to get the
correct advice specific to your installation and application.


It should be noted that there are indications of some major
inefficiencies in pg's IO layer that make it compute bound under some
circumstances before it becomes IO bound.  These may or may not cause
trouble for you as you keep pushing the envelope for maximum IO performance.


With the kind of work you are doing and we are describing, I'm sure
you can have a _very_ zippy system.

Ron



Re: help tuning queries on large database

From
Mark Lewis
Date:
Ron,

A few days back you mentioned:

> Upgrade your kernel to at least  2.6.12
> There's a known issue with earlier versions of the 2.6.x kernel and
> 64b CPUs like the Opteron.  See kernel.org for details.
>

I did some searching and couldn't find any obvious mention of this issue
(I gave up after searching through the first few hundred instances of
"64" in the 2.6.12 changelog).

Would you mind being a little more specific about which issue you're
talking about?  We're about to deploy some new 16GB RAM Opteron DB
servers and I'd like to check and make sure RH backported whatever the
fix was to their current RHEL4 kernel.

Thanks,
Mark Lewis

Re: help tuning queries on large database

From
Ron
Date:
At 07:28 PM 1/10/2006, Mark Lewis wrote:
>Ron,
>
>A few days back you mentioned:
>
> > Upgrade your kernel to at least  2.6.12
> > There's a known issue with earlier versions of the 2.6.x kernel and
> > 64b CPUs like the Opteron.  See kernel.org for details.
> >
>
>I did some searching and couldn't find any obvious mention of this issue
>(I gave up after searching through the first few hundred instances of
>"64" in the 2.6.12 changelog).
>
>Would you mind being a little more specific about which issue you're
>talking about?  We're about to deploy some new 16GB RAM Opteron DB
>servers and I'd like to check and make sure RH backported whatever the
>fix was to their current RHEL4 kernel.
There are 3 issues I know about in general:
1= As Peter Royal noted on this list, pre 12 versions of 2.6.x have
problems with RAM of >= 4GB.

2= Pre 12 versions on 2.6.x when running A64 or Xeon 64b SMP seem to
be susceptible to "context switch storms".

3= Physical and memory IO is considerably improved in the later
versions of 2.6.x compared to 2.6.11 or earlier.

Talk to a real Linux kernel guru (I am not) for details and specifics.
Ron