Thread: help tuning queries on large database
Howdy. I'm running into scaling problems when testing with a 16gb (data +indexes) database. I can run a query, and it returns in a few seconds. If I run it again, it returns in a few milliseconds. I realize this is because during subsequent runs, the necessary disk pages have been cached by the OS. I have experimented with having all 8 disks in a single RAID0 set, a single RAID10 set, and currently 4 RAID0 sets of 2 disks each. There hasn't been an appreciable difference in the overall performance of my test suite (which randomly generates queries like the samples below as well as a few other types. this problem manifests itself on other queries in the test suite as well). So, my question is, is there anything I can do to boost performance with what I've got, or am I in a position where the only 'fix' is more faster disks? I can't think of any schema/index changes that would help, since everything looks pretty optimal from the 'explain analyze' output. I'd like to get a 10x improvement when querying from the 'cold' state. Thanks for any assistance. The advice from reading this list to getting to where I am now has been invaluable. -peter Configuration: PostgreSQL 8.1.1 shared_buffers = 10000 # (It was higher, 50k, but didn't help any, so brought down to free ram for disk cache) work_mem = 8196 random_page_cost = 3 effective_cache_size = 250000 Hardware: CentOS 4.2 (Linux 2.6.9-22.0.1.ELsmp) Areca ARC-1220 8-port PCI-E controller 8 x Hitachi Deskstar 7K80 (SATA2) (7200rpm) 2 x Opteron 242 @ 1.6ghz 3gb RAM (should be 4gb, but separate Linux issue preventing us from getting it to see all of it) Tyan Thunder K8WE RAID Layout: 4 2-disk RAID0 sets created Each raid set is a tablespace, formatted ext3. The majority of the database is in the primary tablespace, and the popular object_data table is in its own tablespace. Sample 1: triple_store=# explain analyze SELECT DISTINCT O.subject AS oid FROM object_data O, object_tags T1, tags T2 WHERE O.type = 179 AND O.subject = T1.object_id AND T1.tag_id = T2.tag_id AND T2.tag = 'transmitter\'s' LIMIT 1000; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------- Limit (cost=1245.07..1245.55 rows=97 width=4) (actual time=3702.697..3704.665 rows=206 loops=1) -> Unique (cost=1245.07..1245.55 rows=97 width=4) (actual time=3702.691..3703.900 rows=206 loops=1) -> Sort (cost=1245.07..1245.31 rows=97 width=4) (actual time=3702.686..3703.056 rows=206 loops=1) Sort Key: o.subject -> Nested Loop (cost=2.82..1241.87 rows=97 width=4) (actual time=97.166..3701.970 rows=206 loops=1) -> Nested Loop (cost=2.82..678.57 rows=186 width=4) (actual time=59.903..1213.170 rows=446 loops=1) -> Index Scan using tags_tag_key on tags t2 (cost=0.00..5.01 rows=1 width=4) (actual time=13.139..13.143 rows=1 loops=1) Index Cond: (tag = 'transmitter''s'::text) -> Bitmap Heap Scan on object_tags t1 (cost=2.82..670.65 rows=233 width=8) (actual time=46.751..1198.198 rows=446 loops=1) Recheck Cond: (t1.tag_id = "outer".tag_id) -> Bitmap Index Scan on object_tags_tag_id_object_id (cost=0.00..2.82 rows=233 width=0) (actual time=31.571..31.571 rows=446 loops=1) Index Cond: (t1.tag_id = "outer".tag_id) -> Index Scan using object_data_pkey on object_data o (cost=0.00..3.02 rows=1 width=4) (actual time=5.573..5.574 rows=0 loops=446) Index Cond: (o.subject = "outer".object_id) Filter: ("type" = 179) Total runtime: 3705.166 ms (16 rows) triple_store=# explain analyze SELECT DISTINCT O.subject AS oid FROM object_data O, object_tags T1, tags T2 WHERE O.type = 179 AND O.subject = T1.object_id AND T1.tag_id = T2.tag_id AND T2.tag = 'transmitter\'s' LIMIT 1000; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ ----------------------- Limit (cost=1245.07..1245.55 rows=97 width=4) (actual time=11.037..12.923 rows=206 loops=1) -> Unique (cost=1245.07..1245.55 rows=97 width=4) (actual time=11.031..12.190 rows=206 loops=1) -> Sort (cost=1245.07..1245.31 rows=97 width=4) (actual time=11.027..11.396 rows=206 loops=1) Sort Key: o.subject -> Nested Loop (cost=2.82..1241.87 rows=97 width=4) (actual time=0.430..10.461 rows=206 loops=1) -> Nested Loop (cost=2.82..678.57 rows=186 width=4) (actual time=0.381..3.479 rows=446 loops=1) -> Index Scan using tags_tag_key on tags t2 (cost=0.00..5.01 rows=1 width=4) (actual time=0.058..0.061 rows=1 loops=1) Index Cond: (tag = 'transmitter''s'::text) -> Bitmap Heap Scan on object_tags t1 (cost=2.82..670.65 rows=233 width=8) (actual time=0.310..1.730 rows=446 loops=1) Recheck Cond: (t1.tag_id = "outer".tag_id) -> Bitmap Index Scan on object_tags_tag_id_object_id (cost=0.00..2.82 rows=233 width=0) (actual time=0.199..0.199 rows=446 loops=1) Index Cond: (t1.tag_id = "outer".tag_id) -> Index Scan using object_data_pkey on object_data o (cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.010 rows=0 loops=446) Index Cond: (o.subject = "outer".object_id) Filter: ("type" = 179) Total runtime: 13.411 ms (16 rows) triple_store=# Sample 2: triple_store=# explain analyze SELECT DISTINCT O.subject AS oid FROM object_data O, object_tags T1, tags T2 WHERE O.type = 93 AND O.subject = T1.object_id AND T1.tag_id = T2.tag_id AND T2.tag = 'current' LIMIT 1000; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------- Limit (cost=1241.88..1241.88 rows=1 width=4) (actual time=6411.409..6411.409 rows=0 loops=1) -> Unique (cost=1241.88..1241.88 rows=1 width=4) (actual time=6411.405..6411.405 rows=0 loops=1) -> Sort (cost=1241.88..1241.88 rows=1 width=4) (actual time=6411.400..6411.400 rows=0 loops=1) Sort Key: o.subject -> Nested Loop (cost=2.82..1241.87 rows=1 width=4) (actual time=6411.386..6411.386 rows=0 loops=1) -> Nested Loop (cost=2.82..678.57 rows=186 width=4) (actual time=46.045..2229.978 rows=446 loops=1) -> Index Scan using tags_tag_key on tags t2 (cost=0.00..5.01 rows=1 width=4) (actual time=11.798..11.802 rows=1 loops=1) Index Cond: (tag = 'current'::text) -> Bitmap Heap Scan on object_tags t1 (cost=2.82..670.65 rows=233 width=8) (actual time=34.222..2216.321 rows=446 loops=1) Recheck Cond: (t1.tag_id = "outer".tag_id) -> Bitmap Index Scan on object_tags_tag_id_object_id (cost=0.00..2.82 rows=233 width=0) (actual time=25.523..25.523 rows=446 loops=1) Index Cond: (t1.tag_id = "outer".tag_id) -> Index Scan using object_data_pkey on object_data o (cost=0.00..3.02 rows=1 width=4) (actual time=9.370..9.370 rows=0 loops=446) Index Cond: (o.subject = "outer".object_id) Filter: ("type" = 93) Total runtime: 6411.516 ms (16 rows) triple_store=# explain analyze SELECT DISTINCT O.subject AS oid FROM object_data O, object_tags T1, tags T2 WHERE O.type = 93 AND O.subject = T1.object_id AND T1.tag_id = T2.tag_id AND T2.tag = 'current' LIMIT 1000; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ ----------------------- Limit (cost=1241.88..1241.88 rows=1 width=4) (actual time=9.437..9.437 rows=0 loops=1) -> Unique (cost=1241.88..1241.88 rows=1 width=4) (actual time=9.431..9.431 rows=0 loops=1) -> Sort (cost=1241.88..1241.88 rows=1 width=4) (actual time=9.426..9.426 rows=0 loops=1) Sort Key: o.subject -> Nested Loop (cost=2.82..1241.87 rows=1 width=4) (actual time=9.414..9.414 rows=0 loops=1) -> Nested Loop (cost=2.82..678.57 rows=186 width=4) (actual time=0.347..3.477 rows=446 loops=1) -> Index Scan using tags_tag_key on tags t2 (cost=0.00..5.01 rows=1 width=4) (actual time=0.039..0.042 rows=1 loops=1) Index Cond: (tag = 'current'::text) -> Bitmap Heap Scan on object_tags t1 (cost=2.82..670.65 rows=233 width=8) (actual time=0.297..1.688 rows=446 loops=1) Recheck Cond: (t1.tag_id = "outer".tag_id) -> Bitmap Index Scan on object_tags_tag_id_object_id (cost=0.00..2.82 rows=233 width=0) (actual time=0.185..0.185 rows=446 loops=1) Index Cond: (t1.tag_id = "outer".tag_id) -> Index Scan using object_data_pkey on object_data o (cost=0.00..3.02 rows=1 width=4) (actual time=0.009..0.009 rows=0 loops=446) Index Cond: (o.subject = "outer".object_id) Filter: ("type" = 93) Total runtime: 9.538 ms (16 rows) triple_store=# Schema: triple_store=# \d object_data Table "public.object_data" Column | Type | Modifiers ---------------+-----------------------------+----------- subject | integer | not null type | integer | not null owned_by | integer | not null created_by | integer | not null created | timestamp without time zone | not null last_modified | timestamp without time zone | not null label | text | Indexes: "object_data_pkey" PRIMARY KEY, btree (subject) "object_data_type_created_by" btree ("type", created_by) "object_data_type_owned_by" btree ("type", owned_by) Foreign-key constraints: "object_data_created_by_fkey" FOREIGN KEY (created_by) REFERENCES objects(object_id) DEFERRABLE INITIALLY DEFERRED "object_data_owned_by_fkey" FOREIGN KEY (owned_by) REFERENCES objects(object_id) DEFERRABLE INITIALLY DEFERRED "object_data_type_fkey" FOREIGN KEY ("type") REFERENCES objects (object_id) DEFERRABLE INITIALLY DEFERRED Tablespace: "alt_2" triple_store=# \d object_tags Table "public.object_tags" Column | Type | Modifiers -----------+---------+----------- object_id | integer | not null tag_id | integer | not null Indexes: "object_tags_pkey" PRIMARY KEY, btree (object_id, tag_id) "object_tags_tag_id" btree (tag_id) "object_tags_tag_id_object_id" btree (tag_id, object_id) Foreign-key constraints: "object_tags_object_id_fkey" FOREIGN KEY (object_id) REFERENCES objects(object_id) DEFERRABLE INITIALLY DEFERRED "object_tags_tag_id_fkey" FOREIGN KEY (tag_id) REFERENCES tags (tag_id) DEFERRABLE INITIALLY DEFERRED triple_store=# \d tags Table "public.tags" Column | Type | Modifiers --------+--------- +------------------------------------------------------- tag_id | integer | not null default nextval('tags_tag_id_seq'::regclass) tag | text | not null Indexes: "tags_pkey" PRIMARY KEY, btree (tag_id) "tags_tag_key" UNIQUE, btree (tag) -- (peter.royal|osi)@pobox.com - http://fotap.org/~osi
Attachment
peter royal <peter.royal@pobox.com> writes: > So, my question is, is there anything I can do to boost performance > with what I've got, or am I in a position where the only 'fix' is > more faster disks? I can't think of any schema/index changes that > would help, since everything looks pretty optimal from the 'explain > analyze' output. I'd like to get a 10x improvement when querying from > the 'cold' state. I don't think you have any hope of improving the "cold" state much. The right way to think about this is not to be in the "cold" state. Check your kernel parameters and make sure it's not set to limit the amount of memory used for cache (I'm not actually sure if there is such a limit on Linux, but there definitely is on some other Unixen). Look around and see if you can reduce the memory used by processes, or even better, offload non-database tasks to other machines. Basically you need to get as much of the database as you can to stay in disk cache. regards, tom lane
On 1/6/06, peter royal <peter.royal@pobox.com> wrote: > PostgreSQL 8.1.1 > > shared_buffers = 10000 # (It was higher, 50k, but didn't help any, > so brought down to free ram for disk cache) > work_mem = 8196 > random_page_cost = 3 > effective_cache_size = 250000 I have played with both disk cache settings and shared buffers and I found that if I increased the shared buffers above a certain value performance would increase dramatically. Playing with the effective cache did not have the same amount of impact. I am currently running with shared_buffers = 254288 # approx 2.1Gb and this is on a smaller dataset than yours. -- Harry http://www.hjackson.org http://www.uklug.co.uk
On Fri, 6 Jan 2006, Tom Lane wrote: > Date: Fri, 06 Jan 2006 18:47:55 -0500 > From: Tom Lane <tgl@sss.pgh.pa.us> > To: peter royal <peter.royal@pobox.com> > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] help tuning queries on large database > > peter royal <peter.royal@pobox.com> writes: >> So, my question is, is there anything I can do to boost performance >> with what I've got, or am I in a position where the only 'fix' is >> more faster disks? I can't think of any schema/index changes that >> would help, since everything looks pretty optimal from the 'explain >> analyze' output. I'd like to get a 10x improvement when querying from >> the 'cold' state. > > I don't think you have any hope of improving the "cold" state much. > The right way to think about this is not to be in the "cold" state. > Check your kernel parameters and make sure it's not set to limit > the amount of memory used for cache (I'm not actually sure if there > is such a limit on Linux, but there definitely is on some other Unixen). Linux doesn't have any ability to limit the amount of memory used for caching (there are periodicly requests for such a feature) David Lang > Look around and see if you can reduce the memory used by processes, > or even better, offload non-database tasks to other machines. > > Basically you need to get as much of the database as you can to stay > in disk cache. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
Peter, On 1/6/06 2:59 PM, "peter royal" <peter.royal@pobox.com> wrote: > I have experimented with having all 8 disks in a single RAID0 set, a > single RAID10 set, and currently 4 RAID0 sets of 2 disks each. There > hasn't been an appreciable difference in the overall performance of > my test suite (which randomly generates queries like the samples > below as well as a few other types. this problem manifests itself on > other queries in the test suite as well). Have you tested the underlying filesystem for it's performance? Run this: time bash -c 'dd if=/dev/zero of=/my_file_system/bigfile bs=8k count=<your_memory_size_in_GB * 250000> && sync' Then run this: time dd if=/my_file_system/bigfile bs=8k of=/dev/null And report the times here please. With your 8 disks in any of the RAID0 configurations you describe, you should be getting 480MB/s. In the RAID10 configuration you should get 240. Note that ext3 will not go faster than about 300MB/s in our experience. You should use xfs, which will run *much* faster. You should also experiment with using larger readahead, which you can implement like this: blockdev --setra 16384 /dev/<my_block_device> E.g. "blockdev --setra 16384 /dev/sda" This will set the readahead of Linux block device reads to 16MB. Using 3Ware's newest controllers we have seen 500MB/s + on 8 disk drives in RAID0 on CentOS 4.1 with xfs. Note that you will need to run the "CentOS unsupported kernel" to get xfs. > So, my question is, is there anything I can do to boost performance > with what I've got, or am I in a position where the only 'fix' is > more faster disks? I can't think of any schema/index changes that > would help, since everything looks pretty optimal from the 'explain > analyze' output. I'd like to get a 10x improvement when querying from > the 'cold' state. From what you describe, one of these is likely: - hardware isn't configured properly or a driver problem. - you need to use xfs and tune your Linux readahead - Luke
I'll second all of Luke Lonergan's comments and add these. You should be able to increase both "cold" and "warm" performance (as well as data integrity. read below.) considerably. Ron At 05:59 PM 1/6/2006, peter royal wrote: >Howdy. > >I'm running into scaling problems when testing with a 16gb (data >+indexes) database. > >I can run a query, and it returns in a few seconds. If I run it >again, it returns in a few milliseconds. I realize this is because >during subsequent runs, the necessary disk pages have been cached by >the OS. > >I have experimented with having all 8 disks in a single RAID0 set, a >single RAID10 set, and currently 4 RAID0 sets of 2 disks each. There >hasn't been an appreciable difference in the overall performance of >my test suite (which randomly generates queries like the samples >below as well as a few other types. this problem manifests itself on >other queries in the test suite as well). > >So, my question is, is there anything I can do to boost performance >with what I've got, or am I in a position where the only 'fix' is >more faster disks? I can't think of any schema/index changes that >would help, since everything looks pretty optimal from the 'explain >analyze' output. I'd like to get a 10x improvement when querying from >the 'cold' state. > >Thanks for any assistance. The advice from reading this list to >getting to where I am now has been invaluable. >-peter > > >Configuration: > >PostgreSQL 8.1.1 > >shared_buffers = 10000 # (It was higher, 50k, but didn't help any, >so brought down to free ram for disk cache) >work_mem = 8196 >random_page_cost = 3 >effective_cache_size = 250000 > > >Hardware: > >CentOS 4.2 (Linux 2.6.9-22.0.1.ELsmp) Upgrade your kernel to at least 2.6.12 There's a known issue with earlier versions of the 2.6.x kernel and 64b CPUs like the Opteron. See kernel.org for details. >Areca ARC-1220 8-port PCI-E controller Make sure you have 1GB or 2GB of cache. Get the battery backup and set the cache for write back rather than write through. >8 x Hitachi Deskstar 7K80 (SATA2) (7200rpm) >2 x Opteron 242 @ 1.6ghz >3gb RAM (should be 4gb, but separate Linux issue preventing us from >getting it to see all of it) >Tyan Thunder K8WE The K8WE has 8 DIMM slots. That should be good for 16 or 32 GB of RAM (Depending on whether the mainboard recognizes 4GB DIMMs or not. Ask Tyan about the latest K8WE firmare.). If nothing else, 1GB DIMMs are now so cheap that you should have no problems having 8GB on the K8WE. A 2.6.12 or later based Linux distro should have NO problems using more than 4GB or RAM. Among the other tricks having lots of RAM allows: If some of your tables are Read Only or VERY rarely written to, you can preload them at boot time and make them RAM resident using the /etc/tmpfs trick. In addition there is at least one company making a cheap battery backed PCI-X card that can hold up to 4GB of RAM and pretend to be a small HD to the OS. I don't remember any names at the moment, but there have been posts here and at storage.review.com on such products. >RAID Layout: > >4 2-disk RAID0 sets created You do know that a RAID 0 set provides _worse_ data protection than a single HD? Don't use RAID 0 for any data you want kept reliably. With 8 HDs, the best config is probably 1 2HD RAID 1 + 1 6HD RAID 10 or 2 4HD RAID 10's It is certainly true that once you have done everything you can with RAM, the next set of HW optimizations is to add HDs. The more the better up to a the limits of your available PCI-X bandwidth. In short, a 2nd RAID fully populated controller is not unreasonable.
On 1/8/06, Ron <rjpeace@earthlink.net> wrote:
What is the /etc/tmpfs trick?
-K
<snip>
Among the other tricks having lots of RAM allows:
If some of your tables are Read Only or VERY rarely written to, you
can preload them at boot time and make them RAM resident using the
/etc/tmpfs trick.
What is the /etc/tmpfs trick?
-K
On 1/9/06, Kelly Burkhart <kelly.burkhart@gmail.com> wrote: > On 1/8/06, Ron <rjpeace@earthlink.net> wrote: > > <snip> > > Among the other tricks having lots of RAM allows: > > If some of your tables are Read Only or VERY rarely written to, you > > can preload them at boot time and make them RAM resident using the > > /etc/tmpfs trick. > > What is the /etc/tmpfs trick? I think he means you can create a directory that mounts and area of RAM. If you put the tables on it then it will be very fast. I would not recommend it for anything you cannot afford to loose. I have also tried it and found that it did not produce as good as performance as I expected. -- Harry http://www.hjackson.org http://www.uklug.co.uk
On Jan 8, 2006, at 1:42 PM, Luke Lonergan wrote: > Have you tested the underlying filesystem for it's performance? > Run this: > time bash -c 'dd if=/dev/zero of=/my_file_system/bigfile bs=8k > count=<your_memory_size_in_GB * 250000> && sync' This is a 2-disk RAID0 [root@bigboy /opt/alt-2]# time bash -c 'dd if=/dev/zero of=/opt/alt-2/ bigfile bs=8k count=1000000 && sync' 1000000+0 records in 1000000+0 records out real 1m27.143s user 0m0.276s sys 0m37.338s 'iostat -x' showed writes peaking at ~100MB/s > Then run this: > time dd if=/my_file_system/bigfile bs=8k of=/dev/null [root@bigboy /opt/alt-2]# time dd if=/opt/alt-2/bigfile bs=8k of=/dev/ null 1000000+0 records in 1000000+0 records out real 1m9.846s user 0m0.189s sys 0m11.099s 'iostat -x' showed reads peaking at ~116MB/s Again with kernel 2.6.15: [root@bigboy ~]# time bash -c 'dd if=/dev/zero of=/opt/alt-2/bigfile bs=8k count=1000000 && sync' 1000000+0 records in 1000000+0 records out real 1m29.144s user 0m0.204s sys 0m48.415s [root@bigboy ~]# time dd if=/opt/alt-2/bigfile bs=8k of=/dev/null 1000000+0 records in 1000000+0 records out real 1m9.701s user 0m0.168s sys 0m11.933s > And report the times here please. With your 8 disks in any of the > RAID0 > configurations you describe, you should be getting 480MB/s. In the > RAID10 > configuration you should get 240. Not anywhere near that. I'm scouring the 'net looking to see what needs to be tuned at the HW level. > You should also experiment with using larger readahead, which you can > implement like this: > blockdev --setra 16384 /dev/<my_block_device> > > E.g. "blockdev --setra 16384 /dev/sda" wow, this helped nicely. Without using the updated kernel, it took 28% off my testcase time. > From what you describe, one of these is likely: > - hardware isn't configured properly or a driver problem. Using the latest Areca driver, looking to see if there is some configuration that was missed. > - you need to use xfs and tune your Linux readahead Will try XFS soon, concentrating on the 'dd' speed issue first. On Jan 8, 2006, at 4:35 PM, Ron wrote: >> Areca ARC-1220 8-port PCI-E controller > > Make sure you have 1GB or 2GB of cache. Get the battery backup and > set the cache for write back rather than write through. The card we've got doesn't have a SODIMM socket, since its only an 8- port card. My understanding was that was cache used when writing? > A 2.6.12 or later based Linux distro should have NO problems using > more than 4GB or RAM. Upgraded the kernel to 2.6.15, then we were able to set the BIOS option for the 'Memory Hole' to 'Software' and it saw all 4G (under 2.6.11 we got a kernel panic with that set) >> RAID Layout: >> >> 4 2-disk RAID0 sets created > You do know that a RAID 0 set provides _worse_ data protection than > a single HD? Don't use RAID 0 for any data you want kept reliably. yup, aware of that. was planning on RAID10 for production, but just broke it out into RAID0 sets for testing (from what I read, I gathered that the read performance of RAID0 and RAID10 were comparable) thanks for all the suggestions, I'll report back as I continue testing. -pete -- (peter.royal|osi)@pobox.com - http://fotap.org/~osi
Attachment
Peter, On 1/9/06 9:23 AM, "peter royal" <peter.royal@pobox.com> wrote: > This is a 2-disk RAID0 Your 2-disk results look fine - what about your 8-disk results? Given that you want to run in production with RAID10, the most you should expect is 2x the 2-disk results using all 8 of your disks. If you want the best rate for production while preserving data integrity, I recommend running your Areca in RAID5, in which case you should expect 3.5x your 2-disk results (7 drives). You can assume you'll get that if you use XFS + readahead. OTOH - I'd like to see your test results anyway :-) - Luke
On Jan 9, 2006, at 2:01 PM, Luke Lonergan wrote: > Peter, > > On 1/9/06 9:23 AM, "peter royal" <peter.royal@pobox.com> wrote: > >> This is a 2-disk RAID0 > > Your 2-disk results look fine - what about your 8-disk results? after some further research the 2-disk RAID0 numbers are not bad. I have a single drive of the same type hooked up to the SATA2 port on the motherboard to boot from, and its performance numbers are (linux 2.6.15, ext3): [root@bigboy ~]# time bash -c 'dd if=/dev/zero of=/tmp/bigfile bs=8k count=1000000 && sync' 1000000+0 records in 1000000+0 records out real 4m55.032s user 0m0.256s sys 0m47.299s [root@bigboy ~]# time dd if=/tmp/bigfile bs=8k of=/dev/null 1000000+0 records in 1000000+0 records out real 3m27.229s user 0m0.156s sys 0m13.377s so, there is a clear advantage to RAID over a single drive. now, some stats in a 8-disk configuration: 8-disk RAID0, ext3, 16k read-ahead [root@bigboy /opt/pgdata]# time bash -c 'dd if=/dev/zero of=/opt/ pgdata/bigfile bs=8k count=1000000 && sync' 1000000+0 records in 1000000+0 records out real 0m53.030s user 0m0.204s sys 0m42.015s [root@bigboy /opt/pgdata]# time dd if=/opt/pgdata/bigfile bs=8k of=/ dev/null 1000000+0 records in 1000000+0 records out real 0m23.232s user 0m0.144s sys 0m13.213s 8-disk RAID0, xfs, 16k read-ahead [root@bigboy /opt/pgdata]# time bash -c 'dd if=/dev/zero of=/opt/ pgdata/bigfile bs=8k count=1000000 && sync' 1000000+0 records in 1000000+0 records out real 0m32.177s user 0m0.212s sys 0m21.277s [root@bigboy /opt/pgdata]# time dd if=/opt/pgdata/bigfile bs=8k of=/ dev/null 1000000+0 records in 1000000+0 records out real 0m21.814s user 0m0.172s sys 0m13.881s ... WOW.. highly impressed with the XFS write speed! going to stick with that! Overall, I got a 50% boost in the overall speed of my test suite by using XFS and the 16k read-ahead. > Given that you want to run in production with RAID10, the most you > should > expect is 2x the 2-disk results using all 8 of your disks. If you > want the > best rate for production while preserving data integrity, I recommend > running your Areca in RAID5, in which case you should expect 3.5x your > 2-disk results (7 drives). You can assume you'll get that if you > use XFS + > readahead. OTOH - I'd like to see your test results anyway :-) I've been avoiding RAID5 after reading how performance drops when a drive is out/rebuilding. The performance benefit will outweigh the cost I think. Thanks for the help! -pete -- (peter.royal|osi)@pobox.com - http://fotap.org/~osi
Attachment
Peter, On 1/9/06 12:59 PM, "peter royal" <peter.royal@pobox.com> wrote: > Overall, I got a 50% boost in the overall speed of my test suite by > using XFS and the 16k read-ahead. Yes, it all looks pretty good for your config, though it looks like you might be adapter limited with the Areca - you should have seen a read time with XFS of about 17 seconds. OTOH - with RAID5, you are probably about balanced, you should see a read time of about 19 seconds and instead you'll get your 22 which isn't too big of a deal. > Thanks for the help! Sure - no problem! BTW - I'm running tests right now with the 3Ware 9550SX controllers. Two of them on one machine running simultaneously with 16 drives and we're getting 800MB/s sustained read times. That's a 32GB file read in 40 seconds (!!) At that rate, we're only about 3x slower than memory access (practically limited at around 2GB/s even though the system bus peak is 10GB/s). So, the point is, if you want to get close to your "warm" speed, you need to get your disk I/O as close to main memory speed as you can. With parallel I/O you can do that (see Bizgres MPP for more). - Luke
At 12:23 PM 1/9/2006, peter royal wrote: >On Jan 8, 2006, at 4:35 PM, Ron wrote: >>>Areca ARC-1220 8-port PCI-E controller >> >>Make sure you have 1GB or 2GB of cache. Get the battery backup and >>set the cache for write back rather than write through. > >The card we've got doesn't have a SODIMM socket, since its only an >8- port card. My understanding was that was cache used when writing? Trade in your 8 port ARC-1220 that doesn't support 1-2GB of cache for a 12, 16, or 24 port Areca one that does. It's that important. Present generation SATA2 HDs should average ~50MBps raw ASTR. The Intel IOP333 DSP on the ARC's is limited to 800MBps, so that's your limit per card. That's 16 SATA2 HD's operating in parallel (16HD RAID 0, 17 HD RAID 5, 32 HD RAID 10). Next generation 2.5" form factor 10Krpm SAS HD's due to retail in 2006 are supposed to average ~90MBps raw ASTR. 8 such HDs in parallel per ARC-12xx will be the limit. Side Note: the PCI-Ex8 bus on the 12xx cards is good for ~1.6GBps RWPB, so I expect Areca is going to be upgrading this controller to at least 2x, if not 4x (would require replacing the x8 bus with a x16 bus), the bandwidth at some point. A PCI-Ex16 bus is good for ~3.2GBps RWPB, so if you have the slots 4 such populated ARC cards will max out a PCI-Ex16 bus. In your shoes, I think I would recommend replacing your 8 port ARC-1220 with a 12 port ARC-1230 with 1-2GB of battery backed cache and planning to get more of them as need arises. >>A 2.6.12 or later based Linux distro should have NO problems using >>more than 4GB or RAM. > >Upgraded the kernel to 2.6.15, then we were able to set the BIOS >option for the 'Memory Hole' to 'Software' and it saw all 4G (under >2.6.11 we got a kernel panic with that set) There are some other kernel tuning params that should help memory and physical IO performance. Talk to a Linux kernel guru to get the correct advice specific to your installation and application. It should be noted that there are indications of some major inefficiencies in pg's IO layer that make it compute bound under some circumstances before it becomes IO bound. These may or may not cause trouble for you as you keep pushing the envelope for maximum IO performance. With the kind of work you are doing and we are describing, I'm sure you can have a _very_ zippy system. Ron
Ron, A few days back you mentioned: > Upgrade your kernel to at least 2.6.12 > There's a known issue with earlier versions of the 2.6.x kernel and > 64b CPUs like the Opteron. See kernel.org for details. > I did some searching and couldn't find any obvious mention of this issue (I gave up after searching through the first few hundred instances of "64" in the 2.6.12 changelog). Would you mind being a little more specific about which issue you're talking about? We're about to deploy some new 16GB RAM Opteron DB servers and I'd like to check and make sure RH backported whatever the fix was to their current RHEL4 kernel. Thanks, Mark Lewis
At 07:28 PM 1/10/2006, Mark Lewis wrote: >Ron, > >A few days back you mentioned: > > > Upgrade your kernel to at least 2.6.12 > > There's a known issue with earlier versions of the 2.6.x kernel and > > 64b CPUs like the Opteron. See kernel.org for details. > > > >I did some searching and couldn't find any obvious mention of this issue >(I gave up after searching through the first few hundred instances of >"64" in the 2.6.12 changelog). > >Would you mind being a little more specific about which issue you're >talking about? We're about to deploy some new 16GB RAM Opteron DB >servers and I'd like to check and make sure RH backported whatever the >fix was to their current RHEL4 kernel. There are 3 issues I know about in general: 1= As Peter Royal noted on this list, pre 12 versions of 2.6.x have problems with RAM of >= 4GB. 2= Pre 12 versions on 2.6.x when running A64 or Xeon 64b SMP seem to be susceptible to "context switch storms". 3= Physical and memory IO is considerably improved in the later versions of 2.6.x compared to 2.6.11 or earlier. Talk to a real Linux kernel guru (I am not) for details and specifics. Ron