Thread: Propagating outer join conditions
The following left outer join plan puzzles me: EXPLAIN ANALYZE SELECT * from t28 LEFT OUTER JOIN (t1 JOIN t11 ON (t11.o = '<http://example.org>' AND t11.s = t1.o)) ON t28.s = t1.s WHERE t28.o = '"spec"'; t28, t1, and t11 all have indexed columns named 's' and 'o' that contain 'text'; Nested Loop Left Join (cost=794249.26..3289704.61 rows=1 width=301) (actual time=581293.390..581293.492 rows=1 loops=1) Join Filter: (t28.s = t1.s) -> Index Scan using t28_o on t28 (cost=0.00..9.22 rows=1 width=89) (actual time=0.073..0.077 rows=1 loops=1) Index Cond: (o = '"spec"'::text) -> Merge Join (cost=794249.26..3267020.66 rows=1813979 width=212) (actual time=230365.522..577078.266 rows=1894969 loops=1) Merge Cond: (t1.o = t11.s) -> Index Scan using t1_o on t1 (cost=0.00..2390242.10 rows=22225696 width=109) (actual time=0.209..162586.801 rows=22223925 loops=1) -> Sort (cost=794249.26..798784.21 rows=1813979 width=103) (actual time=230365.175..237409.474 rows=1894969 loops=1) Sort Key: t11.s -> Bitmap Heap Scan on t11 (cost=78450.82..605679.55 rows=1813979 width=103) (actual time=3252.103..22782.271 rows=1894969 loops=1) Recheck Cond: (o = '<http://example.org>'::text) -> Bitmap Index Scan on t11_o (cost=0.00..78450.82 rows=1813979 width=0) (actual time=2445.422..2445.422 rows=1894969 loops=1) Index Cond: (o = '<http://example.org>'::text) It seems to me that this plan is not very desirable, since the outer part of the nested loop left join (the merge join node) is very expensive. Is is possible to generate a plan that looks like this: Nested Loop Left Join (cost=???) -> Index Scan using t28_o on t28 (cost=0.00..9.11 rows=1 width=89) Index Cond: (o = '"spec"'::text) -> Nested Loop (cost=???) -> Index Scan using t1_s on t1 (cost=???) Index Cond: (s = t28.s) -> Bitmap Heap Scan on t11 (cost=???) Recheck Cond: (t11.s = t1.o) Filter: (o = '<http://example.org>'::text) -> Bitmap Index Scan on t11_s (cost=??? ) Index Cond: (t11.s = t1.o) I *think* this plan is equivalent to the above if I'm assuming the behaviour of the 'nested loop left join' node correctly. So far, I have been tweaking the statistics, cost estimates, and enabling.disabling certain plans to see if I can get it to propagate the join condition t1.s = t28.s to the outer node of the left join.. but so far, I cannot. So, my questions are: 1) Is my 'desired' query plan logic correct 2) Can the executor execute a plan such as my 'desired' plan 3) If (1) and (2) are 'yes', then how may I get the planner to generate such a plan, or do I just need to look harder into tweaking the statistics and cost estimates -Aaron
How about trying: Select * From (Select * from t28 where t28.0='spec') t28a Left out join (t1 JOIN t11 ON > (t11.o = '<http://example.org>' AND t11.s = t1.o)) ON t28a.s = t1.s In this way, I think, the where clause on t28 would be performed before the join rather than after. Jonathan Blitz > -----Original Message----- > From: Aaron Birkland [mailto:birkie@gmail.com] > Sent: Sunday, December 03, 2006 5:12 PM > To: pgsql-performance@postgresql.org > Subject: [PERFORM] Propagating outer join conditions > > The following left outer join plan puzzles me: > > EXPLAIN ANALYZE SELECT * from t28 LEFT OUTER JOIN (t1 JOIN t11 ON > (t11.o = '<http://example.org>' AND t11.s = t1.o)) ON t28.s = t1.s > WHERE t28.o = '"spec"'; > > t28, t1, and t11 all have indexed columns named 's' and 'o' that contain 'text'; > > Nested Loop Left Join (cost=794249.26..3289704.61 rows=1 width=301) > (actual time=581293.390..581293.492 rows=1 loops=1) > Join Filter: (t28.s = t1.s) > -> Index Scan using t28_o on t28 (cost=0.00..9.22 rows=1 > width=89) (actual time=0.073..0.077 rows=1 loops=1) > Index Cond: (o = '"spec"'::text) > -> Merge Join (cost=794249.26..3267020.66 rows=1813979 width=212) > (actual time=230365.522..577078.266 rows=1894969 loops=1) > Merge Cond: (t1.o = t11.s) > -> Index Scan using t1_o on t1 (cost=0.00..2390242.10 > rows=22225696 width=109) (actual time=0.209..162586.801 rows=22223925 > loops=1) > -> Sort (cost=794249.26..798784.21 rows=1813979 width=103) > (actual time=230365.175..237409.474 rows=1894969 loops=1) > Sort Key: t11.s > -> Bitmap Heap Scan on t11 (cost=78450.82..605679.55 > rows=1813979 width=103) (actual time=3252.103..22782.271 rows=1894969 > loops=1) > Recheck Cond: (o = '<http://example.org>'::text) > -> Bitmap Index Scan on t11_o > (cost=0.00..78450.82 rows=1813979 width=0) (actual > time=2445.422..2445.422 rows=1894969 loops=1) > Index Cond: (o = '<http://example.org>'::text) > > > It seems to me that this plan is not very desirable, since the outer > part of the nested loop left join (the merge join node) is very > expensive. Is is possible to generate a plan that looks like this: > > Nested Loop Left Join (cost=???) > -> Index Scan using t28_o on t28 (cost=0.00..9.11 rows=1 width=89) > Index Cond: (o = '"spec"'::text) > -> Nested Loop (cost=???) > -> Index Scan using t1_s on t1 (cost=???) > Index Cond: (s = t28.s) > -> Bitmap Heap Scan on t11 (cost=???) > Recheck Cond: (t11.s = t1.o) > Filter: (o = '<http://example.org>'::text) > -> Bitmap Index Scan on t11_s (cost=??? ) > Index Cond: (t11.s = t1.o) > > I *think* this plan is equivalent to the above if I'm assuming the > behaviour of the 'nested loop left join' node correctly. So far, I > have been tweaking the statistics, cost estimates, and > enabling.disabling certain plans to see if I can get it to propagate > the join condition t1.s = t28.s to the outer node of the left join.. > but so far, I cannot. So, my questions are: > > 1) Is my 'desired' query plan logic correct > 2) Can the executor execute a plan such as my 'desired' plan > 3) If (1) and (2) are 'yes', then how may I get the planner to > generate such a plan, or do I just need to look harder into tweaking > the statistics and cost estimates > > -Aaron > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match > > > -- > No virus found in this incoming message. > Checked by AVG Free Edition. > Version: 7.1.409 / Virus Database: 268.15.6/565 - Release Date: 12/2/2006 > > > -- > No virus found in this incoming message. > Checked by AVG Free Edition. > Version: 7.1.409 / Virus Database: 268.15.6/565 - Release Date: 12/02/2006 > -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.409 / Virus Database: 268.15.6/565 - Release Date: 12/02/2006
First, I forgot to mention - this is 8.2 RC1 I was trying on The suggested change produces an identical 'bad' query plan. The main issue (I think) is that the query node that processes "t1 JOIN t11 ON ..' is not aware of the join condition 't28.s = t1.s'.. even though the value of t28.s (as determined by the inner index scan where t28.o = 'spec') could(should?) theoretically be known to it. If it did, then I imagine it would realize that a nested loop join starting with t1.s = t28.s (which is very selective) would be much cheaper than doing the big merge join. -Aaron On 12/3/06, Jonathan Blitz <jb@anykey.co.il> wrote: > How about trying: > > Select * > From > (Select * from t28 where t28.0='spec') t28a > Left out join (t1 JOIN t11 ON > > (t11.o = '<http://example.org>' AND t11.s = t1.o)) ON t28a.s = t1.s > > In this way, I think, the where clause on t28 would be performed before the > join rather than after.
Hello.. I'm waiting for my new system , and meanwhile , i have some questions. First , here are the specs: The server will have kernel 2.1.19 and it will be use only as a postgresql server (nothing else... no named,dhcp,web,mail , etc). Postgresql version will be 8.2. It will be heavily used on inserts , triggers on each insert/update and occasionally selects. System: SuperMicro 7045B-3 CPU: 1 Dual Core Woodcrest ,2.66 Ghz , 4Mb cache , FSB 1333Mhz RAM: 4 Gb (4 x 1 Gb modules) at 667Mhz RAID CTRL: LSI MegaRAID SAS 8408E DISKS: 8 x SATA II 7200 rpm , NCQ , 8 Mb cache with Supermicro 8 Sas enclosure Based on the needs , i'm planning an update of the drives to 15.000 rpms SAS. (pretty expensive now) Question 1: The RAID layout should be: a) 2 hdd in raid 1 for system and pg_xlog and 6 hdd in raid10 for data ? b) 8 hdd in raid10 for all ? c) 2 hdd in raid1 for system , 2 hdd in raid1 for pg_xlog , 4 hdd in raid10 for data ? Obs: I'm going for setup a) , but i want to hear your thoughts as well. Question 2: (Don't want to start a flame here..... but here is goes) What filesystem should i run for data ? ext3 or xfs ? The tables have ~ 15.000 rel_pages each. The biggest table has now over 30.000 pages. Question 3: The block size in postgresql is 8kb. The strip size in the raid ctrl is 64k. Should i increase the pgsql block size to 16 or 32 or even 64k ? As soon as the system will be delivered , i'm planning some benchmarks. Regards Alex
"Aaron Birkland" <birkie@gmail.com> writes: > ... Is is possible to generate a plan that looks like this: > Nested Loop Left Join (cost=???) > -> Index Scan using t28_o on t28 (cost=0.00..9.11 rows=1 width=89) > Index Cond: (o = '"spec"'::text) > -> Nested Loop (cost=???) > -> Index Scan using t1_s on t1 (cost=???) > Index Cond: (s = t28.s) > -> Bitmap Heap Scan on t11 (cost=???) > Recheck Cond: (t11.s = t1.o) > Filter: (o = '<http://example.org>'::text) > -> Bitmap Index Scan on t11_s (cost=??? ) > Index Cond: (t11.s = t1.o) No. It'd take massive work in both the planner and executor to support something like that --- they are designed around the assumption that nestloop with inner index scan is exactly that, just one indexscan on the inside of the loop. (As of 8.2 there's some klugery to handle an Append of indexscans too, but it won't generalize readily.) It might be something to work on for the future, but I can see a couple of risks to trying to support this type of plan: * exponential growth in the number of plans the planner has to consider; * greatly increased runtime penalty if the planner has underestimated the number of rows produced by the outer side of the nestloop. regards, tom lane
Alexandru, > The server will have kernel 2.1.19 and it will be use only as a postgresql Assuming you're talking Linux, I think you mean 2.6.19? -- Josh Berkus PostgreSQL @ Sun San Francisco
Hello.. Yes , sorry for the mistype.. Regards Alex ----- Original Message ----- From: "Josh Berkus" <josh@agliodbs.com> To: <pgsql-performance@postgresql.org> Cc: "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro> Sent: Sunday, December 03, 2006 10:11 PM Subject: Re: [PERFORM] Hardware advice Alexandru, > The server will have kernel 2.1.19 and it will be use only as a postgresql Assuming you're talking Linux, I think you mean 2.6.19? -- Josh Berkus PostgreSQL @ Sun San Francisco -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006
Hi Alexandru, Alexandru Coseru schrieb: > [...] > Question 1: > The RAID layout should be: > a) 2 hdd in raid 1 for system and pg_xlog and 6 hdd in > raid10 for data ? > b) 8 hdd in raid10 for all ? > c) 2 hdd in raid1 for system , 2 hdd in raid1 for pg_xlog , > 4 hdd in raid10 for data ? > Obs: I'm going for setup a) , but i want to hear your thoughts as well. This depends on you data size. I think, option a and c are good. The potential bottleneck may the RAID 1 for pg_xlog if you have huge amount of updates and insert. What is about another setup 4 hdd in RAID 10 for System and pg_xlog - System partitions are normally not in heavy use and pg_xlog should be fast for writing. 4 hdd in RAID 10 for data. > > > Question 2: (Don't want to start a flame here..... but here is goes) > What filesystem should i run for data ? ext3 or xfs ? > The tables have ~ 15.000 rel_pages each. The biggest table has > now over 30.000 pages. We have a database running with 60,000+ tables. The tables size is between a few kByte for the small tables and up to 30 GB for the largest one. We had no issue with ext3 in the past. > > Question 3: > The block size in postgresql is 8kb. The strip size in the > raid ctrl is 64k. > Should i increase the pgsql block size to 16 or 32 or even 64k ? You should keep in mind that the file system has also a block size. Ext3 has as maximum 4k. I would set up the partitions aligned to the stripe size to prevent unaligned reads. I guess, you can imagine that a larger block size of postgresql may also end up in unaligned reads because the file system has a smaller block size. RAID Volume and File system set up 1. Make all partitions aligned to the RAID strip size. The first partition should be start at 128 kByte. You can do this with fdisk. after you created the partition switch to the expert mode (type x) and modify the begin of the partition (type b). You should change this value to 128 (default is 63). All other partition should also start on a multiple of 128 kByte. 2. Give the file system a hint that you work with larger block sizes. Ext3: mke2fs -b 4096 -j -R stride=2 /dev/sda1 -L LABEL I made a I/O test with PostgreSQL on a RAID system with stripe size of 64kByte and block size of 8 kByte in the RAID system. Stride=2 was the best value. PS: You should have a second XEON in your budget plan. Sven.
Hello.. Thanks for the advices.. Actually , i'm waiting for the clovertown to show up on the market... Regards Alex ----- Original Message ----- From: "Sven Geisler" <sgeisler@aeccom.com> To: "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro> Cc: <pgsql-performance@postgresql.org> Sent: Tuesday, December 05, 2006 11:57 AM Subject: Re: [PERFORM] Hardware advice > Hi Alexandru, > > Alexandru Coseru schrieb: >> [...] >> Question 1: >> The RAID layout should be: >> a) 2 hdd in raid 1 for system and pg_xlog and 6 hdd in >> raid10 for data ? >> b) 8 hdd in raid10 for all ? >> c) 2 hdd in raid1 for system , 2 hdd in raid1 for pg_xlog , >> 4 hdd in raid10 for data ? >> Obs: I'm going for setup a) , but i want to hear your thoughts as >> well. > > This depends on you data size. I think, option a and c are good. > The potential bottleneck may the RAID 1 for pg_xlog if you have huge > amount of updates and insert. > What is about another setup > > 4 hdd in RAID 10 for System and pg_xlog - System partitions are normally > not in heavy use and pg_xlog should be fast for writing. > 4 hdd in RAID 10 for data. > >> >> >> Question 2: (Don't want to start a flame here..... but here is goes) >> What filesystem should i run for data ? ext3 or xfs ? >> The tables have ~ 15.000 rel_pages each. The biggest table has >> now over 30.000 pages. > > We have a database running with 60,000+ tables. The tables size is > between a few kByte for the small tables and up to 30 GB for the largest > one. We had no issue with ext3 in the past. > >> >> Question 3: >> The block size in postgresql is 8kb. The strip size in the >> raid ctrl is 64k. >> Should i increase the pgsql block size to 16 or 32 or even 64k ? > > You should keep in mind that the file system has also a block size. Ext3 > has as maximum 4k. > I would set up the partitions aligned to the stripe size to prevent > unaligned reads. I guess, you can imagine that a larger block size of > postgresql may also end up in unaligned reads because the file system > has a smaller block size. > > RAID Volume and File system set up > 1. Make all partitions aligned to the RAID strip size. > The first partition should be start at 128 kByte. > You can do this with fdisk. after you created the partition switch > to the expert mode (type x) and modify the begin of the partition > (type b). You should change this value to 128 (default is 63). > All other partition should also start on a multiple of 128 kByte. > > 2. Give the file system a hint that you work with larger block sizes. > Ext3: mke2fs -b 4096 -j -R stride=2 /dev/sda1 -L LABEL > I made a I/O test with PostgreSQL on a RAID system with stripe size > of 64kByte and block size of 8 kByte in the RAID system. > Stride=2 was the best value. > > > PS: You should have a second XEON in your budget plan. > > Sven. > > > > > -- > No virus found in this incoming message. > Checked by AVG Free Edition. > Version: 7.1.409 / Virus Database: 268.15.7/569 - Release Date: 12/5/2006 > >
The test that I did - which was somewhat limited, showed no benefit splitting disks into seperate partitions for large bulk loads.
The program read from one very large file and wrote the input out to two other large files.
The totaly throughput on a single partition was close to the maximum theoretical for that logical drive, even though the process was reading and writing to three seperate places on the disk. I don't know what this means for postgresql setups directly, but I would postulate that the benefit from splitting pg_xlog onto a seperate spindle is not as great as it might once have been for large bulk transactions. I am therefore going to be going to a single 6 drive RAID 5 for my data wharehouse application because I want the read speed to be availalbe. I can benefit from fast reads when I want to do large data scans at the expense of slightly slower insert speed.
Alex.
The program read from one very large file and wrote the input out to two other large files.
The totaly throughput on a single partition was close to the maximum theoretical for that logical drive, even though the process was reading and writing to three seperate places on the disk. I don't know what this means for postgresql setups directly, but I would postulate that the benefit from splitting pg_xlog onto a seperate spindle is not as great as it might once have been for large bulk transactions. I am therefore going to be going to a single 6 drive RAID 5 for my data wharehouse application because I want the read speed to be availalbe. I can benefit from fast reads when I want to do large data scans at the expense of slightly slower insert speed.
Alex.
On 12/5/06, Alexandru Coseru <alexandru.coseru@totaltelecom.ro> wrote:
Hello..
Thanks for the advices..
Actually , i'm waiting for the clovertown to show up on the market...
Regards
Alex
----- Original Message -----
From: "Sven Geisler" <sgeisler@aeccom.com>
To: "Alexandru Coseru" < alexandru.coseru@totaltelecom.ro>
Cc: <pgsql-performance@postgresql.org>
Sent: Tuesday, December 05, 2006 11:57 AM
Subject: Re: [PERFORM] Hardware advice
> Hi Alexandru,
>
> Alexandru Coseru schrieb:
>> [...]
>> Question 1:
>> The RAID layout should be:
>> a) 2 hdd in raid 1 for system and pg_xlog and 6 hdd in
>> raid10 for data ?
>> b) 8 hdd in raid10 for all ?
>> c) 2 hdd in raid1 for system , 2 hdd in raid1 for pg_xlog ,
>> 4 hdd in raid10 for data ?
>> Obs: I'm going for setup a) , but i want to hear your thoughts as
>> well.
>
> This depends on you data size. I think, option a and c are good.
> The potential bottleneck may the RAID 1 for pg_xlog if you have huge
> amount of updates and insert.
> What is about another setup
>
> 4 hdd in RAID 10 for System and pg_xlog - System partitions are normally
> not in heavy use and pg_xlog should be fast for writing.
> 4 hdd in RAID 10 for data.
>
>>
>>
>> Question 2: (Don't want to start a flame here..... but here is goes)
>> What filesystem should i run for data ? ext3 or xfs ?
>> The tables have ~ 15.000 rel_pages each. The biggest table has
>> now over 30.000 pages.
>
> We have a database running with 60,000+ tables. The tables size is
> between a few kByte for the small tables and up to 30 GB for the largest
> one. We had no issue with ext3 in the past.
>
>>
>> Question 3:
>> The block size in postgresql is 8kb. The strip size in the
>> raid ctrl is 64k.
>> Should i increase the pgsql block size to 16 or 32 or even 64k ?
>
> You should keep in mind that the file system has also a block size. Ext3
> has as maximum 4k.
> I would set up the partitions aligned to the stripe size to prevent
> unaligned reads. I guess, you can imagine that a larger block size of
> postgresql may also end up in unaligned reads because the file system
> has a smaller block size.
>
> RAID Volume and File system set up
> 1. Make all partitions aligned to the RAID strip size.
> The first partition should be start at 128 kByte.
> You can do this with fdisk. after you created the partition switch
> to the expert mode (type x) and modify the begin of the partition
> (type b). You should change this value to 128 (default is 63).
> All other partition should also start on a multiple of 128 kByte.
>
> 2. Give the file system a hint that you work with larger block sizes.
> Ext3: mke2fs -b 4096 -j -R stride=2 /dev/sda1 -L LABEL
> I made a I/O test with PostgreSQL on a RAID system with stripe size
> of 64kByte and block size of 8 kByte in the RAID system.
> Stride=2 was the best value.
>
>
> PS: You should have a second XEON in your budget plan.
>
> Sven.
>
>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.15.7/569 - Release Date: 12/5/2006
>
>
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org
Hi Alex, Please check out <http://www.powerpostgresql.com/PerfList> before you use RAID 5 for PostgreSQL. Anyhow, In a larger scale you end up in the response time of the I/O system for an read or write. The read is in modern RAID and SAN environments the part where you have to focus when you want to tune your system because most RAID and SAN system can buffer write. PostgreSQL does use the Linux file system cache which is normally much larger then the RAID or SAN cache for reading. This means whenever a PostgreSQL read goes to the RAID or SAN sub system the response time of the hard disk will become interesting. I guess you can imagine that multiple reads to the same spins are causing an delay in the response time. Alexandru, You should have two XEONs, what every your core count is. This would use the full benefit of the memory architecture. You know two FSBs and two memory channels. Cheers Sven Alex Turner schrieb: > The test that I did - which was somewhat limited, showed no benefit > splitting disks into seperate partitions for large bulk loads. > > The program read from one very large file and wrote the input out to two > other large files. > > The totaly throughput on a single partition was close to the maximum > theoretical for that logical drive, even though the process was reading > and writing to three seperate places on the disk. I don't know what > this means for postgresql setups directly, but I would postulate that > the benefit from splitting pg_xlog onto a seperate spindle is not as > great as it might once have been for large bulk transactions. I am > therefore going to be going to a single 6 drive RAID 5 for my data > wharehouse application because I want the read speed to be availalbe. I > can benefit from fast reads when I want to do large data scans at the > expense of slightly slower insert speed. > > Alex. > > On 12/5/06, *Alexandru Coseru* <alexandru.coseru@totaltelecom.ro > <mailto:alexandru.coseru@totaltelecom.ro>> wrote: > > Hello.. > > Thanks for the advices.. > > Actually , i'm waiting for the clovertown to show up on the market... > > Regards > Alex > > ----- Original Message ----- > From: "Sven Geisler" <sgeisler@aeccom.com <mailto:sgeisler@aeccom.com>> > To: "Alexandru Coseru" < alexandru.coseru@totaltelecom.ro > <mailto:alexandru.coseru@totaltelecom.ro>> > Cc: <pgsql-performance@postgresql.org > <mailto:pgsql-performance@postgresql.org>> > Sent: Tuesday, December 05, 2006 11:57 AM > Subject: Re: [PERFORM] Hardware advice > > > > Hi Alexandru, > > > > Alexandru Coseru schrieb: > >> [...] > >> Question 1: > >> The RAID layout should be: > >> a) 2 hdd in raid 1 for system and pg_xlog and 6 hdd in > >> raid10 for data ? > >> b) 8 hdd in raid10 for all ? > >> c) 2 hdd in raid1 for system , 2 hdd in raid1 for > pg_xlog , > >> 4 hdd in raid10 for data ? > >> Obs: I'm going for setup a) , but i want to hear your > thoughts as > >> well. > > > > This depends on you data size. I think, option a and c are good. > > The potential bottleneck may the RAID 1 for pg_xlog if you have huge > > amount of updates and insert. > > What is about another setup > > > > 4 hdd in RAID 10 for System and pg_xlog - System partitions are > normally > > not in heavy use and pg_xlog should be fast for writing. > > 4 hdd in RAID 10 for data. > > > >> > >> > >> Question 2: (Don't want to start a flame here..... but here is goes) > >> What filesystem should i run for data ? ext3 or xfs ? > >> The tables have ~ 15.000 rel_pages each. The biggest > table has > >> now over 30.000 pages. > > > > We have a database running with 60,000+ tables. The tables size is > > between a few kByte for the small tables and up to 30 GB for the > largest > > one. We had no issue with ext3 in the past. > > > >> > >> Question 3: > >> The block size in postgresql is 8kb. The strip size > in the > >> raid ctrl is 64k. > >> Should i increase the pgsql block size to 16 or 32 or > even 64k ? > > > > You should keep in mind that the file system has also a block > size. Ext3 > > has as maximum 4k. > > I would set up the partitions aligned to the stripe size to prevent > > unaligned reads. I guess, you can imagine that a larger block size of > > postgresql may also end up in unaligned reads because the file system > > has a smaller block size. > > > > RAID Volume and File system set up > > 1. Make all partitions aligned to the RAID strip size. > > The first partition should be start at 128 kByte. > > You can do this with fdisk. after you created the partition switch > > to the expert mode (type x) and modify the begin of the partition > > (type b). You should change this value to 128 (default is 63). > > All other partition should also start on a multiple of 128 kByte. > > > > 2. Give the file system a hint that you work with larger block sizes. > > Ext3: mke2fs -b 4096 -j -R stride=2 /dev/sda1 -L LABEL > > I made a I/O test with PostgreSQL on a RAID system with stripe size > > of 64kByte and block size of 8 kByte in the RAID system. > > Stride=2 was the best value. > > > > > > PS: You should have a second XEON in your budget plan. > > > > Sven. > > > > > > > > > > -- > > No virus found in this incoming message. > > Checked by AVG Free Edition. > > Version: 7.1.409 / Virus Database: 268.15.7/569 - Release Date: > 12/5/2006 > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > <http://archives.postgresql.org> > > -- /This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy it, re-transmit it, use it or disclose its contents, but should return it to the sender immediately and delete your copy from your system. Thank you for your cooperation./ Sven Geisler <sgeisler@aeccom.com> Tel +49.30.5362.1627 Fax .1638 Senior Developer, AEC/communications GmbH Berlin, Germany
If your data is valuable I'd recommend against RAID5 ... see <http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt> performance aside, I'd advise against RAID5 in almost all circumstances. Why take chances ? Greg Williamson DBA GlobeXplorer LLC -----Original Message----- From: pgsql-performance-owner@postgresql.org on behalf of Sven Geisler Sent: Wed 12/6/2006 1:09 AM To: Alex Turner Cc: Alexandru Coseru; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Hardware advice Hi Alex, Please check out <http://www.powerpostgresql.com/PerfList> before you use RAID 5 for PostgreSQL. Anyhow, In a larger scale you end up in the response time of the I/O system for an read or write. The read is in modern RAID and SAN environments the part where you have to focus when you want to tune your system because most RAID and SAN system can buffer write. PostgreSQL does use the Linux file system cache which is normally much larger then the RAID or SAN cache for reading. This means whenever a PostgreSQL read goes to the RAID or SAN sub system the response time of the hard disk will become interesting. I guess you can imagine that multiple reads to the same spins are causing an delay in the response time. Alexandru, You should have two XEONs, what every your core count is. This would use the full benefit of the memory architecture. You know two FSBs and two memory channels. Cheers Sven Alex Turner schrieb: > The test that I did - which was somewhat limited, showed no benefit > splitting disks into seperate partitions for large bulk loads. > > The program read from one very large file and wrote the input out to two > other large files. > > The totaly throughput on a single partition was close to the maximum > theoretical for that logical drive, even though the process was reading > and writing to three seperate places on the disk. I don't know what > this means for postgresql setups directly, but I would postulate that > the benefit from splitting pg_xlog onto a seperate spindle is not as > great as it might once have been for large bulk transactions. I am > therefore going to be going to a single 6 drive RAID 5 for my data > wharehouse application because I want the read speed to be availalbe. I > can benefit from fast reads when I want to do large data scans at the > expense of slightly slower insert speed. > > Alex. > > On 12/5/06, *Alexandru Coseru* <alexandru.coseru@totaltelecom.ro > <mailto:alexandru.coseru@totaltelecom.ro>> wrote: > > Hello.. > > Thanks for the advices.. > > Actually , i'm waiting for the clovertown to show up on the market... > > Regards > Alex > > ----- Original Message ----- > From: "Sven Geisler" <sgeisler@aeccom.com <mailto:sgeisler@aeccom.com>> > To: "Alexandru Coseru" < alexandru.coseru@totaltelecom.ro > <mailto:alexandru.coseru@totaltelecom.ro>> > Cc: <pgsql-performance@postgresql.org > <mailto:pgsql-performance@postgresql.org>> > Sent: Tuesday, December 05, 2006 11:57 AM > Subject: Re: [PERFORM] Hardware advice > > > > Hi Alexandru, > > > > Alexandru Coseru schrieb: > >> [...] > >> Question 1: > >> The RAID layout should be: > >> a) 2 hdd in raid 1 for system and pg_xlog and 6 hdd in > >> raid10 for data ? > >> b) 8 hdd in raid10 for all ? > >> c) 2 hdd in raid1 for system , 2 hdd in raid1 for > pg_xlog , > >> 4 hdd in raid10 for data ? > >> Obs: I'm going for setup a) , but i want to hear your > thoughts as > >> well. > > > > This depends on you data size. I think, option a and c are good. > > The potential bottleneck may the RAID 1 for pg_xlog if you have huge > > amount of updates and insert. > > What is about another setup > > > > 4 hdd in RAID 10 for System and pg_xlog - System partitions are > normally > > not in heavy use and pg_xlog should be fast for writing. > > 4 hdd in RAID 10 for data. > > > >> > >> > >> Question 2: (Don't want to start a flame here..... but here is goes) > >> What filesystem should i run for data ? ext3 or xfs ? > >> The tables have ~ 15.000 rel_pages each. The biggest > table has > >> now over 30.000 pages. > > > > We have a database running with 60,000+ tables. The tables size is > > between a few kByte for the small tables and up to 30 GB for the > largest > > one. We had no issue with ext3 in the past. > > > >> > >> Question 3: > >> The block size in postgresql is 8kb. The strip size > in the > >> raid ctrl is 64k. > >> Should i increase the pgsql block size to 16 or 32 or > even 64k ? > > > > You should keep in mind that the file system has also a block > size. Ext3 > > has as maximum 4k. > > I would set up the partitions aligned to the stripe size to prevent > > unaligned reads. I guess, you can imagine that a larger block size of > > postgresql may also end up in unaligned reads because the file system > > has a smaller block size. > > > > RAID Volume and File system set up > > 1. Make all partitions aligned to the RAID strip size. > > The first partition should be start at 128 kByte. > > You can do this with fdisk. after you created the partition switch > > to the expert mode (type x) and modify the begin of the partition > > (type b). You should change this value to 128 (default is 63). > > All other partition should also start on a multiple of 128 kByte. > > > > 2. Give the file system a hint that you work with larger block sizes. > > Ext3: mke2fs -b 4096 -j -R stride=2 /dev/sda1 -L LABEL > > I made a I/O test with PostgreSQL on a RAID system with stripe size > > of 64kByte and block size of 8 kByte in the RAID system. > > Stride=2 was the best value. > > > > > > PS: You should have a second XEON in your budget plan. > > > > Sven. > > > > > > > > > > -- > > No virus found in this incoming message. > > Checked by AVG Free Edition. > > Version: 7.1.409 / Virus Database: 268.15.7/569 - Release Date: > 12/5/2006 > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > <http://archives.postgresql.org> > > -- /This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy it, re-transmit it, use it or disclose its contents, but should return it to the sender immediately and delete your copy from your system. Thank you for your cooperation./ Sven Geisler <sgeisler@aeccom.com> Tel +49.30.5362.1627 Fax .1638 Senior Developer, AEC/communications GmbH Berlin, Germany ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org ------------------------------------------------------- Click link below if it is SPAM gsw@globexplorer.com "https://mailscanner.globexplorer.com/dspam/dspam.cgi?signatureID=4576889b113104295495211&user=gsw@globexplorer.com&retrain=spam&template=history&history_page=1" !DSPAM:4576889b113104295495211! -------------------------------------------------------