Thread: Propagating outer join conditions

Propagating outer join conditions

From

"Aaron Birkland"

Date:

03 December 2006, 11:11:40

The following left outer join plan puzzles me:

EXPLAIN ANALYZE SELECT * from t28 LEFT OUTER JOIN (t1 JOIN t11 ON
(t11.o = '<http://example.org>' AND t11.s = t1.o)) ON t28.s = t1.s
WHERE t28.o = '"spec"';

t28, t1, and t11 all have indexed columns named 's' and 'o' that contain 'text';

 Nested Loop Left Join  (cost=794249.26..3289704.61 rows=1 width=301)
(actual time=581293.390..581293.492 rows=1 loops=1)
   Join Filter: (t28.s = t1.s)
   ->  Index Scan using t28_o on t28  (cost=0.00..9.22 rows=1
width=89) (actual time=0.073..0.077 rows=1 loops=1)
         Index Cond: (o = '"spec"'::text)
   ->  Merge Join  (cost=794249.26..3267020.66 rows=1813979 width=212)
(actual time=230365.522..577078.266 rows=1894969 loops=1)
         Merge Cond: (t1.o = t11.s)
         ->  Index Scan using t1_o on t1  (cost=0.00..2390242.10
rows=22225696 width=109) (actual time=0.209..162586.801 rows=22223925
loops=1)
         ->  Sort  (cost=794249.26..798784.21 rows=1813979 width=103)
(actual time=230365.175..237409.474 rows=1894969 loops=1)
               Sort Key: t11.s
               ->  Bitmap Heap Scan on t11  (cost=78450.82..605679.55
rows=1813979 width=103) (actual time=3252.103..22782.271 rows=1894969
loops=1)
                     Recheck Cond: (o = '<http://example.org>'::text)
                     ->  Bitmap Index Scan on t11_o
(cost=0.00..78450.82 rows=1813979 width=0) (actual
time=2445.422..2445.422 rows=1894969 loops=1)
                           Index Cond: (o = '<http://example.org>'::text)


It seems to me that this plan is not very desirable, since the outer
part of the nested loop left join (the merge join node) is very
expensive. Is is possible to generate a plan that looks like this:

 Nested Loop Left Join  (cost=???)
   ->  Index Scan using t28_o on t28  (cost=0.00..9.11 rows=1 width=89)
         Index Cond: (o = '"spec"'::text)
   ->  Nested Loop  (cost=???)
         ->  Index Scan using t1_s on t1  (cost=???)
               Index Cond: (s = t28.s)
         ->  Bitmap Heap Scan on t11  (cost=???)
               Recheck Cond: (t11.s = t1.o)
               Filter: (o = '<http://example.org>'::text)
               ->  Bitmap Index Scan on t11_s  (cost=??? )
                     Index Cond: (t11.s = t1.o)

I *think* this plan is equivalent to the above if I'm assuming the
behaviour of the 'nested loop left join' node correctly.  So far, I
have been tweaking the statistics, cost estimates, and
enabling.disabling certain plans to see if I can get it to propagate
the join condition t1.s = t28.s to the outer node of the left join..
but so far, I cannot.  So, my questions are:

1) Is my 'desired' query plan logic correct
2) Can the executor execute a plan such as my 'desired' plan
3) If (1) and (2) are 'yes', then how may I get the planner to
generate such a plan, or do I just need to look harder into tweaking
the statistics and cost estimates

  -Aaron

Re: Propagating outer join conditions

From

"Jonathan Blitz"

Date:

03 December 2006, 11:47:52

How about trying:

Select *
From
(Select * from t28 where t28.0='spec')  t28a
Left out join (t1 JOIN t11 ON
> (t11.o = '<http://example.org>' AND t11.s = t1.o)) ON t28a.s = t1.s

In this way, I think, the where clause on t28 would be performed before the
join rather than after.

Jonathan Blitz


> -----Original Message-----
> From: Aaron Birkland [mailto:birkie@gmail.com]
> Sent: Sunday, December 03, 2006 5:12 PM
> To: pgsql-performance@postgresql.org
> Subject: [PERFORM] Propagating outer join conditions
>
> The following left outer join plan puzzles me:
>
> EXPLAIN ANALYZE SELECT * from t28 LEFT OUTER JOIN (t1 JOIN t11 ON
> (t11.o = '<http://example.org>' AND t11.s = t1.o)) ON t28.s = t1.s
> WHERE t28.o = '"spec"';
>
> t28, t1, and t11 all have indexed columns named 's' and 'o' that contain
'text';
>
>  Nested Loop Left Join  (cost=794249.26..3289704.61 rows=1 width=301)
> (actual time=581293.390..581293.492 rows=1 loops=1)
>    Join Filter: (t28.s = t1.s)
>    ->  Index Scan using t28_o on t28  (cost=0.00..9.22 rows=1
> width=89) (actual time=0.073..0.077 rows=1 loops=1)
>          Index Cond: (o = '"spec"'::text)
>    ->  Merge Join  (cost=794249.26..3267020.66 rows=1813979 width=212)
> (actual time=230365.522..577078.266 rows=1894969 loops=1)
>          Merge Cond: (t1.o = t11.s)
>          ->  Index Scan using t1_o on t1  (cost=0.00..2390242.10
> rows=22225696 width=109) (actual time=0.209..162586.801 rows=22223925
> loops=1)
>          ->  Sort  (cost=794249.26..798784.21 rows=1813979 width=103)
> (actual time=230365.175..237409.474 rows=1894969 loops=1)
>                Sort Key: t11.s
>                ->  Bitmap Heap Scan on t11  (cost=78450.82..605679.55
> rows=1813979 width=103) (actual time=3252.103..22782.271 rows=1894969
> loops=1)
>                      Recheck Cond: (o = '<http://example.org>'::text)
>                      ->  Bitmap Index Scan on t11_o
> (cost=0.00..78450.82 rows=1813979 width=0) (actual
> time=2445.422..2445.422 rows=1894969 loops=1)
>                            Index Cond: (o = '<http://example.org>'::text)
>
>
> It seems to me that this plan is not very desirable, since the outer
> part of the nested loop left join (the merge join node) is very
> expensive. Is is possible to generate a plan that looks like this:
>
>  Nested Loop Left Join  (cost=???)
>    ->  Index Scan using t28_o on t28  (cost=0.00..9.11 rows=1 width=89)
>          Index Cond: (o = '"spec"'::text)
>    ->  Nested Loop  (cost=???)
>          ->  Index Scan using t1_s on t1  (cost=???)
>                Index Cond: (s = t28.s)
>          ->  Bitmap Heap Scan on t11  (cost=???)
>                Recheck Cond: (t11.s = t1.o)
>                Filter: (o = '<http://example.org>'::text)
>                ->  Bitmap Index Scan on t11_s  (cost=??? )
>                      Index Cond: (t11.s = t1.o)
>
> I *think* this plan is equivalent to the above if I'm assuming the
> behaviour of the 'nested loop left join' node correctly.  So far, I
> have been tweaking the statistics, cost estimates, and
> enabling.disabling certain plans to see if I can get it to propagate
> the join condition t1.s = t28.s to the outer node of the left join..
> but so far, I cannot.  So, my questions are:
>
> 1) Is my 'desired' query plan logic correct
> 2) Can the executor execute a plan such as my 'desired' plan
> 3) If (1) and (2) are 'yes', then how may I get the planner to
> generate such a plan, or do I just need to look harder into tweaking
> the statistics and cost estimates
>
>   -Aaron
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.15.6/565 - Release Date: 12/2/2006
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.15.6/565 - Release Date: 12/02/2006
>

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.15.6/565 - Release Date: 12/02/2006

Re: Propagating outer join conditions

From

"Aaron Birkland"

Date:

03 December 2006, 12:30:34

First, I forgot to mention - this is 8.2 RC1 I was trying on

The suggested change produces an identical 'bad' query plan.  The main
issue (I think) is that the query node that processes "t1 JOIN t11 ON
..' is not aware of the join condition 't28.s = t1.s'.. even though
the value of t28.s (as determined by the inner index scan where t28.o
= 'spec') could(should?) theoretically be known to it. If it did, then
I imagine it would realize that a nested loop join starting with t1.s
= t28.s (which is very selective) would be much cheaper than doing the
big merge join.

  -Aaron

On 12/3/06, Jonathan Blitz <jb@anykey.co.il> wrote:
> How about trying:
>
> Select *
> From
> (Select * from t28 where t28.0='spec')  t28a
> Left out join (t1 JOIN t11 ON
> > (t11.o = '<http://example.org>' AND t11.s = t1.o)) ON t28a.s = t1.s
>
> In this way, I think, the where clause on t28 would be performed before the
> join rather than after.

Hardware advice

From

"Alexandru Coseru"

Date:

03 December 2006, 13:29:15

Hello..

I'm waiting for my new system , and meanwhile , i have some questions.
First , here are the specs:


The server will have kernel 2.1.19  and it will be use only as a postgresql
server  (nothing else...  no named,dhcp,web,mail , etc).
Postgresql version will be 8.2.
It will be heavily used on inserts , triggers on each insert/update  and
occasionally  selects.

System:  SuperMicro 7045B-3
CPU:   1 Dual Core Woodcrest ,2.66 Ghz , 4Mb cache , FSB 1333Mhz
RAM:   4 Gb   (4 x 1 Gb modules) at 667Mhz
RAID CTRL:  LSI MegaRAID SAS 8408E
DISKS:  8 x  SATA II  7200 rpm , NCQ ,  8 Mb cache        with Supermicro 8
Sas  enclosure

Based on the needs , i'm planning an update of the drives to 15.000 rpms
SAS.  (pretty expensive now)


Question 1:
    The RAID layout should be:
            a)  2 hdd in raid 1 for system and pg_xlog  and 6 hdd in raid10
for data ?
            b)  8 hdd in raid10  for all ?
            c)  2 hdd in raid1 for system  , 2 hdd in raid1 for pg_xlog , 4
hdd in raid10 for data ?
    Obs: I'm going for setup a)  , but i want to hear your thoughts as well.


Question 2:  (Don't want to start a flame here..... but here is goes)
        What filesystem should i run for data ?      ext3 or xfs ?
        The tables have ~ 15.000 rel_pages each.  The biggest table has now
over 30.000 pages.

Question 3:
        The block size in postgresql is 8kb.      The strip size in the raid
ctrl is 64k.
         Should i increase the pgsql block size to 16 or 32 or even 64k ?



As soon as the system will be delivered , i'm planning some benchmarks.

Regards
    Alex

Re: Propagating outer join conditions

From

Tom Lane

Date:

03 December 2006, 14:24:27

"Aaron Birkland" <birkie@gmail.com> writes:
> ... Is is possible to generate a plan that looks like this:

>  Nested Loop Left Join  (cost=???)
>    ->  Index Scan using t28_o on t28  (cost=0.00..9.11 rows=1 width=89)
>          Index Cond: (o = '"spec"'::text)
>    ->  Nested Loop  (cost=???)
>          ->  Index Scan using t1_s on t1  (cost=???)
>                Index Cond: (s = t28.s)
>          ->  Bitmap Heap Scan on t11  (cost=???)
>                Recheck Cond: (t11.s = t1.o)
>                Filter: (o = '<http://example.org>'::text)
>                ->  Bitmap Index Scan on t11_s  (cost=??? )
>                      Index Cond: (t11.s = t1.o)

No.  It'd take massive work in both the planner and executor to support
something like that --- they are designed around the assumption that
nestloop with inner index scan is exactly that, just one indexscan on
the inside of the loop.  (As of 8.2 there's some klugery to handle an
Append of indexscans too, but it won't generalize readily.)

It might be something to work on for the future, but I can see a couple
of risks to trying to support this type of plan:
* exponential growth in the number of plans the planner has to consider;
* greatly increased runtime penalty if the planner has underestimated
  the number of rows produced by the outer side of the nestloop.

            regards, tom lane

Re: Hardware advice

From

Josh Berkus

Date:

03 December 2006, 16:10:54

Alexandru,

> The server will have kernel 2.1.19  and it will be use only as a postgresql

Assuming you're talking Linux, I think you mean 2.6.19?

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Hardware advice

From

"Alexandru Coseru"

Date:

03 December 2006, 17:38:31

Hello..

Yes , sorry for the mistype..

Regards
    Alex
----- Original Message -----
From: "Josh Berkus" <josh@agliodbs.com>
To: <pgsql-performance@postgresql.org>
Cc: "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro>
Sent: Sunday, December 03, 2006 10:11 PM
Subject: Re: [PERFORM] Hardware advice


Alexandru,

> The server will have kernel 2.1.19 and it will be use only as a postgresql

Assuming you're talking Linux, I think you mean 2.6.19?

--
Josh Berkus
PostgreSQL @ Sun
San Francisco



--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006

Re: Hardware advice

From

Sven Geisler

Date:

05 December 2006, 06:22:23

Hi Alexandru,

Alexandru Coseru schrieb:
> [...]
> Question 1:
>    The RAID layout should be:
>            a)  2 hdd in raid 1 for system and pg_xlog  and 6 hdd in
> raid10 for data ?
>            b)  8 hdd in raid10  for all ?
>            c)  2 hdd in raid1 for system  , 2 hdd in raid1 for pg_xlog ,
> 4 hdd in raid10 for data ?
>    Obs: I'm going for setup a)  , but i want to hear your thoughts as well.

This depends on you data size. I think, option a and c are good.
The potential bottleneck may the RAID 1 for pg_xlog if you have huge
amount of updates and insert.
What is about another setup

4 hdd in RAID 10 for System and pg_xlog - System partitions are normally
not in heavy use and pg_xlog should be fast for writing.
4 hdd in RAID 10 for data.

>
>
> Question 2:  (Don't want to start a flame here..... but here is goes)
>        What filesystem should i run for data ?      ext3 or xfs ?
>        The tables have ~ 15.000 rel_pages each.  The biggest table has
> now over 30.000 pages.

We have a database running with 60,000+ tables. The tables size is
between a few kByte for the small tables and up to 30 GB for the largest
one. We had no issue with ext3 in the past.

>
> Question 3:
>        The block size in postgresql is 8kb.      The strip size in the
> raid ctrl is 64k.
>         Should i increase the pgsql block size to 16 or 32 or even 64k ?

You should keep in mind that the file system has also a block size. Ext3
has as maximum 4k.
I would set up the partitions aligned to the stripe size to prevent
unaligned reads. I guess, you can imagine that a larger block size of
postgresql may also end up in unaligned reads because the file system
has a smaller block size.

RAID Volume and File system set up
1. Make all partitions aligned to the RAID strip size.
   The first partition should be start at 128 kByte.
   You can do this with fdisk. after you created the partition switch
   to the expert mode (type x) and modify the begin of the partition
   (type b). You should change this value to 128 (default is 63).
   All other partition should also start on a multiple of 128 kByte.

2. Give the file system a hint that you work with larger block sizes.
   Ext3: mke2fs -b 4096 -j -R stride=2 /dev/sda1 -L LABEL
   I made a I/O test with PostgreSQL on a RAID system with stripe size
   of 64kByte and block size of 8 kByte in the RAID system.
   Stride=2 was the best value.


PS: You should have a second XEON in your budget plan.

Sven.

Re: Hardware advice

From

"Alexandru Coseru"

Date:

05 December 2006, 17:44:47

Hello..

Thanks for the advices..

Actually , i'm waiting for the clovertown  to show up on the market...

Regards
    Alex

----- Original Message -----
From: "Sven Geisler" <sgeisler@aeccom.com>
To: "Alexandru Coseru" <alexandru.coseru@totaltelecom.ro>
Cc: <pgsql-performance@postgresql.org>
Sent: Tuesday, December 05, 2006 11:57 AM
Subject: Re: [PERFORM] Hardware advice


> Hi Alexandru,
>
> Alexandru Coseru schrieb:
>> [...]
>> Question 1:
>>    The RAID layout should be:
>>            a)  2 hdd in raid 1 for system and pg_xlog  and 6 hdd in
>> raid10 for data ?
>>            b)  8 hdd in raid10  for all ?
>>            c)  2 hdd in raid1 for system  , 2 hdd in raid1 for pg_xlog ,
>> 4 hdd in raid10 for data ?
>>    Obs: I'm going for setup a)  , but i want to hear your thoughts as
>> well.
>
> This depends on you data size. I think, option a and c are good.
> The potential bottleneck may the RAID 1 for pg_xlog if you have huge
> amount of updates and insert.
> What is about another setup
>
> 4 hdd in RAID 10 for System and pg_xlog - System partitions are normally
> not in heavy use and pg_xlog should be fast for writing.
> 4 hdd in RAID 10 for data.
>
>>
>>
>> Question 2:  (Don't want to start a flame here..... but here is goes)
>>        What filesystem should i run for data ?      ext3 or xfs ?
>>        The tables have ~ 15.000 rel_pages each.  The biggest table has
>> now over 30.000 pages.
>
> We have a database running with 60,000+ tables. The tables size is
> between a few kByte for the small tables and up to 30 GB for the largest
> one. We had no issue with ext3 in the past.
>
>>
>> Question 3:
>>        The block size in postgresql is 8kb.      The strip size in the
>> raid ctrl is 64k.
>>         Should i increase the pgsql block size to 16 or 32 or even 64k ?
>
> You should keep in mind that the file system has also a block size. Ext3
> has as maximum 4k.
> I would set up the partitions aligned to the stripe size to prevent
> unaligned reads. I guess, you can imagine that a larger block size of
> postgresql may also end up in unaligned reads because the file system
> has a smaller block size.
>
> RAID Volume and File system set up
> 1. Make all partitions aligned to the RAID strip size.
>   The first partition should be start at 128 kByte.
>   You can do this with fdisk. after you created the partition switch
>   to the expert mode (type x) and modify the begin of the partition
>   (type b). You should change this value to 128 (default is 63).
>   All other partition should also start on a multiple of 128 kByte.
>
> 2. Give the file system a hint that you work with larger block sizes.
>   Ext3: mke2fs -b 4096 -j -R stride=2 /dev/sda1 -L LABEL
>   I made a I/O test with PostgreSQL on a RAID system with stripe size
>   of 64kByte and block size of 8 kByte in the RAID system.
>   Stride=2 was the best value.
>
>
> PS: You should have a second XEON in your budget plan.
>
> Sven.
>
>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.15.7/569 - Release Date: 12/5/2006
>
>

Re: Hardware advice

From

"Alex Turner"

Date:

06 December 2006, 00:24:27

The test that I did - which was somewhat limited, showed no benefit splitting disks into seperate partitions for large bulk loads.

The program read from one very large file and wrote the input out to two other large files.

The totaly throughput on a single partition was close to the maximum theoretical for that logical drive, even though the process was reading and writing to three seperate places on the disk. I don't know what this means for postgresql setups directly, but I would postulate that the benefit from splitting pg_xlog onto a seperate spindle is not as great as it might once have been for large bulk transactions. I am therefore going to be going to a single 6 drive RAID 5 for my data wharehouse application because I want the read speed to be availalbe. I can benefit from fast reads when I want to do large data scans at the expense of slightly slower insert speed.

Alex.

On 12/5/06, Alexandru Coseru <alexandru.coseru@totaltelecom.ro> wrote:

Hello..

Thanks for the advices..

Actually , i'm waiting for the clovertown  to show up on the market...

Regards
    Alex

----- Original Message -----
From: "Sven Geisler" <sgeisler@aeccom.com>
To: "Alexandru Coseru" < alexandru.coseru@totaltelecom.ro>
Cc: <pgsql-performance@postgresql.org>
Sent: Tuesday, December 05, 2006 11:57 AM
Subject: Re: [PERFORM] Hardware advice

> Hi Alexandru,
>
> Alexandru Coseru schrieb:
>> [...]
>> Question 1:
>>    The RAID layout should be:
>>            a)  2 hdd in raid 1 for system and pg_xlog  and 6 hdd in
>> raid10 for data ?
>>            b)  8 hdd in raid10  for all ?
>>            c)  2 hdd in raid1 for system  , 2 hdd in raid1 for pg_xlog ,
>> 4 hdd in raid10 for data ?
>>    Obs: I'm going for setup a)  , but i want to hear your thoughts as
>> well.
>
> This depends on you data size. I think, option a and c are good.
> The potential bottleneck may the RAID 1 for pg_xlog if you have huge
> amount of updates and insert.
> What is about another setup
>
> 4 hdd in RAID 10 for System and pg_xlog - System partitions are normally
> not in heavy use and pg_xlog should be fast for writing.
> 4 hdd in RAID 10 for data.
>
>>
>>
>> Question 2:  (Don't want to start a flame here..... but here is goes)
>>        What filesystem should i run for data ?      ext3 or xfs ?
>>        The tables have ~ 15.000 rel_pages each.  The biggest table has
>> now over 30.000 pages.
>
> We have a database running with 60,000+ tables. The tables size is
> between a few kByte for the small tables and up to 30 GB for the largest
> one. We had no issue with ext3 in the past.
>
>>
>> Question 3:
>>        The block size in postgresql is 8kb.      The strip size in the
>> raid ctrl is 64k.
>>         Should i increase the pgsql block size to 16 or 32 or even 64k ?
>
> You should keep in mind that the file system has also a block size. Ext3
> has as maximum 4k.
> I would set up the partitions aligned to the stripe size to prevent
> unaligned reads. I guess, you can imagine that a larger block size of
> postgresql may also end up in unaligned reads because the file system
> has a smaller block size.
>
> RAID Volume and File system set up
> 1. Make all partitions aligned to the RAID strip size.
>   The first partition should be start at 128 kByte.
>   You can do this with fdisk. after you created the partition switch
>   to the expert mode (type x) and modify the begin of the partition
>   (type b). You should change this value to 128 (default is 63).
>   All other partition should also start on a multiple of 128 kByte.
>
> 2. Give the file system a hint that you work with larger block sizes.
>   Ext3: mke2fs -b 4096 -j -R stride=2 /dev/sda1 -L LABEL
>   I made a I/O test with PostgreSQL on a RAID system with stripe size
>   of 64kByte and block size of 8 kByte in the RAID system.
>   Stride=2 was the best value.
>
>
> PS: You should have a second XEON in your budget plan.
>
> Sven.
>
>
>
>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.15.7/569 - Release Date: 12/5/2006
>
>

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

Re: Hardware advice

From

Sven Geisler

Date:

06 December 2006, 05:09:36

Hi Alex,

Please check out <http://www.powerpostgresql.com/PerfList> before you
use RAID 5 for PostgreSQL.

Anyhow, In a larger scale you end up in the response time of the I/O
system for an read or write. The read is in modern RAID and SAN
environments the part where you have to focus when you want to tune your
system because most RAID and SAN system can buffer write.
PostgreSQL does use the Linux file system cache which is normally much
larger then the RAID or SAN cache for reading. This means whenever a
PostgreSQL read goes to the RAID or SAN sub system the response time of
the hard disk will become interesting.
I guess you can imagine that multiple reads to the same spins are
causing an delay in the response time.


Alexandru,

You should have two XEONs, what every your core count is.
This would use the full benefit of the memory architecture.
You know two FSBs and two memory channels.

Cheers
Sven

Alex Turner schrieb:
> The test that I did - which was somewhat limited, showed no benefit
> splitting disks into seperate partitions for large bulk loads.
>
> The program read from one very large file and wrote the input out to two
> other large files.
>
> The totaly throughput on a single partition was close to the maximum
> theoretical for that logical drive, even though the process was reading
> and writing to three seperate places on the disk.  I don't know what
> this means for postgresql setups directly, but I would postulate that
> the benefit from splitting pg_xlog onto a seperate spindle is not as
> great as it might once have been for large bulk transactions.  I am
> therefore going to be going to a single 6 drive RAID 5 for my data
> wharehouse application because I want the read speed to be availalbe.  I
> can benefit from fast reads when I want to do large data scans at the
> expense of slightly slower insert speed.
>
> Alex.
>
> On 12/5/06, *Alexandru Coseru* <alexandru.coseru@totaltelecom.ro
> <mailto:alexandru.coseru@totaltelecom.ro>> wrote:
>
>     Hello..
>
>     Thanks for the advices..
>
>     Actually , i'm waiting for the clovertown  to show up on the market...
>
>     Regards
>         Alex
>
>     ----- Original Message -----
>     From: "Sven Geisler" <sgeisler@aeccom.com <mailto:sgeisler@aeccom.com>>
>     To: "Alexandru Coseru" < alexandru.coseru@totaltelecom.ro
>     <mailto:alexandru.coseru@totaltelecom.ro>>
>     Cc: <pgsql-performance@postgresql.org
>     <mailto:pgsql-performance@postgresql.org>>
>     Sent: Tuesday, December 05, 2006 11:57 AM
>     Subject: Re: [PERFORM] Hardware advice
>
>
>     > Hi Alexandru,
>     >
>     > Alexandru Coseru schrieb:
>     >> [...]
>     >> Question 1:
>     >>    The RAID layout should be:
>     >>            a)  2 hdd in raid 1 for system and pg_xlog  and 6 hdd in
>     >> raid10 for data ?
>     >>            b)  8 hdd in raid10  for all ?
>     >>            c)  2 hdd in raid1 for system  , 2 hdd in raid1 for
>     pg_xlog ,
>     >> 4 hdd in raid10 for data ?
>     >>    Obs: I'm going for setup a)  , but i want to hear your
>     thoughts as
>     >> well.
>     >
>     > This depends on you data size. I think, option a and c are good.
>     > The potential bottleneck may the RAID 1 for pg_xlog if you have huge
>     > amount of updates and insert.
>     > What is about another setup
>     >
>     > 4 hdd in RAID 10 for System and pg_xlog - System partitions are
>     normally
>     > not in heavy use and pg_xlog should be fast for writing.
>     > 4 hdd in RAID 10 for data.
>     >
>     >>
>     >>
>     >> Question 2:  (Don't want to start a flame here..... but here is goes)
>     >>        What filesystem should i run for data ?      ext3 or xfs ?
>     >>        The tables have ~ 15.000 rel_pages each.  The biggest
>     table has
>     >> now over 30.000 pages.
>     >
>     > We have a database running with 60,000+ tables. The tables size is
>     > between a few kByte for the small tables and up to 30 GB for the
>     largest
>     > one. We had no issue with ext3 in the past.
>     >
>     >>
>     >> Question 3:
>     >>        The block size in postgresql is 8kb.      The strip size
>     in the
>     >> raid ctrl is 64k.
>     >>         Should i increase the pgsql block size to 16 or 32 or
>     even 64k ?
>     >
>     > You should keep in mind that the file system has also a block
>     size. Ext3
>     > has as maximum 4k.
>     > I would set up the partitions aligned to the stripe size to prevent
>     > unaligned reads. I guess, you can imagine that a larger block size of
>     > postgresql may also end up in unaligned reads because the file system
>     > has a smaller block size.
>     >
>     > RAID Volume and File system set up
>     > 1. Make all partitions aligned to the RAID strip size.
>     >   The first partition should be start at 128 kByte.
>     >   You can do this with fdisk. after you created the partition switch
>     >   to the expert mode (type x) and modify the begin of the partition
>     >   (type b). You should change this value to 128 (default is 63).
>     >   All other partition should also start on a multiple of 128 kByte.
>     >
>     > 2. Give the file system a hint that you work with larger block sizes.
>     >   Ext3: mke2fs -b 4096 -j -R stride=2 /dev/sda1 -L LABEL
>     >   I made a I/O test with PostgreSQL on a RAID system with stripe size
>     >   of 64kByte and block size of 8 kByte in the RAID system.
>     >   Stride=2 was the best value.
>     >
>     >
>     > PS: You should have a second XEON in your budget plan.
>     >
>     > Sven.
>     >
>     >
>     >
>     >
>     > --
>     > No virus found in this incoming message.
>     > Checked by AVG Free Edition.
>     > Version: 7.1.409 / Virus Database: 268.15.7/569 - Release Date:
>     12/5/2006
>     >
>     >
>
>
>     ---------------------------(end of broadcast)---------------------------
>     TIP 4: Have you searched our list archives?
>
>                    http://archives.postgresql.org
>     <http://archives.postgresql.org>
>
>

--
/This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you are not the intended recipient, you should not
copy it, re-transmit it, use it or disclose its contents, but should
return it to the sender immediately and delete your copy from your
system. Thank you for your cooperation./

Sven Geisler <sgeisler@aeccom.com> Tel +49.30.5362.1627 Fax .1638
Senior Developer,    AEC/communications GmbH    Berlin,   Germany

Re: Hardware advice

From

"Gregory S. Williamson"

Date:

06 December 2006, 05:21:36

If your data is valuable I'd recommend against RAID5 ... see <http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt>

performance aside, I'd advise against RAID5 in almost all circumstances. Why take chances ?

Greg Williamson
DBA
GlobeXplorer LLC

-----Original Message-----
From:    pgsql-performance-owner@postgresql.org on behalf of Sven Geisler
Sent:    Wed 12/6/2006 1:09 AM
To:    Alex Turner
Cc:    Alexandru Coseru; pgsql-performance@postgresql.org
Subject:    Re: [PERFORM] Hardware advice

Hi Alex,

Please check out <http://www.powerpostgresql.com/PerfList> before you
use RAID 5 for PostgreSQL.

Anyhow, In a larger scale you end up in the response time of the I/O
system for an read or write. The read is in modern RAID and SAN
environments the part where you have to focus when you want to tune your
system because most RAID and SAN system can buffer write.
PostgreSQL does use the Linux file system cache which is normally much
larger then the RAID or SAN cache for reading. This means whenever a
PostgreSQL read goes to the RAID or SAN sub system the response time of
the hard disk will become interesting.
I guess you can imagine that multiple reads to the same spins are
causing an delay in the response time.


Alexandru,

You should have two XEONs, what every your core count is.
This would use the full benefit of the memory architecture.
You know two FSBs and two memory channels.

Cheers
Sven

Alex Turner schrieb:
> The test that I did - which was somewhat limited, showed no benefit
> splitting disks into seperate partitions for large bulk loads.
>
> The program read from one very large file and wrote the input out to two
> other large files.
>
> The totaly throughput on a single partition was close to the maximum
> theoretical for that logical drive, even though the process was reading
> and writing to three seperate places on the disk.  I don't know what
> this means for postgresql setups directly, but I would postulate that
> the benefit from splitting pg_xlog onto a seperate spindle is not as
> great as it might once have been for large bulk transactions.  I am
> therefore going to be going to a single 6 drive RAID 5 for my data
> wharehouse application because I want the read speed to be availalbe.  I
> can benefit from fast reads when I want to do large data scans at the
> expense of slightly slower insert speed.
>
> Alex.
>
> On 12/5/06, *Alexandru Coseru* <alexandru.coseru@totaltelecom.ro
> <mailto:alexandru.coseru@totaltelecom.ro>> wrote:
>
>     Hello..
>
>     Thanks for the advices..
>
>     Actually , i'm waiting for the clovertown  to show up on the market...
>
>     Regards
>         Alex
>
>     ----- Original Message -----
>     From: "Sven Geisler" <sgeisler@aeccom.com <mailto:sgeisler@aeccom.com>>
>     To: "Alexandru Coseru" < alexandru.coseru@totaltelecom.ro
>     <mailto:alexandru.coseru@totaltelecom.ro>>
>     Cc: <pgsql-performance@postgresql.org
>     <mailto:pgsql-performance@postgresql.org>>
>     Sent: Tuesday, December 05, 2006 11:57 AM
>     Subject: Re: [PERFORM] Hardware advice
>
>
>     > Hi Alexandru,
>     >
>     > Alexandru Coseru schrieb:
>     >> [...]
>     >> Question 1:
>     >>    The RAID layout should be:
>     >>            a)  2 hdd in raid 1 for system and pg_xlog  and 6 hdd in
>     >> raid10 for data ?
>     >>            b)  8 hdd in raid10  for all ?
>     >>            c)  2 hdd in raid1 for system  , 2 hdd in raid1 for
>     pg_xlog ,
>     >> 4 hdd in raid10 for data ?
>     >>    Obs: I'm going for setup a)  , but i want to hear your
>     thoughts as
>     >> well.
>     >
>     > This depends on you data size. I think, option a and c are good.
>     > The potential bottleneck may the RAID 1 for pg_xlog if you have huge
>     > amount of updates and insert.
>     > What is about another setup
>     >
>     > 4 hdd in RAID 10 for System and pg_xlog - System partitions are
>     normally
>     > not in heavy use and pg_xlog should be fast for writing.
>     > 4 hdd in RAID 10 for data.
>     >
>     >>
>     >>
>     >> Question 2:  (Don't want to start a flame here..... but here is goes)
>     >>        What filesystem should i run for data ?      ext3 or xfs ?
>     >>        The tables have ~ 15.000 rel_pages each.  The biggest
>     table has
>     >> now over 30.000 pages.
>     >
>     > We have a database running with 60,000+ tables. The tables size is
>     > between a few kByte for the small tables and up to 30 GB for the
>     largest
>     > one. We had no issue with ext3 in the past.
>     >
>     >>
>     >> Question 3:
>     >>        The block size in postgresql is 8kb.      The strip size
>     in the
>     >> raid ctrl is 64k.
>     >>         Should i increase the pgsql block size to 16 or 32 or
>     even 64k ?
>     >
>     > You should keep in mind that the file system has also a block
>     size. Ext3
>     > has as maximum 4k.
>     > I would set up the partitions aligned to the stripe size to prevent
>     > unaligned reads. I guess, you can imagine that a larger block size of
>     > postgresql may also end up in unaligned reads because the file system
>     > has a smaller block size.
>     >
>     > RAID Volume and File system set up
>     > 1. Make all partitions aligned to the RAID strip size.
>     >   The first partition should be start at 128 kByte.
>     >   You can do this with fdisk. after you created the partition switch
>     >   to the expert mode (type x) and modify the begin of the partition
>     >   (type b). You should change this value to 128 (default is 63).
>     >   All other partition should also start on a multiple of 128 kByte.
>     >
>     > 2. Give the file system a hint that you work with larger block sizes.
>     >   Ext3: mke2fs -b 4096 -j -R stride=2 /dev/sda1 -L LABEL
>     >   I made a I/O test with PostgreSQL on a RAID system with stripe size
>     >   of 64kByte and block size of 8 kByte in the RAID system.
>     >   Stride=2 was the best value.
>     >
>     >
>     > PS: You should have a second XEON in your budget plan.
>     >
>     > Sven.
>     >
>     >
>     >
>     >
>     > --
>     > No virus found in this incoming message.
>     > Checked by AVG Free Edition.
>     > Version: 7.1.409 / Virus Database: 268.15.7/569 - Release Date:
>     12/5/2006
>     >
>     >
>
>
>     ---------------------------(end of broadcast)---------------------------
>     TIP 4: Have you searched our list archives?
>
>                    http://archives.postgresql.org
>     <http://archives.postgresql.org>
>
>

--
/This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you are not the intended recipient, you should not
copy it, re-transmit it, use it or disclose its contents, but should
return it to the sender immediately and delete your copy from your
system. Thank you for your cooperation./

Sven Geisler <sgeisler@aeccom.com> Tel +49.30.5362.1627 Fax .1638
Senior Developer,    AEC/communications GmbH    Berlin,   Germany

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org


-------------------------------------------------------
Click link below if it is SPAM gsw@globexplorer.com

"https://mailscanner.globexplorer.com/dspam/dspam.cgi?signatureID=4576889b113104295495211&user=gsw@globexplorer.com&retrain=spam&template=history&history_page=1"
!DSPAM:4576889b113104295495211!
-------------------------------------------------------