Thread: RAID 0 not as fast as expected
I'm experiment with RAID, looking for an inexpensive way to boost performance. I bought 4 Seagate 7200.9 120 GB SATA drivesand two SIIG dual-port SATA cards. (NB: I don't plan to run RAID 0 in production, probably RAID 10, so no need tocomment on the failure rate of RAID 0.) I used this raw serial-speed test: time sh -c "dd if=/dev/zero of=./bigfile bs=8k count=1000000 && sync" (unmount/remount) time sh -c "dd if=./bigfile of=/dev/null bs=8k count=1000000 && sync" Which showed that the RAID 0 4-disk array was almost exactly twice as fast as each disk individually. I expected 4X performancefor a 4-disk RAID 0. My suspicion is that each of these budget SATA cards is bandwidth limited; they can't actuallyhandle two disks simultaneously, and I'd need to get four separate SATA cards to get 4X performance (or a more expensivecard such as the Areca someone mentioned the other day). On the other hand, it "feels like" (using our application) the seek performance is quite a bit better, which I'd expect givenmy hypothesis about the SIIG cards. I don't have concrete benchmarks on seek speed. Thanks, Craig
Craig A. James wrote: > I'm experiment with RAID, looking for an inexpensive way to boost > performance. I bought 4 Seagate 7200.9 120 GB SATA drives and two SIIG > dual-port SATA cards. (NB: I don't plan to run RAID 0 in production, > probably RAID 10, so no need to comment on the failure rate of RAID 0.) > > I used this raw serial-speed test: > > time sh -c "dd if=/dev/zero of=./bigfile bs=8k count=1000000 && sync" > (unmount/remount) > time sh -c "dd if=./bigfile of=/dev/null bs=8k count=1000000 && sync" > > Which showed that the RAID 0 4-disk array was almost exactly twice as > fast as each disk individually. I expected 4X performance for a 4-disk > RAID 0. My suspicion is that each of these budget SATA cards is I am assuming linux here, Linux software raid 0 is known not to be super duper. Secondly remember that there is overhead involved with using raid. The direct correlation doesn't work. Joshua D. Drake > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/
On Thursday 14 September 2006 11:05, "Craig A. James" <cjames@modgraph-usa.com> wrote: > I'm experiment with RAID, looking for an inexpensive way to boost > performance. I bought 4 Seagate 7200.9 120 GB SATA drives and two SIIG > dual-port SATA cards. (NB: I don't plan to run RAID 0 in production, > probably RAID 10, so no need to comment on the failure rate of RAID 0.) > Are those PCI cards? If yes, it's just a bus bandwidth limit. -- Alan
Alan Hodgson wrote: > On Thursday 14 September 2006 11:05, "Craig A. James" > <cjames@modgraph-usa.com> wrote: >> I'm experiment with RAID, looking for an inexpensive way to boost >> performance. I bought 4 Seagate 7200.9 120 GB SATA drives and two SIIG >> dual-port SATA cards. (NB: I don't plan to run RAID 0 in production, >> probably RAID 10, so no need to comment on the failure rate of RAID 0.) >> > > Are those PCI cards? If yes, it's just a bus bandwidth limit. Ok, that makes sense. One SATA disk = 52 MB/sec 4-disk RAID0 = 106 MB/sec PCI at 33 MHz x 32 bits (4 bytes) = 132 MB/sec. I guess getting to 80% of the theoretical speed is as much as I should expect. Thanks, Craig
On Thu, 2006-09-14 at 16:35, Craig A. James wrote: > Alan Hodgson wrote: > > On Thursday 14 September 2006 11:05, "Craig A. James" > > <cjames@modgraph-usa.com> wrote: > >> I'm experiment with RAID, looking for an inexpensive way to boost > >> performance. I bought 4 Seagate 7200.9 120 GB SATA drives and two SIIG > >> dual-port SATA cards. (NB: I don't plan to run RAID 0 in production, > >> probably RAID 10, so no need to comment on the failure rate of RAID 0.) > >> > > > > Are those PCI cards? If yes, it's just a bus bandwidth limit. > > Ok, that makes sense. > > One SATA disk = 52 MB/sec > 4-disk RAID0 = 106 MB/sec > > PCI at 33 MHz x 32 bits (4 bytes) = 132 MB/sec. > > I guess getting to 80% of the theoretical speed is as much as I should expect. Note that many mid to high end motherboards have multiple PCI busses / channels, and you could put a card in each one and get > 132MByte/sec on them. But for a database, sequential throughput is almost never the real problem. It's usually random access that counts, and for that a RAID 10 is a pretty good choice.
Josh, On 9/14/06 11:49 AM, "Joshua D. Drake" <jd@commandprompt.com> wrote: > I am assuming linux here, Linux software raid 0 is known not to be super > duper. I've obtained 1,950 MB/s using Linux software RAID on SATA drives. - Luke
Luke Lonergan wrote: > Josh, > > On 9/14/06 11:49 AM, "Joshua D. Drake" <jd@commandprompt.com> wrote: > >> I am assuming linux here, Linux software raid 0 is known not to be super >> duper. > > I've obtained 1,950 MB/s using Linux software RAID on SATA drives. With what? :) > > - Luke > > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/
Luke Lonergan wrote: > Josh, > > On 9/14/06 8:47 PM, "Joshua D. Drake" <jd@commandprompt.com> wrote: > >>> I've obtained 1,950 MB/s using Linux software RAID on SATA drives. >> With what? :) > > Sun X4500 (aka Thumper) running stock RedHat 4.3 (actually CentOS 4.3) with > XFS and the linux md driver without lvm. Here is a summary of the results: > Good god! > > Read Test > RAID Level Max Readahead (KB) RAID Chunksize Max Readahead on Disks (KB) > Max Time (s) Read Bandwidth (MB/s) > 0 65536 64 256 16.689 1,917.43 > 0 4096 64 256 21.269 1,504.54 > 0 65536 256 256 17.967 1,781.04 > 0 2816 256 256 18.835 1,698.96 > 0 65536 1024 256 18.538 1,726.18 > 0 65536 64 512 18.295 1,749.11 > 0 65536 64 256 18.931 1,690.35 > 0 65536 64 256 18.873 1,695.54 > 0 64768 64 256 18.545 1,725.53 > 0 131172 64 256 18.548 1,725.25 > 0 131172 64 65536 19.046 1,680.14 > 0 131172 64 524288 18.125 1,765.52 > 0 131172 64 1048576 18.701 1,711.14 > 5 2560 64 256 39.933 801.34 > 5 16777216 64 256 37.76 847.46 > 5 524288 64 256 53.497 598.16 > 5 65536 32 256 38.472 831.77 > 5 65536 32 256 38.004 842.02 > 5 65536 32 256 37.884 844.68 > 5 2560 16 256 41.39 773.13 > 5 65536 16 256 48.902 654.37 > 10 65536 64 256 83.256 384.36 > 1+0 65536 64 256 19.394 1,649.99 > 1+0 65536 64 256 19.047 1,680.05 > 1+0 65536 64 256 19.195 1,667.10 > 1+0 65536 64 256 18.806 1,701.58 > 1+0 65536 64 256 18.848 1,697.79 > 1+0 65536 64 256 18.371 1,741.88 > 1+0 65536 64 256 21.446 1,492.12 > 1+0 65536 64 256 20.254 1,579.93 > > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/
That's an all PCI-X box which makes sense. There are 6 SATA controllers in that little beastie also. You can always count on Sun to provide over engineered boxes. > -----Original Message----- > From: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of > Joshua D. Drake > Sent: Friday, September 15, 2006 12:01 AM > To: Luke Lonergan > Cc: Craig A. James; pgsql-performance@postgresql.org > Subject: Re: [PERFORM] RAID 0 not as fast as expected > > Luke Lonergan wrote: > > Josh, > > > > On 9/14/06 8:47 PM, "Joshua D. Drake" <jd@commandprompt.com> wrote: > > > >>> I've obtained 1,950 MB/s using Linux software RAID on SATA drives. > >> With what? :) > > > > Sun X4500 (aka Thumper) running stock RedHat 4.3 (actually > CentOS 4.3) > > with XFS and the linux md driver without lvm. Here is a > summary of the results: > > > > > Good god! > > > > > Read Test > > RAID Level Max Readahead (KB) RAID Chunksize Max Readahead > on Disks > > (KB) Max Time (s) Read Bandwidth (MB/s) 0 65536 64 256 16.689 > > 1,917.43 0 4096 64 256 21.269 1,504.54 0 65536 256 256 17.967 > > 1,781.04 0 2816 256 256 18.835 1,698.96 0 65536 1024 256 18.538 > > 1,726.18 0 65536 64 512 18.295 1,749.11 0 65536 64 256 18.931 > > 1,690.35 0 65536 64 256 18.873 1,695.54 0 64768 64 256 18.545 > > 1,725.53 0 131172 64 256 18.548 1,725.25 0 131172 64 > 65536 19.046 > > 1,680.14 0 131172 64 524288 18.125 1,765.52 0 131172 64 1048576 > > 18.701 1,711.14 > > 5 2560 64 256 39.933 801.34 > > 5 16777216 64 256 37.76 847.46 > > 5 524288 64 256 53.497 598.16 > > 5 65536 32 256 38.472 831.77 > > 5 65536 32 256 38.004 842.02 > > 5 65536 32 256 37.884 844.68 > > 5 2560 16 256 41.39 773.13 > > 5 65536 16 256 48.902 654.37 > > 10 65536 64 256 83.256 384.36 > > 1+0 65536 64 256 19.394 1,649.99 > > 1+0 65536 64 256 19.047 1,680.05 > > 1+0 65536 64 256 19.195 1,667.10 > > 1+0 65536 64 256 18.806 1,701.58 > > 1+0 65536 64 256 18.848 1,697.79 > > 1+0 65536 64 256 18.371 1,741.88 > > 1+0 65536 64 256 21.446 1,492.12 > > 1+0 65536 64 256 20.254 1,579.93 > > > > > > > -- > > === The PostgreSQL Company: Command Prompt, Inc. === > Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 > Providing the most comprehensive PostgreSQL solutions since 1997 > http://www.commandprompt.com/ > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
Greg, Josh, Something I found out while doing this - lvm (and lvm2) slows the block stream down dramatically. At first I was using it for convenience sake to implement partitions on top of the md devices, but I found I was stuck at about 700 MB/s. Removing lvm2 from the picture allowed me to get within chucking distance of 2GB/s. When we first started working with Solaris ZFS, we were getting about 400-600 MB/s, and after working with the Solaris Engineering team we now get rates approaching 2GB/s. The updates needed to Solaris are part of the Solaris 10 U3 available in October (and already in Solaris Express, aka Solaris 11). - Luke On 9/15/06 5:43 AM, "Spiegelberg, Greg" <gspiegelberg@cranel.com> wrote: > That's an all PCI-X box which makes sense. There are 6 SATA controllers > in that little beastie also. You can always count on Sun to provide > over engineered boxes. > > > >> -----Original Message----- >> From: pgsql-performance-owner@postgresql.org >> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of >> Joshua D. Drake >> Sent: Friday, September 15, 2006 12:01 AM >> To: Luke Lonergan >> Cc: Craig A. James; pgsql-performance@postgresql.org >> Subject: Re: [PERFORM] RAID 0 not as fast as expected >> >> Luke Lonergan wrote: >>> Josh, >>> >>> On 9/14/06 8:47 PM, "Joshua D. Drake" <jd@commandprompt.com> wrote: >>> >>>>> I've obtained 1,950 MB/s using Linux software RAID on SATA drives. >>>> With what? :) >>> >>> Sun X4500 (aka Thumper) running stock RedHat 4.3 (actually >> CentOS 4.3) >>> with XFS and the linux md driver without lvm. Here is a >> summary of the results: >>> >> >> >> Good god! >> >>> >>> Read Test >>> RAID Level Max Readahead (KB) RAID Chunksize Max Readahead >> on Disks >>> (KB) Max Time (s) Read Bandwidth (MB/s) 0 65536 64 256 16.689 >>> 1,917.43 0 4096 64 256 21.269 1,504.54 0 65536 256 256 17.967 >>> 1,781.04 0 2816 256 256 18.835 1,698.96 0 65536 1024 256 18.538 >>> 1,726.18 0 65536 64 512 18.295 1,749.11 0 65536 64 256 18.931 >>> 1,690.35 0 65536 64 256 18.873 1,695.54 0 64768 64 256 18.545 >>> 1,725.53 0 131172 64 256 18.548 1,725.25 0 131172 64 >> 65536 19.046 >>> 1,680.14 0 131172 64 524288 18.125 1,765.52 0 131172 64 1048576 >>> 18.701 1,711.14 >>> 5 2560 64 256 39.933 801.34 >>> 5 16777216 64 256 37.76 847.46 >>> 5 524288 64 256 53.497 598.16 >>> 5 65536 32 256 38.472 831.77 >>> 5 65536 32 256 38.004 842.02 >>> 5 65536 32 256 37.884 844.68 >>> 5 2560 16 256 41.39 773.13 >>> 5 65536 16 256 48.902 654.37 >>> 10 65536 64 256 83.256 384.36 >>> 1+0 65536 64 256 19.394 1,649.99 >>> 1+0 65536 64 256 19.047 1,680.05 >>> 1+0 65536 64 256 19.195 1,667.10 >>> 1+0 65536 64 256 18.806 1,701.58 >>> 1+0 65536 64 256 18.848 1,697.79 >>> 1+0 65536 64 256 18.371 1,741.88 >>> 1+0 65536 64 256 21.446 1,492.12 >>> 1+0 65536 64 256 20.254 1,579.93 >>> >>> >> >> >> -- >> >> === The PostgreSQL Company: Command Prompt, Inc. === >> Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 >> Providing the most comprehensive PostgreSQL solutions since 1997 >> http://www.commandprompt.com/ >> >> >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 2: Don't 'kill -9' the postmaster >> >
>When we first started working with Solaris ZFS, we were getting about >400-600 MB/s, and after working with the Solaris Engineering team we now >get >rates approaching 2GB/s. The updates needed to Solaris are part of the >Solaris 10 U3 available in October (and already in Solaris Express, aka >Solaris 11). Luke, What other file systems have you had good success with? Solaris would be nice, but it looks like I'm stuck running on FreeBSD (6.1, amd64) so UFS2 would be the default. Not sure about XFS on BSD, and I'm not sure at the moment that ext2/3 provide enough benefit over UFS to spend much time on. Also, has anyone had any experience with gmirror (good or bad)? I'm thinking of trying to use it to stripe two hardware mirrored sets since HW RAID10 wasn't doing as well as I had hoped (Dell Perc5/I controller). For a 4 disk RAID 10 (10k rpm SAS/SCSI disks) what would be a good target performance number? Right now, dd shows 224 MB/s. And lastly, for a more OLAP style database, would I be correct in assuming that sequential access speed would be more important than is normally the case? (I have a relatively small number of connections, but each running on pretty large data sets). Thanks, Bucky
Josh, On 9/14/06 8:47 PM, "Joshua D. Drake" <jd@commandprompt.com> wrote: >> I've obtained 1,950 MB/s using Linux software RAID on SATA drives. > > With what? :) Sun X4500 (aka Thumper) running stock RedHat 4.3 (actually CentOS 4.3) with XFS and the linux md driver without lvm. Here is a summary of the results: Read Test RAID Level Max Readahead (KB) RAID Chunksize Max Readahead on Disks (KB) Max Time (s) Read Bandwidth (MB/s) 0 65536 64 256 16.689 1,917.43 0 4096 64 256 21.269 1,504.54 0 65536 256 256 17.967 1,781.04 0 2816 256 256 18.835 1,698.96 0 65536 1024 256 18.538 1,726.18 0 65536 64 512 18.295 1,749.11 0 65536 64 256 18.931 1,690.35 0 65536 64 256 18.873 1,695.54 0 64768 64 256 18.545 1,725.53 0 131172 64 256 18.548 1,725.25 0 131172 64 65536 19.046 1,680.14 0 131172 64 524288 18.125 1,765.52 0 131172 64 1048576 18.701 1,711.14 5 2560 64 256 39.933 801.34 5 16777216 64 256 37.76 847.46 5 524288 64 256 53.497 598.16 5 65536 32 256 38.472 831.77 5 65536 32 256 38.004 842.02 5 65536 32 256 37.884 844.68 5 2560 16 256 41.39 773.13 5 65536 16 256 48.902 654.37 10 65536 64 256 83.256 384.36 1+0 65536 64 256 19.394 1,649.99 1+0 65536 64 256 19.047 1,680.05 1+0 65536 64 256 19.195 1,667.10 1+0 65536 64 256 18.806 1,701.58 1+0 65536 64 256 18.848 1,697.79 1+0 65536 64 256 18.371 1,741.88 1+0 65536 64 256 21.446 1,492.12 1+0 65536 64 256 20.254 1,579.93
Bucky, On 9/15/06 11:28 AM, "Bucky Jordan" <bjordan@lumeta.com> wrote: > What other file systems have you had good success with? Solaris would be > nice, but it looks like I'm stuck running on FreeBSD (6.1, amd64) so > UFS2 would be the default. Not sure about XFS on BSD, and I'm not sure > at the moment that ext2/3 provide enough benefit over UFS to spend much > time on. It won't matter much between UFS2 or others until you get past about 350 MB/s. > Also, has anyone had any experience with gmirror (good or bad)? I'm > thinking of trying to use it to stripe two hardware mirrored sets since > HW RAID10 wasn't doing as well as I had hoped (Dell Perc5/I controller). > For a 4 disk RAID 10 (10k rpm SAS/SCSI disks) what would be a good > target performance number? Right now, dd shows 224 MB/s. Each disk should sustain somewhere between 60-80 MB/s (see http://www.storagereview.com/ for a profile of your disk). Your dd test sounds suspiciously too fast unless you were running two simultaneous dd processes. Did you read from a file that was at least twice the size of RAM? A single dd stream would run between 120 and 160 MB/s on a RAID10, two streams would be between 240 and 320 MB/s. > And lastly, for a more OLAP style database, would I be correct in > assuming that sequential access speed would be more important than is > normally the case? (I have a relatively small number of connections, but > each running on pretty large data sets). Yes. What's pretty large? We've had to redefine large recently, now we're talking about systems with between 100TB and 1,000TB. - Luke
On Sat, Sep 16, 2006 at 04:46:04PM -0700, Luke Lonergan wrote: > Yes. What's pretty large? We've had to redefine large recently, now we're > talking about systems with between 100TB and 1,000TB. Do you actually have PostgreSQL databases in that size range? /* Steinar */ -- Homepage: http://www.sesse.net/
Steinar H. Gunderson wrote: > On Sat, Sep 16, 2006 at 04:46:04PM -0700, Luke Lonergan wrote: >> Yes. What's pretty large? We've had to redefine large recently, now we're >> talking about systems with between 100TB and 1,000TB. > > Do you actually have PostgreSQL databases in that size range? No, they have databases in MPP that are that large :) Joshua D. Drake > > /* Steinar */ -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/
>Yes. What's pretty large? We've had to redefine large recently, now we're >talking about systems with between 100TB and 1,000TB. > >- Luke Well, I said large, not gargantuan :) - Largest would probably be around a few TB, but the problem I'm having to deal with at the moment is large numbers (potentially > 1 billion) of small records (hopefully I can get it down to a few int4's and a int2 or so) in a single table. Currently we're testing for and targeting in the 500M records range, but the design needs to scale to 2-3 times that at least. I read one of your presentations on very large databases in PG, and saw mention of some tables over a billion rows, so that was encouraging. The new table partitioning in 8.x will be very useful. What's the largest DB you've seen to date on PG (in terms of total disk storage, and records in largest table(s) )? My question is at what point do I have to get fancy with those big tables? From your presentation, it looks like PG can handle 1.2 billion records or so as long as you write intelligent queries. (And normal PG should be able to handle that, correct?) Also, does anyone know if/when any of the MPP stuff will be ported to Postgres, or is the plan to keep that separate? Thanks, Bucky
On 9/18/06, Bucky Jordan <bjordan@lumeta.com> wrote: > My question is at what point do I have to get fancy with those big > tables? From your presentation, it looks like PG can handle 1.2 billion > records or so as long as you write intelligent queries. (And normal PG > should be able to handle that, correct?) I would rephrase that: large databses are less forgiving of unintelligent queries, particularly of the form of your average stupid database abstracting middleware :-). seek times on a 1gb database are going to be zero all the time, not so on a 1tb+ database. good normalization skills are really important for large databases, along with materialization strategies for 'denormalized sets'. regarding the number of rows, there is no limit to how much pg can handle per se, just some practical limitations, especially vacuum and reindex times. these are important because they are required to keep a handle on mvcc bloat and its very nice to be able to vaccum bits of your database at a time. just another fyi, if you have a really big database, you can forget about doing pg_dump for backups (unless you really don't care about being x day or days behind)...you simply have to due some type of replication/failover strategy. i would start with pitr. merlin
On Monday 18 September 2006 13:56, "Merlin Moncure" <mmoncure@gmail.com> wrote: > just another fyi, if you have a really big database, you can forget > about doing pg_dump for backups (unless you really don't care about > being x day or days behind)...you simply have to due some type of > replication/failover strategy. i would start with pitr. And, of course, the biggest problem of all; upgrades. -- Eat right. Exercise regularly. Die anyway.
> good normalization skills are really important for large databases, > along with materialization strategies for 'denormalized sets'. Good points- thanks. I'm especially curious what others have done for the materialization. The matview project on gborg appears dead, and I've only found a smattering of references on google. My guess is, you roll your own for optimal performance... > regarding the number of rows, there is no limit to how much pg can > handle per se, just some practical limitations, especially vacuum and > reindex times. these are important because they are required to keep > a handle on mvcc bloat and its very nice to be able to vaccum bits of > your database at a time. I was hoping for some actual numbers on "practical". Hardware isn't too much of an issue (within reason- we're not talking an amazon or google here... the SunFire X4500 looks interesting... )- if a customer wants to store that much data, and pay for it, we'll figure out how to do it. I'd just rather not have to re-design the database. Say the requirement is to keep 12 months of data accessible, each "scan" produces 100M records, and I run one per month. What happens if the customer wants to run it once a week? I was more trying to figure out at what point (ballpark) I'm going to have to look into archive tables and things of that nature (or at Bizgres/MPP). It's easier for us to add more/bigger hardware, but not so easy to redesign/add history tables... > > just another fyi, if you have a really big database, you can forget > about doing pg_dump for backups (unless you really don't care about > being x day or days behind)...you simply have to due some type of > replication/failover strategy. i would start with pitr. > > merlin I was originally thinking replication, but I did notice some nice pitr features in 8.x - I'll have to look into that some more. Thanks for the pointers though... - Bucky
Do the basic math:
If you have a table with 100million records, each of which is 200bytes long, that gives you roughtly 20 gig of data (assuming it was all written neatly and hasn't been updated much). If you have to do a full table scan, then it will take roughly 400 seconds with a single 10k RPM SCSI drive with an average read speed of 50MB/sec. If you are going to read indexes, figure out how big your index is, and how many blocks will be returned, and figure out how many blocks this will require transferring from the main table, make an estimate of the seeks, add in the transfer total, and you have a time to get your data. A big array with a good controller can pass 1000MB/sec transfer on the right bus if you buy the write technologies. But be warned, if you buy the wrong ones, your big array can end up being slower than a single drive for sequential transfer. At 1000MB/sec your scan would take 20 seconds.
Be warned, the tech specs page:
http://www.sun.com/servers/x64/x4500/specs.xml#anchor3
doesn't mention RAID 10 as a possible, and this is probably what most would recommend for fast data access if you are doing both read and write operations. If you are doing mostly Read, then RAID 5 is passable, but it's redundancy with large numbers of drives is not so great.
Alex.
If you have a table with 100million records, each of which is 200bytes long, that gives you roughtly 20 gig of data (assuming it was all written neatly and hasn't been updated much). If you have to do a full table scan, then it will take roughly 400 seconds with a single 10k RPM SCSI drive with an average read speed of 50MB/sec. If you are going to read indexes, figure out how big your index is, and how many blocks will be returned, and figure out how many blocks this will require transferring from the main table, make an estimate of the seeks, add in the transfer total, and you have a time to get your data. A big array with a good controller can pass 1000MB/sec transfer on the right bus if you buy the write technologies. But be warned, if you buy the wrong ones, your big array can end up being slower than a single drive for sequential transfer. At 1000MB/sec your scan would take 20 seconds.
Be warned, the tech specs page:
http://www.sun.com/servers/x64/x4500/specs.xml#anchor3
doesn't mention RAID 10 as a possible, and this is probably what most would recommend for fast data access if you are doing both read and write operations. If you are doing mostly Read, then RAID 5 is passable, but it's redundancy with large numbers of drives is not so great.
Alex.
On 9/18/06, Bucky Jordan <bjordan@lumeta.com> wrote:
> good normalization skills are really important for large databases,
> along with materialization strategies for 'denormalized sets'.
Good points- thanks. I'm especially curious what others have done for
the materialization. The matview project on gborg appears dead, and I've
only found a smattering of references on google. My guess is, you roll
your own for optimal performance...
> regarding the number of rows, there is no limit to how much pg can
> handle per se, just some practical limitations, especially vacuum and
> reindex times. these are important because they are required to keep
> a handle on mvcc bloat and its very nice to be able to vaccum bits of
> your database at a time.
I was hoping for some actual numbers on "practical". Hardware isn't too
much of an issue (within reason- we're not talking an amazon or google
here... the SunFire X4500 looks interesting... )- if a customer wants to
store that much data, and pay for it, we'll figure out how to do it. I'd
just rather not have to re-design the database. Say the requirement is
to keep 12 months of data accessible, each "scan" produces 100M records,
and I run one per month. What happens if the customer wants to run it
once a week? I was more trying to figure out at what point (ballpark)
I'm going to have to look into archive tables and things of that nature
(or at Bizgres/MPP). It's easier for us to add more/bigger hardware, but
not so easy to redesign/add history tables...
>
> just another fyi, if you have a really big database, you can forget
> about doing pg_dump for backups (unless you really don't care about
> being x day or days behind)...you simply have to due some type of
> replication/failover strategy. i would start with pitr.
>
> merlin
I was originally thinking replication, but I did notice some nice pitr
features in 8.x - I'll have to look into that some more.
Thanks for the pointers though...
- Bucky
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
On Mon, Sep 18, 2006 at 07:14:56PM -0400, Alex Turner wrote: >If you have a table with 100million records, each of which is 200bytes long, >that gives you roughtly 20 gig of data (assuming it was all written neatly >and hasn't been updated much). If you're in that range it doesn't even count as big or challenging--you can keep it memory resident for not all that much money. Mike Stone
Bucky, On 9/18/06 7:37 AM, "Bucky Jordan" <bjordan@lumeta.com> wrote: > My question is at what point do I have to get fancy with those big > tables? From your presentation, it looks like PG can handle 1.2 billion > records or so as long as you write intelligent queries. (And normal PG > should be able to handle that, correct?) PG has limitations that will confront you at sizes beyond about a couple hundred GB of table size, as will Oracle and others. You should be careful to implement very good disk hardware and leverage Postgres 8.1 partitioning and indexes intelligently as you go beyond 100GB per instance. Also be sure to set the random_page_cost parameter in postgresql.conf to 100 or even higher when you use indexes, as the actual seek rate for random access ranges between 50 and 300 for modern disk hardware. If this parameter is left at the default of 4, indexes will often be used inappropriately. > Also, does anyone know if/when any of the MPP stuff will be ported to > Postgres, or is the plan to keep that separate? The plan is to keep that separate for now, though we're contributing technology like partitioning, faster sorting, bitmap index, adaptive nested loop, and hybrid hash aggregation to make big databases work better in Postgres. - Luke
Alex, On 9/18/06 4:14 PM, "Alex Turner" <armtuk@gmail.com> wrote: > Be warned, the tech specs page: > http://www.sun.com/servers/x64/x4500/specs.xml#anchor3 > doesn't mention RAID 10 as a possible, and this is probably what most would > recommend for fast data access if you are doing both read and write > operations. If you are doing mostly Read, then RAID 5 is passable, but it's > redundancy with large numbers of drives is not so great. RAID10 works great on the X4500 we get 1.6GB/s + per X4500 using RAID10 in ZFS. We worked with the Sun Solaris kernel team to make that happen and the patches are part of Solaris 10 Update 3 due out in November. - Luke
Sweet - thats good - RAID 10 support seems like an odd thing to leave out.
Alex
Alex
On 9/18/06, Luke Lonergan < llonergan@greenplum.com> wrote:
Alex,
On 9/18/06 4:14 PM, "Alex Turner" < armtuk@gmail.com> wrote:
> Be warned, the tech specs page:
> http://www.sun.com/servers/x64/x4500/specs.xml#anchor3
> doesn't mention RAID 10 as a possible, and this is probably what most would
> recommend for fast data access if you are doing both read and write
> operations. If you are doing mostly Read, then RAID 5 is passable, but it's
> redundancy with large numbers of drives is not so great.
RAID10 works great on the X4500 we get 1.6GB/s + per X4500 using RAID10 in
ZFS. We worked with the Sun Solaris kernel team to make that happen and the
patches are part of Solaris 10 Update 3 due out in November.
- Luke
Yep, Solaris ZFS kicks butt. It does RAID10/5/6, etc and implements most of the high end features available on high end SANs... - Luke On 9/18/06 8:40 PM, "Alex Turner" <armtuk@gmail.com> wrote: > Sweet - thats good - RAID 10 support seems like an odd thing to leave out. > > Alex > > On 9/18/06, Luke Lonergan < llonergan@greenplum.com > <mailto:llonergan@greenplum.com> > wrote: >> Alex, >> >> On 9/18/06 4:14 PM, "Alex Turner" < armtuk@gmail.com> wrote: >> >>> Be warned, the tech specs page: >>> http://www.sun.com/servers/x64/x4500/specs.xml#anchor3 >>> <http://www.sun.com/servers/x64/x4500/specs.xml#anchor3> >>> doesn't mention RAID 10 as a possible, and this is probably what most would >>> recommend for fast data access if you are doing both read and write >>> operations. If you are doing mostly Read, then RAID 5 is passable, but it's >>> redundancy with large numbers of drives is not so great. >> >> RAID10 works great on the X4500 we get 1.6GB/s + per X4500 using RAID10 in >> ZFS. We worked with the Sun Solaris kernel team to make that happen and the >> patches are part of Solaris 10 Update 3 due out in November. >> >> - Luke >> >> > >
On Mon, Sep 18, 2006 at 06:10:13PM -0700, Luke Lonergan wrote: > Also be sure to set the random_page_cost parameter in > postgresql.conf to 100 or even higher when you use indexes, as the actual > seek rate for random access ranges between 50 and 300 for modern disk > hardware. If this parameter is left at the default of 4, indexes will often > be used inappropriately. Does a tool exist yet to time this for a particular configuration? Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
Mark, On 9/18/06 8:45 PM, "mark@mark.mielke.cc" <mark@mark.mielke.cc> wrote: > Does a tool exist yet to time this for a particular configuration? We're considering building this into ANALYZE on a per-table basis. The basic approach times sequential access in page rate, then random seeks as page rate and takes the ratio of same. Since PG's heap scan is single threaded, the seek rate is equivalent to a single disk (even though RAID arrays may have many spindles), the typical random seek rates are around 100-200 seeks per second from within the backend. That means that as sequential scan performance increases, such as happens when using large RAID arrays, the random_page_cost will range from 50 to 300 linearly as the size of the RAID array increases. - Luke
Mike, > On Mon, Sep 18, 2006 at 07:14:56PM -0400, Alex Turner wrote: > >If you have a table with 100million records, each of which is 200bytes > long, > >that gives you roughtly 20 gig of data (assuming it was all written > neatly > >and hasn't been updated much). > I'll keep that in mind (minimizing updates during loads). My plan is updates will actually be implemented as insert to summary/history table then delete old records. The OLTP part of this will be limited to a particular set of tables that I anticipate will not be nearly as large. > If you're in that range it doesn't even count as big or challenging--you > can keep it memory resident for not all that much money. > > Mike Stone > I'm aware of that, however, *each* scan could be 100m records, and we need to keep a minimum of 12, and possibly 50 or more. So sure, if I only have 100m records total, sure, but 500m, or 1b... According to Alex's calculations, that'd be 100G for 500m records (just that one table, not including indexes). From what Luke was saying, there's some issues once you get over a couple hundred GB in a single table, so in the case of 12 scans, it looks like I can squeeze it in given sufficient hardware, but more than that and I'll have to look at history tables or some other solution. I'd also think doing some sort of summary table/materialized view for count/sum operations would be a necessity at this point. I'm not sure that this is a good topic for the list, but in the interest of sharing info I'll ask, and if someone feels it warrants a private response, we can discuss off list. Would Bizgres be able to handle tables > 200GB or so, or is it still quite similar to Postgres (single threaded/process issues per query type things..)? What about Bizgres MPP? And also, does switching from Postgres to Bizgres or Bizgres MPP require any application changes? Thanks for all the help, Bucky
Hi, Luke, Luke Lonergan wrote: > Since PG's heap scan is single threaded, the seek rate is equivalent to a > single disk (even though RAID arrays may have many spindles), the typical > random seek rates are around 100-200 seeks per second from within the > backend. That means that as sequential scan performance increases, such as > happens when using large RAID arrays, the random_page_cost will range from > 50 to 300 linearly as the size of the RAID array increases. Do you think that adding some posix_fadvise() calls to the backend to pre-fetch some blocks into the OS cache asynchroneously could improve that situation? I could imagine that e. G. index bitmap scans could profit in the heap fetching stage by fadvise()ing the next few blocks. Maybe asynchroneous I/O could be used for the same benefit, but posix_fadvise is less() intrusive, and can easily be #define'd out on platforms that don't support it. Combine this with the Linux Kernel I/O Scheduler patches (readahead improvements) that were discussed here in summer... Regards, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
Markus, On 9/20/06 1:09 AM, "Markus Schaber" <schabi@logix-tt.com> wrote: > Do you think that adding some posix_fadvise() calls to the backend to > pre-fetch some blocks into the OS cache asynchroneously could improve > that situation? Nope - this requires true multi-threading of the I/O, there need to be multiple seek operations running simultaneously. The current executor blocks on each page request, waiting for the I/O to happen before requesting the next page. The OS can't predict what random page is to be requested next. We can implement multiple scanners (already present in MPP), or we could implement AIO and fire off a number of simultaneous I/O requests for fulfillment. - Luke
Hi, Luke, Luke Lonergan wrote: >> Do you think that adding some posix_fadvise() calls to the backend to >> pre-fetch some blocks into the OS cache asynchroneously could improve >> that situation? > > Nope - this requires true multi-threading of the I/O, there need to be > multiple seek operations running simultaneously. The current executor > blocks on each page request, waiting for the I/O to happen before requesting > the next page. The OS can't predict what random page is to be requested > next. I thought that posix_fadvise() with POSIX_FADV_WILLNEED was exactly meant for this purpose? My idea was that the executor could posix_fadvise() the blocks it will need in the near future, and later, when it actually issues the blocking read, the block is there already. This could even give speedups in the single-spindle case, as the I/O scheduler could already fetch the next blocks while the executor processes the current one. But there must be some details in the executor that prevent this. > We can implement multiple scanners (already present in MPP), or we could > implement AIO and fire off a number of simultaneous I/O requests for > fulfillment. AIO is much more intrusive to implement, so I'd preferrably look whether posix_fadvise() could improve the situation. Thanks, Markus
IMHO, AIO is the architecturally cleaner and more elegant solution. We in fact have a project on the boards to do this but funding (as yet) has not been found. My $.02, Ron At 02:02 PM 9/20/2006, Markus Schaber wrote: >Hi, Luke, > >Luke Lonergan wrote: > > >> Do you think that adding some posix_fadvise() calls to the backend to > >> pre-fetch some blocks into the OS cache asynchroneously could improve > >> that situation? > > > > Nope - this requires true multi-threading of the I/O, there need to be > > multiple seek operations running simultaneously. The current executor > > blocks on each page request, waiting for the I/O to happen before > requesting > > the next page. The OS can't predict what random page is to be requested > > next. > >I thought that posix_fadvise() with POSIX_FADV_WILLNEED was exactly >meant for this purpose? > >My idea was that the executor could posix_fadvise() the blocks it will >need in the near future, and later, when it actually issues the blocking >read, the block is there already. This could even give speedups in the >single-spindle case, as the I/O scheduler could already fetch the next >blocks while the executor processes the current one. > >But there must be some details in the executor that prevent this. > > > We can implement multiple scanners (already present in MPP), or we could > > implement AIO and fire off a number of simultaneous I/O requests for > > fulfillment. > >AIO is much more intrusive to implement, so I'd preferrably look >whether posix_fadvise() could improve the situation. > >Thanks, >Markus > >---------------------------(end of broadcast)--------------------------- >TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq
Markus, On 9/20/06 11:02 AM, "Markus Schaber" <schabi@logix-tt.com> wrote: > I thought that posix_fadvise() with POSIX_FADV_WILLNEED was exactly > meant for this purpose? This is a good idea - I wasn't aware that this was possible. We'll do some testing and see if it works as advertised on Linux and Solaris. - Luke
Hi, Luke, Luke Lonergan wrote: >> I thought that posix_fadvise() with POSIX_FADV_WILLNEED was exactly >> meant for this purpose? > > This is a good idea - I wasn't aware that this was possible. This possibility was the reason for me to propose it. :-) > We'll do some testing and see if it works as advertised on Linux and > Solaris. Fine, I'm looking forward to the results. According to my small test, it works at least on linux 2.6.17.4. Btw, posix_fadvise() could even give a small improvement for multi-threaded backends, given that the I/O subsystem is smart enough to cope intelligently to cope with large bunches of outstanding requests. HTH, Markus
> > Do you think that adding some posix_fadvise() calls to the backend to > > pre-fetch some blocks into the OS cache asynchroneously could improve > > that situation? > > Nope - this requires true multi-threading of the I/O, there need to be > multiple seek operations running simultaneously. The current executor > blocks on each page request, waiting for the I/O to happen before > requesting > the next page. The OS can't predict what random page is to be requested > next. > > We can implement multiple scanners (already present in MPP), or we could > implement AIO and fire off a number of simultaneous I/O requests for > fulfillment. So this might be a dumb question, but the above statements apply to the cluster (e.g. postmaster) as a whole, not per postgres process/transaction correct? So each transaction is blocked waiting for the main postmaster to retrieve the data in the order it was requested (i.e. not multiple scanners/aio)? In this case, the only way to take full advantage of larger hardware using normal postgres would be to run multiple instances? (Which might not be a bad idea since it would set your application up to be able to deal with databases distributed on multiple servers...) - Bucky
> So this might be a dumb question, but the above statements apply to the > cluster (e.g. postmaster) as a whole, not per postgres > process/transaction correct? So each transaction is blocked waiting for > the main postmaster to retrieve the data in the order it was requested > (i.e. not multiple scanners/aio)? Each connection runs its own separate back-end process, so these statements apply per PG connection (=process).
Hi, Bucky, Bucky Jordan wrote: >> We can implement multiple scanners (already present in MPP), or we > could >> implement AIO and fire off a number of simultaneous I/O requests for >> fulfillment. > > So this might be a dumb question, but the above statements apply to the > cluster (e.g. postmaster) as a whole, not per postgres > process/transaction correct? So each transaction is blocked waiting for > the main postmaster to retrieve the data in the order it was requested > (i.e. not multiple scanners/aio)? No, that's a wrong assumption. It applies per active backend. When connecting, the Postmaster forks a new backend process. Each backend process has its own scanner and executor. The main postmaster is only for coordination (forking, config reload etc.), all the work is done in the forked per-connection backends. Furthermore, the PostgreSQL MVCC system ensures that readers are neither ever blocked nor blocking other backends. Writers can block each other due to the ACID transaction semantics, however the MVCC limits that to a minimum. > In this case, the only way to take full advantage of larger hardware > using normal postgres would be to run multiple instances? (Which might > not be a bad idea since it would set your application up to be able to > deal with databases distributed on multiple servers...) Typical OLTP applications (Web UIs, Booking systems, etc.) have multiple connections, and those run fully parallel. So if your application is of this type, it will take full advantage of larger hardware. In the list archive, you should find some links to benchmarks that prove this statement, PostgreSQL scales linearly, up to 8 CPUs and 32 "hyperthreads" in this benchmarks. Our discussion is about some different type of application, where you have a single application issuing a single query at a time dealing with a large amount (several gigs up to teras) of data. Now, when such a query is generating sequential disk access, the I/O scheduler of the underlying OS can easily recognize that pattern, and prefetch the data, thus giving the full speed benefit of the underlying RAID. The discussed problem arises when such large queries generate random (non-continous) disk access (e. G. index scans). Here, the underlying RAID cannot effectively prefetch data as it does not know what the application will need next. This effectively limits the speed to that of a single disk, regardless of the details of the underlying RAID, as it can only process a request at a time, and has to wait for the application for the next one. Now, Bizgres MPP goes the way of having multiple threads per backend, each one processing a fraction of the data. So there are always several outstanding read requests that can be scheduled to the disks. My proposal was to use posix_fadvise() in the single-threaded scanner, so it can tell the OS "I will need those blocks in the near future". So the OS can pre-fetch those blocks into the cache, while PostgreSQL still processes the previous block of data. Another proposal would be to use so-called asynchroneous I/O. This is definitely an interesting and promising idea, but needs much more changes to the code, compared to posix_fadvise(). I hope that this lengthy mail is enlightening, if not, don't hesitate to ask. Thanks for your patience, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
Markus, First, thanks- your email was very enlightining. But, it does bring up a few additional questions, so thanks for your patience also- I've listed them below. > It applies per active backend. When connecting, the Postmaster forks a > new backend process. Each backend process has its own scanner and > executor. The main postmaster is only for coordination (forking, config > reload etc.), all the work is done in the forked per-connection backends. Each postgres process also uses shared memory (aka the buffer cache) so as to not fetch data that another process has already requested, correct? > Our discussion is about some different type of application, where you > have a single application issuing a single query at a time dealing with > a large amount (several gigs up to teras) of data. Commonly these are referred to as OLAP applications, correct? Which is where I believe my application is more focused (it may be handling some transactions in the future, but at the moment, it follows the "load lots of data, then analyze it" pattern). > The discussed problem arises when such large queries generate random > (non-continous) disk access (e. G. index scans). Here, the underlying > RAID cannot effectively prefetch data as it does not know what the > application will need next. This effectively limits the speed to that of > a single disk, regardless of the details of the underlying RAID, as it > can only process a request at a time, and has to wait for the > application for the next one. Does this have anything to do with postgres indexes not storing data, as some previous posts to this list have mentioned? (In otherwords, having the index in memory doesn't help? Or are we talking about indexes that are too large to fit in RAM?) So this issue would be only on a per query basis? Could it be alleviated somewhat if I ran multiple smaller queries? For example, I want to calculate a summary table on 500m records- fire off 5 queries that count 100m records each and update the summary table, leaving MVCC to handle update contention? Actually, now that I think about it- that would only work if the sections I mentioned above were on different disks right? So I would actually have to do table partitioning with tablespaces on different spindles to get that to be beneficial? (which is basically not feasible with RAID, since I don't get to pick what disks the data goes on...) Are there any other workarounds for current postgres? Thanks again, Bucky
Hi, Bucky, Bucky Jordan wrote: > Each postgres process also uses shared memory (aka the buffer cache) so > as to not fetch data that another process has already requested, > correct? Yes. Additinally, the OS caches disk blocks. Most unixoid ones like Linux use (nearly) all unused memory for this block cache, I don't know about Windows. > Commonly these are referred to as OLAP applications, correct? Which is > where I believe my application is more focused (it may be handling some > transactions in the future, but at the moment, it follows the "load lots > of data, then analyze it" pattern). Yes, most OLAP apps fall into this category. But I also think that most OLAP apps mainly generate sequential data access (sequential scans), for which the OS prefetching of data works fine. Btw, some weeks ago, there was a patch mentioned here that improves the linux kernel I/O scheduler wr/t those prefetching capabilities. > Does this have anything to do with postgres indexes not storing data, as > some previous posts to this list have mentioned? (In otherwords, having > the index in memory doesn't help? Or are we talking about indexes that > are too large to fit in RAM?) Yes, it has, but only for the cases where your query fetches only columns in that index. In case where you fetch other columns, PostgreSQL has to access the Heap nevertheless to fetch those. The overhead for checking outdated row versions (those that were updated or deleted, but not yet vacuumed) is zero, as those "load bulk, then analyze" applications typically don't create invalid rows, so every row fetched from the heap is valid. This is very different in OLTP applications. > So this issue would be only on a per query basis? Could it be alleviated > somewhat if I ran multiple smaller queries? For example, I want to > calculate a summary table on 500m records- fire off 5 queries that count > 100m records each and update the summary table, leaving MVCC to handle > update contention? Yes, you could do that, but only if you're CPU bound, and have a multi-core machine. And you need table partitioning, as LIMIT/OFFSET is expensive. Btw, the Bizgres people do exactly this under their hood, so it may be worth a look. If you're I/O bound, and your query is a full table scan, or something else that results in (nearly) sequential disk access, the OS prefetch algorithm will work. You can use some I/O monitoring tools to compare the actual speed the data comes in when PostgreSQL does the sequential scan, and compare it to DD'ing the database table files. For simple aggregates like sum(), you usually get near the "raw" speed, and the real bottlenecks are the disk I/O rate, bad RAID implementations or PCI bus contention. > Actually, now that I think about it- that would only work if the > sections I mentioned above were on different disks right? So I would > actually have to do table partitioning with tablespaces on different > spindles to get that to be beneficial? (which is basically not feasible > with RAID, since I don't get to pick what disks the data goes on...) If you really need that much throughput, you can always put the different partitions on different RAIDs. But hardware gets very expensive in those dimensions, and it may be better to partition the data on different machines altogether. AFAIK, Bizgres MPP does exactly that. > Are there any other workarounds for current postgres? Are your questions of theoretical nature, or do you have a concrete problem? In latter case, you could post your details here, and we'll see whether we can help. Btw, I'm not related with Bizgres in any way. :-) HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
Bucky, On 9/21/06 2:16 PM, "Bucky Jordan" <bjordan@lumeta.com> wrote: > Does this have anything to do with postgres indexes not storing data, as > some previous posts to this list have mentioned? (In otherwords, having > the index in memory doesn't help? Or are we talking about indexes that > are too large to fit in RAM?) Yes, if the index could be scanned without needing to scan the heap to satisfy a query, that query would benefit from sequential access. This is true whether the index fits in RAM or not. > So this issue would be only on a per query basis? Could it be alleviated > somewhat if I ran multiple smaller queries? For example, I want to > calculate a summary table on 500m records- fire off 5 queries that count > 100m records each and update the summary table, leaving MVCC to handle > update contention? Clever, functional and very painful way to do it, but yes, you would get 5 disks worth of seeking. My goal is to provide for as many disks seeking at the same time as are available to the RAID. Note that the Sun Appliance (X4500 based) has 11 disk drives available per CPU core. Later it will drop to 5-6 disks per core with the introduction of quad core CPUs, which is more the norm for now. Bizgres MPP will achieve one or two concurrent heap scanner per CPU for a given query in the default configurations, so we're missing out on lots of potential speedup for index scans in many cases. With both MPP and stock Postgres you get more seek rate as you add users, but it would take 44 users to use all of the drives in random seeking for Postgres, where for MPP it would take more like 5. > Actually, now that I think about it- that would only work if the > sections I mentioned above were on different disks right? So I would > actually have to do table partitioning with tablespaces on different > spindles to get that to be beneficial? (which is basically not feasible > with RAID, since I don't get to pick what disks the data goes on...) On average, for random seeking we can assume that RAID will distribute the data evenly. The I/Os should balance out. - Luke
> >> I thought that posix_fadvise() with POSIX_FADV_WILLNEED was exactly > >> meant for this purpose? > > > > This is a good idea - I wasn't aware that this was possible. > > This possibility was the reason for me to propose it. :-) posix_fadvise() features in the TODO list already; I'm not sure if any work on it has been done for pg8.2. Anyway, I understand that POSIX_FADV_DONTNEED on a linux 2.6 kernel allows pages to be discarded from memory earlier than usual. This is useful, since it means you can prevent your seqscan from nuking the OS cache. Of course you could argue the OS should be able to detect this, and prevent it occuring anyway. I don't know anything about linux's behaviour in this area. .Guy
Guy Thornley wrote: > > >> I thought that posix_fadvise() with POSIX_FADV_WILLNEED was exactly > > >> meant for this purpose? > > > > > > This is a good idea - I wasn't aware that this was possible. > > > > This possibility was the reason for me to propose it. :-) > > posix_fadvise() features in the TODO list already; I'm not sure if any work > on it has been done for pg8.2. > > Anyway, I understand that POSIX_FADV_DONTNEED on a linux 2.6 kernel allows > pages to be discarded from memory earlier than usual. This is useful, since > it means you can prevent your seqscan from nuking the OS cache. > > Of course you could argue the OS should be able to detect this, and prevent > it occuring anyway. I don't know anything about linux's behaviour in this > area. We tried posix_fadvise() during the 8.2 development cycle, but had problems as outlined in a comment in xlog.c: /* * posix_fadvise is problematic on many platforms: on older x86 Linux * it just dumps core, and there are reports of problems on PPC platforms * as well. The following is therefore disabled for the time being. * We could consider some kind of configure test to see if it's safe to * use, but since we lack hard evidence that there's any useful performance * gain to be had, spending time on that seems unprofitable for now. */ -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, Sep 22, 2006 at 02:52:09PM +1200, Guy Thornley wrote: > > >> I thought that posix_fadvise() with POSIX_FADV_WILLNEED was exactly > > >> meant for this purpose? > > > This is a good idea - I wasn't aware that this was possible. > > This possibility was the reason for me to propose it. :-) > posix_fadvise() features in the TODO list already; I'm not sure if any work > on it has been done for pg8.2. > > Anyway, I understand that POSIX_FADV_DONTNEED on a linux 2.6 kernel allows > pages to be discarded from memory earlier than usual. This is useful, since > it means you can prevent your seqscan from nuking the OS cache. > > Of course you could argue the OS should be able to detect this, and prevent > it occuring anyway. I don't know anything about linux's behaviour in this > area. I recall either monitoring or participating in the discussion when this call was added to Linux. I don't believe the kernel can auto-detect that you do not need a page any longer. It can only prioritize pages to keep when memory is fully in use and a new page must be loaded. This is often some sort of LRU scheme. If the page is truly useless, only the application can know. I'm not convinced that PostgreSQL can know this. The case where it is useful is if a single process is sequentially scanning a large file (much larger than memory). As soon as it is more than one process, or if it is not a sequential scan, or if it is not a large file, this call hurts more than it gains. Just because I'm done with the page does not mean that *you* are done with the page. I'd advise against using this call unless it can be shown that the page will not be used in the future, or at least, that the page is less useful than all other pages currently in memory. This is what the call really means. It means, "There is no value to keeping this page in memory". Perhaps certain PostgreSQL loads fit this pattern. None of my uses fit this pattern, and I have trouble believing that a majority of PostgreSQL loads fits this pattern. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
Mark, On 9/21/06 8:40 PM, "mark@mark.mielke.cc" <mark@mark.mielke.cc> wrote: > I'd advise against using this call unless it can be shown that the page > will not be used in the future, or at least, that the page is less useful > than all other pages currently in memory. This is what the call really means. > It means, "There is no value to keeping this page in memory". Yes, it's a bit subtle. I think the topic is similar to "cache bypass", used in cache capable vector processors (Cray, Convex, Multiflow, etc) in the 90's. When you are scanning through something larger than the cache, it should be marked "non-cacheable" and bypass caching altogether. This avoids a copy, and keeps the cache available for things that can benefit from it. WRT the PG buffer cache, the rule would have to be: "if the heap scan is going to be larger than "effective_cache_size", then issue the posix_fadvise(BLOCK_NOT_NEEDED) call". It doesn't sound very efficient to do this in block/extent increments though, and it would possibly mess with subsets of the block space that would be re-used for other queries. - Luke
Hi, Guy, Guy Thornley wrote: > Of course you could argue the OS should be able to detect this, and prevent > it occuring anyway. I don't know anything about linux's behaviour in this > area. Yes, one can argue that way. But a generic Algorithm in the OS can never be as smart as the application which has more informations about semantics and algorithms. Everything else would need a crystal ball device :-) HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
On Thu, Sep 21, 2006 at 08:46:41PM -0700, Luke Lonergan wrote: > Mark, > > On 9/21/06 8:40 PM, "mark@mark.mielke.cc" <mark@mark.mielke.cc> wrote: > > > I'd advise against using this call unless it can be shown that the page > > will not be used in the future, or at least, that the page is less useful > > than all other pages currently in memory. This is what the call really means. > > It means, "There is no value to keeping this page in memory". > > Yes, it's a bit subtle. > > I think the topic is similar to "cache bypass", used in cache capable vector > processors (Cray, Convex, Multiflow, etc) in the 90's. When you are > scanning through something larger than the cache, it should be marked > "non-cacheable" and bypass caching altogether. This avoids a copy, and > keeps the cache available for things that can benefit from it. > > WRT the PG buffer cache, the rule would have to be: "if the heap scan is > going to be larger than "effective_cache_size", then issue the > posix_fadvise(BLOCK_NOT_NEEDED) call". It doesn't sound very efficient to > do this in block/extent increments though, and it would possibly mess with > subsets of the block space that would be re-used for other queries. Another issue is that if you start two large seqscans on the same table at about the same time, right now you should only be issuing one set of reads for both requests, because one of them will just pull the blocks back out of cache. If we weren't caching then each query would have to physically read (which would be horrid). There's been talk of adding code that would have a seqscan detect if another seqscan is happening on the table at the same time, and if it is, to start it's seqscan wherever the other seqscan is currently running. That would probably ensure that we weren't reading from the table in 2 different places, even if we weren't caching. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Thu, Sep 21, 2006 at 11:05:39PM -0400, Bruce Momjian wrote: > We tried posix_fadvise() during the 8.2 development cycle, but had > problems as outlined in a comment in xlog.c: > > /* > * posix_fadvise is problematic on many platforms: on older x86 Linux > * it just dumps core, and there are reports of problems on PPC platforms > * as well. The following is therefore disabled for the time being. > * We could consider some kind of configure test to see if it's safe to > * use, but since we lack hard evidence that there's any useful performance > * gain to be had, spending time on that seems unprofitable for now. > */ In case it's not clear, that's a call for someone to do some performance testing. :) Bruce, you happen to have a URL for a patch to put fadvise in? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Luke Lonergan wrote: > > I think the topic is similar to "cache bypass", used in cache capable vector > processors (Cray, Convex, Multiflow, etc) in the 90's. When you are > scanning through something larger than the cache, it should be marked > "non-cacheable" and bypass caching altogether. This avoids a copy, and > keeps the cache available for things that can benefit from it. And 'course some file systems do this automatically when they detect a sequential scan[1] though it can have unexpected (to some) negative side effects[2]. For file systems that support freebehind as a configurable parameter, it might be easier to experiment with the idea there. [1] http://www.ediaudit.com/doc_sol10/Solaris_10_Doc/common/SUNWaadm/reloc/sun_docs/C/solaris_10/SUNWaadm/SOLTUNEPARAMREF/p18.html [2] http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6207772
Jim, On 9/22/06 7:01 AM, "Jim C. Nasby" <jim@nasby.net> wrote: > There's been talk of adding code that would have a seqscan detect if > another seqscan is happening on the table at the same time, and if it > is, to start it's seqscan wherever the other seqscan is currently > running. That would probably ensure that we weren't reading from the > table in 2 different places, even if we weren't caching. Right, aka "SyncScan" The optimization you point out that we miss when bypassing cache is a pretty unlikely event in real world apps, though it makes poorly designed benchmarks go really fast. It's much more likely that the second seqscan will start after the block cache is exhausted, which will cause actuator thrashing (depending on the readahead that the OS uses). SyncScan fixes that. - Luke