Thread: Huge Data sets, simple queries
Does anyone have any experience with extremely large data sets? I'm mean hundreds of millions of rows. The queries I need to run on my 200 million transactions are relatively simple: select month, count(distinct(cardnum)) count(*), sum(amount) from transactions group by month; This query took 18 hours on PG 8.1 on a Dual Xeon, RHEL3, (2.4 Kernel) with RAID-10 (15K drives) and 12 GB Ram. I was expecting it to take about 4 hours - based on some experience with a similar dataset on a different machine (RH9, PG7.3 Dual Xeon, 4GB RAM, Raid-5 10K drives) This machine is COMPLETELY devoted to running these relatively simple queries one at a time. (No multi-user support needed!) I've been tooling with the various performance settings: effective_cache at 5GB, shared_buffers at 2 GB, workmem, sortmem at 1 GB each. ( Shared buffers puzzles me a it bit - my instinct says to set it as high as possible, but everything I read says that "too high" can hurt performance.) Any ideas for performance tweaking in this kind of application would be greatly appreciated. We've got indexes on the fields being grouped, and always vacuum analzye after building them. It's difficult to just "try" various ideas because each attempt takes a full day to test. Real experience is needed here! Thanks much, Mike
On Fri, 2006-01-27 at 20:23 -0500, Mike Biamonte wrote: > > Does anyone have any experience with extremely large data sets? > I'm mean hundreds of millions of rows. Sure, I think more than a few of us do. Just today I built a summary table from a 25GB primary table with ~430 million rows. This took about 45 minutes. > The queries I need to run on my 200 million transactions are relatively > simple: > > select month, count(distinct(cardnum)) count(*), sum(amount) from > transactions group by month; > > This query took 18 hours on PG 8.1 on a Dual Xeon, RHEL3, (2.4 Kernel) with > RAID-10 (15K drives) > and 12 GB Ram. I was expecting it to take about 4 hours - based on some > experience with a > similar dataset on a different machine (RH9, PG7.3 Dual Xeon, 4GB RAM, > Raid-5 10K drives) Possibly the latter machine has a faster I/O subsystem. How large is the table on disk? > This machine is COMPLETELY devoted to running these relatively simple > queries one at a > time. (No multi-user support needed!) I've been tooling with the various > performance settings: > effective_cache at 5GB, shared_buffers at 2 GB, workmem, sortmem at 1 GB > each. > ( Shared buffers puzzles me a it bit - my instinct says to set it as high as > possible, > but everything I read says that "too high" can hurt performance.) > > Any ideas for performance tweaking in this kind of application would be > greatly appreciated. > We've got indexes on the fields being grouped, > and always vacuum analzye > after building them. Probably vacuum makes no difference. > It's difficult to just "try" various ideas because each attempt takes a > full day to test. Real > experience is needed here! Can you send us an EXPLAIN of the query? I believe what you're seeing here is probably: Aggregate +-Sort +-Sequential Scan or perhaps: Aggregate +-Index Scan I have a feeling that the latter will be much faster. If your table has been created over time, then it is probably naturally ordered by date, and therefore also ordered by month. You might expect a Sequential Scan to be the fastest, but the Sort step will be a killer. On the other hand, if your table is badly disordered by date, the Index Scan could also be very slow. Anyway, send us the query plan and also perhaps a sample of vmstat during the query. For what it's worth, I have: effective_cache_size | 700000 cpu_tuple_cost | 0.01 cpu_index_tuple_cost | 0.001 random_page_cost | 3 shared_buffers | 50000 temp_buffers | 1000 work_mem | 1048576 <= for this query only And here's a few lines from vmstat during the query: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 1 76 43476 94916 7655148 0 0 78800 0 1662 788 68 12 0 20 1 1 76 45060 91196 7658088 0 0 78028 0 1639 712 71 11 0 19 2 0 76 44668 87624 7662960 0 0 78924 52 1650 736 69 12 0 19 2 0 76 45300 83672 7667432 0 0 83536 16 1688 768 71 12 0 18 1 1 76 45744 80652 7670712 0 0 84052 0 1691 796 70 12 0 17 That's about 80MB/sec sequential input, for comparison purposes. -jwb
"Mike Biamonte" <mike@dbeat.com> writes: > The queries I need to run on my 200 million transactions are relatively > simple: > select month, count(distinct(cardnum)) count(*), sum(amount) from > transactions group by month; count(distinct) is not "relatively simple", and the current implementation isn't especially efficient. Can you avoid that construct? Assuming that "month" means what it sounds like, the above would result in running twelve parallel sort/uniq operations, one for each month grouping, to eliminate duplicates before counting. You've got sortmem set high enough to blow out RAM in that scenario ... regards, tom lane
On Sat, 2006-01-28 at 10:55 -0500, Tom Lane wrote: > > Assuming that "month" means what it sounds like, the above would > result > in running twelve parallel sort/uniq operations, one for each month > grouping, to eliminate duplicates before counting. You've got sortmem > set high enough to blow out RAM in that scenario ... Hrmm, why is it that with a similar query I get a far simpler plan than you describe, and relatively snappy runtime? select date , count(1) as nads , sum(case when premium then 1 else 0 end) as npremium , count(distinct(keyword)) as nwords , count(distinct(advertiser)) as nadvertisers from data group by date order by date asc QUERY PLAN ----------------------------------------------------------------------------------------------- GroupAggregate (cost=0.00..14452743.09 rows=721 width=13) -> Index Scan using data_date_idx on data (cost=0.00..9075144.27 rows=430206752 width=13) (2 rows) =# show server_version; server_version ---------------- 8.1.2 (1 row) -jwb
"Jeffrey W. Baker" <jwbaker@acm.org> writes: > On Sat, 2006-01-28 at 10:55 -0500, Tom Lane wrote: >> Assuming that "month" means what it sounds like, the above would result >> in running twelve parallel sort/uniq operations, one for each month >> grouping, to eliminate duplicates before counting. You've got sortmem >> set high enough to blow out RAM in that scenario ... > Hrmm, why is it that with a similar query I get a far simpler plan than > you describe, and relatively snappy runtime? You can't see the sort operations in the plan, because they're invoked implicitly by the GroupAggregate node. But they're there. Also, a plan involving GroupAggregate is going to run the "distinct" sorts sequentially, because it's dealing with only one grouping value at a time. In the original case, the planner probably realizes there are only 12 groups and therefore prefers a HashAggregate, which will try to run all the sorts in parallel. Your "group by date" isn't a good approximation of the original conditions because there will be a lot more groups. (We might need to tweak the planner to discourage selecting HashAggregate in the presence of DISTINCT aggregates --- I don't remember whether it accounts for the sortmem usage in deciding whether the hash will fit in memory or not ...) regards, tom lane
Sounds like you are running into the limits of your disk subsystem. You are scanning all of the data in the transactions table, so you will be limited by the disk bandwidth you have – and using RAID-10, you should divide the number of disk drives by 2 and multiply by their indiividual bandwidth (around 60MB/s) and that’s what you can expect in terms of performance. So, if you have 8 drives, you should expect to get 4 x 60 MB/s = 240 MB/s in bandwidth. That means that if you are dealing with 24,000 MB of data in the “transactions” table, then you will scan it in 100 seconds.
With a workload like this, you are in the realm of business intelligence / data warehousing I think. You should check your disk performance, I would expect you’ll find it lacking, partly because you are running RAID10, but mostly because I expect you are using a hardware RAID adapter.
- Luke
On 1/27/06 5:23 PM, "Mike Biamonte" <mike@dbeat.com> wrote:
With a workload like this, you are in the realm of business intelligence / data warehousing I think. You should check your disk performance, I would expect you’ll find it lacking, partly because you are running RAID10, but mostly because I expect you are using a hardware RAID adapter.
- Luke
On 1/27/06 5:23 PM, "Mike Biamonte" <mike@dbeat.com> wrote:
Does anyone have any experience with extremely large data sets?
I'm mean hundreds of millions of rows.
The queries I need to run on my 200 million transactions are relatively
simple:
select month, count(distinct(cardnum)) count(*), sum(amount) from
transactions group by month;
This query took 18 hours on PG 8.1 on a Dual Xeon, RHEL3, (2.4 Kernel) with
RAID-10 (15K drives)
and 12 GB Ram. I was expecting it to take about 4 hours - based on some
experience with a
similar dataset on a different machine (RH9, PG7.3 Dual Xeon, 4GB RAM,
Raid-5 10K drives)
This machine is COMPLETELY devoted to running these relatively simple
queries one at a
time. (No multi-user support needed!) I've been tooling with the various
performance settings:
effective_cache at 5GB, shared_buffers at 2 GB, workmem, sortmem at 1 GB
each.
( Shared buffers puzzles me a it bit - my instinct says to set it as high as
possible,
but everything I read says that "too high" can hurt performance.)
Any ideas for performance tweaking in this kind of application would be
greatly appreciated.
We've got indexes on the fields being grouped, and always vacuum analzye
after building them.
It's difficult to just "try" various ideas because each attempt takes a
full day to test. Real
experience is needed here!
Thanks much,
Mike
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
I wrote: > (We might need to tweak the planner to discourage selecting > HashAggregate in the presence of DISTINCT aggregates --- I don't > remember whether it accounts for the sortmem usage in deciding > whether the hash will fit in memory or not ...) Ah, I take that all back after checking the code: we don't use HashAggregate at all when there are DISTINCT aggregates, precisely because of this memory-blow-out problem. For both your group-by-date query and the original group-by-month query, the plan of attack is going to be to read the original input in grouping order (either via sort or indexscan, with sorting probably preferred unless the table is pretty well correlated with the index) and then sort/uniq on the DISTINCT value within each group. The OP is probably losing on that step compared to your test because it's over much larger groups than yours, forcing some spill to disk. And most likely he's not got an index on month, so the first sort is in fact a sort and not an indexscan. Bottom line is that he's probably doing a ton of on-disk sorting where you're not doing any. This makes me think Luke's theory about inadequate disk horsepower may be on the money. regards, tom lane
On 1/28/06, Luke Lonergan <llonergan@greenplum.com> wrote: > You should check your disk performance, I would > expect you'll find it lacking, partly because you are running RAID10, but > mostly because I expect you are using a hardware RAID adapter. hmm .. do i understand correctly that you're suggesting that using raid 10 and/or hardware raid adapter might hurt disc subsystem performance? could you elaborate on the reasons, please? it's not that i'm against the idea - i'm just curious as this is very "against-common-sense". and i always found it interesting when somebody states something that uncommon... best regards depesz
On Sun, Jan 29, 2006 at 12:25:23PM +0100, hubert depesz lubaczewski wrote: >hmm .. do i understand correctly that you're suggesting that using >raid 10 and/or hardware raid adapter might hurt disc subsystem >performance? could you elaborate on the reasons, please? I think it's been fairly well beaten to death that the low-end hardware raid adapters have lousy performance. It's not until you get into the range of battery-backed disk caches with 512M+ and multiple I/O channels that hardware raid becomes competitive with software raid. Mike Stone
Depesz, > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of > hubert depesz lubaczewski > Sent: Sunday, January 29, 2006 3:25 AM > > hmm .. do i understand correctly that you're suggesting that > using raid 10 and/or hardware raid adapter might hurt disc > subsystem performance? could you elaborate on the reasons, > please? it's not that i'm against the idea - i'm just curious > as this is very "against-common-sense". and i always found it > interesting when somebody states something that uncommon... See previous postings on this list - often when someone is reporting a performance problem with large data, the answer comes back that their I/O setup is not performing well. Most times, people are trusting that when they buy a hardware RAID adapter and set it up, that the performance will be what they expect and what is theoretically correct for the number of disk drives. In fact, in our testing of various host-based SCSI RAID adapters (LSI, Dell PERC, Adaptec, HP SmartArray), we find that *all* of them underperform, most of them severely. Some produce results slower than a single disk drive. We've found that some external SCSI RAID adapters, those built into the disk chassis, often perform better. I think this might be due to the better drivers and perhaps a different marketplace for the higher end solutions driving performance validation. The important lesson we've learned is to always test the I/O subsystem performance - you can do so with a simple test like: time bash -c "dd if=/dev/zero of=bigfile bs=8k count=4000000 && sync" time dd if=bigfile of=/dev/null bs=8k If the answer isn't something close to the theoretical rate, you are likely limited by your RAID setup. You might be shocked to find a severe performance problem. If either is true, switching to software RAID using a simple SCSI adapter will fix the problem. BTW - we've had very good experiences with the host-based SATA adapters from 3Ware. The Areca controllers are also respected. Oh - and about RAID 10 - for large data work it's more often a waste of disk performance-wise compared to RAID 5 these days. RAID5 will almost double the performance on a reasonable number of drives. - Luke
On Sun, 2006-01-29 at 13:44 -0500, Luke Lonergan wrote: > Depesz, > > > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of > > hubert depesz lubaczewski > > Sent: Sunday, January 29, 2006 3:25 AM > > > > hmm .. do i understand correctly that you're suggesting that > > using raid 10 and/or hardware raid adapter might hurt disc > > subsystem performance? could you elaborate on the reasons, > > please? it's not that i'm against the idea - i'm just curious > > as this is very "against-common-sense". and i always found it > > interesting when somebody states something that uncommon... > Oh - and about RAID 10 - for large data work it's more often a waste of > disk performance-wise compared to RAID 5 these days. RAID5 will almost > double the performance on a reasonable number of drives. I think you might want to be more specific here. I would agree with you for data warehousing, decision support, data mining, and similar read-mostly non-transactional loads. For transactional loads RAID-5 is, generally speaking, a disaster due to the read-before-write problem. While we're on the topic, I just installed another one of those Areca ARC-1130 controllers with 1GB cache. It's ludicrously fast: 250MB/sec burst writes, CPU-limited reads. I can't recommend them highly enough. -jwb PS: Could you look into fixing your mailer? Your messages sometimes don't contain In-Reply-To headers, and therefore don't thread properly.
On Fri, Jan 27, 2006 at 08:23:55PM -0500, Mike Biamonte wrote: > This query took 18 hours on PG 8.1 on a Dual Xeon, RHEL3, (2.4 > Kernel) with RAID-10 (15K drives) and 12 GB Ram. I was expecting it > to take about 4 hours - based on some experience with a similar > dataset on a different machine (RH9, PG7.3 Dual Xeon, 4GB RAM, > Raid-5 10K drives) > > It's difficult to just "try" various ideas because each attempt > takes a full day to test. Real experience is needed here! It seems like you are changing multiple variables at the same time. I think you need to first compare the query plans with EXPLAIN SELECT to see if they are significantly different. Your upgrade from 7.3 to 8.1 may have resulted in a less optimal plan. Second, you should monitor your IO performance during the query execution and test it independent of postgres. Then compare the stats between the two systems. As a side note, if you have many disks and you are concerned about bottlenecks on read operations, RAID 5 may perform better than RAID 10. -Mike
Mike Biamonte wrote: > Does anyone have any experience with extremely large data sets? > I'm mean hundreds of millions of rows. > > The queries I need to run on my 200 million transactions are relatively > simple: > > select month, count(distinct(cardnum)) count(*), sum(amount) from > transactions group by month; This may be heretical to post to a relational-database group, but sometimes a problem can be better solved OUTSIDE of therelational system. I had a similar problem recently: I have a set of about 100,000 distinct values, each of which occurs one to several milliontimes in the database, with an aggregate total of several hundred million occurances in the database. Sorting this into distinct lists ("Which rows contain this value?") proved quite time consuming (just like your case), buton reflection, I realized that it was dumb to expect a general-purpose sorting algorithm to sort a list about which Ihad specialized knowledge. General-purpose sorting usually takes O(N*log(N)), but if you have a small number of distinctvalues, you can use "bucket sorting" and sort in O(N) time, a huge improvement. In my case, it was even more specialized-- there was a very small number of the lists that contained thousands or millions of items, but about 95% ofthe lists only had a few items. Armed with this knowledge, it took me couple weeks to write a highly-specialized sorting system that used a combination ofPostgres, in-memory and disk caching, and algorithms dredged up from Knuth. The final result ran in about four hours. The thing to remember about relational databases is that the designers are constrained by the need for generality, reliabilityand SQL standards. Given any particular well-defined task where you have specialized knowledge about the data,and/or you don't care about transactional correctness, and/or you're not concerned about data loss, a good programmercan always write a faster solution. Of course, there's a huge penalty. You lose support, lose of generality, the application takes on complexity that shouldbe in the database, and on and on. A hand-crafted solution should be avoided unless there's simply no other way. A relational database is a tool. Although powerful, like any tool it has limitations. Use the tool where it's useful, anduse other tools when necessary. Craig
On Sun, 29 Jan 2006, Luke Lonergan wrote: > In fact, in our testing of various host-based SCSI RAID adapters (LSI, > Dell PERC, Adaptec, HP SmartArray), we find that *all* of them > underperform, most of them severely. [snip] > The important lesson we've learned is to always test the I/O subsystem > performance - you can do so with a simple test like: > time bash -c "dd if=/dev/zero of=bigfile bs=8k count=4000000 && sync" > time dd if=bigfile of=/dev/null bs=8k I'm curious about this since we're shopping around for something new... I do want to get some kind of baseline to compare new products to. Areca sent me stats on their SCSI->SATA controller and it looks like it maxes out around 10,000 IOPS. I'd like to see how our existing stuff compares to this. I'd especially like to see it in graph form such as the docs Areca sent (IOPS on one axis, block size on the other, etc.). Looking at the venerable Bonnie, it doesn't really seem to focus so much on the number of read/write operations per second, but on big bulky transfers. What are you folks using to measure your arrays? I've been considering using some of our data and just basically benchmarking postgres on various hardware with that, but I cannot compare that to any manufacturer tests. Sorry to meander a bit off topic, but I've been getting frustrated with this little endeavour... Thanks, Charles > - Luke > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
Charles, On 1/29/06 9:35 PM, "Charles Sprickman" <spork@bway.net> wrote: > What are you folks using to measure your arrays? Bonnie++ measures random I/Os, numbers we find are typically in the 500/s range, the best I've seen is 1500/s on a large Fibre Channel RAID0 (at http://www.wlug.org.nz/HarddiskBenchmarks). - Luke
On 1/29/06, Luke Lonergan <LLonergan@greenplum.com> wrote: > Oh - and about RAID 10 - for large data work it's more often a waste of > disk performance-wise compared to RAID 5 these days. RAID5 will almost > double the performance on a reasonable number of drives. how many is reasonable? depesz
Depesz, On 1/30/06 9:53 AM, "hubert depesz lubaczewski" <depesz@gmail.com> wrote: >> double the performance on a reasonable number of drives. > > how many is reasonable? What I mean by that is: given a set of disks N, the read performance of RAID will be equal to the drive read rate A times the number of drives used for reading by the RAID algorithm. In the case of RAID5, that number is (N-1), so the read rate is A x (N-1). In the case of RAID10, that number is N/2, so the read rate is A x (N/2). So, the ratio of read performance RAID5/RAID10 is (N-1)/(N/2) = 2 x (N-1)/N. For numbers of drives, this ratio looks like this: N RAID5/RAID10 3 1.33 6 1.67 8 1.75 14 1.86 So - I think reasonable would be 6-8, which are common disk configurations. - Luke
On Fri, Jan 27, 2006 at 07:05:04PM -0800, Luke Lonergan wrote: > Sounds like you are running into the limits of your disk subsystem. You are > scanning all of the data in the transactions table, so you will be limited > by the disk bandwidth you have ? and using RAID-10, you should divide the > number of disk drives by 2 and multiply by their indiividual bandwidth > (around 60MB/s) and that?s what you can expect in terms of performance. So, > if you have 8 drives, you should expect to get 4 x 60 MB/s = 240 MB/s in > bandwidth. That means that if you are dealing with 24,000 MB of data in the > ?transactions? table, then you will scan it in 100 seconds. Why divide by 2? A good raid controller should be able to send read requests to both drives out of the mirrored set to fully utilize the bandwidth. Of course, that probably won't come into play unless the OS decides that it's going to read-ahead fairly large chunks of the table at a time... Also, some vmstat output would certainly help clarify where the bottleneck is... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim, On 1/30/06 12:25 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote: > Why divide by 2? A good raid controller should be able to send read > requests to both drives out of the mirrored set to fully utilize the > bandwidth. Of course, that probably won't come into play unless the OS > decides that it's going to read-ahead fairly large chunks of the table > at a time... I've not seen one that does, nor would it work in the general case IMO. In RAID1 writes are duplicated and reads come from one of the copies. You could alternate read service requests to minimize rotational latency, but you can't improve bandwidth. - Luke
Luke Lonergan wrote: > Jim, > > On 1/30/06 12:25 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote: > > >> Why divide by 2? A good raid controller should be able to send read >> requests to both drives out of the mirrored set to fully utilize the >> bandwidth. Of course, that probably won't come into play unless the OS >> decides that it's going to read-ahead fairly large chunks of the table >> at a time... >> > > I've not seen one that does, nor would it work in the general case IMO. In > RAID1 writes are duplicated and reads come from one of the copies. You > could alternate read service requests to minimize rotational latency, but > you can't improve bandwidth. > > - Luke > > For Solaris's software raid, the default settings for raid-1 sets is: round-robin read, parallel write. I assumed this mean't it would give similar read performance to raid-0, but I've never benchmarked it. -Kevin
On Tue, Jan 31, 2006 at 09:00:30AM -0800, Luke Lonergan wrote: > Jim, > > On 1/30/06 12:25 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote: > > > Why divide by 2? A good raid controller should be able to send read > > requests to both drives out of the mirrored set to fully utilize the > > bandwidth. Of course, that probably won't come into play unless the OS > > decides that it's going to read-ahead fairly large chunks of the table > > at a time... > > I've not seen one that does, nor would it work in the general case IMO. In > RAID1 writes are duplicated and reads come from one of the copies. You > could alternate read service requests to minimize rotational latency, but > you can't improve bandwidth. (BTW, I did some testing that seems to confirm this) Why couldn't you double the bandwidth? If you're doing a largish read you should be able to do something like have drive a read the first track, drive b the second, etc. Of course that means that the controller or OS would have to be able to stitch things back together. As for software raid, I'm wondering how well that works if you can't use a BBU to allow write caching/re-ordering... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Tue, 2006-01-31 at 09:00 -0800, Luke Lonergan wrote: > Jim, > > On 1/30/06 12:25 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote: > > > Why divide by 2? A good raid controller should be able to send read > > requests to both drives out of the mirrored set to fully utilize the > > bandwidth. Of course, that probably won't come into play unless the OS > > decides that it's going to read-ahead fairly large chunks of the table > > at a time... > > I've not seen one that does, nor would it work in the general case IMO. In > RAID1 writes are duplicated and reads come from one of the copies. You > could alternate read service requests to minimize rotational latency, but > you can't improve bandwidth. Then you've not seen Linux. Linux does balanced reads on software mirrors. I'm not sure why you think this can't improve bandwidth. It does improve streaming bandwidth as long as the platter STR is more than the bus STR. -jwb
Jeffrey, On 1/31/06 12:03 PM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote: > Then you've not seen Linux. :-D > Linux does balanced reads on software > mirrors. I'm not sure why you think this can't improve bandwidth. It > does improve streaming bandwidth as long as the platter STR is more than > the bus STR. ... Prove it. - Luke
Jim, On 1/31/06 11:21 AM, "Jim C. Nasby" <jnasby@pervasive.com> wrote: > (BTW, I did some testing that seems to confirm this) > > Why couldn't you double the bandwidth? If you're doing a largish read > you should be able to do something like have drive a read the first > track, drive b the second, etc. Of course that means that the controller > or OS would have to be able to stitch things back together. It's because your alternating reads are skipping in chunks across the platter. Disks work at their max internal rate when reading sequential data, and the cache is often built to buffer a track-at-a-time, so alternating pieces that are not contiguous has the effect of halving the max internal sustained bandwidth of each drive - the total is equal to one drive's sustained internal bandwidth. This works differently for RAID0, where the chunks are allocated to each drive and laid down contiguously on each, so that when they're read back, each drive runs at it's sustained sequential throughput. The alternating technique in mirroring might improve rotational latency for random seeking - a trick that Tandem exploited, but it won't improve bandwidth. > As for software raid, I'm wondering how well that works if you can't use > a BBU to allow write caching/re-ordering... Works great with standard OS write caching. - Luke
>> Linux does balanced reads on software >> mirrors. I'm not sure why you think this can't improve bandwidth. It >> does improve streaming bandwidth as long as the platter STR is more than >> the bus STR. > > ... Prove it. > (I have a software RAID1 on this desktop machine) It's a lot faster than a single disk for random reads when more than 1 thread hits the disk, because it distributes reads to both disks. Thus, applications start faster, and the machine is more reactive even when the disk is thrashing. Cron starting a "updatedb" is less painful. It's cool for desktop use (and of course it's more reliable). However large reads (dd-style) are just the same speed as 1 drive. I guess you'd need a humongous readahead in order to read from both disks.
On Tue, Jan 31, 2006 at 02:52:57PM -0800, Luke Lonergan wrote: > It's because your alternating reads are skipping in chunks across the > platter. Disks work at their max internal rate when reading sequential > data, and the cache is often built to buffer a track-at-a-time, so > alternating pieces that are not contiguous has the effect of halving the max > internal sustained bandwidth of each drive - the total is equal to one > drive's sustained internal bandwidth. > > This works differently for RAID0, where the chunks are allocated to each > drive and laid down contiguously on each, so that when they're read back, > each drive runs at it's sustained sequential throughput. > > The alternating technique in mirroring might improve rotational latency for > random seeking - a trick that Tandem exploited, but it won't improve > bandwidth. Or just work in multiples of tracks, which would greatly reduce the impact of delays from seeking. > > As for software raid, I'm wondering how well that works if you can't use > > a BBU to allow write caching/re-ordering... > > Works great with standard OS write caching. Well, the only problem with that is if the machine crashes for any reason you risk having the database corrupted (or at best losing some committed transactions). -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
PFC, On 1/31/06 3:11 PM, "PFC" <lists@peufeu.com> wrote: >> ... Prove it. >> > > (I have a software RAID1 on this desktop machine) > > It's a lot faster than a single disk for random reads when more than 1 > thread hits the disk, because it distributes reads to both disks. Thus, > applications start faster, and the machine is more reactive even when the > disk is thrashing. Cron starting a "updatedb" is less painful. It's cool > for desktop use (and of course it's more reliable). Exactly - improved your random seeks. > However large reads (dd-style) are just the same speed as 1 drive. I > guess you'd need a humongous readahead in order to read from both disks. Nope - won't help. - Luke
Jim, On 1/31/06 3:12 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote: >> The alternating technique in mirroring might improve rotational latency for >> random seeking - a trick that Tandem exploited, but it won't improve >> bandwidth. > > Or just work in multiples of tracks, which would greatly reduce the > impact of delays from seeking. So, having rediscovered the facts underlying the age-old RAID10 versus RAID5 debate we're back to the earlier points. RAID10 is/was the best option when latency / random seek was the predominant problem to be solved, RAID5/50 is best where read bandwidth is needed. Modern developments in fast CPUs for write checksumming have made RAID5/50 a viable alternative to RAID10 even when there is moderate write / random seek workloads and fast read is needed. >> >> Works great with standard OS write caching. > > Well, the only problem with that is if the machine crashes for any > reason you risk having the database corrupted (or at best losing some > committed transactions). So, do you routinely turn off Linux write caching? If not, then there's no difference. - Luke
On Tue, Jan 31, 2006 at 12:47:10PM -0800, Luke Lonergan wrote: >> Linux does balanced reads on software >> mirrors. I'm not sure why you think this can't improve bandwidth. It >> does improve streaming bandwidth as long as the platter STR is more than >> the bus STR. > ... Prove it. FWIW, this is on Ultra160 disks (Seagate 10000rpm) on a dual Opteron running Linux 2.6.14.3: cassarossa:~# grep md1 /proc/mdstat md1 : active raid1 sdf6[1] sda6[0] cassarossa:~# dd if=/dev/sda6 of=/dev/null bs=8k count=400000 [system at about 35% wait for I/O and 15% system, according to top] 400000+0 records in 400000+0 records out 3276800000 bytes transferred in 54,488154 seconds (60137842 bytes/sec) [system at about 45% wait for I/O and 7% system -- whoa?] 400000+0 records in 400000+0 records out 3276800000 bytes transferred in 52,523771 seconds (62386990 bytes/sec) I'm not sure if it _refutes_ the assertion that the Linux RAID-1 driver can do balancing of sequential reads, but it certainly doesn't present very much evidence in that direction. BTW, sda and sdf are on different channels of a dual-channel (onboard, connected via PCI-X) Adaptec board, so I doubt the bus is the limiting factor. /* Steinar */ -- Homepage: http://www.sesse.net/
Steinar, On 1/31/06 5:26 PM, "Steinar H. Gunderson" <sgunderson@bigfoot.com> wrote: > cassarossa:~# grep md1 /proc/mdstat > md1 : active raid1 sdf6[1] sda6[0] > cassarossa:~# dd if=/dev/sda6 of=/dev/null bs=8k count=400000 > [system at about 35% wait for I/O and 15% system, according to top] > 400000+0 records in > 400000+0 records out > 3276800000 bytes transferred in 54,488154 seconds (60137842 bytes/sec) > [system at about 45% wait for I/O and 7% system -- whoa?] > 400000+0 records in > 400000+0 records out > 3276800000 bytes transferred in 52,523771 seconds (62386990 bytes/sec) > > I'm not sure if it _refutes_ the assertion that the Linux RAID-1 driver can > do balancing of sequential reads, but it certainly doesn't present very much > evidence in that direction. BTW, sda and sdf are on different channels of a > dual-channel (onboard, connected via PCI-X) Adaptec board, so I doubt the bus > is the limiting factor. Yep - 2MB/s is noise. Run a RAID0, you should get 120MB/s. Incidentally, before this thread took a turn to RAID10 vs. RAID5, the question of HW RAID adapter versus SW RAID was the focus. I routinely see numbers like 20MB/s coming from HW RAID adapters on Linux, so it's nice to see someone post a decent number using SW RAID. We're very happy with the 3Ware HW RAID adapters, but so far they're the only ones (I have two Arecas but I mistakenly ordered PCI-E so I can't test them :-( - Luke
On Tue, Jan 31, 2006 at 03:19:38PM -0800, Luke Lonergan wrote: > > Well, the only problem with that is if the machine crashes for any > > reason you risk having the database corrupted (or at best losing some > > committed transactions). > > So, do you routinely turn off Linux write caching? If not, then there's no > difference. My thought was about fsync on WAL; if you're doing much writing then a good raid write cache with BBU will improve performance. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Tue, 2006-01-31 at 12:47 -0800, Luke Lonergan wrote: > Jeffrey, > > On 1/31/06 12:03 PM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote: > > Linux does balanced reads on software > > mirrors. I'm not sure why you think this can't improve bandwidth. It > > does improve streaming bandwidth as long as the platter STR is more than > > the bus STR. > > ... Prove it. It's clear that Linux software RAID1, and by extension RAID10, does balanced reads, and that these balanced reads double the bandwidth. A quick glance at the kernel source code, and a trivial test, proves the point. In this test, sdf and sdg are Seagate 15k.3 disks on a single channel of an Adaptec 39320, but the enclosure, and therefore the bus, is capable of only Ultra160 operation. # grep md0 /proc/mdstat md0 : active raid1 sdf1[0] sdg1[1] # dd if=/dev/md0 of=/dev/null bs=8k count=400000 skip=0 & dd if=/dev/md0 of=/dev/null bs=8k count=400000 skip=400000 400000+0 records in 400000+0 records out 3276800000 bytes transferred in 48.243362 seconds (67922298 bytes/sec) 400000+0 records in 400000+0 records out 3276800000 bytes transferred in 48.375897 seconds (67736211 bytes/sec) That's 136MB/sec, for those following along at home. With only two disks in a RAID1, you can nearly max out the SCSI bus. # dd if=/dev/sdf1 of=/dev/null bs=8k count=400000 skip=0 & dd if=/dev/sdf1 of=/dev/null bs=8k count=400000 skip=400000 400000+0 records in 400000+0 records out 3276800000 bytes transferred in 190.413286 seconds (17208883 bytes/sec) 400000+0 records in 400000+0 records out 3276800000 bytes transferred in 192.096232 seconds (17058117 bytes/sec) That, on the other hand, is only 34MB/sec. With two threads, the RAID1 is 296% faster. # dd if=/dev/md0 of=/dev/null bs=8k count=400000 skip=0 & dd if=/dev/md0 of=/dev/null bs=8k count=400000 skip=400000 & dd if=/dev/md0 of=/dev/null bs=8k count=400000 skip=800000 & dd if=/dev/md0 of=/dev/null bs=8k count=400000 skip=1200000 & 400000+0 records in 400000+0 records out 3276800000 bytes transferred in 174.276585 seconds (18802296 bytes/sec) 400000+0 records in 400000+0 records out 3276800000 bytes transferred in 181.581893 seconds (18045852 bytes/sec) 400000+0 records in 400000+0 records out 3276800000 bytes transferred in 183.724243 seconds (17835425 bytes/sec) 400000+0 records in 400000+0 records out 3276800000 bytes transferred in 184.209018 seconds (17788489 bytes/sec) That's 71MB/sec with 4 threads... # dd if=/dev/sdf1 of=/dev/null bs=8k count=100000 skip=0 & dd if=/dev/sdf1 of=/dev/null bs=8k count=100000 skip=400000 & dd if=/dev/sdf1 of=/dev/null bs=8k count=100000 skip=800000 & dd if=/dev/sdf1 of=/dev/null bs=8k count=100000 skip=1200000 & 100000+0 records in 100000+0 records out 819200000 bytes transferred in 77.489210 seconds (10571794 bytes/sec) 100000+0 records in 100000+0 records out 819200000 bytes transferred in 87.628000 seconds (9348610 bytes/sec) 100000+0 records in 100000+0 records out 819200000 bytes transferred in 88.912989 seconds (9213502 bytes/sec) 100000+0 records in 100000+0 records out 819200000 bytes transferred in 90.238705 seconds (9078144 bytes/sec) Only 36MB/sec for the single disk. 96% advantage for the RAID1. # dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=0 & dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=400000 & dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=800000 & dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=1200000 & dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=1600000 & dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=2000000 & dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=2400000 & dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=2800000 & 50000+0 records in 50000+0 records out 409600000 bytes transferred in 35.289648 seconds (11606803 bytes/sec) 50000+0 records in 50000+0 records out 409600000 bytes transferred in 42.653475 seconds (9602969 bytes/sec) 50000+0 records in 50000+0 records out 409600000 bytes transferred in 43.524714 seconds (9410745 bytes/sec) 50000+0 records in 50000+0 records out 409600000 bytes transferred in 45.151705 seconds (9071640 bytes/sec) 50000+0 records in 50000+0 records out 409600000 bytes transferred in 47.741845 seconds (8579476 bytes/sec) 50000+0 records in 50000+0 records out 409600000 bytes transferred in 48.600533 seconds (8427891 bytes/sec) 50000+0 records in 50000+0 records out 409600000 bytes transferred in 48.758726 seconds (8400548 bytes/sec) 50000+0 records in 50000+0 records out 409600000 bytes transferred in 49.679275 seconds (8244887 bytes/sec) 66MB/s with 8 threads. # dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=0 & dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=400000 & dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=800000 & dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=1200000 & dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=1600000 & dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=2000000 & dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=2400000 & dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=2800000 & 50000+0 records in 50000+0 records out 409600000 bytes transferred in 73.873911 seconds (5544583 bytes/sec) 50000+0 records in 50000+0 records out 409600000 bytes transferred in 75.613093 seconds (5417051 bytes/sec) 50000+0 records in 50000+0 records out 409600000 bytes transferred in 79.988303 seconds (5120749 bytes/sec) 50000+0 records in 50000+0 records out 409600000 bytes transferred in 79.996440 seconds (5120228 bytes/sec) 50000+0 records in 50000+0 records out 409600000 bytes transferred in 84.885172 seconds (4825342 bytes/sec) 50000+0 records in 50000+0 records out 409600000 bytes transferred in 92.995892 seconds (4404496 bytes/sec) 50000+0 records in 50000+0 records out 409600000 bytes transferred in 99.180337 seconds (4129851 bytes/sec) 50000+0 records in 50000+0 records out 409600000 bytes transferred in 100.144752 seconds (4090080 bytes/sec) 33MB/s. RAID1 gives a 100% advantage at 8 threads. I think I've proved my point. Software RAID1 read balancing provides 0%, 300%, 100%, and 100% speedup on 1, 2, 4, and 8 threads, respectively. In the presence of random I/O, the results are even better. Anyone who thinks they have a single-threaded workload has not yet encountered the autovacuum daemon. -Jeff
Jeffrey, On 1/31/06 8:09 PM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote: >> ... Prove it. > I think I've proved my point. Software RAID1 read balancing provides > 0%, 300%, 100%, and 100% speedup on 1, 2, 4, and 8 threads, > respectively. In the presence of random I/O, the results are even > better. > Anyone who thinks they have a single-threaded workload has not yet > encountered the autovacuum daemon. Good data - interesting case. I presume from your results that you had to make the I/Os non-overlapping (the "skip" option to dd) in order to get the concurrent access to work. Why the particular choice of offset - 3.2GB in this case? So - the bandwidth doubles in specific circumstances under concurrent workloads - not relevant to "Huge Data sets, simple queries", but possibly helpful for certain kinds of OLTP applications. - Luke
On Tue, 2006-01-31 at 21:53 -0800, Luke Lonergan wrote: > Jeffrey, > > On 1/31/06 8:09 PM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote: > >> ... Prove it. > > I think I've proved my point. Software RAID1 read balancing provides > > 0%, 300%, 100%, and 100% speedup on 1, 2, 4, and 8 threads, > > respectively. In the presence of random I/O, the results are even > > better. > > Anyone who thinks they have a single-threaded workload has not yet > > encountered the autovacuum daemon. > > Good data - interesting case. I presume from your results that you had to > make the I/Os non-overlapping (the "skip" option to dd) in order to get the > concurrent access to work. Why the particular choice of offset - 3.2GB in > this case? No particular reason. 8k x 100000 is what the last guy used upthread. > > So - the bandwidth doubles in specific circumstances under concurrent > workloads - not relevant to "Huge Data sets, simple queries", but possibly > helpful for certain kinds of OLTP applications. Ah, but someday Pg will be able to concurrently read from two datastreams to complete a single query. And that day will be glorious and fine, and you'll want as much disk concurrency as you can get your hands on. -jwb
I did a little test on soft raid1 : I have two 800 Mbytes files, say A and B. (RAM is 512Mbytes). Test 1 : 1- Read A, then read B : 19 seconds per file 2- Read A and B simultaneously using two threads : 22 seconds total (reads were paralleled by the RAID) 3- Read one block of A, then one block of B, then one block of A, etc. Essentially this is the same as the threaded case, except there's only one thread. 53 seconds total (with heavy seeking noise from the hdd). I was half expecting 3 to take the same as 2. It simulates, for instance, scanning a table and its index, or scanning 2 sort bins. Well, maybe one day... It would be nice if the Kernel had an API for applications to tell it "I'm gonna need these blocks in the next seconds, can you read them in the order you like (fastest), from whatever disk you like, and cache them for me please; so that I can read them in the order I like, but very fast ?" On Wed, 01 Feb 2006 09:25:13 +0100, Jeffrey W. Baker <jwbaker@acm.org> wrote: > On Tue, 2006-01-31 at 21:53 -0800, Luke Lonergan wrote: >> Jeffrey, >> >> On 1/31/06 8:09 PM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote: >> >> ... Prove it. >> > I think I've proved my point. Software RAID1 read balancing provides >> > 0%, 300%, 100%, and 100% speedup on 1, 2, 4, and 8 threads, >> > respectively. In the presence of random I/O, the results are even >> > better. >> > Anyone who thinks they have a single-threaded workload has not yet >> > encountered the autovacuum daemon. >> >> Good data - interesting case. I presume from your results that you had >> to >> make the I/Os non-overlapping (the "skip" option to dd) in order to get >> the >> concurrent access to work. Why the particular choice of offset - 3.2GB >> in >> this case? > > No particular reason. 8k x 100000 is what the last guy used upthread. >> >> So - the bandwidth doubles in specific circumstances under concurrent >> workloads - not relevant to "Huge Data sets, simple queries", but >> possibly >> helpful for certain kinds of OLTP applications. > > Ah, but someday Pg will be able to concurrently read from two > datastreams to complete a single query. And that day will be glorious > and fine, and you'll want as much disk concurrency as you can get your > hands on. > > -jwb > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
On Tue, Jan 31, 2006 at 08:09:40PM -0800, Jeffrey W. Baker wrote: >I think I've proved my point. Software RAID1 read balancing provides >0%, 300%, 100%, and 100% speedup on 1, 2, 4, and 8 threads, >respectively. In the presence of random I/O, the results are even >better. Umm, the point *was* about single stream performance. I guess you did a good job of proving it. >Anyone who thinks they have a single-threaded workload has not yet >encountered the autovacuum daemon. On tables where my single stream performance matters you'd better believe that the autovacuum daemon isn't running. Mike Stone
PFC, On 2/1/06 1:01 AM, "PFC" <lists@peufeu.com> wrote: > 3- Read one block of A, then one block of B, then one block of A, etc. > Essentially this is the same as the threaded case, except there's only one > thread. > 53 seconds total (with heavy seeking noise from the hdd). > > I was half expecting 3 to take the same as 2. It simulates, for > instance, > scanning a table and its index, or scanning 2 sort bins. Well, maybe one > day... This is actually interesting overall - I think what this might be showing is that the Linux SW RAID1 is alternating I/Os to the mirror disks from different processes (LWP or HWP both maybe?), but not within one process. > It would be nice if the Kernel had an API for applications to tell it > "I'm gonna need these blocks in the next seconds, can you read them in the > order you like (fastest), from whatever disk you like, and cache them for > me please; so that I can read them in the order I like, but very fast ?" More control is always good IMO, but for now there's I/O reordering in the SCSI layer and readahead tuning. There is POSIX fadvise() also to tell the underlying I/O layer what the access pattern looks like. - Luke
On Wed, Feb 01, 2006 at 09:42:12AM -0800, Luke Lonergan wrote: > This is actually interesting overall - I think what this might be showing is > that the Linux SW RAID1 is alternating I/Os to the mirror disks from > different processes (LWP or HWP both maybe?), but not within one process. Having read the code, I'm fairly certain it doesn't really care what process anything is coming from. /* Steinar */ -- Homepage: http://www.sesse.net/
On 2/1/06, Luke Lonergan <llonergan@greenplum.com> wrote: [snip] > This is actually interesting overall - I think what this might be showing is > that the Linux SW RAID1 is alternating I/Os to the mirror disks from > different processes (LWP or HWP both maybe?), but not within one process. I can confirm this behavior after looking at my multipathed fibre channel SAN. To the best of my knowledge, the multipathing code uses the same underlying I/O code as the Linux SW RAID logic. -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
Jeffrey, On 2/1/06 12:25 AM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote: > Ah, but someday Pg will be able to concurrently read from two > datastreams to complete a single query. And that day will be glorious > and fine, and you'll want as much disk concurrency as you can get your > hands on. Well - so happens that we have one of those multi-processing postgres' handy, so we'll test this theory out in the next couple of days. We've a customer who ordered 3 machines with 6 drives each (Dell 2850s) on two U320 SCSI busses, and we're going to try configuring them all in a single RAID10 and run two Bizgres MPP segments on that (along with two mirrors). We'll try the RAID10 config and if we get full parallelism, we'll use it (if the customer like it). Otherwise, we'll use two 3 disk RAID5 sets. I'll post the results here. Thanks Jeffrey, - Luke
Jeffrey W. Baker wrote: > On Tue, 2006-01-31 at 09:00 -0800, Luke Lonergan wrote: > >> Jim, >> >> On 1/30/06 12:25 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote: >> >> >>> Why divide by 2? A good raid controller should be able to send read >>> requests to both drives out of the mirrored set to fully utilize the >>> bandwidth. Of course, that probably won't come into play unless the OS >>> decides that it's going to read-ahead fairly large chunks of the table >>> at a time... >>> >> I've not seen one that does, nor would it work in the general case IMO. In >> RAID1 writes are duplicated and reads come from one of the copies. You >> could alternate read service requests to minimize rotational latency, but >> you can't improve bandwidth. >> > > Then you've not seen Linux. Linux does balanced reads on software > mirrors. I'm not sure why you think this can't improve bandwidth. It > does improve streaming bandwidth as long as the platter STR is more than > the bus STR. > FYI: so does the Solaris Volume Manager (by default) on Solaris. One can choose alternate access methods like "First" (if the other mirrors are slower than the first) or "Geometric". It's been doing this for a good 10 years now (back when it was called DiskSuite), so it's nothing new. -- Alan