Thread: Huge Data sets, simple queries

Huge Data sets, simple queries

From
"Mike Biamonte"
Date:

Does anyone have any experience with extremely large data sets?
I'm mean hundreds of millions of rows.

The queries I need to run on my 200 million transactions are relatively
simple:

   select month, count(distinct(cardnum)) count(*), sum(amount) from
transactions group by month;

This query took 18 hours on PG 8.1 on a Dual Xeon, RHEL3, (2.4 Kernel) with
RAID-10 (15K drives)
and 12 GB Ram.  I was expecting it to take about 4 hours - based on some
experience with a
similar dataset on a different machine (RH9, PG7.3 Dual Xeon, 4GB RAM,
Raid-5 10K drives)

  This machine is COMPLETELY devoted to running these relatively simple
queries one at a
time. (No multi-user support needed!)    I've been tooling with the various
performance settings:
effective_cache at 5GB, shared_buffers at 2 GB, workmem, sortmem at 1 GB
each.
( Shared buffers puzzles me a it bit - my instinct says to set it as high as
possible,
but everything I read says that "too high" can hurt performance.)

   Any ideas for performance tweaking in this kind of application would be
greatly appreciated.
We've got indexes on the fields being grouped, and always vacuum analzye
after building them.

   It's difficult to just "try" various ideas because each attempt takes a
full day to test.  Real
experience is needed here!

Thanks much,

Mike


Re: Huge Data sets, simple queries

From
"Jeffrey W. Baker"
Date:
On Fri, 2006-01-27 at 20:23 -0500, Mike Biamonte wrote:
>
> Does anyone have any experience with extremely large data sets?
> I'm mean hundreds of millions of rows.

Sure, I think more than a few of us do.  Just today I built a summary
table from a 25GB primary table with ~430 million rows.  This took about
45 minutes.

> The queries I need to run on my 200 million transactions are relatively
> simple:
>
>    select month, count(distinct(cardnum)) count(*), sum(amount) from
> transactions group by month;
>
> This query took 18 hours on PG 8.1 on a Dual Xeon, RHEL3, (2.4 Kernel) with
> RAID-10 (15K drives)
> and 12 GB Ram.  I was expecting it to take about 4 hours - based on some
> experience with a
> similar dataset on a different machine (RH9, PG7.3 Dual Xeon, 4GB RAM,
> Raid-5 10K drives)

Possibly the latter machine has a faster I/O subsystem.  How large is
the table on disk?

>   This machine is COMPLETELY devoted to running these relatively simple
> queries one at a
> time. (No multi-user support needed!)    I've been tooling with the various
> performance settings:
> effective_cache at 5GB, shared_buffers at 2 GB, workmem, sortmem at 1 GB
> each.
> ( Shared buffers puzzles me a it bit - my instinct says to set it as high as
> possible,
> but everything I read says that "too high" can hurt performance.)
>
>    Any ideas for performance tweaking in this kind of application would be
> greatly appreciated.
> We've got indexes on the fields being grouped,
> and always vacuum analzye
> after building them.

Probably vacuum makes no difference.

>    It's difficult to just "try" various ideas because each attempt takes a
> full day to test.  Real
> experience is needed here!

Can you send us an EXPLAIN of the query?  I believe what you're seeing
here is probably:

Aggregate
+-Sort
  +-Sequential Scan

or perhaps:

Aggregate
+-Index Scan

I have a feeling that the latter will be much faster.  If your table has
been created over time, then it is probably naturally ordered by date,
and therefore also ordered by month.  You might expect a Sequential Scan
to be the fastest, but the Sort step will be a killer.  On the other
hand, if your table is badly disordered by date, the Index Scan could
also be very slow.

Anyway, send us the query plan and also perhaps a sample of vmstat
during the query.

For what it's worth, I have:

effective_cache_size            | 700000
cpu_tuple_cost                  | 0.01
cpu_index_tuple_cost            | 0.001
random_page_cost                | 3
shared_buffers                  | 50000
temp_buffers                    | 1000
work_mem                        | 1048576 <= for this query only

And here's a few lines from vmstat during the query:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 2  1     76  43476  94916 7655148    0    0 78800     0 1662   788 68 12  0 20
 1  1     76  45060  91196 7658088    0    0 78028     0 1639   712 71 11  0 19
 2  0     76  44668  87624 7662960    0    0 78924    52 1650   736 69 12  0 19
 2  0     76  45300  83672 7667432    0    0 83536    16 1688   768 71 12  0 18
 1  1     76  45744  80652 7670712    0    0 84052     0 1691   796 70 12  0 17

That's about 80MB/sec sequential input, for comparison purposes.

-jwb


Re: Huge Data sets, simple queries

From
Tom Lane
Date:
"Mike Biamonte" <mike@dbeat.com> writes:
> The queries I need to run on my 200 million transactions are relatively
> simple:

>    select month, count(distinct(cardnum)) count(*), sum(amount) from
> transactions group by month;

count(distinct) is not "relatively simple", and the current
implementation isn't especially efficient.  Can you avoid that
construct?

Assuming that "month" means what it sounds like, the above would result
in running twelve parallel sort/uniq operations, one for each month
grouping, to eliminate duplicates before counting.  You've got sortmem
set high enough to blow out RAM in that scenario ...

            regards, tom lane

Re: Huge Data sets, simple queries

From
"Jeffrey W. Baker"
Date:
On Sat, 2006-01-28 at 10:55 -0500, Tom Lane wrote:
>
> Assuming that "month" means what it sounds like, the above would
> result
> in running twelve parallel sort/uniq operations, one for each month
> grouping, to eliminate duplicates before counting.  You've got sortmem
> set high enough to blow out RAM in that scenario ...

Hrmm, why is it that with a similar query I get a far simpler plan than
you describe, and relatively snappy runtime?

  select date
       , count(1) as nads
       , sum(case when premium then 1 else 0 end) as npremium
       , count(distinct(keyword)) as nwords
       , count(distinct(advertiser)) as nadvertisers
    from data
group by date
order by date asc

                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.00..14452743.09 rows=721 width=13)
   ->  Index Scan using data_date_idx on data  (cost=0.00..9075144.27 rows=430206752 width=13)
(2 rows)

=# show server_version;
 server_version
----------------
 8.1.2
(1 row)

-jwb


Re: Huge Data sets, simple queries

From
Tom Lane
Date:
"Jeffrey W. Baker" <jwbaker@acm.org> writes:
> On Sat, 2006-01-28 at 10:55 -0500, Tom Lane wrote:
>> Assuming that "month" means what it sounds like, the above would result
>> in running twelve parallel sort/uniq operations, one for each month
>> grouping, to eliminate duplicates before counting.  You've got sortmem
>> set high enough to blow out RAM in that scenario ...

> Hrmm, why is it that with a similar query I get a far simpler plan than
> you describe, and relatively snappy runtime?

You can't see the sort operations in the plan, because they're invoked
implicitly by the GroupAggregate node.  But they're there.

Also, a plan involving GroupAggregate is going to run the "distinct"
sorts sequentially, because it's dealing with only one grouping value at
a time.  In the original case, the planner probably realizes there are
only 12 groups and therefore prefers a HashAggregate, which will try
to run all the sorts in parallel.  Your "group by date" isn't a good
approximation of the original conditions because there will be a lot
more groups.

(We might need to tweak the planner to discourage selecting
HashAggregate in the presence of DISTINCT aggregates --- I don't
remember whether it accounts for the sortmem usage in deciding
whether the hash will fit in memory or not ...)

            regards, tom lane

Re: Huge Data sets, simple queries

From
"Luke Lonergan"
Date:
Sounds like you are running into the limits of your disk subsystem.  You are scanning all of the data in the transactions table, so you will be limited by the disk bandwidth you have – and using RAID-10, you should divide the number of disk drives by 2 and multiply by their indiividual bandwidth (around 60MB/s) and that’s what you can expect in terms of performance.  So, if you have 8 drives, you should expect to get 4 x 60 MB/s = 240 MB/s in bandwidth.  That means that if you are dealing with 24,000 MB of data in the “transactions” table, then you will scan  it in 100 seconds.

With a workload like this, you are in the realm of business intelligence / data warehousing I think.  You should check your disk performance, I would expect you’ll find it lacking, partly because you are running RAID10, but mostly because I expect you are using a hardware RAID adapter.

- Luke


On 1/27/06 5:23 PM, "Mike Biamonte" <mike@dbeat.com> wrote:




Does anyone have any experience with extremely large data sets?
I'm mean hundreds of millions of rows.

The queries I need to run on my 200 million transactions are relatively
simple:

   select month, count(distinct(cardnum)) count(*), sum(amount) from
transactions group by month;

This query took 18 hours on PG 8.1 on a Dual Xeon, RHEL3, (2.4 Kernel) with
RAID-10 (15K drives)
and 12 GB Ram.  I was expecting it to take about 4 hours - based on some
experience with a
similar dataset on a different machine (RH9, PG7.3 Dual Xeon, 4GB RAM,
Raid-5 10K drives)

  This machine is COMPLETELY devoted to running these relatively simple
queries one at a
time. (No multi-user support needed!)    I've been tooling with the various
performance settings:
effective_cache at 5GB, shared_buffers at 2 GB, workmem, sortmem at 1 GB
each.
( Shared buffers puzzles me a it bit - my instinct says to set it as high as
possible,
but everything I read says that "too high" can hurt performance.)

   Any ideas for performance tweaking in this kind of application would be
greatly appreciated.
We've got indexes on the fields being grouped, and always vacuum analzye
after building them.

   It's difficult to just "try" various ideas because each attempt takes a
full day to test.  Real
experience is needed here!

Thanks much,

Mike


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings



Re: Huge Data sets, simple queries

From
Tom Lane
Date:
I wrote:
> (We might need to tweak the planner to discourage selecting
> HashAggregate in the presence of DISTINCT aggregates --- I don't
> remember whether it accounts for the sortmem usage in deciding
> whether the hash will fit in memory or not ...)

Ah, I take that all back after checking the code: we don't use
HashAggregate at all when there are DISTINCT aggregates, precisely
because of this memory-blow-out problem.

For both your group-by-date query and the original group-by-month query,
the plan of attack is going to be to read the original input in grouping
order (either via sort or indexscan, with sorting probably preferred
unless the table is pretty well correlated with the index) and then
sort/uniq on the DISTINCT value within each group.  The OP is probably
losing on that step compared to your test because it's over much larger
groups than yours, forcing some spill to disk.  And most likely he's not
got an index on month, so the first sort is in fact a sort and not an
indexscan.

Bottom line is that he's probably doing a ton of on-disk sorting
where you're not doing any.  This makes me think Luke's theory about
inadequate disk horsepower may be on the money.

            regards, tom lane

Re: Huge Data sets, simple queries

From
hubert depesz lubaczewski
Date:
On 1/28/06, Luke Lonergan <llonergan@greenplum.com> wrote:
> You should check your disk performance, I would
> expect you'll find it lacking, partly because you are running RAID10, but
> mostly because I expect you are using a hardware RAID adapter.

hmm .. do i understand correctly that you're suggesting that using
raid 10 and/or hardware raid adapter might hurt disc subsystem
performance? could you elaborate on the reasons, please? it's not that
i'm against the idea - i'm just curious as this is very
"against-common-sense". and i always found it interesting when
somebody states something that uncommon...

best regards

depesz

Re: Huge Data sets, simple queries

From
Michael Stone
Date:
On Sun, Jan 29, 2006 at 12:25:23PM +0100, hubert depesz lubaczewski wrote:
>hmm .. do i understand correctly that you're suggesting that using
>raid 10 and/or hardware raid adapter might hurt disc subsystem
>performance? could you elaborate on the reasons, please?

I think it's been fairly well beaten to death that the low-end hardware
raid adapters have lousy performance. It's not until you get into the
range of battery-backed disk caches with 512M+ and multiple I/O channels
that hardware raid becomes competitive with software raid.

Mike Stone

Re: Huge Data sets, simple queries

From
"Luke Lonergan"
Date:
Depesz,

> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
> hubert depesz lubaczewski
> Sent: Sunday, January 29, 2006 3:25 AM
>
> hmm .. do i understand correctly that you're suggesting that
> using raid 10 and/or hardware raid adapter might hurt disc
> subsystem performance? could you elaborate on the reasons,
> please? it's not that i'm against the idea - i'm just curious
> as this is very "against-common-sense". and i always found it
> interesting when somebody states something that uncommon...

See previous postings on this list - often when someone is reporting a
performance problem with large data, the answer comes back that their
I/O setup is not performing well.  Most times, people are trusting that
when they buy a hardware RAID adapter and set it up, that the
performance will be what they expect and what is theoretically correct
for the number of disk drives.

In fact, in our testing of various host-based SCSI RAID adapters (LSI,
Dell PERC, Adaptec, HP SmartArray), we find that *all* of them
underperform, most of them severely.  Some produce results slower than a
single disk drive.  We've found that some external SCSI RAID adapters,
those built into the disk chassis, often perform better.  I think this
might be due to the better drivers and perhaps a different marketplace
for the higher end solutions driving performance validation.

The important lesson we've learned is to always test the I/O subsystem
performance - you can do so with a simple test like:
  time bash -c "dd if=/dev/zero of=bigfile bs=8k count=4000000 && sync"
  time dd if=bigfile of=/dev/null bs=8k

If the answer isn't something close to the theoretical rate, you are
likely limited by your RAID setup.  You might be shocked to find a
severe performance problem.  If either is true, switching to software
RAID using a simple SCSI adapter will fix the problem.

BTW - we've had very good experiences with the host-based SATA adapters
from 3Ware.  The Areca controllers are also respected.

Oh - and about RAID 10 - for large data work it's more often a waste of
disk performance-wise compared to RAID 5 these days.  RAID5 will almost
double the performance on a reasonable number of drives.

- Luke


Re: Huge Data sets, simple queries

From
"Jeffrey W. Baker"
Date:
On Sun, 2006-01-29 at 13:44 -0500, Luke Lonergan wrote:
> Depesz,
>
> > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
> > hubert depesz lubaczewski
> > Sent: Sunday, January 29, 2006 3:25 AM
> >
> > hmm .. do i understand correctly that you're suggesting that
> > using raid 10 and/or hardware raid adapter might hurt disc
> > subsystem performance? could you elaborate on the reasons,
> > please? it's not that i'm against the idea - i'm just curious
> > as this is very "against-common-sense". and i always found it
> > interesting when somebody states something that uncommon...

> Oh - and about RAID 10 - for large data work it's more often a waste of
> disk performance-wise compared to RAID 5 these days.  RAID5 will almost
> double the performance on a reasonable number of drives.

I think you might want to be more specific here.  I would agree with you
for data warehousing, decision support, data mining, and similar
read-mostly non-transactional loads.  For transactional loads RAID-5 is,
generally speaking, a disaster due to the read-before-write problem.

While we're on the topic, I just installed another one of those Areca
ARC-1130 controllers with 1GB cache.  It's ludicrously fast: 250MB/sec
burst writes, CPU-limited reads.  I can't recommend them highly enough.

-jwb

PS: Could you look into fixing your mailer?  Your messages sometimes
don't contain In-Reply-To headers, and therefore don't thread properly.


Re: Huge Data sets, simple queries

From
Michael Adler
Date:
On Fri, Jan 27, 2006 at 08:23:55PM -0500, Mike Biamonte wrote:

> This query took 18 hours on PG 8.1 on a Dual Xeon, RHEL3, (2.4
> Kernel) with RAID-10 (15K drives) and 12 GB Ram.  I was expecting it
> to take about 4 hours - based on some experience with a similar
> dataset on a different machine (RH9, PG7.3 Dual Xeon, 4GB RAM,
> Raid-5 10K drives)
>
>    It's difficult to just "try" various ideas because each attempt
> takes a full day to test.  Real experience is needed here!

It seems like you are changing multiple variables at the same time.

I think you need to first compare the query plans with EXPLAIN SELECT
to see if they are significantly different. Your upgrade from 7.3 to
8.1 may have resulted in a less optimal plan.

Second, you should monitor your IO performance during the query
execution and test it independent of postgres. Then compare the stats
between the two systems.

As a side note, if you have many disks and you are concerned about
bottlenecks on read operations, RAID 5 may perform better than
RAID 10.

 -Mike

Re: Huge Data sets, simple queries

From
"Craig A. James"
Date:
Mike Biamonte wrote:
> Does anyone have any experience with extremely large data sets?
> I'm mean hundreds of millions of rows.
>
> The queries I need to run on my 200 million transactions are relatively
> simple:
>
>    select month, count(distinct(cardnum)) count(*), sum(amount) from
> transactions group by month;

This may be heretical to post to a relational-database group, but sometimes a problem can be better solved OUTSIDE of
therelational system. 

I had a similar problem recently: I have a set of about 100,000 distinct values, each of which occurs one to several
milliontimes in the database, with an aggregate total of several hundred million occurances in the database. 

Sorting this into distinct lists ("Which rows contain this value?") proved quite time consuming (just like your case),
buton reflection, I realized that it was dumb to expect a general-purpose sorting algorithm to sort a list about which
Ihad specialized knowledge.  General-purpose sorting usually takes O(N*log(N)), but if you have a small number of
distinctvalues, you can use "bucket sorting" and sort in O(N) time, a huge improvement.  In my case, it was even more
specialized-- there was a very small number of the lists that contained thousands or millions of items, but about 95%
ofthe lists only had a few items. 

Armed with this knowledge, it took me couple weeks to write a highly-specialized sorting system that used a combination
ofPostgres, in-memory and disk caching, and algorithms dredged up from Knuth.  The final result ran in about four
hours.

The thing to remember about relational databases is that the designers are constrained by the need for generality,
reliabilityand SQL standards.  Given any particular well-defined task where you have specialized knowledge about the
data,and/or you don't care about transactional correctness, and/or you're not concerned about data loss, a good
programmercan always write a faster solution. 

Of course, there's a huge penalty.  You lose support, lose of generality, the application takes on complexity that
shouldbe in the database, and on and on.  A hand-crafted solution should be avoided unless there's simply no other way. 

A relational database is a tool.  Although powerful, like any tool it has limitations.  Use the tool where it's useful,
anduse other tools when necessary. 

Craig

Re: Huge Data sets, simple queries

From
Charles Sprickman
Date:
On Sun, 29 Jan 2006, Luke Lonergan wrote:

> In fact, in our testing of various host-based SCSI RAID adapters (LSI,
> Dell PERC, Adaptec, HP SmartArray), we find that *all* of them
> underperform, most of them severely.

[snip]

> The important lesson we've learned is to always test the I/O subsystem
> performance - you can do so with a simple test like:
>  time bash -c "dd if=/dev/zero of=bigfile bs=8k count=4000000 && sync"
>  time dd if=bigfile of=/dev/null bs=8k

I'm curious about this since we're shopping around for something new...  I
do want to get some kind of baseline to compare new products to.  Areca
sent me stats on their SCSI->SATA controller and it looks like it maxes
out around 10,000 IOPS.

I'd like to see how our existing stuff compares to this.  I'd especially
like to see it in graph form such as the docs Areca sent (IOPS on one
axis, block size on the other, etc.).  Looking at the venerable Bonnie, it
doesn't really seem to focus so much on the number of read/write
operations per second, but on big bulky transfers.

What are you folks using to measure your arrays?

I've been considering using some of our data and just basically
benchmarking postgres on various hardware with that, but I cannot compare
that to any manufacturer tests.

Sorry to meander a bit off topic, but I've been getting frustrated with
this little endeavour...

Thanks,

Charles

> - Luke
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>

Re: Huge Data sets, simple queries

From
"Luke Lonergan"
Date:
Charles,

On 1/29/06 9:35 PM, "Charles Sprickman" <spork@bway.net> wrote:

> What are you folks using to measure your arrays?

Bonnie++ measures random I/Os, numbers we find are typically in the 500/s
range, the best I've seen is 1500/s on a large Fibre Channel RAID0 (at
http://www.wlug.org.nz/HarddiskBenchmarks).

- Luke



Re: Huge Data sets, simple queries

From
hubert depesz lubaczewski
Date:
On 1/29/06, Luke Lonergan <LLonergan@greenplum.com> wrote:
> Oh - and about RAID 10 - for large data work it's more often a waste of
> disk performance-wise compared to RAID 5 these days.  RAID5 will almost
> double the performance on a reasonable number of drives.

how many is reasonable?

depesz

Re: Huge Data sets, simple queries

From
"Luke Lonergan"
Date:
Depesz,

On 1/30/06 9:53 AM, "hubert depesz lubaczewski" <depesz@gmail.com> wrote:

>> double the performance on a reasonable number of drives.
>
> how many is reasonable?

What I mean by that is: given a set of disks N, the read performance of RAID
will be equal to the drive read rate A times the number of drives used for
reading by the RAID algorithm.  In the case of RAID5, that number is (N-1),
so the read rate is A x (N-1).  In the case of RAID10, that number is N/2,
so the read rate is A x (N/2). So, the ratio of read performance
RAID5/RAID10 is (N-1)/(N/2) = 2 x (N-1)/N.  For numbers of drives, this
ratio looks like this:
N       RAID5/RAID10
3       1.33
6       1.67
8       1.75
14      1.86

So - I think reasonable would be 6-8, which are common disk configurations.

- Luke



Re: Huge Data sets, simple queries

From
"Jim C. Nasby"
Date:
On Fri, Jan 27, 2006 at 07:05:04PM -0800, Luke Lonergan wrote:
> Sounds like you are running into the limits of your disk subsystem.  You are
> scanning all of the data in the transactions table, so you will be limited
> by the disk bandwidth you have ? and using RAID-10, you should divide the
> number of disk drives by 2 and multiply by their indiividual bandwidth
> (around 60MB/s) and that?s what you can expect in terms of performance.  So,
> if you have 8 drives, you should expect to get 4 x 60 MB/s = 240 MB/s in
> bandwidth.  That means that if you are dealing with 24,000 MB of data in the
> ?transactions? table, then you will scan  it in 100 seconds.

Why divide by 2? A good raid controller should be able to send read
requests to both drives out of the mirrored set to fully utilize the
bandwidth. Of course, that probably won't come into play unless the OS
decides that it's going to read-ahead fairly large chunks of the table
at a time...

Also, some vmstat output would certainly help clarify where the
bottleneck is...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Huge Data sets, simple queries

From
"Luke Lonergan"
Date:
Jim,

On 1/30/06 12:25 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:

> Why divide by 2? A good raid controller should be able to send read
> requests to both drives out of the mirrored set to fully utilize the
> bandwidth. Of course, that probably won't come into play unless the OS
> decides that it's going to read-ahead fairly large chunks of the table
> at a time...

I've not seen one that does, nor would it work in the general case IMO.  In
RAID1 writes are duplicated and reads come from one of the copies.  You
could alternate read service requests to minimize rotational latency, but
you can't improve bandwidth.

- Luke



Re: Huge Data sets, simple queries

From
Kevin
Date:
Luke Lonergan wrote:
> Jim,
>
> On 1/30/06 12:25 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:
>
>
>> Why divide by 2? A good raid controller should be able to send read
>> requests to both drives out of the mirrored set to fully utilize the
>> bandwidth. Of course, that probably won't come into play unless the OS
>> decides that it's going to read-ahead fairly large chunks of the table
>> at a time...
>>
>
> I've not seen one that does, nor would it work in the general case IMO.  In
> RAID1 writes are duplicated and reads come from one of the copies.  You
> could alternate read service requests to minimize rotational latency, but
> you can't improve bandwidth.
>
> - Luke
>
>
For Solaris's software raid, the default settings for raid-1 sets is:
round-robin read, parallel write.   I assumed this mean't it would give
similar read performance to raid-0, but I've never benchmarked it.

-Kevin

Re: Huge Data sets, simple queries

From
"Jim C. Nasby"
Date:
On Tue, Jan 31, 2006 at 09:00:30AM -0800, Luke Lonergan wrote:
> Jim,
>
> On 1/30/06 12:25 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:
>
> > Why divide by 2? A good raid controller should be able to send read
> > requests to both drives out of the mirrored set to fully utilize the
> > bandwidth. Of course, that probably won't come into play unless the OS
> > decides that it's going to read-ahead fairly large chunks of the table
> > at a time...
>
> I've not seen one that does, nor would it work in the general case IMO.  In
> RAID1 writes are duplicated and reads come from one of the copies.  You
> could alternate read service requests to minimize rotational latency, but
> you can't improve bandwidth.

(BTW, I did some testing that seems to confirm this)

Why couldn't you double the bandwidth? If you're doing a largish read
you should be able to do something like have drive a read the first
track, drive b the second, etc. Of course that means that the controller
or OS would have to be able to stitch things back together.

As for software raid, I'm wondering how well that works if you can't use
a BBU to allow write caching/re-ordering...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Huge Data sets, simple queries

From
"Jeffrey W. Baker"
Date:
On Tue, 2006-01-31 at 09:00 -0800, Luke Lonergan wrote:
> Jim,
>
> On 1/30/06 12:25 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:
>
> > Why divide by 2? A good raid controller should be able to send read
> > requests to both drives out of the mirrored set to fully utilize the
> > bandwidth. Of course, that probably won't come into play unless the OS
> > decides that it's going to read-ahead fairly large chunks of the table
> > at a time...
>
> I've not seen one that does, nor would it work in the general case IMO.  In
> RAID1 writes are duplicated and reads come from one of the copies.  You
> could alternate read service requests to minimize rotational latency, but
> you can't improve bandwidth.

Then you've not seen Linux.  Linux does balanced reads on software
mirrors.  I'm not sure why you think this can't improve bandwidth.  It
does improve streaming bandwidth as long as the platter STR is more than
the bus STR.

-jwb

Re: Huge Data sets, simple queries

From
"Luke Lonergan"
Date:
Jeffrey,

On 1/31/06 12:03 PM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote:

> Then you've not seen Linux.

:-D

> Linux does balanced reads on software
> mirrors.  I'm not sure why you think this can't improve bandwidth.  It
> does improve streaming bandwidth as long as the platter STR is more than
> the bus STR.

... Prove it.

- Luke



Re: Huge Data sets, simple queries

From
"Luke Lonergan"
Date:
Jim,

On 1/31/06 11:21 AM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:

> (BTW, I did some testing that seems to confirm this)
>
> Why couldn't you double the bandwidth? If you're doing a largish read
> you should be able to do something like have drive a read the first
> track, drive b the second, etc. Of course that means that the controller
> or OS would have to be able to stitch things back together.

It's because your alternating reads are skipping in chunks across the
platter.  Disks work at their max internal rate when reading sequential
data, and the cache is often built to buffer a track-at-a-time, so
alternating pieces that are not contiguous has the effect of halving the max
internal sustained bandwidth of each drive - the total is equal to one
drive's sustained internal bandwidth.

This works differently for RAID0, where the chunks are allocated to each
drive and laid down contiguously on each, so that when they're read back,
each drive runs at it's sustained sequential throughput.

The alternating technique in mirroring might improve rotational latency for
random seeking - a trick that Tandem exploited, but it won't improve
bandwidth.

> As for software raid, I'm wondering how well that works if you can't use
> a BBU to allow write caching/re-ordering...

Works great with standard OS write caching.

- Luke



Re: Huge Data sets, simple queries

From
PFC
Date:
>> Linux does balanced reads on software
>> mirrors.  I'm not sure why you think this can't improve bandwidth.  It
>> does improve streaming bandwidth as long as the platter STR is more than
>> the bus STR.
>
> ... Prove it.
>

    (I have a software RAID1 on this desktop machine)

    It's a lot faster than a single disk for random reads when more than 1
thread hits the disk, because it distributes reads to both disks. Thus,
applications start faster, and the machine is more reactive even when the
disk is thrashing. Cron starting a "updatedb" is less painful. It's cool
for desktop use (and of course it's more reliable).

    However large reads (dd-style) are just the same speed as 1 drive. I
guess you'd need a humongous readahead in order to read from both disks.

Re: Huge Data sets, simple queries

From
"Jim C. Nasby"
Date:
On Tue, Jan 31, 2006 at 02:52:57PM -0800, Luke Lonergan wrote:
> It's because your alternating reads are skipping in chunks across the
> platter.  Disks work at their max internal rate when reading sequential
> data, and the cache is often built to buffer a track-at-a-time, so
> alternating pieces that are not contiguous has the effect of halving the max
> internal sustained bandwidth of each drive - the total is equal to one
> drive's sustained internal bandwidth.
>
> This works differently for RAID0, where the chunks are allocated to each
> drive and laid down contiguously on each, so that when they're read back,
> each drive runs at it's sustained sequential throughput.
>
> The alternating technique in mirroring might improve rotational latency for
> random seeking - a trick that Tandem exploited, but it won't improve
> bandwidth.

Or just work in multiples of tracks, which would greatly reduce the
impact of delays from seeking.

> > As for software raid, I'm wondering how well that works if you can't use
> > a BBU to allow write caching/re-ordering...
>
> Works great with standard OS write caching.

Well, the only problem with that is if the machine crashes for any
reason you risk having the database corrupted (or at best losing some
committed transactions).
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Huge Data sets, simple queries

From
"Luke Lonergan"
Date:
PFC,

On 1/31/06 3:11 PM, "PFC" <lists@peufeu.com> wrote:

>> ... Prove it.
>>
>
> (I have a software RAID1 on this desktop machine)
>
> It's a lot faster than a single disk for random reads when more than 1
> thread hits the disk, because it distributes reads to both disks. Thus,
> applications start faster, and the machine is more reactive even when the
> disk is thrashing. Cron starting a "updatedb" is less painful. It's cool
> for desktop use (and of course it's more reliable).

Exactly - improved your random seeks.

> However large reads (dd-style) are just the same speed as 1 drive. I
> guess you'd need a humongous readahead in order to read from both disks.

Nope - won't help.

- Luke



Re: Huge Data sets, simple queries

From
"Luke Lonergan"
Date:
Jim,

On 1/31/06 3:12 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:

>> The alternating technique in mirroring might improve rotational latency for
>> random seeking - a trick that Tandem exploited, but it won't improve
>> bandwidth.
>
> Or just work in multiples of tracks, which would greatly reduce the
> impact of delays from seeking.

So, having rediscovered the facts underlying the age-old RAID10 versus RAID5
debate we're back to the earlier points.

RAID10 is/was the best option when latency / random seek was the predominant
problem to be solved, RAID5/50 is best where read bandwidth is needed.
Modern developments in fast CPUs for write checksumming have made RAID5/50 a
viable alternative to RAID10 even when there is moderate write / random seek
workloads and fast read is needed.

>>
>> Works great with standard OS write caching.
>
> Well, the only problem with that is if the machine crashes for any
> reason you risk having the database corrupted (or at best losing some
> committed transactions).

So, do you routinely turn off Linux write caching?  If not, then there's no
difference.

- Luke



Re: Huge Data sets, simple queries

From
"Steinar H. Gunderson"
Date:
On Tue, Jan 31, 2006 at 12:47:10PM -0800, Luke Lonergan wrote:
>> Linux does balanced reads on software
>> mirrors.  I'm not sure why you think this can't improve bandwidth.  It
>> does improve streaming bandwidth as long as the platter STR is more than
>> the bus STR.
> ... Prove it.

FWIW, this is on Ultra160 disks (Seagate 10000rpm) on a dual Opteron running
Linux 2.6.14.3:

cassarossa:~# grep md1 /proc/mdstat
md1 : active raid1 sdf6[1] sda6[0]
cassarossa:~# dd if=/dev/sda6 of=/dev/null bs=8k count=400000
[system at about 35% wait for I/O and 15% system, according to top]
400000+0 records in
400000+0 records out
3276800000 bytes transferred in 54,488154 seconds (60137842 bytes/sec)
[system at about 45% wait for I/O and 7% system -- whoa?]
400000+0 records in
400000+0 records out
3276800000 bytes transferred in 52,523771 seconds (62386990 bytes/sec)

I'm not sure if it _refutes_ the assertion that the Linux RAID-1 driver can
do balancing of sequential reads, but it certainly doesn't present very much
evidence in that direction. BTW, sda and sdf are on different channels of a
dual-channel (onboard, connected via PCI-X) Adaptec board, so I doubt the bus
is the limiting factor.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Huge Data sets, simple queries

From
"Luke Lonergan"
Date:
Steinar,

On 1/31/06 5:26 PM, "Steinar H. Gunderson" <sgunderson@bigfoot.com> wrote:

> cassarossa:~# grep md1 /proc/mdstat
> md1 : active raid1 sdf6[1] sda6[0]
> cassarossa:~# dd if=/dev/sda6 of=/dev/null bs=8k count=400000
> [system at about 35% wait for I/O and 15% system, according to top]
> 400000+0 records in
> 400000+0 records out
> 3276800000 bytes transferred in 54,488154 seconds (60137842 bytes/sec)
> [system at about 45% wait for I/O and 7% system -- whoa?]
> 400000+0 records in
> 400000+0 records out
> 3276800000 bytes transferred in 52,523771 seconds (62386990 bytes/sec)
>
> I'm not sure if it _refutes_ the assertion that the Linux RAID-1 driver can
> do balancing of sequential reads, but it certainly doesn't present very much
> evidence in that direction. BTW, sda and sdf are on different channels of a
> dual-channel (onboard, connected via PCI-X) Adaptec board, so I doubt the bus
> is the limiting factor.

Yep - 2MB/s is noise.  Run a RAID0, you should get 120MB/s.

Incidentally, before this thread took a turn to RAID10 vs. RAID5, the
question of HW RAID adapter versus SW RAID was the focus.  I routinely see
numbers like 20MB/s coming from HW RAID adapters on Linux, so it's nice to
see someone post a decent number using SW RAID.

We're very happy with the 3Ware HW RAID adapters, but so far they're the
only ones (I have two Arecas but I mistakenly ordered PCI-E so I can't test
them :-(

- Luke



Re: Huge Data sets, simple queries

From
"Jim C. Nasby"
Date:
On Tue, Jan 31, 2006 at 03:19:38PM -0800, Luke Lonergan wrote:
> > Well, the only problem with that is if the machine crashes for any
> > reason you risk having the database corrupted (or at best losing some
> > committed transactions).
>
> So, do you routinely turn off Linux write caching?  If not, then there's no
> difference.

My thought was about fsync on WAL; if you're doing much writing then
a good raid write cache with BBU will improve performance.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Huge Data sets, simple queries

From
"Jeffrey W. Baker"
Date:
On Tue, 2006-01-31 at 12:47 -0800, Luke Lonergan wrote:
> Jeffrey,
>
> On 1/31/06 12:03 PM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote:
> > Linux does balanced reads on software
> > mirrors.  I'm not sure why you think this can't improve bandwidth.  It
> > does improve streaming bandwidth as long as the platter STR is more than
> > the bus STR.
>
> ... Prove it.

It's clear that Linux software RAID1, and by extension RAID10, does
balanced reads, and that these balanced reads double the bandwidth.  A
quick glance at the kernel source code, and a trivial test, proves the
point.

In this test, sdf and sdg are Seagate 15k.3 disks on a single channel of
an Adaptec 39320, but the enclosure, and therefore the bus, is capable
of only Ultra160 operation.

# grep md0 /proc/mdstat
md0 : active raid1 sdf1[0] sdg1[1]

# dd if=/dev/md0 of=/dev/null bs=8k count=400000 skip=0      &
  dd if=/dev/md0 of=/dev/null bs=8k count=400000 skip=400000
400000+0 records in
400000+0 records out
3276800000 bytes transferred in 48.243362 seconds (67922298 bytes/sec)
400000+0 records in
400000+0 records out
3276800000 bytes transferred in 48.375897 seconds (67736211 bytes/sec)

That's 136MB/sec, for those following along at home.  With only two
disks in a RAID1, you can nearly max out the SCSI bus.

# dd if=/dev/sdf1 of=/dev/null bs=8k count=400000 skip=0      &
  dd if=/dev/sdf1 of=/dev/null bs=8k count=400000 skip=400000
400000+0 records in
400000+0 records out
3276800000 bytes transferred in 190.413286 seconds (17208883 bytes/sec)
400000+0 records in
400000+0 records out
3276800000 bytes transferred in 192.096232 seconds (17058117 bytes/sec)

That, on the other hand, is only 34MB/sec.  With two threads, the RAID1
is 296% faster.

# dd if=/dev/md0 of=/dev/null bs=8k count=400000 skip=0       &
  dd if=/dev/md0 of=/dev/null bs=8k count=400000 skip=400000  &
  dd if=/dev/md0 of=/dev/null bs=8k count=400000 skip=800000  &
  dd if=/dev/md0 of=/dev/null bs=8k count=400000 skip=1200000 &
400000+0 records in
400000+0 records out
3276800000 bytes transferred in 174.276585 seconds (18802296 bytes/sec)
400000+0 records in
400000+0 records out
3276800000 bytes transferred in 181.581893 seconds (18045852 bytes/sec)
400000+0 records in
400000+0 records out
3276800000 bytes transferred in 183.724243 seconds (17835425 bytes/sec)
400000+0 records in
400000+0 records out
3276800000 bytes transferred in 184.209018 seconds (17788489 bytes/sec)

That's 71MB/sec with 4 threads...

# dd if=/dev/sdf1 of=/dev/null bs=8k count=100000 skip=0       &
  dd if=/dev/sdf1 of=/dev/null bs=8k count=100000 skip=400000  &
  dd if=/dev/sdf1 of=/dev/null bs=8k count=100000 skip=800000  &
  dd if=/dev/sdf1 of=/dev/null bs=8k count=100000 skip=1200000 &
100000+0 records in
100000+0 records out
819200000 bytes transferred in 77.489210 seconds (10571794 bytes/sec)
100000+0 records in
100000+0 records out
819200000 bytes transferred in 87.628000 seconds (9348610 bytes/sec)
100000+0 records in
100000+0 records out
819200000 bytes transferred in 88.912989 seconds (9213502 bytes/sec)
100000+0 records in
100000+0 records out
819200000 bytes transferred in 90.238705 seconds (9078144 bytes/sec)

Only 36MB/sec for the single disk.  96% advantage for the RAID1.

# dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=0 &
  dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=400000  &
  dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=800000  &
  dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=1200000 &
  dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=1600000 &
  dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=2000000 &
  dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=2400000 &
  dd if=/dev/md0 of=/dev/null bs=8k count=50000 skip=2800000 &
50000+0 records in
50000+0 records out
409600000 bytes transferred in 35.289648 seconds (11606803 bytes/sec)
50000+0 records in
50000+0 records out
409600000 bytes transferred in 42.653475 seconds (9602969 bytes/sec)
50000+0 records in
50000+0 records out
409600000 bytes transferred in 43.524714 seconds (9410745 bytes/sec)
50000+0 records in
50000+0 records out
409600000 bytes transferred in 45.151705 seconds (9071640 bytes/sec)
50000+0 records in
50000+0 records out
409600000 bytes transferred in 47.741845 seconds (8579476 bytes/sec)
50000+0 records in
50000+0 records out
409600000 bytes transferred in 48.600533 seconds (8427891 bytes/sec)
50000+0 records in
50000+0 records out
409600000 bytes transferred in 48.758726 seconds (8400548 bytes/sec)
50000+0 records in
50000+0 records out
409600000 bytes transferred in 49.679275 seconds (8244887 bytes/sec)

66MB/s with 8 threads.

# dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=0 &
  dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=400000  &
  dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=800000  &
  dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=1200000 &
  dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=1600000 &
  dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=2000000 &
  dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=2400000 &
  dd if=/dev/sdf1 of=/dev/null bs=8k count=50000 skip=2800000 &
50000+0 records in
50000+0 records out
409600000 bytes transferred in 73.873911 seconds (5544583 bytes/sec)
50000+0 records in
50000+0 records out
409600000 bytes transferred in 75.613093 seconds (5417051 bytes/sec)
50000+0 records in
50000+0 records out
409600000 bytes transferred in 79.988303 seconds (5120749 bytes/sec)
50000+0 records in
50000+0 records out
409600000 bytes transferred in 79.996440 seconds (5120228 bytes/sec)
50000+0 records in
50000+0 records out
409600000 bytes transferred in 84.885172 seconds (4825342 bytes/sec)
50000+0 records in
50000+0 records out
409600000 bytes transferred in 92.995892 seconds (4404496 bytes/sec)
50000+0 records in
50000+0 records out
409600000 bytes transferred in 99.180337 seconds (4129851 bytes/sec)
50000+0 records in
50000+0 records out
409600000 bytes transferred in 100.144752 seconds (4090080 bytes/sec)

33MB/s.  RAID1 gives a 100% advantage at 8 threads.

I think I've proved my point.  Software RAID1 read balancing provides
0%, 300%, 100%, and 100% speedup on 1, 2, 4, and 8 threads,
respectively.  In the presence of random I/O, the results are even
better.

Anyone who thinks they have a single-threaded workload has not yet
encountered the autovacuum daemon.

-Jeff


Re: Huge Data sets, simple queries

From
"Luke Lonergan"
Date:
Jeffrey,

On 1/31/06 8:09 PM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote:
>> ... Prove it.
> I think I've proved my point.  Software RAID1 read balancing provides
> 0%, 300%, 100%, and 100% speedup on 1, 2, 4, and 8 threads,
> respectively.  In the presence of random I/O, the results are even
> better.
> Anyone who thinks they have a single-threaded workload has not yet
> encountered the autovacuum daemon.

Good data - interesting case.  I presume from your results that you had to
make the I/Os non-overlapping (the "skip" option to dd) in order to get the
concurrent access to work.  Why the particular choice of offset - 3.2GB in
this case?

So - the bandwidth doubles in specific circumstances under concurrent
workloads - not relevant to "Huge Data sets, simple queries", but possibly
helpful for certain kinds of OLTP applications.

- Luke



Re: Huge Data sets, simple queries

From
"Jeffrey W. Baker"
Date:
On Tue, 2006-01-31 at 21:53 -0800, Luke Lonergan wrote:
> Jeffrey,
>
> On 1/31/06 8:09 PM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote:
> >> ... Prove it.
> > I think I've proved my point.  Software RAID1 read balancing provides
> > 0%, 300%, 100%, and 100% speedup on 1, 2, 4, and 8 threads,
> > respectively.  In the presence of random I/O, the results are even
> > better.
> > Anyone who thinks they have a single-threaded workload has not yet
> > encountered the autovacuum daemon.
>
> Good data - interesting case.  I presume from your results that you had to
> make the I/Os non-overlapping (the "skip" option to dd) in order to get the
> concurrent access to work.  Why the particular choice of offset - 3.2GB in
> this case?

No particular reason.  8k x 100000 is what the last guy used upthread.
>
> So - the bandwidth doubles in specific circumstances under concurrent
> workloads - not relevant to "Huge Data sets, simple queries", but possibly
> helpful for certain kinds of OLTP applications.

Ah, but someday Pg will be able to concurrently read from two
datastreams to complete a single query.  And that day will be glorious
and fine, and you'll want as much disk concurrency as you can get your
hands on.

-jwb


Re: Huge Data sets, simple queries

From
PFC
Date:
    I did a little test on soft raid1 :

    I have two 800 Mbytes files, say A and B. (RAM is 512Mbytes).

    Test 1 :
    1- Read A, then read B :
        19 seconds per file

    2- Read A and B simultaneously using two threads :
        22 seconds total (reads were paralleled by the RAID)

    3- Read one block of A, then one block of B, then one block of A, etc.
Essentially this is the same as the threaded case, except there's only one
thread.
        53 seconds total (with heavy seeking noise from the hdd).

    I was half expecting 3 to take the same as 2. It simulates, for instance,
scanning a table and its index, or scanning 2 sort bins. Well, maybe one
day...

    It would be nice if the Kernel had an API for applications to tell it
"I'm gonna need these blocks in the next seconds, can you read them in the
order you like (fastest), from whatever disk you like, and cache them for
me please; so that I can read them in the order I like, but very fast ?"


On Wed, 01 Feb 2006 09:25:13 +0100, Jeffrey W. Baker <jwbaker@acm.org>
wrote:

> On Tue, 2006-01-31 at 21:53 -0800, Luke Lonergan wrote:
>> Jeffrey,
>>
>> On 1/31/06 8:09 PM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote:
>> >> ... Prove it.
>> > I think I've proved my point.  Software RAID1 read balancing provides
>> > 0%, 300%, 100%, and 100% speedup on 1, 2, 4, and 8 threads,
>> > respectively.  In the presence of random I/O, the results are even
>> > better.
>> > Anyone who thinks they have a single-threaded workload has not yet
>> > encountered the autovacuum daemon.
>>
>> Good data - interesting case.  I presume from your results that you had
>> to
>> make the I/Os non-overlapping (the "skip" option to dd) in order to get
>> the
>> concurrent access to work.  Why the particular choice of offset - 3.2GB
>> in
>> this case?
>
> No particular reason.  8k x 100000 is what the last guy used upthread.
>>
>> So - the bandwidth doubles in specific circumstances under concurrent
>> workloads - not relevant to "Huge Data sets, simple queries", but
>> possibly
>> helpful for certain kinds of OLTP applications.
>
> Ah, but someday Pg will be able to concurrently read from two
> datastreams to complete a single query.  And that day will be glorious
> and fine, and you'll want as much disk concurrency as you can get your
> hands on.
>
> -jwb
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org



Re: Huge Data sets, simple queries

From
Michael Stone
Date:
On Tue, Jan 31, 2006 at 08:09:40PM -0800, Jeffrey W. Baker wrote:
>I think I've proved my point.  Software RAID1 read balancing provides
>0%, 300%, 100%, and 100% speedup on 1, 2, 4, and 8 threads,
>respectively.  In the presence of random I/O, the results are even
>better.

Umm, the point *was* about single stream performance. I guess you did a
good job of proving it.

>Anyone who thinks they have a single-threaded workload has not yet
>encountered the autovacuum daemon.

On tables where my single stream performance matters you'd better
believe that the autovacuum daemon isn't running.

Mike Stone

Re: Huge Data sets, simple queries

From
"Luke Lonergan"
Date:
PFC,


On 2/1/06 1:01 AM, "PFC" <lists@peufeu.com> wrote:

>         3- Read one block of A, then one block of B, then one block of A, etc.
> Essentially this is the same as the threaded case, except there's only one
> thread.
>                 53 seconds total (with heavy seeking noise from the hdd).
>
>         I was half expecting 3 to take the same as 2. It simulates, for
> instance,
> scanning a table and its index, or scanning 2 sort bins. Well, maybe one
> day...

This is actually interesting overall - I think what this might be showing is
that the Linux SW RAID1 is alternating I/Os to the mirror disks from
different processes (LWP or HWP both maybe?), but not within one process.

>         It would be nice if the Kernel had an API for applications to tell it
> "I'm gonna need these blocks in the next seconds, can you read them in the
> order you like (fastest), from whatever disk you like, and cache them for
> me please; so that I can read them in the order I like, but very fast ?"

More control is always good IMO, but for now there's I/O reordering in the
SCSI layer and readahead tuning.  There is POSIX fadvise() also to tell the
underlying I/O layer what the access pattern looks like.

- Luke



Re: Huge Data sets, simple queries

From
"Steinar H. Gunderson"
Date:
On Wed, Feb 01, 2006 at 09:42:12AM -0800, Luke Lonergan wrote:
> This is actually interesting overall - I think what this might be showing is
> that the Linux SW RAID1 is alternating I/Os to the mirror disks from
> different processes (LWP or HWP both maybe?), but not within one process.

Having read the code, I'm fairly certain it doesn't really care what process
anything is coming from.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Huge Data sets, simple queries

From
Mike Rylander
Date:
On 2/1/06, Luke Lonergan <llonergan@greenplum.com> wrote:
[snip]
> This is actually interesting overall - I think what this might be showing is
> that the Linux SW RAID1 is alternating I/Os to the mirror disks from
> different processes (LWP or HWP both maybe?), but not within one process.

I can confirm this behavior after looking at my multipathed fibre
channel SAN.  To the best of my knowledge, the multipathing code uses
the same underlying I/O code as the Linux SW RAID logic.

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

Re: Huge Data sets, simple queries

From
"Luke Lonergan"
Date:
Jeffrey,

On 2/1/06 12:25 AM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote:

> Ah, but someday Pg will be able to concurrently read from two
> datastreams to complete a single query.  And that day will be glorious
> and fine, and you'll want as much disk concurrency as you can get your
> hands on.

Well - so happens that we have one of those multi-processing postgres'
handy, so we'll test this theory out in the next couple of days.  We've a
customer who ordered 3 machines with 6 drives each (Dell 2850s) on two U320
SCSI busses, and we're going to try configuring them all in a single RAID10
and run two Bizgres MPP segments on that (along with two mirrors).

We'll try the RAID10 config and if we get full parallelism, we'll use it (if
the customer like it).  Otherwise, we'll use two 3 disk RAID5 sets.

I'll post the results here.

Thanks Jeffrey,

- Luke



Re: Huge Data sets, simple queries

From
Alan Stange
Date:
Jeffrey W. Baker wrote:
> On Tue, 2006-01-31 at 09:00 -0800, Luke Lonergan wrote:
>
>> Jim,
>>
>> On 1/30/06 12:25 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:
>>
>>
>>> Why divide by 2? A good raid controller should be able to send read
>>> requests to both drives out of the mirrored set to fully utilize the
>>> bandwidth. Of course, that probably won't come into play unless the OS
>>> decides that it's going to read-ahead fairly large chunks of the table
>>> at a time...
>>>
>> I've not seen one that does, nor would it work in the general case IMO.  In
>> RAID1 writes are duplicated and reads come from one of the copies.  You
>> could alternate read service requests to minimize rotational latency, but
>> you can't improve bandwidth.
>>
>
> Then you've not seen Linux.  Linux does balanced reads on software
> mirrors.  I'm not sure why you think this can't improve bandwidth.  It
> does improve streaming bandwidth as long as the platter STR is more than
> the bus STR.
>
FYI:   so does the Solaris Volume Manager (by default) on Solaris.   One
can choose alternate access methods like  "First" (if the other mirrors
are slower than the first) or "Geometric".  It's been doing this for a
good 10 years now (back when it was called DiskSuite), so it's nothing new.

-- Alan