Re: Huge Data sets, simple queries - Mailing list pgsql-performance

From PFC
Subject Re: Huge Data sets, simple queries
Date
Msg-id op.s4ad01lhcigqcu@apollo13
Whole thread Raw
In response to Re: Huge Data sets, simple queries  ("Jeffrey W. Baker" <jwbaker@acm.org>)
Responses Re: Huge Data sets, simple queries  ("Luke Lonergan" <llonergan@greenplum.com>)
List pgsql-performance
    I did a little test on soft raid1 :

    I have two 800 Mbytes files, say A and B. (RAM is 512Mbytes).

    Test 1 :
    1- Read A, then read B :
        19 seconds per file

    2- Read A and B simultaneously using two threads :
        22 seconds total (reads were paralleled by the RAID)

    3- Read one block of A, then one block of B, then one block of A, etc.
Essentially this is the same as the threaded case, except there's only one
thread.
        53 seconds total (with heavy seeking noise from the hdd).

    I was half expecting 3 to take the same as 2. It simulates, for instance,
scanning a table and its index, or scanning 2 sort bins. Well, maybe one
day...

    It would be nice if the Kernel had an API for applications to tell it
"I'm gonna need these blocks in the next seconds, can you read them in the
order you like (fastest), from whatever disk you like, and cache them for
me please; so that I can read them in the order I like, but very fast ?"


On Wed, 01 Feb 2006 09:25:13 +0100, Jeffrey W. Baker <jwbaker@acm.org>
wrote:

> On Tue, 2006-01-31 at 21:53 -0800, Luke Lonergan wrote:
>> Jeffrey,
>>
>> On 1/31/06 8:09 PM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote:
>> >> ... Prove it.
>> > I think I've proved my point.  Software RAID1 read balancing provides
>> > 0%, 300%, 100%, and 100% speedup on 1, 2, 4, and 8 threads,
>> > respectively.  In the presence of random I/O, the results are even
>> > better.
>> > Anyone who thinks they have a single-threaded workload has not yet
>> > encountered the autovacuum daemon.
>>
>> Good data - interesting case.  I presume from your results that you had
>> to
>> make the I/Os non-overlapping (the "skip" option to dd) in order to get
>> the
>> concurrent access to work.  Why the particular choice of offset - 3.2GB
>> in
>> this case?
>
> No particular reason.  8k x 100000 is what the last guy used upthread.
>>
>> So - the bandwidth doubles in specific circumstances under concurrent
>> workloads - not relevant to "Huge Data sets, simple queries", but
>> possibly
>> helpful for certain kinds of OLTP applications.
>
> Ah, but someday Pg will be able to concurrently read from two
> datastreams to complete a single query.  And that day will be glorious
> and fine, and you'll want as much disk concurrency as you can get your
> hands on.
>
> -jwb
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org



pgsql-performance by date:

Previous
From: "Jeffrey W. Baker"
Date:
Subject: Re: Huge Data sets, simple queries
Next
From: Richard Huxton
Date:
Subject: Re: partitioning and locking problems