Thread: Re: [HACKERS] sort on huge table

Re: [HACKERS] sort on huge table

From

Mike Mascari

Date:

01 November 1999, 21:25:00

--- Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > RedHat Linux 6.0 (kernel 2.2.5-smp)
> > Pentium III 500MHz x 2
> > RAM: 512MB
> > Disk: Ultra Wide SCSI 9GB x 4 + Hardware RAID (RAID 5).
> 
> OK, no problem with inadequate hardware anyway ;-).  Bruce's concern
> about simplistic read-ahead algorithm in Linux may apply though.
> 
> > Also, I could provide testing scripts to reproduce my tests.
> 
> Please.  That would be very handy so that we can make sure we are all
> comparing the same thing.  I assume the scripts can be tweaked to vary
> the amount of disk space used?  I can't scare up more than a couple
> hundred meg at the moment.  (The natural state of a disk drive is
> "full" ...)
> 
> > I think it depends on the disk space available. Ideally it should be
> > able to choice the sort algorithm.
> 
> I was hoping to avoid that, because of the extra difficulty of testing
> and maintenance.  But it may be the only answer.
> 
>             regards, tom lane

I know this is a VERY long shot, but... what were the READ/WRITE ratios
between the old version and the new version? Perhaps the computation
of the checksum (sic) blocks under RAID5 caused the unexpected behavior. 
With RAID 5 increasing read performance but decreasing writes, one might 
expect a new algorithm which say, halves reads, but increases writes 
slightly to not realize the same benefits as under a normal disk system or
a RAID 1 (or, better yet, a RAID 0+1) array.

Like I said...a VERY long shot theory.

Mike Mascari
(mascarim@yahoo.com)

=====

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com

Re: [HACKERS] sort on huge table

From

Don Baccus

Date:

01 November 1999, 21:38:01

At 06:24 PM 11/1/99 -0800, Mike Mascari wrote:

>I know this is a VERY long shot, but... what were the READ/WRITE ratios
>between the old version and the new version? Perhaps the computation
>of the checksum (sic) blocks under RAID5 caused the unexpected behavior. 
>With RAID 5 increasing read performance but decreasing writes, one might 
>expect a new algorithm which say, halves reads, but increases writes 
>slightly to not realize the same benefits as under a normal disk system or
>a RAID 1 (or, better yet, a RAID 0+1) array.

RAID 5, not the operating system, might be getting in the way...it
would be interesting to test this on a Linux 2.2 kernel without
the RAID 5 complication.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.

Re: [HACKERS] sort on huge table

From

Tom Lane

Date:

01 November 1999, 22:40:01

> At 06:24 PM 11/1/99 -0800, Mike Mascari wrote:
>> I know this is a VERY long shot, but... what were the READ/WRITE ratios
>> between the old version and the new version? Perhaps the computation
>> of the checksum (sic) blocks under RAID5 caused the unexpected behavior. 

Good try but no cigar --- we're dealing with a merge algorithm here,
and it's inherently the same amount of data in and out.  You write
a block once, you read the same block once later on.  But...

Don Baccus <dhogaza@pacifier.com> writes:
> RAID 5, not the operating system, might be getting in the way...it
> would be interesting to test this on a Linux 2.2 kernel without
> the RAID 5 complication.

... I agree this'd be worth trying.  There could be some subtle effect
somewhere in RAID5 that's tripping things up.  It'd also be useful if
someone could try it on similar RAID hardware with a non-Linux kernel.
        regards, tom lane

Re: [HACKERS] sort on huge table

From

Tatsuo Ishii

Date:

01 November 1999, 23:08:02

>> At 06:24 PM 11/1/99 -0800, Mike Mascari wrote:
>>> I know this is a VERY long shot, but... what were the READ/WRITE ratios
>>> between the old version and the new version? Perhaps the computation
>>> of the checksum (sic) blocks under RAID5 caused the unexpected behavior. 
>
>Good try but no cigar --- we're dealing with a merge algorithm here,
>and it's inherently the same amount of data in and out.  You write
>a block once, you read the same block once later on.  But...
>
>Don Baccus <dhogaza@pacifier.com> writes:
>> RAID 5, not the operating system, might be getting in the way...it
>> would be interesting to test this on a Linux 2.2 kernel without
>> the RAID 5 complication.
>
>... I agree this'd be worth trying.  There could be some subtle effect
>somewhere in RAID5 that's tripping things up.  It'd also be useful if
>someone could try it on similar RAID hardware with a non-Linux kernel.

I have compared current with 6.5 using 1000000 tuple-table (243MB) (I
wanted to try 2GB+ table but 6.5 does not work in this case). The
result was strange in that current is *faster* than 6.5!

RAID5current    2:296.5.2    3:15

non-RAIDcurrent    1:506.5.2    2:13

Seems my previous testing was done in wrong way or the behavior of
sorting might be different if the table size is changed?

Anyway, here is my test script.
First, edit Makefile to set DB and number of tuples. Then run type
make. That's all.
--
Tatsuo Ishii
--------------------------------------------------------------------
begin 644 sort.tar.gz
M'XL(`(-F'C@``^V737/:,!"&N5:_8@MD@@D8VQ@\`R$S!=)V.J3I).TIR<'8
M,H@:F]@BDTQ+?WNU_H#2-L.ED![T<$`?[Z[6R"LM<1CQQH7]E7K,IX7]H&M:
MVS2A`-!LZDW\UJV6AM\I+5,#L#3-,LVV9NIBVA"?`FA[BF>+9<SM"*#`ZRR>
M,O:L;CZ)#A'.H2GM@I3`I9Z]]#G$E',63&+PP@@XC;&CDF$?>DF/?/S\Y=/H
M_%ITQ8XC9+=W,A@(_<1QR.#MZ,T[-*Y?&N3Z_?EH)-J-,0L:\91<7>2=:`YU
MCUQ?#5`ZH8'JD,O^A[P3$F+[?@>B99"$1+)&!V?)JW(E<:R`VG`B:G.JQE,H
M5[+`%=$<]A5"A+8CVNA70:/!`*?2`+&%JRM0#Q.GQ/&I'710=W6AX%!N"M4?
MY*7W=Q<QYG_R,^YOC9WYWVZE^6\8>E/'_->;EBGS_Q"46.#X2Y?":<Q=%JK3
M,T+F-@LJ+.!@1Q.G!LY4_$#5JN@\*.0;`<`I5IMULV:0)WR7X(@'%;2#,]`5
M0#DD"IN'#"<>;O0[!4U7J/86D7#A5<3B-(IJ1=P,/%;@R`4G7#P5:T$B1JDX
M="JLIW79:=!E)R>Y\\Q#\<B]Y<4:2^29>B;4LU-#1#;;Z#<6C\5,O-IR=!OD
MX_@L["AY-NCU0-NX^#UN]5=7J[\\F!L&5$T]K_ZG0R')__PLO/?WLL:N_(>6
ME>6_WFZU+;S_35.3^7\(W"A<`+?'/@6N=TGZ)JP'*@PSO`9<7/"/7+R[+QVN
MY!^SE?_3_:RQ*_]-2\_J_Z:E-S7,?Z-MR/P_!*77ZQI;W'8W4"Y!/:!@P%V7
M3T5UB_LS[/>28AK;6:G<2RY%0OV8;@V7]=RB;!"/$;(0EXHHV&%SQ4!YV"=J
M(ZF3,[/OJ<J!8[SSQ;D#7A3.`2N2X#C14V<:0K&'P(/M+)=S2#I%DO;<,=3K
M^:GUA\6ZJKA?TN@IM^1L3F&]<DQ]ZG"HIDL+)V$D;FX8/P$3?X!B!WPV9QP+
FG32BE]XXB40BD4@D$HE$(I%()!*)1"*12"229_@)GWZ*5``H````
`
end

Re: [HACKERS] sort on huge table

From

Tom Lane

Date:

02 November 1999, 00:32:03

Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> I have compared current with 6.5 using 1000000 tuple-table (243MB) (I
> wanted to try 2GB+ table but 6.5 does not work in this case). The
> result was strange in that current is *faster* than 6.5!

> RAID5
>     current    2:29
>     6.5.2    3:15

> non-RAID
>     current    1:50
>     6.5.2    2:13

> Seems my previous testing was done in wrong way or the behavior of
> sorting might be different if the table size is changed?

Well, I feel better now, anyway ;-).  I thought that my first cut
ought to have been about the same speed as 6.5, and after I added
the code to slurp up multiple tuples in sequence, it should've been
faster than 6.5.  The above numbers seem to be in line with that
theory.  Next question: is there some additional effect that comes
into play once the table size gets really huge?  I am thinking maybe
there's some glitch affecting performance once the temp file size
goes past one segment (1Gb).  Tatsuo, can you try sorts of say
0.9 and 1.1 Gb to see if something bad happens at 1Gb?  I could
try rebuilding here with a small RELSEG_SIZE, but right at the
moment I'm not certain I'd see the same behavior you do...
        regards, tom lane

Re: [HACKERS] sort on huge table

From

Hannu Krosing

Date:

02 November 1999, 03:05:07

Tatsuo Ishii wrote:
> 
> >... I agree this'd be worth trying.  There could be some subtle effect
> >somewhere in RAID5 that's tripping things up.  It'd also be useful if
> >someone could try it on similar RAID hardware with a non-Linux kernel.
> 
> I have compared current with 6.5 using 1000000 tuple-table (243MB) (I
> wanted to try 2GB+ table but 6.5 does not work in this case). The
> result was strange in that current is *faster* than 6.5!
> 
> RAID5
>         current 2:29
>         6.5.2   3:15
> 
> non-RAID
>         current 1:50
>         6.5.2   2:13
> 
> Seems my previous testing was done in wrong way or the behavior of
> sorting might be different if the table size is changed?

Or the behaviour of RAID5 changes at some size. 

I have set up an IBM Netfinity server with specs similar to yours, except 
that it has 1G memory and 5x9GB disks. The RAID controller is IBM ServeRAID.

It seems that when I try to write over 60 MB sequentially, the write 
performance drops from over 50MB/s to under 2MB/s.

Maybe such behaviour would suggest that building an index and traversing 
that could be faster than full sort ?

The same tests on my single Celeron 450 produced ~10MB/s writes 
whatever the size.

> Anyway, here is my test script.
> First, edit Makefile to set DB and number of tuples. Then run type
> make. That's all.

I'll try to run it tonight (in GMT+2 tz). 

Can't run it earlyer as it is a production site and a highly 
visible web-server.

If I have the time I'll even try my index theory.

--------
Hannu

Re: [HACKERS] sort on huge table

From

Tatsuo Ishii

Date:

04 November 1999, 03:30:39

>
>Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> I have compared current with 6.5 using 1000000 tuple-table (243MB) (I
>> wanted to try 2GB+ table but 6.5 does not work in this case). The
>> result was strange in that current is *faster* than 6.5!
>
>> RAID5
>>     current    2:29
>>     6.5.2    3:15
>
>> non-RAID
>>     current    1:50
>>     6.5.2    2:13
>
>> Seems my previous testing was done in wrong way or the behavior of
>> sorting might be different if the table size is changed?
>
>Well, I feel better now, anyway ;-).  I thought that my first cut
>ought to have been about the same speed as 6.5, and after I added
>the code to slurp up multiple tuples in sequence, it should've been
>faster than 6.5.  The above numbers seem to be in line with that
>theory.  Next question: is there some additional effect that comes
>into play once the table size gets really huge?  I am thinking maybe
>there's some glitch affecting performance once the temp file size
>goes past one segment (1Gb).  Tatsuo, can you try sorts of say
>0.9 and 1.1 Gb to see if something bad happens at 1Gb?  I could
>try rebuilding here with a small RELSEG_SIZE, but right at the
>moment I'm not certain I'd see the same behavior you do...

Ok. I have run some testings with various amount of data.

RedHat Linux 6.0
Kernel 2.2.5-smp
512MB RAM
Sort mem: 80MB
RAID5

100 million tuples    1:31
200            4:24
300            7:27
400            11:11  <-- 970MB
500            14:01  <-- 1.1GB (segmented files)
600            18:31
700            22:24
800            24:36
900            28:12
1000            32:14

I didn't see any bad thing at 1.1GB (500 million).
--
Tatsuo Ishii