Thread: Re: [PATCHES] Try again: S_LOCK reduced contention
Once upon a time Bruce Momjian wrote: > OK, here is my argument for inlining tas(). > > This output is for startup, a single query returning a single row result > on an indexed column, and shutdown. Guess who is at the top of the > list, tas(). Mcount() is a gprof internal counter, so it doesn't > "count". tas() is showing 0.01 cummulative seconds. Psql shows > wallclock time for the query at 0.13 seconds. That 0.01 is pretty > significant. ... > % cumulative self self total > time seconds seconds calls ms/call ms/call name > 20.0 0.02 0.02 mcount (463) > 10.0 0.03 0.01 5288 0.00 0.00 _tas [31] As promised, I did a little testing to see what part of this overhead is measurement error due to the nature of gprof and to see what the real overhead of various spinlock implementations are. Here is what I learned. Section 1. Summary 1.1. Spinlocks are pretty cheap. Spinlocks are relatively cheap operations. The worst implementation I tested took 0.34 microseconds for a spinlock roundtrip (take and release). The current (in CVS as of May 98) code takes 0.28 microseconds. The best hand tuned asm implementation took only 0.14 microseconds. This is fast enough that I had to use a huge iteration count (100,000,000) to get reasonably large run times. Table 1.1 Overheads of spinlock roundtrip in microseconds Test Case Time (usec) Notes Original 0.14 S_LOCK in 6.3.2 (no backoff, asm) InlineTas 0.15 Patch to be submitted (backoff, _inline_) TasFunction2 0.20 Refined S_LOCK patch TAS as function. MacroSLOCK 0.28 May 98 S_LOCK patch as in CVS 1.2. gprof doesn't work for small quick functions. gprof introduces severe experimental error for small quick functions. According to the gprof profile done by Bruce, 5288 tas calls took 0.1 second. That would require the spinlock roundtrips to take almost 19 microseconds each, not 0.28 microseconds. So in reality the 5288 calls took only 0.0015 seconds, not 0.1 seconds. So gprof has overestimated the cost of the spinlock by 68 to 1. Perhaps the spinlock function is so short and quick compared to the mcount overhead added to the function prolog that the overhead dominates the measurement. gprof remains a useful tool for larger functions with longer runtimes, but must be considered very suspect for tiny functions. 1.3 Function calls are pretty cheap or Macros may not save all that much. The difference between the current (late May) macro version and the same code removed to a separate function and called with three arguments was only 0.06 microseconds. That is 60 nanoseconds for the argument passing, the call, and the return. I suspect that on the x86 "architecture" the limited number of registers means that inline code has to save results to memory as it goes along which may offset to some extent the overhead of the register saves for a function call. 1.4 There are mysteries. Beware. In some of the test cases there was significant timing variation from run to run even though the test conditions were apparently identical. Even more strangely, the variation in time was not random but appeared to represent two different modes. And, the variation was itself repeatable. Here are the raw times in CPU seconds from two different experiments each run six consecutive times: case 1: 49.81, 49.43, 40.68, 49.51, 40.68, 40.69 clusters about 40.7 and 49.5 seconds case 2: 39.34, 29.09, 28.65, 40.34, 28.64, 28.64 clusters about 28.9 and 39.6 seconds Note that the testrun times have a bimodal distribution with one group of fast runs clustered tightly about one time and then a much slower group clustered tightly about the second time. The difference between groups is huge (about 25%) while the diffence within a group is very small (probably within the measurement error. I have no explanation for this variation. Possibly it is some interaction of where the program is loaded and the state of the memory heirarchy, but even this is hard to sustain. I would be very curious to hear of any plausible explainations. 1.5 Timing very small functions in isolation with large counts is effective. Notwithstanding the strange behavior in section 1.4, it is possible to time differences in functions that amount to the addition or deletion of one or two instructions. For example, the TasFunction and TasFunction2 cases below are almost but not quite identical yet show noticably different runtimes. 1.6. New patch to follow. The current S_LOCK and TAS() implementations (my patch of late May) are slower than they need to be and cause more code bloat than they need to. The bloat is caused by using a macro to inline a relatively complex bit of code that is only used in the blocked lock case. I suspect the slowness is caused at least partly by the macro as it requires more registers. I have developed a new patch that separates out the lock available case from the busywaiting case and that uses the GCC _inline_ facilty to make the asm interface still look as clean as a function while not costing anything. For a preview, see Section 2. Test Procedure My test takes and releases a spinlock 100,000,000 times and measures the total CPU time. I ran this test with many variations of spinlock implementation. I also ran a baseline test that has just the loop and call overheads with no actual spinlock so that we can separate out just the S_LOCK time. The test harness code appears below and the variant spinlock implementations, generated assembler output and raw result timings appear later in this message. Testing was done on "leslie" (P133 HX chipset 512K L2) running Linux 2.0.33. The system was up and running X but no other workloads. I avoided typing or touching the mouse during the test. Each variation was run three times and the results averaged. For some tests there was significant variation in times for the three iterations. In this case another set of three was run and the average of six runs used. Section 2.1 Test Harness Code /* * Test s_lock timing with variations */ typedef unsigned char slock_t; volatile slock_t the_lock; int main(void) { int i = 0; the_lock = 0; while (i < 100000000) { /* 100 million iterations */ i = tryit(&the_lock, &i); /* take and release spinlock */ } return i & 1; } /* * Take and release lock */ int tryit(volatile slock_t *lock, int *p) { int i; S_LOCK(lock); i = ++(*p); /* just to make it more realistic */ S_UNLOCK(lock); return *p; } Section 3.0 Test Case Summary Case Tag Per Iteration Comments ============= ============= ============================================ TZero: 0.14 usec compare lock to 0, do nothing TZeroNoCall 0.17 usec compare lock to 0, call s_lock if not TZeroCall 0.30 usec call function that compares lock to 0 TasFunction 0.45 usec lock spinlock in separate tas() function TasFunction2 0.37 usec improved separate tas function SlockAsmMacro 0.31 usec Inline xchgb in S_LOCK macro, call s_lock if needed. Note strange variation in recorded times. I have no explaination. Original 0.31 usec The original S_LOCK from 6.3.2 MacroSLOCK 0.45 usec Current in CVS as of 5/30/98. Tends to bloat. AllFunctions 0.51 usec Call s_lock() function. tas() as function InlineTas 0.32 usec Use function inlining. Patch to follow. Section 3.1 Test Cases and Raw Timing Results. Case Tag Per Iteration Comments ============= ============= ============================================ TZero: 0.14 usec compare lock to 0, do nothing #define TAS(lock) (*lock != 0) #define S_LOCK(lock) TAS(lock) tryit: pushl %ebp movl %esp,%ebp movl 8(%ebp),%eax movl 12(%ebp),%edx movb (%eax),%cl incl (%edx) movb $0,(%eax) movl (%edx),%eax leave ret CPU times: 14.32, 14.32, 14.32 Case Tag Per Iteration Comments ============= ============= ============================================ TZeroNoCall 0.17 usec compare lock to 0, call s_lock if not #define TAS(lock) (*lock != 0) #define S_LOCK(lock) if (TAS(lock)) s_lock(lock, __FILE__, __LINE__); else tryit: pushl %ebp movl %esp,%ebp pushl %esi pushl %ebx movl 8(%ebp),%ebx movl 12(%ebp),%esi movb (%ebx),%al testb %al,%al je .L14 pushl $141 pushl $.LC1 pushl %ebx call s_lock .L14: incl (%esi) movb $0,(%ebx) movl (%esi),%eax leal -8(%ebp),%esp popl %ebx popl %esi leave ret CPU times: 17.33, 17.35, 17.31 Case Tag Per Iteration Comments ============= ============= ============================================ TZeroCall 0.30 usec call function that compares lock to 0 #define TAS(lock) tas_test(lock) #define S_LOCK(lock) if (TAS(lock)) s_lock(lock, __FILE__, __LINE__); else int tas_test(volatile slock_t *lock) { return *lock == 0; } tas_test: pushl %ebp movl %esp,%ebp movl 8(%ebp),%eax movb (%eax),%al testb %al,%al setne %al andl $255,%eax leave ret tryit: pushl %ebp movl %esp,%ebp pushl %esi pushl %ebx movl 8(%ebp),%ebx movl 12(%ebp),%esi pushl %ebx call tas_test addl $4,%esp testl %eax,%eax je .L13 pushl $141 pushl $.LC1 pushl %ebx call s_lock .L13: incl (%esi) movb $0,(%ebx) movl (%esi),%eax leal -8(%ebp),%esp popl %ebx popl %esi leave ret CPU times: 30.16, 30.15, 30.14 Case Tag Per Iteration Comments ============= ============= ============================================ TasFunction 0.45 usec lock spinlock in separate tas() function #define TAS(lock) tas(lock) #define S_LOCK(lock) if (TAS(lock)) s_lock(lock, __FILE__, __LINE__); else int tas(volatile slock_t *lock) { slock_t _res = 1; __asm__("lock; xchgb %0,%1": "=q"(_res), "=m"(*lock):"0"(0x1)); return _res != 0; } tas: pushl %ebp movl %esp,%ebp movl 8(%ebp),%eax movl $1,%edx #APP lock; xchgb %dl,(%eax) #NO_APP testb %dl,%dl setne %al andl $255,%eax leave ret CPU times: 46.13, 47.48, 45.01, 39.51, 45.79, 45.86 Case Tag Per Iteration Comments ============= ============= ============================================ TasFunction2 0.37 usec. improved separate tas function #define TAS(lock) tas2(lock) #define S_LOCK(lock) if (TAS(lock)) s_lock(lock, __FILE__, __LINE__); else int tas2(volatile slock_t *lock) { slock_t _res = 1; __asm__("lock; xchgb %0,%1": "=q"(_res), "=m"(*lock):"0"(0x1)); return (int) _res; } tas2: pushl %ebp movl %esp,%ebp movl 8(%ebp),%edx movl $1,%eax #APP lock; xchgb %al,(%edx) #NO_APP andl $255,%eax leave ret CPU times: 37.67, 37.67, 37.68, 37.57, 37.12, 36.91 Case Tag Per Iteration Comments ============= ============= ============================================ SlockAsmMacro 0.31 usec Inline xchgb in S_LOCK macro, call s_lock if needed. Note strange variation in recorded times. I have no explaination. #define TAS(lock) tas2(lock) #define S_LOCK(lock) if (1) { \ slock_t _res = 1; \ __asm__("lock; xchgb %0,%1": "=q"(_res), "=m"(*lock):"0"(0x1)); \ if (_res) \ s_lock(lock, __FILE__, __LINE__); \ } else tryit: pushl %ebp movl %esp,%ebp pushl %esi pushl %ebx movl 8(%ebp),%ebx movl 12(%ebp),%esi movl $1,%eax #APP lock; xchgb %al,(%ebx) #NO_APP testb %al,%al je .L14 pushl $141 pushl $.LC1 pushl %ebx call s_lock .L14: incl (%esi) movb $0,(%ebx) movl (%esi),%eax leal -8(%ebp),%esp popl %ebx popl %esi leave ret CPU times: 40.53, 30.14, 30.13, 40.44, 30.12, 40.50, 28.65, 28.63, 28.62 Case Tag Per Iteration Comments ============= ============= ============================================ Original 0.31 usec The original S_LOCK from 6.3.2 #define S_LOCK(lock) do { \ slock_t _res = 1; \ do { \ __asm__("lock; xchgb %0,%1": "=q"(_res), "=m"(*lock):"0"(0x1)); \ } while (_res !=0); \ } while (0) tryit: pushl %ebp movl %esp,%ebp movl 8(%ebp),%edx movl 12(%ebp),%ecx .align 4 .align 4 .L15: movl $1,%eax #APP lock; xchgb %al,(%edx) #NO_APP testb %al,%al jne .L15 incl (%ecx) movb $0,(%edx) movl (%ecx),%eax leave ret CPU times: 28.55, 33.31, 31.40 Case Tag Per Iteration Comments ============= ============= ============================================ MacroSLOCK 0.45 usec Current in CVS as of 5/30/98. Tends to bloat. #define TAS(lock) tas(lock) #define S_LOCK(lock) if (1) { \ int spins = 0; \ while (TAS(lock)) { \ struct timeval delay; \ delay.tv_sec = 0; \ delay.tv_usec = s_spincycle[spins++ % S_NSPINCYCLE]; \ (void) select(0, NULL, NULL, NULL, &delay); \ if (spins > S_MAX_BUSY) { \ /* It's been well over a minute... */ \ s_lock_stuck(lock, __FILE__, __LINE__); \ } \ } \ } else tryit: pushl %ebp movl %esp,%ebp subl $8,%esp pushl %edi pushl %esi pushl %ebx movl 8(%ebp),%esi movl 12(%ebp),%edi xorl %ebx,%ebx .align 4 .L13: pushl %esi call tas addl $4,%esp testl %eax,%eax je .L18 movl $0,-8(%ebp) movl %ebx,%edx movl %ebx,%eax incl %ebx testl %edx,%edx jge .L16 leal 15(%edx),%eax .L16: andb $240,%al subl %eax,%edx movl %edx,%eax movl s_spincycle(,%eax,4),%eax movl %eax,-4(%ebp) leal -8(%ebp),%eax pushl %eax pushl $0 pushl $0 pushl $0 pushl $0 call select addl $20,%esp cmpl $16000,%ebx jle .L13 pushl $141 pushl $.LC1 pushl %esi call s_lock_stuck addl $12,%esp jmp .L13 .align 4 .L18: incl (%edi) movb $0,(%esi) movl (%edi),%eax leal -20(%ebp),%esp popl %ebx popl %esi popl %edi leave ret CPU times: 49.81, 49.43, 40.68, 49.51, 40.68, 40.69 Case Tag Per Iteration Comments ============= ============= ============================================ AllFunction 0.51 usec Call s_lock() function. tas() as function #define TAS(lock) tas2(lock) #define S_LOCK(lock) s_lock(lock, __FILE__, __LINE__) void s_lock(volatile slock_t *lock, char *file, int line) { int spins = 0; while (TAS(lock)) { struct timeval delay; delay.tv_sec = 0; delay.tv_usec = s_spincycle[spins++ % S_NSPINCYCLE]; (void) select(0, NULL, NULL, NULL, &delay); if (spins > S_MAX_BUSY) { /* It's been well over a minute... */ s_lock_stuck(lock, file, line); } } } s_lock: pushl %ebp movl %esp,%ebp subl $8,%esp pushl %edi pushl %esi pushl %ebx movl 8(%ebp),%esi movl 16(%ebp),%edi xorl %ebx,%ebx .align 4 .L5: pushl %esi call tas2 addl $4,%esp testl %eax,%eax je .L6 movl $0,-8(%ebp) movl %ebx,%edx movl %ebx,%eax incl %ebx testl %edx,%edx jge .L8 leal 15(%edx),%eax .L8: andb $240,%al subl %eax,%edx movl %edx,%eax movl s_spincycle(,%eax,4),%eax movl %eax,-4(%ebp) leal -8(%ebp),%eax pushl %eax pushl $0 pushl $0 pushl $0 pushl $0 call select addl $20,%esp cmpl $16000,%ebx jle .L5 pushl %edi pushl 12(%ebp) pushl %esi call s_lock_stuck addl $12,%esp jmp .L5 .align 4 .L6: leal -20(%ebp),%esp popl %ebx popl %esi popl %edi leave ret tryit: pushl %ebp movl %esp,%ebp pushl %esi pushl %ebx movl 8(%ebp),%esi movl 12(%ebp),%ebx pushl $141 pushl $.LC1 pushl %esi call s_lock incl (%ebx) movb $0,(%esi) movl (%ebx),%eax leal -8(%ebp),%esp popl %ebx popl %esi leave ret CPU times: 51.23, 51.23, 51.23 Case Tag Per Iteration Comments ============= ============= ============================================ InlineTas 0.32 Use function inlining. Patch to follow. #define TAS(lock) tas_i(lock) #define S_LOCK(lock) if (TAS(lock)) s_lock(lock, __FILE__, __LINE__); else static __inline__ int tas_i(volatile slock_t *lock) { slock_t _res = 1; __asm__("lock; xchgb %0,%1": "=q"(_res), "=m"(*lock): "0"(_res)); return (int) _res; } tryit: pushl %ebp movl %esp,%ebp pushl %esi pushl %ebx movl 8(%ebp),%ebx movl 12(%ebp),%esi movb $1,%al #APP lock; xchgb %al,(%ebx) #NO_APP testb %al,%al je .L16 pushl $156 pushl $.LC1 pushl %ebx call s_lock .L16: incl (%esi) movb $0,(%ebx) movl (%esi),%eax leal -8(%ebp),%esp popl %ebx popl %esi leave ret CPU times: 39.34, 29.09, 28.65, 40.34, 28.64, 28.64 ---------------------------------------------------------------------------- Section 4.0 Test Suite begin 644 s_lock_test.tar.gz M'XL(`&B<?34``^T;:W/;.*Y?I5_!39N>G;J*9$MVDFXZD_.T.Y[+IITFW;N] MZXY'MIA875GR67*:3B?__0!2E*B'GW&2]E9H8TD@2((D`((@&?:]8/AG/Z)A MM/_DGH"8>L>RR!-"C"9_XAM_<M`):5NM3LNPVJT.)!I-4W]"K/MB2(99&-E3 M0IXX5XOIOHPH]1Z"H8>%4!I__-&&VZ_#T/6V:<X??[/#QK_9[K3;+;.)B:;9 M?$+T[;-2A+_X^*O[>RK9(SCT))RX/DH#F=#I)6#W537Z.J$.O20S/W2O?.J0 MX0AZ*^0R\TI5KP//CER/"A2)1I1)%*31FXA.?>+Z@)U^=:-:@7@/GPU&L3>I M0Q9\&]NN#Z2N4U>_J2@>B'3),=&!`+]%%1R%F"\C++7FDI\)R!J'.OFF*/M[ MB"!CU_/<`%@!AH"#P`^Q<4+\L&S.X'-1=(,\=^N\[%OV.Z71#-M"GA/CE7JK MJH\];MN"O/[CU[9MP!+]!\UO,OMO`H`1@,16JV55^O\0L%C]G[K^T)LYE/P< M1HX;:*/7,NIKN!^Y8XK8I8:"U],=V?X510T.X3<@$Q>J8Y4/[9!JD!ZX0QH2 M>TJ/D/[BY)R\?$T"GY+@\@@_^V>]TP9[N7AS?L'?NK^=\Y<FYCGOG[[K_D/. MQC']LS?_;(AWEB5^?_>A]TL#<Q8AINB=G?;.WB09WGX\Z[+^>0HM=J$2J+N& M+:TKR$6/OR>I/%=,D/(24^7(D#.>0MQ+4C/`C)%/JH(V$,<G9$8/,;'-2ZJ. M"94PFLZ&8'%A8*YM3W&H9W]E&?BK%EWW0SI,BDFQ,XX.^UC/\.O0H_]A-;YX M07:!N;/S][VS[N_=TS=_\(S<1I.0>G08U?0&.?MX>IK]?<X*KW-Z;`YOPFLH M[M>3?_7__O'\=\$VFNI>]+>0#"CUR1?J>22XIE-B@_'V9Q'5-(U`EW/:V&B% MT6SX9XW;ZW[_;>_T3;^/;SA8_7Y<[2W^XL\MH5Y("_V-HQ]WN!-P9L3TU)]2 M[&\#"T)I4`2!TN_;X;C?K^VPJ8[<#$=7`[*K-W:-G2.R<_S?G1KFK3?@?;Q3 M8]-<_6A'WZGI-T8]82P>0E;-3\<ZQR=HO2@;*'@QK[P+YC>^D#>1."97&;%9 M5M8K,J?ON%X4Q56H3UE';J?SL#)&A5\X-,KR1K#.Y>TH]NR'-W$[E%I<)3D& M'9&U^..9I,=*;2]N=XZJ=]:[R"J\G$\B3!5*,=H27FB&HJ`OPPQ:2BI[58JL MJG^\$BFHEF)(YWE<S#CO74(*][X\J+I>5D*L8VN6H\J6$2VVZ%J6#61=[K+8 M@L<DD1WF>HH9]C2U69*,4T$\&D#!G)D2JEY:2-\5UE=V4J'J\H:F71/3-5<E MY+S,)>8>[S*ZV`^.?="T#YDCBB6H"UB'S,*@@>H)141/^HYZJ"HQ0[']RO$S MMXONGZ$:L%!G;'&.^GUP;$`*^OUD6-SOA[N\)F>F6G*L?H.IE?]'>]`@%ON- M/^'1PE]50N`;)SIDOTV>KYU0Z^HMBAZJN;J)H<`^0J[!Z"8NB5KBD:C*MX)# M0KA#HI;Y(QMZ(^OY(O,]$4CYMKHG,L</X5T5VT&8Y&'.8:/,_=]"ACIY"3T" MW?+%C4:$X1,_G+F8TB!M8HMQJ"XG4_B^K($+3Z?3!C*^\\E_>W)Q<GHD)HI= M_>"F3NR([(9'NTXCQXE&3@;!-'+]*^V3O\-*J"6./A/GLL9+]0:SZ.'JM3%/ MK9YHUVK1!R'4+I0@>^SPB0&"%R]J+$21F\]?92SS!.M\[!7=>I!9_X.*NI$6 MCK9<Q[+XGZ&;<?ROH[<-'=?_\*]:_S\$//UI?^#Z^^%(Q=$GVKXD$"NB?C") MKT`&6?]_M?^D:$JW7<<R_6]://[?`=?$Z+0PL=-J5?K_$""-_Q'A&T`DC0,3 M(1*J\JS6[8*W\HZ\/)<)I(1B[I<!D0V%4F(\5L4]=C_]OX*L_\PQTF!A,O.B M<(MU+-;_EMXQ6KG]7],P.I7^/P1T[9"2"_M*V.+WL-+IB6TR^.X&XS'UHU`] ME@$2%GXO`?7BWW0:'$G[_YIA$K;DXS`,QA-[2IE[C_L$L*)S`N('T0C6`CSW M6="U86D6Y^XLR3U$6B[K&*B$DE1>3%((%M/2,\5@TN7,'[*NB$:P2HE+#M.B MU0L[?"MH>"FF)9?"*).M%=>'12H484>4Q9OJ205R0<V8G4RKW/%D"@M1)U-` MFOT<RS\)Q[_:PVG`LQMR]AX+@<1A"V`CWB89(WF^?Y(`[CSP*76HHY&S`-B` M!3[;U;FVIRX7&RA^2H?!%&B6%H6V/M1(CXSL:PH#0^C-Q+-=GY6DJ>^F[A5\ MB#$J-.MB1$D@:.(V74Z#,6EK+:VILMXX9]C2T>G.IE.*^\L^Z?YV3J!#@TMB M[;?T_<,#C5Q0WPEQF`=>8$>:>N)Y8HA"7IJ58::;]J(TLEH\TO)H\=&`(4^; MU91+^@A:F8@>BU[A(IB\MZ/A"!FZ##PO^*+!`O?=-9V.J.TPSF4Y&[O0="@Q M@#9@_*HH7TT=T,4.-DQ`%SNN>:`**9+Y-BPU"6]<L&U$+D4PKL`RCVLDDA&R MF,:2O<)%9PJV>T`@(XAW.RN0%A*!RT1LWP$5\"A:UV10XG+FG"B`K+]0'RO% M'@D<"GVE8EN/DMHGLW#DD5TZF"2H<7"-F'#2R*##V<`CS\P&IA2SWV2S/],; M+\T:YJ\G"1X&I02VD<FCV1X,&S%5[=0LX>TFAWHF.BK!,V/#>C'?#/NFR(GM M.%#(0;8IP_$$D(<<&I@Q2?H,(PV<I?E]S&]DB5A]+P_*F@<MOZ;)%PP3C@/; M#1=2GA_@)%BW3K`ID2C`L.>K!)4-/R6$4A0JI5P2C&)`Y@'35_0`0M:>#]SS M4]7'=`H4)7$&E,5N@%K<?V?<ENXRR7U*T@RJRL9K$Q5CZ%2"\L)E-),D)Y,T M(#6DAH2A)TG!T$.\<U//TH)N<O)LZ9PT6VU1<-7N^X]\@CT"GQJFF$;FP83C M$<>:NW!*XKZ5C7>9X[;YN.>VOI?O&?/MXHUE1%"&[E(C7&J+\I(DE2,D:8!B M8*>2A"NY`0%,!OL9;:)AYKAX!@NM/$H[[>9Q&9:D`5&QR*."#(=NJ0P/2F08 M2`LRG+'+\NP53/+S"\=(G;)$!3I:J]5@#XL_C,=6@;+EA[+:PF..%LS9`-^* M'FRZ67TL-H=%YJW;V]2H+E6%D$;0'S*&>PA-RUIJ3[]'0S!/1T5?YQRIG$_( MHG7<\\IZ4<Q@M+9O,%H_D,%HP=34;K"'Q1_F8QJ,=`&G).M89>7XP@*#<8^V M8HF9$$@!F1,7=S]S$0]J\90*<'8//A];98"W]_3D_?M4Q&36'4^X<T_/WO5E MLMA8`0'\;<5828)LMC4#9CZSHYD'\+`TW0!Y/M0L@WUU#MGCH/U]2'=326)? MFK(D[+5`JO,GQ;8IUHM.-SVL7.>.$R%O6Y!LITRR[462;3/)=DHD>W6)S=C> MCM;N-.3'`7M8_,N`14RKK1T^I@<W-^2J_*#!UGG*M%R=<H=_YYV@1KC[\5]1 MDGP(6&[E*D>!D>[V^U[>K:YU@_GS2;4.+)D-=<UJQ?X<?^#<""Y5_-5D7Q8> MG3S0VA9_M/CC,8,F(E2OI(9FR?9'_IQ[]J*#X'Z1MN9($;:GP.57(#(I^A9# M=/E)3=*\84EXNX``,;>.-M/0LGEQCH;Z3$6M@CH-RT.#3HDZ#=<+#8)<6[B^ M:8%8P</03/T1I3S=>9)6.>OMTRV<Q]:8Q@JWKA!9?O%*]%'NM#-)[E\)@K); M6(6T]>YBB>SKW<A*Y&O!O2PBP<H7M.1,J]_4$CEN$PN07MG:V`#P;;#<WI'( M[A1G[;4F\OQL+4?\TZ2;8,J+F;N))H=$2A@1P9R[Q'$.<E9+;XBY-=^#C,^\ MH8S14K',*&4:)!AP<@5\OF(<M+-3NV').QB8GO8!K!?`N#5-/1NVPZ&,VU?D MS[DI68Y+ZE-CR0VS;-5>ONV8]T"D7&*0"JAG^ET0W*-BRIL;ZZ9>MOMIL!L= MF4'@>Y^;A^WR8B>K;XXG\)(R/'T>3[)5RQ)^4/0.G7+O,.<T"M(R[["I;^H> MQAAG98?Q4#O`2,FA9G)/$=>B9AP^B;_@<?B(LZ9T-$5)CJ4HJQ](V3B.LOPB MZL87$J7C'0#R72/\+EXW0FRZL3[GWI%(+MX^*J2L<P=)9%[G)E*B$G/N(XGT MM$T(*UY/DK,LNZ<DZ/C)%'9AB6?Y_N?<]F9SKK72E-N\PYS;SC+Z\%-N.N>7 MSK@'U82[M0G7*G!:D'/A'&Y_RDTKE^1;<J<>9+)\V!#:`L^[>/SL;KY/P6\9 ME"_#R_R60<DR_/ZB6I:A-<$[D1_J(SHDR4E1C%8U%67U$ZWSSS6X][&C$T;0 M$4.RYB5Q9"G=Y-G&]@[!$!G#+][@^3Z#U0,6"H/I8]NQZG9!GZT":LU8=?L' MBE6W#K66V2#-0TT_3.+1L-A@2/@2#[6ZAEE!!1544$$%%51000455%!!!154 M4$$%%51000455%!!!1544$$%%51000455/"7@O\!JG)P!X```````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` M```````````````````````````````````````````````````````````` 9```````````````````````````````````` ` end -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats." -- Howard Aiken
> 1.6. New patch to follow. > > The current S_LOCK and TAS() implementations (my patch of late May) are > slower than they need to be and cause more code bloat than they need to. > The bloat is caused by using a macro to inline a relatively complex bit > of code that is only used in the blocked lock case. I suspect the slowness > is caused at least partly by the macro as it requires more registers. > > I have developed a new patch that separates out the lock available case > from the busywaiting case and that uses the GCC _inline_ facilty to make > the asm interface still look as clean as a function while not costing > anything. For a preview, see Quite and analysis. I want to comment on the code more, but I just want to point out now that many of our i386 platforms are not GNU. I think we have to use macros. I can't think of any GNU-specific code in the source tree at this point, and I don't think it makes sense add it now just to make the code look a litter cleaner. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
> > 1.6. New patch to follow. > > > > The current S_LOCK and TAS() implementations (my patch of late May) are > > slower than they need to be and cause more code bloat than they need to. > > The bloat is caused by using a macro to inline a relatively complex bit > > of code that is only used in the blocked lock case. I suspect the slowness > > is caused at least partly by the macro as it requires more registers. > > > > I have developed a new patch that separates out the lock available case > > from the busywaiting case and that uses the GCC _inline_ facilty to make > > the asm interface still look as clean as a function while not costing > > anything. For a preview, see > > Quite and analysis. I want to comment on the code more, but I just want Please do. I am very interested in reactions or followup investigations. > to point out now that many of our i386 platforms are not GNU. I think > we have to use macros. I can't think of any GNU-specific code in the > source tree at this point, and I don't think it makes sense add it now > just to make the code look a litter cleaner. Most of the original tas() __asm__() implementations are GCC specific. This includes all the Linux platforms except PPC, all the *BSD platforms, even the VAX. GCC is also fairly commonly used even on the commercial OSes. As far as I can tell, the only C coded platforms that are not GCC specific are SCO i386 and SunOS/Solaris on Sun3 and Sparc. The other non-GCC platforms have external tas.s function implementations (HP), or have system specific calls (AIX, OSF, SGI, Nextstep). Finally, the difference between a tas() function implementation and the best possible inline implementation appears to be only 0.06 microseconds on a P133. This will add 0.0003 seconds to startup. On SCO only. On Sparc this is a leaf call and possibly even cheaper. No other platforms are affected. Remember also that I am adding two features that previously did not exist, backoff, and stuck lock detection. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "If you lie to the compiler, it will get its revenge." -- Henry Spencer
On Wed, 10 Jun 1998, Bruce Momjian wrote: > Quite and analysis. I want to comment on the code more, but I just want > to point out now that many of our i386 platforms are not GNU. I think > we have to use macros. I can't think of any GNU-specific code in the > source tree at this point, and I don't think it makes sense add it now > just to make the code look a litter cleaner. Indeed. Those of use who have thousand dollar SunPro compilers thank you. (can you say progressive optomizer?) /* Matthew N. Dodd | A memory retaining a love you had for life winter@jurai.net | As cruel as it seems nothing ever seems to http://www.jurai.net/~winter | go right - FLA M 3.1:53 */
Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contention
From
dg@illustra.com (David Gould)
Date:
Matthew N. Dodd writes: > On Wed, 10 Jun 1998, Bruce Momjian wrote: > > Quite and analysis. I want to comment on the code more, but I just want > > to point out now that many of our i386 platforms are not GNU. I think > > we have to use macros. I can't think of any GNU-specific code in the > > source tree at this point, and I don't think it makes sense add it now > > just to make the code look a litter cleaner. > > Indeed. Those of use who have thousand dollar SunPro compilers thank you. > > (can you say progressive optomizer?) ^^^^^^^^^ uhhh, no. ;-) Hmmmm, looking at the original code, non-GCC Sparc makes a function call to the tas() routine which is coded as asm. I have not in fact changed it. As I understand your comment, you wish this to be a macro. The code is: #if defined(NEED_SPARC_TAS_ASM) /* * sparc machines not using gcc */ static void tas_dummy() /* really means: extern int tas(slock_t *lock); */ { asm(".seg \"data\""); asm(".seg \"text\""); asm("_tas:"); /* * Sparc atomic test and set (sparc calls it "atomic load-store") */ asm("ldstub [%r8], %r8"); asm("retl"); asm("nop"); } #endif /* NEED_SPARC_TAS_ASM */ I doubt there are any major performance gains to be had here, but I would be interested to learn otherwise. I don't have access to a Sparc machine that I can use for this, so if anyone cares to test this implementation and any others they can think of please post the results. But I think perhaps we are micro-optimizing here. I only bothered to do all the i386 flavors because Bruce had some gprof output that suggested we had a problem (we didn't), and then I just got kinda interested in the experiment itself. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "A week of coding can sometimes save an hour of thought."
> Most of the original tas() __asm__() implementations are GCC specific. This > includes all the Linux platforms except PPC, all the *BSD platforms, even the > VAX. GCC is also fairly commonly used even on the commercial OSes. > > As far as I can tell, the only C coded platforms that are not GCC specific > are SCO i386 and SunOS/Solaris on Sun3 and Sparc. The other non-GCC platforms > have external tas.s function implementations (HP), or have system specific > calls (AIX, OSF, SGI, Nextstep). That s_lock.h file is a hornets nest of portability problems. I really don't want to have multiple functions/macros for different CPU's if I can help it. I don't even want to mix functions/macros for the same function name if I can help it. I also do not want to start playing around with isGNU/isnotGNU in a file that is already complex. Macros work and we already have tons of them, we don't use inline anywhere else, and the actual locks are 80% asm code anyway, so it looks the same whether it is in a macro or an inline function. I have made them macros before for this file, so I can do it again quite easily. As for the benefits, well, when I see lots of calls to a function, and I try and eliminate the calls if it is reasonable. In many places, the call handling is actually more instructions than the inlining. I look at the measured performance change vs. the executable size increase and make a decision. With something like s_lock, it just seems normal to make it a macro. > Finally, the difference between a tas() function implementation and the best > possible inline implementation appears to be only 0.06 microseconds on a P133. > This will add 0.0003 seconds to startup. On SCO only. On Sparc this is a leaf > call and possibly even cheaper. No other platforms are affected. > > Remember also that I am adding two features that previously did not exist, > backoff, and stuck lock detection. Yes, and good improvements. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contention
From
dg@illustra.com (David Gould)
Date:
Bruce Momjian writes: > David Gould writes: > > Most of the original tas() __asm__() implementations are GCC specific. This > > includes all the Linux platforms except PPC, all the *BSD platforms, even the > > VAX. GCC is also fairly commonly used even on the commercial OSes. > > > > As far as I can tell, the only C coded platforms that are not GCC specific > > are SCO i386 and SunOS/Solaris on Sun3 and Sparc. The other non-GCC platforms > > have external tas.s function implementations (HP), or have system specific > > calls (AIX, OSF, SGI, Nextstep). > > That s_lock.h file is a hornets nest of portability problems. I really > don't want to have multiple functions/macros for different CPU's if I > can help it. I don't even want to mix functions/macros for the same > function name if I can help it. I also do not want to start playing > around with isGNU/isnotGNU in a file that is already complex. Actually, my main motivation for this file is to reduce the portability problems. If you will look at the next patch (when I submit it, probably tonight) I think you will see that it is fairly clear what to do to port to a new platform, and how the existing platforms work. We already implicitly make a isGCC vs notGCC distinction when we use the GCC asm() syntax. I am merely intending to make it explict. > Macros work and we already have tons of them, we don't use inline > anywhere else, and the actual locks are 80% asm code anyway, so it looks > the same whether it is in a macro or an inline function. > > I have made them macros before for this file, so I can do it again quite > easily. > > As for the benefits, well, when I see lots of calls to a function, and I > try and eliminate the calls if it is reasonable. In many places, the > call handling is actually more instructions than the inlining. I look > at the measured performance change vs. the executable size increase and > make a decision. With something like s_lock, it just seems normal to > make it a macro. With the old S_LOCK this was a reasonable choice. With the new S_LOCK which is quite a bit more complex, the macro expansion generates quite a bit of code. See the generated code for the "MacroSLOCK" case in my large post. > > Finally, the difference between a tas() function implementation and the best > > possible inline implementation appears to be only 0.06 microseconds on a P133. > > This will add 0.0003 seconds to startup. On SCO only. On Sparc this is a leaf > > call and possibly even cheaper. No other platforms are affected. > > > > Remember also that I am adding two features that previously did not exist, > > backoff, and stuck lock detection. > > Yes, and good improvements. Again, please have a look at the (forthcoming) patch. It gives up nothing in either space or time performance compared to the original, is clearer imho, and incorporates the the new features. -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software 300 Lakeside Drive Oakland, CA 94612 - A child of five could understand this! Fetch me a child of five.
> > Bruce Momjian writes: > > David Gould writes: > > > Most of the original tas() __asm__() implementations are GCC specific. This > > > includes all the Linux platforms except PPC, all the *BSD platforms, even the > > > VAX. GCC is also fairly commonly used even on the commercial OSes. > > > > > > As far as I can tell, the only C coded platforms that are not GCC specific > > > are SCO i386 and SunOS/Solaris on Sun3 and Sparc. The other non-GCC platforms > > > have external tas.s function implementations (HP), or have system specific > > > calls (AIX, OSF, SGI, Nextstep). > > > > That s_lock.h file is a hornets nest of portability problems. I really > > don't want to have multiple functions/macros for different CPU's if I > > can help it. I don't even want to mix functions/macros for the same > > function name if I can help it. I also do not want to start playing > > around with isGNU/isnotGNU in a file that is already complex. > > Actually, my main motivation for this file is to reduce the portability > problems. If you will look at the next patch (when I submit it, probably > tonight) I think you will see that it is fairly clear what to do to port to > a new platform, and how the existing platforms work. > > We already implicitly make a isGCC vs notGCC distinction when we use the > GCC asm() syntax. I am merely intending to make it explict. Ah, I see. I wondered how other compilers were understanding the asm() stuff. I thought it was gcc-specific, but then other platforms were using it. I guess they have gcc. > > > Macros work and we already have tons of them, we don't use inline > > anywhere else, and the actual locks are 80% asm code anyway, so it looks > > the same whether it is in a macro or an inline function. > > > > I have made them macros before for this file, so I can do it again quite > > easily. > > > > As for the benefits, well, when I see lots of calls to a function, and I > > try and eliminate the calls if it is reasonable. In many places, the > > call handling is actually more instructions than the inlining. I look > > at the measured performance change vs. the executable size increase and > > make a decision. With something like s_lock, it just seems normal to > > make it a macro. > > With the old S_LOCK this was a reasonable choice. With the new S_LOCK which > is quite a bit more complex, the macro expansion generates quite a bit of > code. See the generated code for the "MacroSLOCK" case in my large post. Yes, I suspected that may be a problem. I will apply your patch as soon as I see it. > > > > Finally, the difference between a tas() function implementation and the best > > > possible inline implementation appears to be only 0.06 microseconds on a P133. > > > This will add 0.0003 seconds to startup. On SCO only. On Sparc this is a leaf > > > call and possibly even cheaper. No other platforms are affected. > > > > > > Remember also that I am adding two features that previously did not exist, > > > backoff, and stuck lock detection. > > > > Yes, and good improvements. > > Again, please have a look at the (forthcoming) patch. It gives up nothing in > either space or time performance compared to the original, is clearer imho, > and incorporates the the new features. Sounds like a plan. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
dg@illustra.com (David Gould) writes: > [ Much careful testing snipped ] Nice job, David! > gprof introduces severe experimental error for small quick functions. > Perhaps the spinlock function is so short and quick compared to the > mcount overhead added to the function prolog that the overhead dominates > the measurement. gprof remains a useful tool for larger functions with > longer runtimes, but must be considered very suspect for tiny functions. Right. gprof is a fine example of Heisenberg's Uncertainty Principle applied to software ;-). You can't measure something without affecting it. As Bruce Momjian just pointed out in another followup, running the test function a lot of times in a tight loop isn't a perfect answer either: you find out what happens when the function's code and referenced data are all in cache, but you can't tell much about what happens when they are not; and you can't tell whether the whole application's memory usage patterns are such that the function will remain in cache. I'm guessing that the backend uses tas() enough that it will probably stay in cache, but that is *strictly* a guess with no evidence. > In some of the test cases there was significant timing variation from > run to run even though the test conditions were apparently identical. > Even more strangely, the variation in time was not random but appeared > to represent two different modes. And, the variation was itself > repeatable. > [snip] > I have no explanation for this variation. Possibly it is some interaction > of where the program is loaded and the state of the memory heirarchy, but > even this is hard to sustain. I would be very curious to hear of any > plausible explainations. After chewing on this for a while I think that your speculation is right. You were using a Pentium, you said. The Pentium has a two-way set associative cache, which means that any given main-memory address has exactly two cache lines it could be loaded into. Main-memory addresses that are 1000H apart contend for the same pair of cache lines. Thus, if your program happens to repeatedly hit three locations that are exactly 1000H apart, it will suffer a cache miss every time. Change the address spacing, and no miss occurs. The cache miss takes forty-some clock cycles, versus one if the target location is in cache. (BTW, I'm getting this info out of Rick Booth's "Inner Loops", a fine reference book if you are into hand-optimized assembly coding for Intel processors.) So what I think you were seeing is that on some runs, the loop involved hitting three addresses that contend for the same cache line pair, while on other runs there was no cache contention. This could be explained by varying load addresses for the program, if it touched both its own addresses (variable) and some non-varying addresses --- say, C library routines executed from a shared library that remained loaded throughout. If you can link with a non-shared C library then you should get more consistent runtimes, because the offsets between all the locations touched by your loop would be fixed, and thus cache hits or misses ought to be consistent from run to run. The bottom line, however, is that this behavior is too unpredictable to be worth worrying about in a production program. A cache miss in a tight loop could be created or eliminated by unrelated changes in distant parts of the code ... and in any case the behavior will be different on different CPUs. The 486, Pentium, and Pentium Pro all have radically different cache layouts, let alone non-Intel CPUs. regards, tom lane
Tom Lane writes: > dg@illustra.com (David Gould) writes: > > [ Much careful testing snipped ] > > In some of the test cases there was significant timing variation from > > run to run even though the test conditions were apparently identical. > > Even more strangely, the variation in time was not random but appeared > > to represent two different modes. And, the variation was itself > > repeatable. > > [snip] > > I have no explanation for this variation. Possibly it is some interaction > > of where the program is loaded and the state of the memory heirarchy, but > > even this is hard to sustain. I would be very curious to hear of any > > plausible explainations. > > After chewing on this for a while I think that your speculation is > right. You were using a Pentium, you said. The Pentium has a two-way > set associative cache, which means that any given main-memory address > has exactly two cache lines it could be loaded into. Main-memory > addresses that are 1000H apart contend for the same pair of cache lines. > Thus, if your program happens to repeatedly hit three locations that are > exactly 1000H apart, it will suffer a cache miss every time. Change the > address spacing, and no miss occurs. The cache miss takes forty-some > clock cycles, versus one if the target location is in cache. > (BTW, I'm getting this info out of Rick Booth's "Inner Loops", a fine > reference book if you are into hand-optimized assembly coding for Intel > processors.) > > So what I think you were seeing is that on some runs, the loop involved > hitting three addresses that contend for the same cache line pair, while > on other runs there was no cache contention. This could be explained > by varying load addresses for the program, if it touched both its own > addresses (variable) and some non-varying addresses --- say, C library > routines executed from a shared library that remained loaded throughout. > If you can link with a non-shared C library then you should get more > consistent runtimes, because the offsets between all the locations > touched by your loop would be fixed, and thus cache hits or misses ought > to be consistent from run to run. This is in line with my own speculation. However, I am not convinced. First, the test loop and the function it calls are only about 100 bytes total taken together. And, no system calls or library calls are made during the test. This tends to rule out "locations 1000H apart" and shared library effects. Also, I would expect the system to load programs and libraries on VM page boundaries. Unless there is some cachability difference from one page to the next, I am at a loss to account for this. > The bottom line, however, is that this behavior is too unpredictable > to be worth worrying about in a production program. A cache miss in > a tight loop could be created or eliminated by unrelated changes in > distant parts of the code ... and in any case the behavior will be > different on different CPUs. The 486, Pentium, and Pentium Pro all > have radically different cache layouts, let alone non-Intel CPUs. Agreed, the reasonable thing to do is to try to be sensitive to cache effects and accept that there are mysteries. thanks -dg David Gould dg@illustra.com 510.628.3783 or 510.305.9468 Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612 "Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats." -- Howard Aiken