Thread: Re: [PATCHES] Try again: S_LOCK reduced contention

Re: [PATCHES] Try again: S_LOCK reduced contention

From
dg@illustra.com (David Gould)
Date:
Once upon a time Bruce Momjian wrote:
> OK, here is my argument for inlining tas().
>
> This output is for startup, a single query returning a single row result
> on an indexed column, and shutdown.  Guess who is at the top of the
> list, tas().  Mcount() is a gprof internal counter, so it doesn't
> "count".  tas() is showing 0.01 cummulative seconds.  Psql shows
> wallclock time for the query at 0.13 seconds.  That 0.01 is pretty
> significant.
...
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls  ms/call  ms/call  name
>  20.0       0.02     0.02                             mcount (463)
>  10.0       0.03     0.01     5288     0.00     0.00  _tas [31]


As promised, I did a little testing to see what part of this overhead
is measurement error due to the nature of gprof and to see what the real
overhead of various spinlock implementations are. Here is what I learned.

Section 1. Summary

1.1. Spinlocks are pretty cheap.

   Spinlocks are relatively cheap operations. The worst implementation I
   tested took 0.34 microseconds for a spinlock roundtrip (take and release).
   The current (in CVS as of May 98) code takes 0.28 microseconds. The best
   hand tuned asm implementation took only 0.14 microseconds.

   This is fast enough that I had to use a huge iteration count (100,000,000)
   to get reasonably large run times.

   Table 1.1 Overheads of spinlock roundtrip in microseconds

     Test Case     Time (usec)   Notes

     Original       0.14         S_LOCK in 6.3.2 (no backoff, asm)
     InlineTas      0.15         Patch to be submitted (backoff, _inline_)
     TasFunction2   0.20         Refined S_LOCK patch TAS as function.
     MacroSLOCK     0.28         May 98 S_LOCK patch as in CVS


1.2. gprof doesn't work for small quick functions.

   gprof introduces severe experimental error for small quick functions.
   According to the gprof profile done by Bruce, 5288 tas calls took 0.1
   second. That would require the spinlock roundtrips to take almost 19
   microseconds each, not 0.28 microseconds. So in reality the 5288 calls
   took only 0.0015 seconds, not 0.1 seconds.

   So gprof has overestimated the cost of the spinlock by 68 to 1.

   Perhaps the spinlock function is so short and quick compared to the
   mcount overhead added to the function prolog that the overhead dominates
   the measurement. gprof remains a useful tool for larger functions with
   longer runtimes, but must be considered very suspect for tiny functions.


1.3 Function calls are pretty cheap or Macros may not save all that much.

   The difference between the current (late May) macro version and the same
   code removed to a separate function and called with three arguments was
   only 0.06 microseconds. That is 60 nanoseconds for the argument passing,
   the call, and the return.

   I suspect that on the x86 "architecture" the limited number of registers
   means that inline code has to save results to memory as it goes along
   which may offset to some extent the overhead of the register saves for
   a function call.


1.4 There are mysteries. Beware.

   In some of the test cases there was significant timing variation from
   run to run even though the test conditions were apparently identical.
   Even more strangely, the variation in time was not random but appeared
   to represent two different modes. And, the variation was itself repeatable.

   Here are the raw times in CPU seconds from two different experiments each
   run six consecutive times:

   case 1: 49.81, 49.43, 40.68, 49.51, 40.68, 40.69
           clusters about 40.7 and 49.5 seconds

   case 2: 39.34, 29.09, 28.65, 40.34, 28.64, 28.64
           clusters about 28.9 and 39.6 seconds

   Note that the testrun times have a bimodal distribution with one group of
   fast runs clustered tightly about one time and then a much slower group
   clustered tightly about the second time. The difference between groups is
   huge (about 25%) while the diffence within a group is very small (probably
   within the measurement error.

   I have no explanation for this variation. Possibly it is some interaction
   of where the program is loaded and the state of the memory heirarchy, but
   even this is hard to sustain. I would be very curious to hear of any
   plausible explainations.


1.5 Timing very small functions in isolation with large counts is effective.

   Notwithstanding the strange behavior in section 1.4, it is possible to
   time differences in functions that amount to the addition or deletion of
   one or two instructions. For example, the TasFunction and TasFunction2
   cases below are almost but not quite identical yet show noticably
   different runtimes.


1.6. New patch to follow.

   The current S_LOCK and TAS() implementations (my patch of late May) are
   slower than they need to be and cause more code bloat than they need to.
   The bloat is caused by using a macro to inline a relatively complex bit
   of code that is only used in the blocked lock case. I suspect the slowness
   is caused at least partly by the macro as it requires more registers.

   I have developed a new patch that separates out the lock available case
   from the busywaiting case and that uses the GCC _inline_ facilty to make
   the asm interface still look as clean as a function while not costing
   anything. For a preview, see



Section 2. Test Procedure

My test takes and releases a spinlock 100,000,000 times and measures the
total CPU time. I ran this test with many variations of spinlock
implementation. I also ran a baseline test that has just the loop and call
overheads with no actual spinlock so that we can separate out just the S_LOCK
time. The test harness code appears below and the variant spinlock
implementations, generated assembler output and raw result timings appear
later in this message.

Testing was done on "leslie" (P133 HX chipset 512K L2) running Linux 2.0.33.
The system was up and running X but no other workloads. I avoided typing
or touching the mouse during the test. Each variation was run three times
and the results averaged. For some tests there was significant variation in
times for the three iterations. In this case another set of three was run
and the average of six runs used.



Section 2.1 Test Harness Code

/*
 * Test s_lock timing with variations
 */
typedef unsigned char slock_t;
volatile slock_t the_lock;

int main(void)
{
    int i = 0;

    the_lock = 0;
    while (i < 100000000) {             /* 100 million iterations */
        i = tryit(&the_lock, &i);       /* take and release spinlock */
    }
    return i & 1;
}

/*
 * Take and release lock
 */
int tryit(volatile slock_t *lock, int *p)
{
        int     i;
        S_LOCK(lock);
        i = ++(*p);            /* just to make it more realistic */
        S_UNLOCK(lock);
    return *p;
}



Section 3.0 Test Case Summary

Case Tag        Per Iteration   Comments
=============   =============   ============================================
TZero:          0.14 usec       compare lock to 0, do nothing
TZeroNoCall     0.17 usec       compare lock to 0, call s_lock if not

TZeroCall       0.30 usec       call function that compares lock to 0
TasFunction     0.45 usec       lock spinlock in separate tas() function
TasFunction2    0.37 usec       improved separate tas function
SlockAsmMacro   0.31 usec       Inline xchgb in S_LOCK macro, call s_lock if
                                needed. Note strange variation in recorded
                                times. I have no explaination.
Original        0.31 usec       The original S_LOCK from 6.3.2
MacroSLOCK      0.45 usec       Current in CVS as of 5/30/98. Tends to bloat.
AllFunctions    0.51 usec       Call s_lock() function. tas() as function
InlineTas       0.32 usec       Use function inlining. Patch to follow.



Section 3.1 Test Cases and Raw Timing Results.


Case Tag        Per Iteration   Comments
=============   =============   ============================================
TZero:        0.14 usec    compare lock to 0, do nothing


#define TAS(lock)     (*lock != 0)
#define S_LOCK(lock)  TAS(lock)

tryit:
        pushl %ebp
        movl %esp,%ebp
        movl 8(%ebp),%eax
        movl 12(%ebp),%edx
        movb (%eax),%cl
        incl (%edx)
        movb $0,(%eax)
        movl (%edx),%eax
        leave
        ret

CPU times: 14.32, 14.32, 14.32




Case Tag        Per Iteration   Comments
=============   =============   ============================================
TZeroNoCall    0.17 usec    compare lock to 0, call s_lock if not


#define TAS(lock)     (*lock != 0)
#define S_LOCK(lock)  if (TAS(lock)) s_lock(lock, __FILE__, __LINE__); else

tryit:
        pushl %ebp
        movl %esp,%ebp
        pushl %esi
        pushl %ebx
        movl 8(%ebp),%ebx
        movl 12(%ebp),%esi
        movb (%ebx),%al
        testb %al,%al
        je .L14
        pushl $141
        pushl $.LC1
        pushl %ebx
        call s_lock
.L14:
        incl (%esi)
        movb $0,(%ebx)
        movl (%esi),%eax
        leal -8(%ebp),%esp
        popl %ebx
        popl %esi
        leave
        ret

CPU times: 17.33, 17.35, 17.31




Case Tag        Per Iteration   Comments
=============   =============   ============================================
TZeroCall       0.30 usec    call function that compares lock to 0

#define TAS(lock)     tas_test(lock)
#define S_LOCK(lock)  if (TAS(lock)) s_lock(lock, __FILE__, __LINE__); else

int tas_test(volatile slock_t *lock)
{
    return *lock == 0;
}

tas_test:
        pushl %ebp
        movl %esp,%ebp
        movl 8(%ebp),%eax
        movb (%eax),%al
        testb %al,%al
        setne %al
        andl $255,%eax
        leave
        ret

tryit:
        pushl %ebp
        movl %esp,%ebp
        pushl %esi
        pushl %ebx
        movl 8(%ebp),%ebx
        movl 12(%ebp),%esi
        pushl %ebx
        call tas_test
        addl $4,%esp
        testl %eax,%eax
        je .L13
        pushl $141
        pushl $.LC1
        pushl %ebx
        call s_lock
.L13:
        incl (%esi)
        movb $0,(%ebx)
        movl (%esi),%eax
        leal -8(%ebp),%esp
        popl %ebx
        popl %esi
        leave
        ret

CPU times: 30.16, 30.15, 30.14




Case Tag        Per Iteration   Comments
=============   =============   ============================================
TasFunction    0.45 usec    lock spinlock in separate tas() function


#define TAS(lock)     tas(lock)
#define S_LOCK(lock)  if (TAS(lock)) s_lock(lock, __FILE__, __LINE__); else

int tas(volatile slock_t *lock)
{
    slock_t         _res = 1;

    __asm__("lock; xchgb %0,%1": "=q"(_res), "=m"(*lock):"0"(0x1));
    return _res != 0;
}

tas:
        pushl %ebp
        movl %esp,%ebp
        movl 8(%ebp),%eax
        movl $1,%edx
#APP
        lock; xchgb %dl,(%eax)
#NO_APP
        testb %dl,%dl
        setne %al
        andl $255,%eax
        leave
        ret

CPU times: 46.13, 47.48, 45.01, 39.51, 45.79, 45.86




Case Tag        Per Iteration   Comments
=============   =============   ============================================
TasFunction2    0.37 usec.    improved separate tas function


#define TAS(lock)     tas2(lock)
#define S_LOCK(lock)  if (TAS(lock)) s_lock(lock, __FILE__, __LINE__); else

int tas2(volatile slock_t *lock)
{
    slock_t         _res = 1;

    __asm__("lock; xchgb %0,%1": "=q"(_res), "=m"(*lock):"0"(0x1));
    return (int) _res;
}

tas2:
        pushl %ebp
        movl %esp,%ebp
        movl 8(%ebp),%edx
        movl $1,%eax
#APP
        lock; xchgb %al,(%edx)
#NO_APP
        andl $255,%eax
        leave
        ret


CPU times: 37.67, 37.67, 37.68, 37.57, 37.12, 36.91




Case Tag        Per Iteration   Comments
=============   =============   ============================================
SlockAsmMacro   0.31 usec    Inline xchgb in S_LOCK macro, call s_lock if
                                needed. Note strange variation in recorded
                                times. I have no explaination.


#define TAS(lock)       tas2(lock)
#define S_LOCK(lock) if (1) { \
    slock_t _res = 1; \
        __asm__("lock; xchgb %0,%1": "=q"(_res), "=m"(*lock):"0"(0x1)); \
        if (_res) \
            s_lock(lock, __FILE__, __LINE__); \
    } else

tryit:
        pushl %ebp
        movl %esp,%ebp
        pushl %esi
        pushl %ebx
        movl 8(%ebp),%ebx
        movl 12(%ebp),%esi
        movl $1,%eax
#APP
        lock; xchgb %al,(%ebx)
#NO_APP
        testb %al,%al
        je .L14
        pushl $141
        pushl $.LC1
        pushl %ebx
        call s_lock
.L14:
        incl (%esi)
        movb $0,(%ebx)
        movl (%esi),%eax
        leal -8(%ebp),%esp
        popl %ebx
        popl %esi
        leave
        ret

CPU times: 40.53, 30.14, 30.13, 40.44, 30.12, 40.50, 28.65, 28.63, 28.62




Case Tag        Per Iteration   Comments
=============   =============   ============================================
Original    0.31 usec    The original S_LOCK from 6.3.2


#define S_LOCK(lock) do { \
        slock_t _res = 1; \
        do { \
            __asm__("lock; xchgb %0,%1": "=q"(_res), "=m"(*lock):"0"(0x1)); \
        } while (_res !=0); \
    } while (0)

tryit:
        pushl %ebp
        movl %esp,%ebp
        movl 8(%ebp),%edx
        movl 12(%ebp),%ecx
        .align 4
        .align 4
.L15:
        movl $1,%eax
#APP
        lock; xchgb %al,(%edx)
#NO_APP
        testb %al,%al
        jne .L15
        incl (%ecx)
        movb $0,(%edx)
        movl (%ecx),%eax
        leave
        ret

CPU times: 28.55, 33.31, 31.40




Case Tag        Per Iteration   Comments
=============   =============   ============================================
MacroSLOCK    0.45 usec    Current in CVS as of 5/30/98. Tends to bloat.


#define TAS(lock)       tas(lock)
#define S_LOCK(lock) if (1) { \
    int spins = 0; \
    while (TAS(lock)) { \
        struct timeval  delay; \
        delay.tv_sec = 0; \
        delay.tv_usec = s_spincycle[spins++ % S_NSPINCYCLE]; \
        (void) select(0, NULL, NULL, NULL, &delay); \
        if (spins > S_MAX_BUSY) { \
            /* It's been well over a minute...  */ \
            s_lock_stuck(lock, __FILE__, __LINE__); \
        } \
    } \
} else

tryit:
        pushl %ebp
        movl %esp,%ebp
        subl $8,%esp
        pushl %edi
        pushl %esi
        pushl %ebx
        movl 8(%ebp),%esi
        movl 12(%ebp),%edi
        xorl %ebx,%ebx
        .align 4
.L13:
        pushl %esi
        call tas
        addl $4,%esp
        testl %eax,%eax
        je .L18
        movl $0,-8(%ebp)
        movl %ebx,%edx
        movl %ebx,%eax
        incl %ebx
        testl %edx,%edx
        jge .L16
        leal 15(%edx),%eax
.L16:
        andb $240,%al
        subl %eax,%edx
        movl %edx,%eax
        movl s_spincycle(,%eax,4),%eax
        movl %eax,-4(%ebp)
        leal -8(%ebp),%eax
        pushl %eax
        pushl $0
        pushl $0
        pushl $0
        pushl $0
        call select
        addl $20,%esp
        cmpl $16000,%ebx
        jle .L13
        pushl $141
        pushl $.LC1
        pushl %esi
        call s_lock_stuck
        addl $12,%esp
        jmp .L13
        .align 4
.L18:
        incl (%edi)
        movb $0,(%esi)
        movl (%edi),%eax
        leal -20(%ebp),%esp
        popl %ebx
        popl %esi
        popl %edi
        leave
        ret

CPU times: 49.81, 49.43, 40.68, 49.51, 40.68, 40.69




Case Tag        Per Iteration   Comments
=============   =============   ============================================
AllFunction    0.51 usec    Call s_lock() function. tas() as function


#define TAS(lock)     tas2(lock)
#define S_LOCK(lock)  s_lock(lock, __FILE__, __LINE__)

void s_lock(volatile slock_t *lock, char *file, int line)
{
    int    spins = 0;

    while (TAS(lock))
    {
        struct timeval delay;

        delay.tv_sec = 0;
        delay.tv_usec = s_spincycle[spins++ % S_NSPINCYCLE];
        (void) select(0, NULL, NULL, NULL, &delay);
        if (spins > S_MAX_BUSY)
        {
            /* It's been well over a minute...  */
            s_lock_stuck(lock, file, line);
        }
    }
}

s_lock:
        pushl %ebp
        movl %esp,%ebp
        subl $8,%esp
        pushl %edi
        pushl %esi
        pushl %ebx
        movl 8(%ebp),%esi
        movl 16(%ebp),%edi
        xorl %ebx,%ebx
        .align 4
.L5:
        pushl %esi
        call tas2
        addl $4,%esp
        testl %eax,%eax
        je .L6
        movl $0,-8(%ebp)
        movl %ebx,%edx
        movl %ebx,%eax
        incl %ebx
        testl %edx,%edx
        jge .L8
        leal 15(%edx),%eax
.L8:
        andb $240,%al
        subl %eax,%edx
        movl %edx,%eax
        movl s_spincycle(,%eax,4),%eax
        movl %eax,-4(%ebp)
        leal -8(%ebp),%eax
        pushl %eax
        pushl $0
        pushl $0
        pushl $0
        pushl $0
        call select
        addl $20,%esp
        cmpl $16000,%ebx
        jle .L5
        pushl %edi
        pushl 12(%ebp)
        pushl %esi
        call s_lock_stuck
        addl $12,%esp
        jmp .L5
        .align 4
.L6:
        leal -20(%ebp),%esp
        popl %ebx
        popl %esi
        popl %edi
        leave
        ret

tryit:
        pushl %ebp
        movl %esp,%ebp
        pushl %esi
        pushl %ebx
        movl 8(%ebp),%esi
        movl 12(%ebp),%ebx
        pushl $141
        pushl $.LC1
        pushl %esi
        call s_lock
        incl (%ebx)
        movb $0,(%esi)
        movl (%ebx),%eax
        leal -8(%ebp),%esp
        popl %ebx
        popl %esi
        leave
        ret

CPU times: 51.23, 51.23, 51.23





Case Tag        Per Iteration   Comments
=============   =============   ============================================
InlineTas    0.32        Use function inlining. Patch to follow.


#define TAS(lock)     tas_i(lock)
#define S_LOCK(lock)  if (TAS(lock)) s_lock(lock, __FILE__, __LINE__); else

static __inline__ int tas_i(volatile slock_t *lock)
{
    slock_t   _res = 1;

    __asm__("lock; xchgb %0,%1": "=q"(_res), "=m"(*lock): "0"(_res));
    return (int) _res;
}

tryit:
        pushl %ebp
        movl %esp,%ebp
        pushl %esi
        pushl %ebx
        movl 8(%ebp),%ebx
        movl 12(%ebp),%esi
        movb $1,%al
#APP
        lock; xchgb %al,(%ebx)
#NO_APP
        testb %al,%al
        je .L16
        pushl $156
        pushl $.LC1
        pushl %ebx
        call s_lock
.L16:
        incl (%esi)
        movb $0,(%ebx)
        movl (%esi),%eax
        leal -8(%ebp),%esp
        popl %ebx
        popl %esi
        leave
        ret

CPU times: 39.34, 29.09, 28.65, 40.34, 28.64, 28.64


----------------------------------------------------------------------------
Section 4.0 Test Suite


begin 644 s_lock_test.tar.gz
M'XL(`&B<?34``^T;:W/;.*Y?I5_!39N>G;J*9$MVDFXZD_.T.Y[+IITFW;N]
MZXY'MIA875GR67*:3B?__0!2E*B'GW&2]E9H8TD@2((D`((@&?:]8/AG/Z)A
MM/_DGH"8>L>RR!-"C"9_XAM_<M`):5NM3LNPVJT.)!I-4W]"K/MB2(99&-E3
M0IXX5XOIOHPH]1Z"H8>%4!I__-&&VZ_#T/6V:<X??[/#QK_9[K3;+;.)B:;9
M?$+T[;-2A+_X^*O[>RK9(SCT))RX/DH#F=#I)6#W537Z.J$.O20S/W2O?.J0
MX0AZ*^0R\TI5KP//CER/"A2)1I1)%*31FXA.?>+Z@)U^=:-:@7@/GPU&L3>I
M0Q9\&]NN#Z2N4U>_J2@>B'3),=&!`+]%%1R%F"\C++7FDI\)R!J'.OFF*/M[
MB"!CU_/<`%@!AH"#P`^Q<4+\L&S.X'-1=(,\=^N\[%OV.Z71#-M"GA/CE7JK
MJH\];MN"O/[CU[9MP!+]!\UO,OMO`H`1@,16JV55^O\0L%C]G[K^T)LYE/P<
M1HX;:*/7,NIKN!^Y8XK8I8:"U],=V?X510T.X3<@$Q>J8Y4/[9!JD!ZX0QH2
M>TJ/D/[BY)R\?$T"GY+@\@@_^V>]TP9[N7AS?L'?NK^=\Y<FYCGOG[[K_D/.
MQC']LS?_;(AWEB5^?_>A]TL#<Q8AINB=G?;.WB09WGX\Z[+^>0HM=J$2J+N&
M+:TKR$6/OR>I/%=,D/(24^7(D#.>0MQ+4C/`C)%/JH(V$,<G9$8/,;'-2ZJ.
M"94PFLZ&8'%A8*YM3W&H9W]E&?BK%EWW0SI,BDFQ,XX.^UC/\.O0H_]A-;YX
M07:!N;/S][VS[N_=TS=_\(S<1I.0>G08U?0&.?MX>IK]?<X*KW-Z;`YOPFLH
M[M>3?_7__O'\=\$VFNI>]+>0#"CUR1?J>22XIE-B@_'V9Q'5-(U`EW/:V&B%
MT6SX9XW;ZW[_;>_T3;^/;SA8_7Y<[2W^XL\MH5Y("_V-HQ]WN!-P9L3TU)]2
M[&\#"T)I4`2!TN_;X;C?K^VPJ8[<#$=7`[*K-W:-G2.R<_S?G1KFK3?@?;Q3
M8]-<_6A'WZGI-T8]82P>0E;-3\<ZQR=HO2@;*'@QK[P+YC>^D#>1."97&;%9
M5M8K,J?ON%X4Q56H3UE';J?SL#)&A5\X-,KR1K#.Y>TH]NR'-W$[E%I<)3D&
M'9&U^..9I,=*;2]N=XZJ=]:[R"J\G$\B3!5*,=H27FB&HJ`OPPQ:2BI[58JL
MJG^\$BFHEF)(YWE<S#CO74(*][X\J+I>5D*L8VN6H\J6$2VVZ%J6#61=[K+8
M@L<DD1WF>HH9]C2U69*,4T$\&D#!G)D2JEY:2-\5UE=V4J'J\H:F71/3-5<E
MY+S,)>8>[S*ZV`^.?="T#YDCBB6H"UB'S,*@@>H)141/^HYZJ"HQ0[']RO$S
MMXONGZ$:L%!G;'&.^GUP;$`*^OUD6-SOA[N\)F>F6G*L?H.IE?]'>]`@%ON-
M/^'1PE]50N`;)SIDOTV>KYU0Z^HMBAZJN;J)H<`^0J[!Z"8NB5KBD:C*MX)#
M0KA#HI;Y(QMZ(^OY(O,]$4CYMKHG,L</X5T5VT&8Y&'.8:/,_=]"ACIY"3T"
MW?+%C4:$X1,_G+F8TB!M8HMQJ"XG4_B^K($+3Z?3!C*^\\E_>W)Q<GHD)HI=
M_>"F3NR([(9'NTXCQXE&3@;!-'+]*^V3O\-*J"6./A/GLL9+]0:SZ.'JM3%/
MK9YHUVK1!R'4+I0@>^SPB0&"%R]J+$21F\]?92SS!.M\[!7=>I!9_X.*NI$6
MCK9<Q[+XGZ&;<?ROH[<-'=?_\*]:_S\$//UI?^#Z^^%(Q=$GVKXD$"NB?C")
MKT`&6?]_M?^D:$JW7<<R_6]://[?`=?$Z+0PL=-J5?K_$""-_Q'A&T`DC0,3
M(1*J\JS6[8*W\HZ\/)<)I(1B[I<!D0V%4F(\5L4]=C_]OX*L_\PQTF!A,O.B
M<(MU+-;_EMXQ6KG]7],P.I7^/P1T[9"2"_M*V.+WL-+IB6TR^.X&XS'UHU`]
ME@$2%GXO`?7BWW0:'$G[_YIA$K;DXS`,QA-[2IE[C_L$L*)S`N('T0C6`CSW
M6="U86D6Y^XLR3U$6B[K&*B$DE1>3%((%M/2,\5@TN7,'[*NB$:P2HE+#M.B
MU0L[?"MH>"FF)9?"*).M%=>'12H484>4Q9OJ205R0<V8G4RKW/%D"@M1)U-`
MFOT<RS\)Q[_:PVG`LQMR]AX+@<1A"V`CWB89(WF^?Y(`[CSP*76HHY&S`-B`
M!3[;U;FVIRX7&RA^2H?!%&B6%H6V/M1(CXSL:PH#0^C-Q+-=GY6DJ>^F[A5\
MB#$J-.MB1$D@:.(V74Z#,6EK+:VILMXX9]C2T>G.IE.*^\L^Z?YV3J!#@TMB
M[;?T_<,#C5Q0WPEQF`=>8$>:>N)Y8HA"7IJ58::;]J(TLEH\TO)H\=&`(4^;
MU91+^@A:F8@>BU[A(IB\MZ/A"!FZ##PO^*+!`O?=-9V.J.TPSF4Y&[O0="@Q
M@#9@_*HH7TT=T,4.-DQ`%SNN>:`**9+Y-BPU"6]<L&U$+D4PKL`RCVLDDA&R
MF,:2O<)%9PJV>T`@(XAW.RN0%A*!RT1LWP$5\"A:UV10XG+FG"B`K+]0'RO%
M'@D<"GVE8EN/DMHGLW#DD5TZF"2H<7"-F'#2R*##V<`CS\P&IA2SWV2S/],;
M+\T:YJ\G"1X&I02VD<FCV1X,&S%5[=0LX>TFAWHF.BK!,V/#>C'?#/NFR(GM
M.%#(0;8IP_$$D(<<&I@Q2?H,(PV<I?E]S&]DB5A]+P_*F@<MOZ;)%PP3C@/;
M#1=2GA_@)%BW3K`ID2C`L.>K!)4-/R6$4A0JI5P2C&)`Y@'35_0`0M:>#]SS
M4]7'=`H4)7$&E,5N@%K<?V?<ENXRR7U*T@RJRL9K$Q5CZ%2"\L)E-),D)Y,T
M(#6DAH2A)TG!T$.\<U//TH)N<O)LZ9PT6VU1<-7N^X]\@CT"GQJFF$;FP83C
M$<>:NW!*XKZ5C7>9X[;YN.>VOI?O&?/MXHUE1%"&[E(C7&J+\I(DE2,D:8!B
M8*>2A"NY`0%,!OL9;:)AYKAX!@NM/$H[[>9Q&9:D`5&QR*."#(=NJ0P/2F08
M2`LRG+'+\NP53/+S"\=(G;)$!3I:J]5@#XL_C,=6@;+EA[+:PF..%LS9`-^*
M'FRZ67TL-H=%YJW;V]2H+E6%D$;0'S*&>PA-RUIJ3[]'0S!/1T5?YQRIG$_(
MHG7<\\IZ4<Q@M+9O,%H_D,%HP=34;K"'Q1_F8QJ,=`&G).M89>7XP@*#<8^V
M8HF9$$@!F1,7=S]S$0]J\90*<'8//A];98"W]_3D_?M4Q&36'4^X<T_/WO5E
MLMA8`0'\;<5828)LMC4#9CZSHYD'\+`TW0!Y/M0L@WUU#MGCH/U]2'=326)?
MFK(D[+5`JO,GQ;8IUHM.-SVL7.>.$R%O6Y!LITRR[462;3/)=DHD>W6)S=C>
MCM;N-.3'`7M8_,N`14RKK1T^I@<W-^2J_*#!UGG*M%R=<H=_YYV@1KC[\5]1
MDGP(6&[E*D>!D>[V^U[>K:YU@_GS2;4.+)D-=<UJQ?X<?^#<""Y5_-5D7Q8>
MG3S0VA9_M/CC,8,F(E2OI(9FR?9'_IQ[]J*#X'Z1MN9($;:GP.57(#(I^A9#
M=/E)3=*\84EXNX``,;>.-M/0LGEQCH;Z3$6M@CH-RT.#3HDZ#=<+#8)<6[B^
M:8%8P</03/T1I3S=>9)6.>OMTRV<Q]:8Q@JWKA!9?O%*]%'NM#-)[E\)@K);
M6(6T]>YBB>SKW<A*Y&O!O2PBP<H7M.1,J]_4$CEN$PN07MG:V`#P;;#<WI'(
M[A1G[;4F\OQL+4?\TZ2;8,J+F;N))H=$2A@1P9R[Q'$.<E9+;XBY-=^#C,^\
MH8S14K',*&4:)!AP<@5\OF(<M+-3NV').QB8GO8!K!?`N#5-/1NVPZ&,VU?D
MS[DI68Y+ZE-CR0VS;-5>ONV8]T"D7&*0"JAG^ET0W*-BRIL;ZZ9>MOMIL!L=
MF4'@>Y^;A^WR8B>K;XXG\)(R/'T>3[)5RQ)^4/0.G7+O,.<T"M(R[["I;^H>
MQAAG98?Q4#O`2,FA9G)/$=>B9AP^B;_@<?B(LZ9T-$5)CJ4HJQ](V3B.LOPB
MZL87$J7C'0#R72/\+EXW0FRZL3[GWI%(+MX^*J2L<P=)9%[G)E*B$G/N(XGT
MM$T(*UY/DK,LNZ<DZ/C)%'9AB6?Y_N?<]F9SKK72E-N\PYS;SC+Z\%-N.N>7
MSK@'U82[M0G7*G!:D'/A'&Y_RDTKE^1;<J<>9+)\V!#:`L^[>/SL;KY/P6\9
ME"_#R_R60<DR_/ZB6I:A-<$[D1_J(SHDR4E1C%8U%67U$ZWSSS6X][&C$T;0
M$4.RYB5Q9"G=Y-G&]@[!$!G#+][@^3Z#U0,6"H/I8]NQZG9!GZT":LU8=?L'
MBE6W#K66V2#-0TT_3.+1L-A@2/@2#[6ZAEE!!1544$$%%51000455%!!!154
M4$$%%51000455%!!!1544$$%%51000455/"7@O\!J&#G)P!X````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
9````````````````````````````````````
`
end


-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"Don't worry about people stealing your ideas.  If your ideas are any
 good, you'll have to ram them down people's throats." -- Howard Aiken

Re: [PATCHES] Try again: S_LOCK reduced contention

From
Bruce Momjian
Date:
> 1.6. New patch to follow.
>
>    The current S_LOCK and TAS() implementations (my patch of late May) are
>    slower than they need to be and cause more code bloat than they need to.
>    The bloat is caused by using a macro to inline a relatively complex bit
>    of code that is only used in the blocked lock case. I suspect the slowness
>    is caused at least partly by the macro as it requires more registers.
>
>    I have developed a new patch that separates out the lock available case
>    from the busywaiting case and that uses the GCC _inline_ facilty to make
>    the asm interface still look as clean as a function while not costing
>    anything. For a preview, see

Quite and analysis.  I want to comment on the code more, but I just want
to point out now that many of our i386 platforms are not GNU.  I think
we have to use macros.  I can't think of any GNU-specific code in the
source tree at this point, and I don't think it makes sense add it now
just to make the code look a litter cleaner.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [PATCHES] Try again: S_LOCK reduced contention

From
dg@illustra.com (David Gould)
Date:
> > 1.6. New patch to follow.
> >
> >    The current S_LOCK and TAS() implementations (my patch of late May) are
> >    slower than they need to be and cause more code bloat than they need to.
> >    The bloat is caused by using a macro to inline a relatively complex bit
> >    of code that is only used in the blocked lock case. I suspect the slowness
> >    is caused at least partly by the macro as it requires more registers.
> >
> >    I have developed a new patch that separates out the lock available case
> >    from the busywaiting case and that uses the GCC _inline_ facilty to make
> >    the asm interface still look as clean as a function while not costing
> >    anything. For a preview, see
>
> Quite and analysis.  I want to comment on the code more, but I just want

Please do. I am very interested in reactions or followup investigations.

> to point out now that many of our i386 platforms are not GNU.  I think
> we have to use macros.  I can't think of any GNU-specific code in the
> source tree at this point, and I don't think it makes sense add it now
> just to make the code look a litter cleaner.

Most of the original tas() __asm__() implementations are GCC specific. This
includes all the Linux platforms except PPC, all the *BSD platforms, even the
VAX. GCC is also fairly commonly used even on the commercial OSes.

As far as I can tell, the only C coded platforms that are not GCC specific
are SCO i386 and SunOS/Solaris on Sun3 and Sparc. The other non-GCC platforms
have external tas.s function implementations (HP), or have system specific
calls (AIX, OSF, SGI, Nextstep).

Finally, the difference between a tas() function implementation and the best
possible inline implementation appears to be only 0.06 microseconds on a P133.
This will add 0.0003 seconds to startup. On SCO only. On Sparc this is a leaf
call and possibly even cheaper. No other platforms are affected.

Remember also that I am adding two features that previously did not exist,
backoff, and stuck lock detection.

-dg

David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"If you lie to the compiler, it will get its revenge."  -- Henry Spencer

Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contention

From
"Matthew N. Dodd"
Date:
On Wed, 10 Jun 1998, Bruce Momjian wrote:
> Quite and analysis.  I want to comment on the code more, but I just want
> to point out now that many of our i386 platforms are not GNU.  I think
> we have to use macros.  I can't think of any GNU-specific code in the
> source tree at this point, and I don't think it makes sense add it now
> just to make the code look a litter cleaner.

Indeed.  Those of use who have thousand dollar SunPro compilers thank you.

(can you say progressive optomizer?)

/*
   Matthew N. Dodd        | A memory retaining a love you had for life
   winter@jurai.net        | As cruel as it seems nothing ever seems to
   http://www.jurai.net/~winter | go right - FLA M 3.1:53
*/


Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contention

From
dg@illustra.com (David Gould)
Date:
Matthew N. Dodd writes:
> On Wed, 10 Jun 1998, Bruce Momjian wrote:
> > Quite and analysis.  I want to comment on the code more, but I just want
> > to point out now that many of our i386 platforms are not GNU.  I think
> > we have to use macros.  I can't think of any GNU-specific code in the
> > source tree at this point, and I don't think it makes sense add it now
> > just to make the code look a litter cleaner.
>
> Indeed.  Those of use who have thousand dollar SunPro compilers thank you.
>
> (can you say progressive optomizer?)
                           ^^^^^^^^^  uhhh, no. ;-)


Hmmmm, looking at the original code, non-GCC Sparc makes a function call to
the tas() routine which is coded as asm. I have not in fact changed it.
As I understand your comment, you wish this to be a macro.

The code is:

#if defined(NEED_SPARC_TAS_ASM)
/*
 * sparc machines not using gcc
 */
static void tas_dummy()     /* really means: extern int tas(slock_t *lock); */
{
    asm(".seg \"data\"");
    asm(".seg \"text\"");
    asm("_tas:");
    /*
     * Sparc atomic test and set (sparc calls it "atomic load-store")
     */
    asm("ldstub [%r8], %r8");
    asm("retl");
    asm("nop");
}
#endif /* NEED_SPARC_TAS_ASM */


I doubt there are any major performance gains to be had here, but I would
be interested to learn otherwise. I don't have access to a Sparc machine
that I can use for this, so if anyone cares to test this implementation and
any others they can think of please post the results.

But I think perhaps we are micro-optimizing here. I only bothered to do
all the i386 flavors because Bruce had some gprof output that suggested
we had a problem (we didn't), and then I just got kinda interested in the
experiment itself.

-dg


David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"A week of coding can sometimes save an hour of thought."

Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contention

From
Bruce Momjian
Date:
> Most of the original tas() __asm__() implementations are GCC specific. This
> includes all the Linux platforms except PPC, all the *BSD platforms, even the
> VAX. GCC is also fairly commonly used even on the commercial OSes.
>
> As far as I can tell, the only C coded platforms that are not GCC specific
> are SCO i386 and SunOS/Solaris on Sun3 and Sparc. The other non-GCC platforms
> have external tas.s function implementations (HP), or have system specific
> calls (AIX, OSF, SGI, Nextstep).

That s_lock.h file is a hornets nest of portability problems.  I really
don't want to have multiple functions/macros for different CPU's if I
can help it.  I don't even want to mix functions/macros for the same
function name if I can help it.  I also do not want to start playing
around with isGNU/isnotGNU in a file that is already complex.

Macros work and we already have tons of them, we don't use inline
anywhere else, and the actual locks are 80% asm code anyway, so it looks
the same whether it is in a macro or an inline function.

I have made them macros before for this file, so I can do it again quite
easily.

As for the benefits, well, when I see lots of calls to a function, and I
try and eliminate the calls if it is reasonable.  In many places, the
call handling is actually more instructions than the inlining.  I look
at the measured performance change vs. the executable size increase and
make a decision. With something like s_lock, it just seems normal to
make it a macro.

> Finally, the difference between a tas() function implementation and the best
> possible inline implementation appears to be only 0.06 microseconds on a P133.
> This will add 0.0003 seconds to startup. On SCO only. On Sparc this is a leaf
> call and possibly even cheaper. No other platforms are affected.
>
> Remember also that I am adding two features that previously did not exist,
> backoff, and stuck lock detection.

Yes, and good improvements.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contention

From
dg@illustra.com (David Gould)
Date:
Bruce Momjian writes:
> David Gould writes:
> > Most of the original tas() __asm__() implementations are GCC specific. This
> > includes all the Linux platforms except PPC, all the *BSD platforms, even the
> > VAX. GCC is also fairly commonly used even on the commercial OSes.
> >
> > As far as I can tell, the only C coded platforms that are not GCC specific
> > are SCO i386 and SunOS/Solaris on Sun3 and Sparc. The other non-GCC platforms
> > have external tas.s function implementations (HP), or have system specific
> > calls (AIX, OSF, SGI, Nextstep).
>
> That s_lock.h file is a hornets nest of portability problems.  I really
> don't want to have multiple functions/macros for different CPU's if I
> can help it.  I don't even want to mix functions/macros for the same
> function name if I can help it.  I also do not want to start playing
> around with isGNU/isnotGNU in a file that is already complex.

Actually, my main motivation for this file is to reduce the portability
problems. If you will look at the next patch (when I submit it, probably
tonight) I think you will see that it is fairly clear what to do to port to
a new platform, and how the existing platforms work.

We already implicitly make a isGCC vs notGCC distinction when we use the
GCC asm() syntax. I am merely intending to make it explict.

> Macros work and we already have tons of them, we don't use inline
> anywhere else, and the actual locks are 80% asm code anyway, so it looks
> the same whether it is in a macro or an inline function.
>
> I have made them macros before for this file, so I can do it again quite
> easily.
>
> As for the benefits, well, when I see lots of calls to a function, and I
> try and eliminate the calls if it is reasonable.  In many places, the
> call handling is actually more instructions than the inlining.  I look
> at the measured performance change vs. the executable size increase and
> make a decision. With something like s_lock, it just seems normal to
> make it a macro.

With the old S_LOCK this was a reasonable choice. With the new S_LOCK which
is quite a bit more complex, the macro expansion generates quite a bit of
code. See the generated code for the "MacroSLOCK" case in my large post.

> > Finally, the difference between a tas() function implementation and the best
> > possible inline implementation appears to be only 0.06 microseconds on a P133.
> > This will add 0.0003 seconds to startup. On SCO only. On Sparc this is a leaf
> > call and possibly even cheaper. No other platforms are affected.
> >
> > Remember also that I am adding two features that previously did not exist,
> > backoff, and stuck lock detection.
>
> Yes, and good improvements.

Again, please have a look at the (forthcoming) patch. It gives up nothing in
either space or time performance compared to the original, is clearer imho,
and incorporates the the new features.

-dg

David Gould           dg@illustra.com            510.628.3783 or 510.305.9468
Informix Software                      300 Lakeside Drive   Oakland, CA 94612
 - A child of five could understand this!  Fetch me a child of five.


Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contention

From
Bruce Momjian
Date:
>
> Bruce Momjian writes:
> > David Gould writes:
> > > Most of the original tas() __asm__() implementations are GCC specific. This
> > > includes all the Linux platforms except PPC, all the *BSD platforms, even the
> > > VAX. GCC is also fairly commonly used even on the commercial OSes.
> > >
> > > As far as I can tell, the only C coded platforms that are not GCC specific
> > > are SCO i386 and SunOS/Solaris on Sun3 and Sparc. The other non-GCC platforms
> > > have external tas.s function implementations (HP), or have system specific
> > > calls (AIX, OSF, SGI, Nextstep).
> >
> > That s_lock.h file is a hornets nest of portability problems.  I really
> > don't want to have multiple functions/macros for different CPU's if I
> > can help it.  I don't even want to mix functions/macros for the same
> > function name if I can help it.  I also do not want to start playing
> > around with isGNU/isnotGNU in a file that is already complex.
>
> Actually, my main motivation for this file is to reduce the portability
> problems. If you will look at the next patch (when I submit it, probably
> tonight) I think you will see that it is fairly clear what to do to port to
> a new platform, and how the existing platforms work.
>
> We already implicitly make a isGCC vs notGCC distinction when we use the
> GCC asm() syntax. I am merely intending to make it explict.

Ah, I see.  I wondered how other compilers were understanding the asm()
stuff.  I thought it was gcc-specific, but then other platforms were
using it.  I guess they have gcc.

>
> > Macros work and we already have tons of them, we don't use inline
> > anywhere else, and the actual locks are 80% asm code anyway, so it looks
> > the same whether it is in a macro or an inline function.
> >
> > I have made them macros before for this file, so I can do it again quite
> > easily.
> >
> > As for the benefits, well, when I see lots of calls to a function, and I
> > try and eliminate the calls if it is reasonable.  In many places, the
> > call handling is actually more instructions than the inlining.  I look
> > at the measured performance change vs. the executable size increase and
> > make a decision. With something like s_lock, it just seems normal to
> > make it a macro.
>
> With the old S_LOCK this was a reasonable choice. With the new S_LOCK which
> is quite a bit more complex, the macro expansion generates quite a bit of
> code. See the generated code for the "MacroSLOCK" case in my large post.

Yes, I suspected that may be a problem.  I will apply your patch as soon
as I see it.

>
> > > Finally, the difference between a tas() function implementation and the best
> > > possible inline implementation appears to be only 0.06 microseconds on a P133.
> > > This will add 0.0003 seconds to startup. On SCO only. On Sparc this is a leaf
> > > call and possibly even cheaper. No other platforms are affected.
> > >
> > > Remember also that I am adding two features that previously did not exist,
> > > backoff, and stuck lock detection.
> >
> > Yes, and good improvements.
>
> Again, please have a look at the (forthcoming) patch. It gives up nothing in
> either space or time performance compared to the original, is clearer imho,
> and incorporates the the new features.

Sounds like a plan.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] S_LOCK reduced contention

From
Tom Lane
Date:
dg@illustra.com (David Gould) writes:
> [ Much careful testing snipped ]

Nice job, David!

>    gprof introduces severe experimental error for small quick functions.
>    Perhaps the spinlock function is so short and quick compared to the
>    mcount overhead added to the function prolog that the overhead dominates
>    the measurement. gprof remains a useful tool for larger functions with
>    longer runtimes, but must be considered very suspect for tiny functions.

Right.  gprof is a fine example of Heisenberg's Uncertainty Principle
applied to software ;-).  You can't measure something without affecting it.

As Bruce Momjian just pointed out in another followup, running the test
function a lot of times in a tight loop isn't a perfect answer either:
you find out what happens when the function's code and referenced data are
all in cache, but you can't tell much about what happens when they are
not; and you can't tell whether the whole application's memory usage
patterns are such that the function will remain in cache.

I'm guessing that the backend uses tas() enough that it will probably
stay in cache, but that is *strictly* a guess with no evidence.

>    In some of the test cases there was significant timing variation from
>    run to run even though the test conditions were apparently identical.
>    Even more strangely, the variation in time was not random but appeared
>    to represent two different modes. And, the variation was itself
>    repeatable.
>    [snip]
>    I have no explanation for this variation. Possibly it is some interaction
>    of where the program is loaded and the state of the memory heirarchy, but
>    even this is hard to sustain. I would be very curious to hear of any
>    plausible explainations.

After chewing on this for a while I think that your speculation is
right.  You were using a Pentium, you said.  The Pentium has a two-way
set associative cache, which means that any given main-memory address
has exactly two cache lines it could be loaded into.  Main-memory
addresses that are 1000H apart contend for the same pair of cache lines.
Thus, if your program happens to repeatedly hit three locations that are
exactly 1000H apart, it will suffer a cache miss every time.  Change the
address spacing, and no miss occurs.  The cache miss takes forty-some
clock cycles, versus one if the target location is in cache.
(BTW, I'm getting this info out of Rick Booth's "Inner Loops", a fine
reference book if you are into hand-optimized assembly coding for Intel
processors.)

So what I think you were seeing is that on some runs, the loop involved
hitting three addresses that contend for the same cache line pair, while
on other runs there was no cache contention.  This could be explained
by varying load addresses for the program, if it touched both its own
addresses (variable) and some non-varying addresses --- say, C library
routines executed from a shared library that remained loaded throughout.
If you can link with a non-shared C library then you should get more
consistent runtimes, because the offsets between all the locations
touched by your loop would be fixed, and thus cache hits or misses ought
to be consistent from run to run.

The bottom line, however, is that this behavior is too unpredictable
to be worth worrying about in a production program.  A cache miss in
a tight loop could be created or eliminated by unrelated changes in
distant parts of the code ... and in any case the behavior will be
different on different CPUs.  The 486, Pentium, and Pentium Pro all
have radically different cache layouts, let alone non-Intel CPUs.

            regards, tom lane

Re: [HACKERS] S_LOCK reduced contention

From
dg@illustra.com (David Gould)
Date:
Tom Lane writes:
> dg@illustra.com (David Gould) writes:
> > [ Much careful testing snipped ]
> >    In some of the test cases there was significant timing variation from
> >    run to run even though the test conditions were apparently identical.
> >    Even more strangely, the variation in time was not random but appeared
> >    to represent two different modes. And, the variation was itself
> >    repeatable.
> >    [snip]
> >    I have no explanation for this variation. Possibly it is some interaction
> >    of where the program is loaded and the state of the memory heirarchy, but
> >    even this is hard to sustain. I would be very curious to hear of any
> >    plausible explainations.
>
> After chewing on this for a while I think that your speculation is
> right.  You were using a Pentium, you said.  The Pentium has a two-way
> set associative cache, which means that any given main-memory address
> has exactly two cache lines it could be loaded into.  Main-memory
> addresses that are 1000H apart contend for the same pair of cache lines.
> Thus, if your program happens to repeatedly hit three locations that are
> exactly 1000H apart, it will suffer a cache miss every time.  Change the
> address spacing, and no miss occurs.  The cache miss takes forty-some
> clock cycles, versus one if the target location is in cache.
> (BTW, I'm getting this info out of Rick Booth's "Inner Loops", a fine
> reference book if you are into hand-optimized assembly coding for Intel
> processors.)
>
> So what I think you were seeing is that on some runs, the loop involved
> hitting three addresses that contend for the same cache line pair, while
> on other runs there was no cache contention.  This could be explained
> by varying load addresses for the program, if it touched both its own
> addresses (variable) and some non-varying addresses --- say, C library
> routines executed from a shared library that remained loaded throughout.
> If you can link with a non-shared C library then you should get more
> consistent runtimes, because the offsets between all the locations
> touched by your loop would be fixed, and thus cache hits or misses ought
> to be consistent from run to run.

This is in line with my own speculation. However, I am not convinced.

First, the test loop and the function it calls are only about 100 bytes
total taken together. And, no system calls or library calls are made during
the test. This tends to rule out "locations 1000H apart" and shared library
effects. Also, I would expect the system to load programs and libraries
on VM page boundaries. Unless there is some cachability difference from one
page to the next, I am at a loss to account for this.

> The bottom line, however, is that this behavior is too unpredictable
> to be worth worrying about in a production program.  A cache miss in
> a tight loop could be created or eliminated by unrelated changes in
> distant parts of the code ... and in any case the behavior will be
> different on different CPUs.  The 486, Pentium, and Pentium Pro all
> have radically different cache layouts, let alone non-Intel CPUs.

Agreed, the reasonable thing to do is to try to be sensitive to cache
effects and accept that there are mysteries.

thanks

-dg


David Gould            dg@illustra.com           510.628.3783 or 510.305.9468
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
"Don't worry about people stealing your ideas.  If your ideas are any
 good, you'll have to ram them down people's throats." -- Howard Aiken