Re: update i386 spinlock for hyperthreading - Mailing list pgsql-patches

From Manfred Spraul
Subject Re: update i386 spinlock for hyperthreading
Date
Msg-id 3FED6028.7020300@colorfullife.com
Whole thread Raw
In response to Re: update i386 spinlock for hyperthreading  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: update i386 spinlock for hyperthreading  (ohp@pyrenet.fr)
Re: update i386 spinlock for hyperthreading  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-patches
Tom Lane wrote:

>Manfred Spraul <manfred@colorfullife.com> writes:
>
>
>>Tom Lane wrote:
>>
>>
>>>Don't you have to put it in a specific place in the loop to make that
>>>work?  If not, why not?
>>>
>>>
>>>
>>Rep;nop is just a short delay - that's all.
>>
>>
>
>That view seems to me to be directly contradicted by this statement:
>
>
>
>>The PAUSE instruction provides a hint to the processor that
>>the code sequence is a spin-wait loop. The processor uses this hint to
>>avoid the memory order violation in most situations, which greatly
>>improves processor performance.
>>
>>
>
>It's not apparent to me how a short delay translates into avoiding a
>memory order violation (possibly some docs on what that means exactly
>might help...).  I suspect strongly that there needs to be some near
>proximity between the PAUSE instruction and the lock-test instruction
>for this to work as advertised.  It would help if Intel were less coy
>about what the instruction really does.
>
>
My guess: Pentium 4 cpu support something like 250 uops in flight - it
will have a dozend of the spinlock loops in it's pipeline. When the
spinlock is released, it must figure out which of the loops should get
it, and gets lost. My guess is that rep;nop delays the cpu buy at least
100 cpu ticks, and thus the pipeline will be empty before it proceeds. I
don't have a Pentium 4, and the HP testdrive is down. Someone around who
could run my test app?

>
>
>>This instruction does not change the architectural state of the
>>processor (that is, it performs essentially a delaying noop
>>operation).
>>
>>
>
>This can be rephrased as "we're not telling you what this instruction
>really does, because its interesting effects are below the level of the
>instruction set architecture".  Great.  How are we supposed to know
>how to use it?
>
>
There was a w_spinlock.pdf document with reference code. google still
finds it, but the links are dead :-(

>>I think a separate function is better than adding it into TAS: if it's
>>part of tas, then it would automatically be included by every
>>SpinLockAcquire call - unnecessary .text bloat.
>>
>>
>
>Why do you think it's unnecessary?  One thing that I find particularly
>vague in the quoted documentation is the statement that the PAUSE
>instruction is needed to avoid a delay when *exiting* the spin-wait
>loop.  Doesn't this mean that a PAUSE is needed in the success path
>when the first TAS succeeds (i.e, the normal no-contention path)?
>
IIRC: No.

>If not, why not?  If so, does it go before or after the lock
>instruction?
>
>
Neither: somewhere in the failure path.

>Also, if the principal effect is a "short delay", do we really need it
>at all considering that our inner loop in s_lock is rather more than
>an "xchgb" followed by a conditional branch?  There will be time for
>the write queue to drain while we're incrementing and testing our
>spin counter (which I trust is in a register...).
>
>The reason I'm so full of questions is that I spent some time several
>days ago looking at exactly this issue, and came away with only the
>conclusion that I had to find some more-detailed documentation before
>I could figure out what we should do about the spinlocks for Xeons.
>
I'll try to find some more docs and post links.

The 2nd thing I would change is to add a nonatomic test in the slow
path: locked instructions generate lots of bus traffic, and that's a
waste of resources.

Another question: regardless of the placement of rep;nop - 10% speedup
means that the postgres spends far too much time in the spinlock code.
I've looked at the oprofile dumps, and something like 1.2% of the total
cpu time is spent it the TAS macro in LWLockAcquire. That's the hottest
instruction in the whole profile, it eats more cpu cycles than the
memcpy() calls that transfer data to/from kernel.
Is there an easy way find out which LWLock is contended?
--
    Manfred
/*
 * skel.cpp. skeleton for rdtsc benchmarks
 *
 * Copyright (C) 1999, 2001 by Manfred Spraul.
 *    All rights reserved except the rights granted by the GPL.
 *
 * Redistribution of this file is permitted under the terms of the GNU
 * General Public License (GPL) version 2 or later.
 * $Header: /pub/home/manfred/cvs-tree/timetest/rep_nop.cpp,v 1.1 2001/04/07 19:38:33 manfred Exp $
 */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <getopt.h>

// disable local interrupts during benchmark
#undef USE_CLI

// define a cache flushing function
#undef CACHE_FLUSH

#ifdef USE_CLI
#include <sys/io.h>
#define CLI    "cli\n\t"
#define STI    "sti\n\t"
#else
#define CLI
#define STI
#define iopl(a)    do { } while(0)
#endif

// Intel recommends that a serializing instruction
// should be called before and after rdtsc.
// CPUID is a serializing instruction.
// ".align 128:" P 4 L2 cache line size
#define read_rdtsc_before(time)        \
    __asm__ __volatile__(        \
        ".align 128\n\t"    \
        "xor %%eax,%%eax\n\t"    \
        CLI            \
        "cpuid\n\t"        \
        "rdtsc\n\t"        \
        "mov %%eax,(%0)\n\t"    \
        "mov %%edx,4(%0)\n\t"    \
        "xor %%eax,%%eax\n\t"    \
        "cpuid\n\t"        \
        : /* no output */    \
        : "S"(&time)        \
        : "eax", "ebx", "ecx", "edx", "memory")

#define read_rdtsc_after(time)        \
    __asm__ __volatile__(        \
        "xor %%eax,%%eax\n\t"    \
        "cpuid\n\t"        \
        "rdtsc\n\t"        \
        "mov %%eax,(%0)\n\t"    \
        "mov %%edx,4(%0)\n\t"    \
        "xor %%eax,%%eax\n\t"    \
        "cpuid\n\t"        \
        STI            \
        : /* no output */    \
        : "S"(&time)        \
        : "eax", "ebx", "ecx", "edx", "memory")

#define BUILD_TESTFNC(name, text, instructions) \
void name##_dummy(void)                \
{                        \
    __asm__ __volatile__(            \
        ".align 4096\n\t"        \
        "xor %%eax, %%eax\n\t"        \
        : : : "eax");            \
}                        \
static unsigned long name##_best = 1024*1024*1024; \
\
static void name(void) \
{ \
    unsigned long long time; \
    unsigned long long time2; \
 \
    read_rdtsc_before(time); \
    instructions; \
    read_rdtsc_after(time2); \
    if(time2-time < name##_best) { \
        printf( text ":\t%10Ld ticks; \n", \
            time2-time-zerotest_best); \
        name##_best = time2-time; \
    } \
}

void filler(void)
{
static int i = 3;
static int j;
    j = i*i;
}

#define DO_3(x) \
    do { x; x; x; } while(0)

#define DO_10(x) \
    do { x; x; x; x; x; x; x; x; x; x;} while(0)

#define DO_50(x) \
    do { DO_10(x); DO_10(x);DO_10(x); DO_10(x);DO_10(x);} while(0)


#define DO_T(y) do { \
    DO_3(filler()); \
    y; \
    DO_3(filler());} while(0)

#ifdef CACHE_FLUSH
#define DRAIN_SZ    (4*1024*1024)
int other[3*DRAIN_SZ] __attribute ((aligned (4096)));
static inline void drain_cache(void)
{
    int i;
    for(i=0;i<DRAIN_SZ;i++) other[DRAIN_SZ+i]=0;
    for(i=0;i<DRAIN_SZ;i++) if(other[DRAIN_SZ+i]!=0) break;
}
#else
static inline void drain_cache(void)
{
}
#endif

#define DO_TEST(x) \
    do { \
        int i; \
        for(i=0;i<5000;i++) \
            x; \
    } while(0)

//////////////////////////////////////////////////////////////////////////////

#define REP_NOP()    __asm__ __volatile__ ("rep;nop\n\t": : : "memory");
#define NOP()        __asm__ __volatile__ ("nop\n\t": : : "memory");

BUILD_TESTFNC(zerotest,"zerotest", DO_T((void)0));
BUILD_TESTFNC(rnop,"rep nop", DO_T(REP_NOP()));
BUILD_TESTFNC(nop,"nop", DO_T(NOP()));

//////////////////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////////////////
int main()
{
    if(geteuid() == 0) {
        int res = nice(-20);
        if(res < 0) {
            perror("nice(-20)");
            return 1;
        }
        printf("MOVETEST, reniced to (-20).\n");
    } else
    {
        printf("MOVETEST called by non-superuser, running with normal priority.\n");
    }
    for(;;) {
        DO_TEST(zerotest());
        DO_TEST(rnop());
        DO_TEST(nop());
    }
    return 0;
}
Index: ./src/backend/storage/lmgr/s_lock.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/storage/lmgr/s_lock.c,v
retrieving revision 1.22
diff -u -r1.22 s_lock.c
--- ./src/backend/storage/lmgr/s_lock.c    23 Dec 2003 22:15:07 -0000    1.22
+++ ./src/backend/storage/lmgr/s_lock.c    27 Dec 2003 10:28:54 -0000
@@ -76,7 +76,7 @@
      * The select() delays are measured in centiseconds (0.01 sec) because 10
      * msec is a common resolution limit at the OS level.
      */
-#define SPINS_PER_DELAY        100
+#define SPINS_PER_DELAY        1000
 #define NUM_DELAYS            1000
 #define MIN_DELAY_CSEC        1
 #define MAX_DELAY_CSEC        100
@@ -86,7 +86,7 @@
     int            cur_delay = MIN_DELAY_CSEC;
     struct timeval delay;

-    while (TAS(lock))
+    while (!S_LOCK_FREE(lock) || TAS(lock))
     {
         if (++spins > SPINS_PER_DELAY)
         {
@@ -111,6 +111,7 @@

             spins = 0;
         }
+        CPU_DELAY();
     }
 }

Index: ./src/include/storage/s_lock.h
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/include/storage/s_lock.h,v
retrieving revision 1.123
diff -u -r1.123 s_lock.h
--- ./src/include/storage/s_lock.h    23 Dec 2003 22:15:07 -0000    1.123
+++ ./src/include/storage/s_lock.h    27 Dec 2003 10:28:55 -0000
@@ -52,6 +52,15 @@
  *    in assembly language to execute a hardware atomic-test-and-set
  *    instruction.  Equivalent OS-supplied mutex routines could be used too.
  *
+ *    Additionally, a platform specific delay function can be defined:
+ *
+ *    void CPU_DELAY(void)
+ *        Notification that the cpu is executing a busy loop.
+ *
+ *     Some platforms need such an indication. One example are platforms
+ *     that implement SMT, i.e. multiple logical threads that share
+ *     execution resources in one physical cpu.
+ *
  *    If no system-specific TAS() is available (ie, HAVE_SPINLOCKS is not
  *    defined), then we fall back on an emulation that uses SysV semaphores
  *    (see spin.c).  This emulation will be MUCH MUCH slower than a proper TAS()
@@ -115,6 +124,17 @@
     return (int) _res;
 }

+#define HAS_CPU_DELAY
+
+#define CPU_DELAY()    cpu_delay()
+
+static __inline__ void
+cpu_delay(void)
+{
+    __asm__ __volatile__(
+        " rep; nop            \n"
+        : : : "memory");
+}
 #endif     /* __i386__ || __x86_64__ */


@@ -715,6 +735,9 @@
 #define TAS(lock)        tas(lock)
 #endif     /* TAS */

+#ifndef HAS_CPU_DELAY
+#define CPU_DELAY()    do { } while(0)
+#endif

 /*
  * Platform-independent out-of-line support routines

pgsql-patches by date:

Previous
From: Claudio Natoli
Date:
Subject: Re: fork/exec patch: pre-CreateProcess finalization
Next
From: ohp@pyrenet.fr
Date:
Subject: Re: update i386 spinlock for hyperthreading