NTA access on Solaris - Mailing list pgsql-hackers

From Sherry Moore
Subject NTA access on Solaris
Date
Msg-id 20070306054114.GA259293@sun.com
Whole thread Raw
Responses Re: NTA access on Solaris  (Sherry Moore <sherry.moore@sun.com>)
List pgsql-hackers
----- Forwarded message from Sherry Moore <sherry.moore@sun.com> -----

Date: Mon, 5 Mar 2007 21:34:19 -0800
From: Sherry Moore <sherry.moore@sun.com>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Luke Lonergan <LLonergan@greenplum.com>,Mark Kirkwood <markir@paradise.net.nz>,Pavan Deolasee
<pavan@enterprisedb.com>,GavinSherry <swm@alcove.com.au>,PGSQL Hackers <pgsql-hackers@postgresql.org>,Doug Rady
<drady@greenplum.com>,SherryMoore <sherry.moore@sun.com>
 
Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant

Hi Tom,

Sorry about the delay.  I have been away from computers all day.

In the current Solaris release in development (Code name Nevada,
available for download at http://opensolaris.org), I have implemented
non-temporal access (NTA) which bypasses L2 for most writes, and reads
larger than copyout_max_cached (patchable, default to 128K).  The block
size used by Postgres is 8KB.  If I patch copyout_max_cached to 4KB to
trigger NTA for reads, the access time with 16KB buffer or 128MB buffer
are very close.

I wrote readtest to simulate the access pattern of VACUUM (attached).
tread is a 4-socket dual-core Opteron box.

<81 tread >./readtest -h
Usage: readtest [-v] [-N] -s <size> -n iter [-d delta] [-c count]       -v:             Verbose mode       -N:
  Normalize results by number of reads       -s <size>:      Working set size (may specify K,M,G suffix)       -n iter:
      Number of test iterations       -f filename:    Name of the file to read from       -d [+|-]delta:  Distance
betweensubsequent reads       -c count:       Number of reads       -h:             Print this help
 

With copyout_max_cached at 128K (in nanoseconds, NTA not triggered):

<82 tread >./readtest -s 16k -f boot_archive       
46445262
<83 tread >./readtest -s 128M -f boot_archive          
118294230
<84 tread >./readtest -s 16k -f boot_archive -n 100
4230210856
<85 tread >./readtest -s 128M -f boot_archive -n 100
6343619546

With copyout_max_cached at 4K (in nanoseconds, NTA triggered):

<89 tread >./readtest -s 16k -f boot_archive
43606882
<90 tread >./readtest -s 128M -f boot_archive 
100547909
<91 tread >./readtest -s 16k -f boot_archive -n 100
4251823995
<92 tread >./readtest -s 128M -f boot_archive -n 100
4205491984

When the iteration is 1 (the default), the timing difference between
using 16k buffer and 128M buffer is much bigger for both
copyout_max_cached sizes, mostly due to the cost of TLB misses.  When
the iteration count is bigger, most of the page tables would be in Page
Descriptor Cache for the later page accesses so the overhead of TLB
misses become smaller.  As you can see, when we do bypass L2, the
performance with either buffer size is comparable.

I am sure your next question is why the 128K limitation for reads.
Here are the main reasons:
   - Based on a lot of the benchmarks and workloads I traced, the     target buffer of read operations are typically
accessedagain     shortly after the read, while writes are usually not.  Therefore,     the default operation mode is
tobypass L2 for writes, but not     for reads.
 
   - The Opteron's L1 cache size is 64K.  If reads are larger than     128KB, it would have displacement flushed itself
anyway,so for     large reads, I will also bypass L2. I am working on dynamically     setting copyout_max_cached based
onthe L1 D-cache size on the     system.
 

The above heuristic should have worked well in Luke's test case.
However, due to the fact that the reads was done as 16,000 8K reads
rather than one 128MB read, the NTA code was not triggered.

Since the OS code has to be general enough to handle with most
workloads, we have to pick some defaults that might not work best for
some specific operations.  It is a calculated balance.

Thanks,
Sherry


On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote:
> "Luke Lonergan" <LLonergan@greenplum.com> writes:
> > Good info - it's the same in Solaris, the routine is uiomove (Sherry
> > wrote it).
> 
> Cool.  Maybe Sherry can comment on the question whether it's possible
> for a large-scale-memcpy to not take a hit on filling a cache line
> that wasn't previously in cache?
> 
> I looked a bit at the Linux code that's being used here, but it's all
> x86_64 assembler which is something I've never studied :-(.
> 
>             regards, tom lane

-- 
Sherry Moore, Solaris Kernel Development    http://blogs.sun.com/sherrym

#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/param.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <errno.h>
#include <thread.h>
#include <signal.h>
#include <strings.h>
#include <libgen.h>

#define KB(a)           (a*1024)
#define MB(a)           (KB(a)*1024)

static void
usage(char *s)
{       fprintf(stderr,           "Usage: %s [-v] [-N] -s <size> -n iter "           "[-d delta] [-c count]\n", s);
 fprintf(stderr,           "\t-v:\t\tVerbose mode\n"           "\t-N:\t\tNormalize results by number of reads\n"
  "\t-s <size>:\tWorking set size (may specify K,M,G suffix)\n"           "\t-n iter:\tNumber of test iterations\n"
     "\t-f filename:\tName of the file to read from\n"           "\t-d [+|-]delta:\tDistance between subsequent
reads\n"          "\t-c count:\tNumber of reads\n"           "\t-h:\t\tPrint this help\n" );       exit(1);
 
}

#define ABS(x) ((x) >= 0 ? (x) : -(x))

static void
format_num(size_t v, size_t *new, char *code)
{       if (v % (1024 * 1024 * 1024) == 0) {               *new = v / (1024 * 1024 * 1024);               *code = 'G';
    } else if (v % (1024 * 1024) == 0) {               *new = v / (1024 * 1024);               *code = 'M';       }
elseif (v % (1024) == 0) {               *new = v / (1024);               *code = 'K';       } else {
*new= v;               *code = ' ';       }
 
}

static size_t
parse_num(char *s)
{       size_t v = 0;
       for (;;) {               switch (tolower(*s)) {               case '0':               case '1':
case'2':               case '3':               case '4':               case '5':               case '6':
case'7':               case '8':               case '9':                       v = v * 10 + *s - '0';
   ++s;                       continue;
 
               case 'k':                       v *= 1024;                       return (v);
               case 'm':                       v *= (1024 * 1024);                       return (v);
               case 'g':                       v *= (1024 * 1024 * 1024);                       return (v);
               default:                       return (v);               }       }
}

/**  * create a memry segment with a given pagesize*   */
static void *
create_memory(size_t size, size_t pagesize)
{       caddr_t p;
       p = mmap((void *)pagesize, size, PROT_WRITE|PROT_READ,           MAP_ALIGN|MAP_PRIVATE|MAP_ANON, -1, 0);
       if (p == MAP_FAILED) {               char    code;               size_t  out;
               format_num(pagesize, &out, &code);               fprintf(stderr, "mmap(%lu%c,", out, code);
               format_num(size, &out, &code);               fprintf(stderr, " %lu%c, ...)", out, code);
               perror("failed");               exit(1);       }
       return (p);
}


int
main (int argc, char **argv)
{       hrtime_t        start, end, total = 0;       unsigned int    i;       unsigned int    iterations = 1;
size_t         pagesize = getpagesize();       size_t          size = 1024;       longlong_t      j;       longlong_t
  k;       char            *table;       volatile int    value;       int             c;       int             verbose
=0;       int             delta = 1;       int             normalize = 0;       size_t          count;       size_t
    count_requested = 0;       double          normalized;       char            filename[256];
 
       while ((c = getopt( argc, argv, "Nhvc:d:f:s:n:")) != EOF) {               switch (c) {               case 'n':
                   iterations = parse_num(optarg);                       break;               case 's':
     size = parse_num(optarg);                       break;               case 'v':                       verbose = 1;
                    break;               case 'd':                       delta = atoi(optarg);
break;              case 'c':                       count_requested = parse_num(optarg);                       break;
           case 'f':                       strcpy(filename, optarg);                       break;
 
               case 'N':                       normalize = 1;                       break;               case 'h':
        default:                       usage(basename(argv[0]));                       break;               }       }
 
       if (ABS(delta) >= size) {               fprintf(stderr, "delta %llu is larger than size %llu\n",
 ABS(delta), size);               exit(1);       }
 
       count = count_requested ? count_requested : size;
       if (verbose)               printf("Creating table of %llu bytes\n", size);
       table = create_memory(size, pagesize);

       for (i = 0; i < iterations; i++) {               int n;               int offset = 0;               int fd =
-1;
               if ((fd = open(filename, O_RDONLY)) < 0) {                       perror("open");
exit(1);              }
 
               k = size - 1;               start = gethrtime();               while ((n = read(fd, &table[offset],
KB(8)))>0) {                       offset += n;                       offset %= size;               }
 
               end = gethrtime();               total += (end - start);               normalized = (double)(end -
start)/ count;               if (verbose) {                       printf("total time: %llu, normalized time: %g\n",
                     end - start, normalized);               } else if (normalize) {
printf("%g\n",                          (double)(end - start) / count);               }               close(fd);
}      printf("%llu\n", total);       exit(0);
 
}


----- End forwarded message -----

-- 
Sherry Moore, Solaris Kernel Development    http://blogs.sun.com/sherrym


pgsql-hackers by date:

Previous
From: Gregory Stark
Date:
Subject: Re: GIST and TOAST
Next
From: Sherry Moore
Date:
Subject: Re: Bug: Buffer cache is not scan resistant