Re: NTA access on Solaris - Mailing list pgsql-hackers

From Sherry Moore
Subject Re: NTA access on Solaris
Date
Msg-id 20070306065038.GA264296@sun.com
Whole thread Raw
In response to NTA access on Solaris  (Sherry Moore <sherry.moore@sun.com>)
List pgsql-hackers
On a 1P system system with 512K L2, it is more obvious why we shouldn't
bypass L2 for small reads:

The same readtest as my previous mail invoked as following:
   ./readtest -s working-set-size -f /platform/i86pc/boot_archive -n 100

With copyout_max_cached being 128K:

Working
set    16K    32K    64K    128K    256K    512K    1M    2M    128M
================================================================================
Seconds    4.2    4.0    4.1    4.1    5.7    7.0    7.1    7.0    7.5

With copyout_max_cached being 8K:

Working
set    16K    32K    64K    128K    256K    512K    1M    2M    128M
================================================================================
Seconds    4.8    4.8    4.9    4.9    5.0    5.0    5.0    5.0    5.1


Sherry

On Mon, Mar 05, 2007 at 09:41:14PM -0800, Sherry Moore wrote:
> ----- Forwarded message from Sherry Moore <sherry.moore@sun.com> -----
> 
> Date: Mon, 5 Mar 2007 21:34:19 -0800
> From: Sherry Moore <sherry.moore@sun.com>
> To: Tom Lane <tgl@sss.pgh.pa.us>
> Cc: Luke Lonergan <LLonergan@greenplum.com>,
>     Mark Kirkwood <markir@paradise.net.nz>,
>     Pavan Deolasee <pavan@enterprisedb.com>,
>     Gavin Sherry <swm@alcove.com.au>,
>     PGSQL Hackers <pgsql-hackers@postgresql.org>,
>     Doug Rady <drady@greenplum.com>,
>     Sherry Moore <sherry.moore@sun.com>
> Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant
> 
> Hi Tom,
> 
> Sorry about the delay.  I have been away from computers all day.
> 
> In the current Solaris release in development (Code name Nevada,
> available for download at http://opensolaris.org), I have implemented
> non-temporal access (NTA) which bypasses L2 for most writes, and reads
> larger than copyout_max_cached (patchable, default to 128K).  The block
> size used by Postgres is 8KB.  If I patch copyout_max_cached to 4KB to
> trigger NTA for reads, the access time with 16KB buffer or 128MB buffer
> are very close.
> 
> I wrote readtest to simulate the access pattern of VACUUM (attached).
> tread is a 4-socket dual-core Opteron box.
> 
> <81 tread >./readtest -h
> Usage: readtest [-v] [-N] -s <size> -n iter [-d delta] [-c count]
>         -v:             Verbose mode
>         -N:             Normalize results by number of reads
>         -s <size>:      Working set size (may specify K,M,G suffix)
>         -n iter:        Number of test iterations
>         -f filename:    Name of the file to read from
>         -d [+|-]delta:  Distance between subsequent reads
>         -c count:       Number of reads
>         -h:             Print this help
> 
> With copyout_max_cached at 128K (in nanoseconds, NTA not triggered):
> 
> <82 tread >./readtest -s 16k -f boot_archive       
> 46445262
> <83 tread >./readtest -s 128M -f boot_archive          
> 118294230
> <84 tread >./readtest -s 16k -f boot_archive -n 100
> 4230210856
> <85 tread >./readtest -s 128M -f boot_archive -n 100
> 6343619546
> 
> With copyout_max_cached at 4K (in nanoseconds, NTA triggered):
> 
> <89 tread >./readtest -s 16k -f boot_archive
> 43606882
> <90 tread >./readtest -s 128M -f boot_archive 
> 100547909
> <91 tread >./readtest -s 16k -f boot_archive -n 100
> 4251823995
> <92 tread >./readtest -s 128M -f boot_archive -n 100
> 4205491984
> 
> When the iteration is 1 (the default), the timing difference between
> using 16k buffer and 128M buffer is much bigger for both
> copyout_max_cached sizes, mostly due to the cost of TLB misses.  When
> the iteration count is bigger, most of the page tables would be in Page
> Descriptor Cache for the later page accesses so the overhead of TLB
> misses become smaller.  As you can see, when we do bypass L2, the
> performance with either buffer size is comparable.
> 
> I am sure your next question is why the 128K limitation for reads.
> Here are the main reasons:
> 
>     - Based on a lot of the benchmarks and workloads I traced, the
>       target buffer of read operations are typically accessed again
>       shortly after the read, while writes are usually not.  Therefore,
>       the default operation mode is to bypass L2 for writes, but not
>       for reads.
> 
>     - The Opteron's L1 cache size is 64K.  If reads are larger than
>       128KB, it would have displacement flushed itself anyway, so for
>       large reads, I will also bypass L2. I am working on dynamically
>       setting copyout_max_cached based on the L1 D-cache size on the
>       system.
> 
> The above heuristic should have worked well in Luke's test case.
> However, due to the fact that the reads was done as 16,000 8K reads
> rather than one 128MB read, the NTA code was not triggered.
> 
> Since the OS code has to be general enough to handle with most
> workloads, we have to pick some defaults that might not work best for
> some specific operations.  It is a calculated balance.
> 
> Thanks,
> Sherry
> 
> 
> On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote:
> > "Luke Lonergan" <LLonergan@greenplum.com> writes:
> > > Good info - it's the same in Solaris, the routine is uiomove (Sherry
> > > wrote it).
> > 
> > Cool.  Maybe Sherry can comment on the question whether it's possible
> > for a large-scale-memcpy to not take a hit on filling a cache line
> > that wasn't previously in cache?
> > 
> > I looked a bit at the Linux code that's being used here, but it's all
> > x86_64 assembler which is something I've never studied :-(.
> > 
> >             regards, tom lane
> 
> -- 
> Sherry Moore, Solaris Kernel Development    http://blogs.sun.com/sherrym
> 
> #include <stdlib.h>
> #include <stdio.h>
> #include <ctype.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <sys/param.h>
> #include <sys/time.h>
> #include <sys/mman.h>
> #include <errno.h>
> #include <thread.h>
> #include <signal.h>
> #include <strings.h>
> #include <libgen.h>
> 
> #define KB(a)           (a*1024)
> #define MB(a)           (KB(a)*1024)
> 
> static void
> usage(char *s)
> {
>         fprintf(stderr,
>             "Usage: %s [-v] [-N] -s <size> -n iter "
>             "[-d delta] [-c count]\n", s);
>         fprintf(stderr,
>             "\t-v:\t\tVerbose mode\n"
>             "\t-N:\t\tNormalize results by number of reads\n"
>             "\t-s <size>:\tWorking set size (may specify K,M,G suffix)\n"
>             "\t-n iter:\tNumber of test iterations\n"
>             "\t-f filename:\tName of the file to read from\n"
>             "\t-d [+|-]delta:\tDistance between subsequent reads\n"
>             "\t-c count:\tNumber of reads\n"
>             "\t-h:\t\tPrint this help\n" );
>         exit(1);
> }
> 
> #define ABS(x) ((x) >= 0 ? (x) : -(x))
> 
> static void
> format_num(size_t v, size_t *new, char *code)
> {
>         if (v % (1024 * 1024 * 1024) == 0) {
>                 *new = v / (1024 * 1024 * 1024);
>                 *code = 'G';
>         } else if (v % (1024 * 1024) == 0) {
>                 *new = v / (1024 * 1024);
>                 *code = 'M';
>         } else if (v % (1024) == 0) {
>                 *new = v / (1024);
>                 *code = 'K';
>         } else {
>                 *new = v;
>                 *code = ' ';
>         }
> }
> 
> static size_t
> parse_num(char *s)
> {
>         size_t v = 0;
> 
>         for (;;) {
>                 switch (tolower(*s)) {
>                 case '0':
>                 case '1':
>                 case '2':
>                 case '3':
>                 case '4':
>                 case '5':
>                 case '6':
>                 case '7':
>                 case '8':
>                 case '9':
>                         v = v * 10 + *s - '0';
>                         ++s;
>                         continue;
> 
>                 case 'k':
>                         v *= 1024;
>                         return (v);
> 
>                 case 'm':
>                         v *= (1024 * 1024);
>                         return (v);
> 
>                 case 'g':
>                         v *= (1024 * 1024 * 1024);
>                         return (v);
> 
>                 default:
>                         return (v);
>                 }
>         }
> }
> 
> /*
>  *  * create a memry segment with a given pagesize
>  *   */
> static void *
> create_memory(size_t size, size_t pagesize)
> {
>         caddr_t p;
> 
>         p = mmap((void *)pagesize, size, PROT_WRITE|PROT_READ,
>             MAP_ALIGN|MAP_PRIVATE|MAP_ANON, -1, 0);
> 
>         if (p == MAP_FAILED) {
>                 char    code;
>                 size_t  out;
> 
>                 format_num(pagesize, &out, &code);
>                 fprintf(stderr, "mmap(%lu%c,", out, code);
> 
>                 format_num(size, &out, &code);
>                 fprintf(stderr, " %lu%c, ...)", out, code);
> 
>                 perror("failed");
>                 exit(1);
>         }
> 
>         return (p);
> }
> 
> 
> int
> main (int argc, char **argv)
> {
>         hrtime_t        start, end, total = 0;
>         unsigned int    i;
>         unsigned int    iterations = 1;
>         size_t          pagesize = getpagesize();
>         size_t          size = 1024;
>         longlong_t      j;
>         longlong_t      k;
>         char            *table;
>         volatile int    value;
>         int             c;
>         int             verbose = 0;
>         int             delta = 1;
>         int             normalize = 0;
>         size_t          count;
>         size_t          count_requested = 0;
>         double          normalized;
>         char            filename[256];
> 
>         while ((c = getopt( argc, argv, "Nhvc:d:f:s:n:")) != EOF) {
>                 switch (c) {
>                 case 'n':
>                         iterations = parse_num(optarg);
>                         break;
>                 case 's':
>                         size = parse_num(optarg);
>                         break;
>                 case 'v':
>                         verbose = 1;
>                         break;
>                 case 'd':
>                         delta = atoi(optarg);
>                         break;
>                 case 'c':
>                         count_requested = parse_num(optarg);
>                         break;
>                 case 'f':
>                         strcpy(filename, optarg);
>                         break;
> 
>                 case 'N':
>                         normalize = 1;
>                         break;
>                 case 'h':
>                 default:
>                         usage(basename(argv[0]));
>                         break;
>                 }
>         }
> 
>         if (ABS(delta) >= size) {
>                 fprintf(stderr, "delta %llu is larger than size %llu\n",
>                     ABS(delta), size);
>                 exit(1);
>         }
> 
>         count = count_requested ? count_requested : size;
> 
>         if (verbose)
>                 printf("Creating table of %llu bytes\n", size);
> 
>         table = create_memory(size, pagesize);
> 
> 
>         for (i = 0; i < iterations; i++) {
>                 int n;
>                 int offset = 0;
>                 int fd = -1;
> 
>                 if ((fd = open(filename, O_RDONLY)) < 0) {
>                         perror("open");
>                         exit(1);
>                 }
> 
>                 k = size - 1;
>                 start = gethrtime();
>                 while ((n = read(fd, &table[offset], KB(8))) >0) {
>                         offset += n;
>                         offset %= size;
>                 }
> 
>                 end = gethrtime();
>                 total += (end - start);
>                 normalized = (double)(end - start) / count;
>                 if (verbose) {
>                         printf("total time: %llu, normalized time: %g\n",
>                             end - start, normalized);
>                 } else if (normalize) {
>                         printf("%g\n",
>                             (double)(end - start) / count);
>                 }
>                 close(fd);
>         }
>         printf("%llu\n", total);
>         exit(0);
> }
> 
> 
> ----- End forwarded message -----
> 
> -- 
> Sherry Moore, Solaris Kernel Development    http://blogs.sun.com/sherrym

-- 
Sherry Moore, Solaris Kernel Development    http://blogs.sun.com/sherrym


pgsql-hackers by date:

Previous
From: "William ZHANG"
Date:
Subject: Re: ERROR: operator does not exist: integer !=- integer
Next
From: Shane Ambler
Date:
Subject: Re: Auto creation of Partitions