Re: NTA access on Solaris - Mailing list pgsql-hackers
From | Sherry Moore |
---|---|
Subject | Re: NTA access on Solaris |
Date | |
Msg-id | 20070306065038.GA264296@sun.com Whole thread Raw |
In response to | NTA access on Solaris (Sherry Moore <sherry.moore@sun.com>) |
List | pgsql-hackers |
On a 1P system system with 512K L2, it is more obvious why we shouldn't bypass L2 for small reads: The same readtest as my previous mail invoked as following: ./readtest -s working-set-size -f /platform/i86pc/boot_archive -n 100 With copyout_max_cached being 128K: Working set 16K 32K 64K 128K 256K 512K 1M 2M 128M ================================================================================ Seconds 4.2 4.0 4.1 4.1 5.7 7.0 7.1 7.0 7.5 With copyout_max_cached being 8K: Working set 16K 32K 64K 128K 256K 512K 1M 2M 128M ================================================================================ Seconds 4.8 4.8 4.9 4.9 5.0 5.0 5.0 5.0 5.1 Sherry On Mon, Mar 05, 2007 at 09:41:14PM -0800, Sherry Moore wrote: > ----- Forwarded message from Sherry Moore <sherry.moore@sun.com> ----- > > Date: Mon, 5 Mar 2007 21:34:19 -0800 > From: Sherry Moore <sherry.moore@sun.com> > To: Tom Lane <tgl@sss.pgh.pa.us> > Cc: Luke Lonergan <LLonergan@greenplum.com>, > Mark Kirkwood <markir@paradise.net.nz>, > Pavan Deolasee <pavan@enterprisedb.com>, > Gavin Sherry <swm@alcove.com.au>, > PGSQL Hackers <pgsql-hackers@postgresql.org>, > Doug Rady <drady@greenplum.com>, > Sherry Moore <sherry.moore@sun.com> > Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant > > Hi Tom, > > Sorry about the delay. I have been away from computers all day. > > In the current Solaris release in development (Code name Nevada, > available for download at http://opensolaris.org), I have implemented > non-temporal access (NTA) which bypasses L2 for most writes, and reads > larger than copyout_max_cached (patchable, default to 128K). The block > size used by Postgres is 8KB. If I patch copyout_max_cached to 4KB to > trigger NTA for reads, the access time with 16KB buffer or 128MB buffer > are very close. > > I wrote readtest to simulate the access pattern of VACUUM (attached). > tread is a 4-socket dual-core Opteron box. > > <81 tread >./readtest -h > Usage: readtest [-v] [-N] -s <size> -n iter [-d delta] [-c count] > -v: Verbose mode > -N: Normalize results by number of reads > -s <size>: Working set size (may specify K,M,G suffix) > -n iter: Number of test iterations > -f filename: Name of the file to read from > -d [+|-]delta: Distance between subsequent reads > -c count: Number of reads > -h: Print this help > > With copyout_max_cached at 128K (in nanoseconds, NTA not triggered): > > <82 tread >./readtest -s 16k -f boot_archive > 46445262 > <83 tread >./readtest -s 128M -f boot_archive > 118294230 > <84 tread >./readtest -s 16k -f boot_archive -n 100 > 4230210856 > <85 tread >./readtest -s 128M -f boot_archive -n 100 > 6343619546 > > With copyout_max_cached at 4K (in nanoseconds, NTA triggered): > > <89 tread >./readtest -s 16k -f boot_archive > 43606882 > <90 tread >./readtest -s 128M -f boot_archive > 100547909 > <91 tread >./readtest -s 16k -f boot_archive -n 100 > 4251823995 > <92 tread >./readtest -s 128M -f boot_archive -n 100 > 4205491984 > > When the iteration is 1 (the default), the timing difference between > using 16k buffer and 128M buffer is much bigger for both > copyout_max_cached sizes, mostly due to the cost of TLB misses. When > the iteration count is bigger, most of the page tables would be in Page > Descriptor Cache for the later page accesses so the overhead of TLB > misses become smaller. As you can see, when we do bypass L2, the > performance with either buffer size is comparable. > > I am sure your next question is why the 128K limitation for reads. > Here are the main reasons: > > - Based on a lot of the benchmarks and workloads I traced, the > target buffer of read operations are typically accessed again > shortly after the read, while writes are usually not. Therefore, > the default operation mode is to bypass L2 for writes, but not > for reads. > > - The Opteron's L1 cache size is 64K. If reads are larger than > 128KB, it would have displacement flushed itself anyway, so for > large reads, I will also bypass L2. I am working on dynamically > setting copyout_max_cached based on the L1 D-cache size on the > system. > > The above heuristic should have worked well in Luke's test case. > However, due to the fact that the reads was done as 16,000 8K reads > rather than one 128MB read, the NTA code was not triggered. > > Since the OS code has to be general enough to handle with most > workloads, we have to pick some defaults that might not work best for > some specific operations. It is a calculated balance. > > Thanks, > Sherry > > > On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote: > > "Luke Lonergan" <LLonergan@greenplum.com> writes: > > > Good info - it's the same in Solaris, the routine is uiomove (Sherry > > > wrote it). > > > > Cool. Maybe Sherry can comment on the question whether it's possible > > for a large-scale-memcpy to not take a hit on filling a cache line > > that wasn't previously in cache? > > > > I looked a bit at the Linux code that's being used here, but it's all > > x86_64 assembler which is something I've never studied :-(. > > > > regards, tom lane > > -- > Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym > > #include <stdlib.h> > #include <stdio.h> > #include <ctype.h> > #include <unistd.h> > #include <fcntl.h> > #include <sys/param.h> > #include <sys/time.h> > #include <sys/mman.h> > #include <errno.h> > #include <thread.h> > #include <signal.h> > #include <strings.h> > #include <libgen.h> > > #define KB(a) (a*1024) > #define MB(a) (KB(a)*1024) > > static void > usage(char *s) > { > fprintf(stderr, > "Usage: %s [-v] [-N] -s <size> -n iter " > "[-d delta] [-c count]\n", s); > fprintf(stderr, > "\t-v:\t\tVerbose mode\n" > "\t-N:\t\tNormalize results by number of reads\n" > "\t-s <size>:\tWorking set size (may specify K,M,G suffix)\n" > "\t-n iter:\tNumber of test iterations\n" > "\t-f filename:\tName of the file to read from\n" > "\t-d [+|-]delta:\tDistance between subsequent reads\n" > "\t-c count:\tNumber of reads\n" > "\t-h:\t\tPrint this help\n" ); > exit(1); > } > > #define ABS(x) ((x) >= 0 ? (x) : -(x)) > > static void > format_num(size_t v, size_t *new, char *code) > { > if (v % (1024 * 1024 * 1024) == 0) { > *new = v / (1024 * 1024 * 1024); > *code = 'G'; > } else if (v % (1024 * 1024) == 0) { > *new = v / (1024 * 1024); > *code = 'M'; > } else if (v % (1024) == 0) { > *new = v / (1024); > *code = 'K'; > } else { > *new = v; > *code = ' '; > } > } > > static size_t > parse_num(char *s) > { > size_t v = 0; > > for (;;) { > switch (tolower(*s)) { > case '0': > case '1': > case '2': > case '3': > case '4': > case '5': > case '6': > case '7': > case '8': > case '9': > v = v * 10 + *s - '0'; > ++s; > continue; > > case 'k': > v *= 1024; > return (v); > > case 'm': > v *= (1024 * 1024); > return (v); > > case 'g': > v *= (1024 * 1024 * 1024); > return (v); > > default: > return (v); > } > } > } > > /* > * * create a memry segment with a given pagesize > * */ > static void * > create_memory(size_t size, size_t pagesize) > { > caddr_t p; > > p = mmap((void *)pagesize, size, PROT_WRITE|PROT_READ, > MAP_ALIGN|MAP_PRIVATE|MAP_ANON, -1, 0); > > if (p == MAP_FAILED) { > char code; > size_t out; > > format_num(pagesize, &out, &code); > fprintf(stderr, "mmap(%lu%c,", out, code); > > format_num(size, &out, &code); > fprintf(stderr, " %lu%c, ...)", out, code); > > perror("failed"); > exit(1); > } > > return (p); > } > > > int > main (int argc, char **argv) > { > hrtime_t start, end, total = 0; > unsigned int i; > unsigned int iterations = 1; > size_t pagesize = getpagesize(); > size_t size = 1024; > longlong_t j; > longlong_t k; > char *table; > volatile int value; > int c; > int verbose = 0; > int delta = 1; > int normalize = 0; > size_t count; > size_t count_requested = 0; > double normalized; > char filename[256]; > > while ((c = getopt( argc, argv, "Nhvc:d:f:s:n:")) != EOF) { > switch (c) { > case 'n': > iterations = parse_num(optarg); > break; > case 's': > size = parse_num(optarg); > break; > case 'v': > verbose = 1; > break; > case 'd': > delta = atoi(optarg); > break; > case 'c': > count_requested = parse_num(optarg); > break; > case 'f': > strcpy(filename, optarg); > break; > > case 'N': > normalize = 1; > break; > case 'h': > default: > usage(basename(argv[0])); > break; > } > } > > if (ABS(delta) >= size) { > fprintf(stderr, "delta %llu is larger than size %llu\n", > ABS(delta), size); > exit(1); > } > > count = count_requested ? count_requested : size; > > if (verbose) > printf("Creating table of %llu bytes\n", size); > > table = create_memory(size, pagesize); > > > for (i = 0; i < iterations; i++) { > int n; > int offset = 0; > int fd = -1; > > if ((fd = open(filename, O_RDONLY)) < 0) { > perror("open"); > exit(1); > } > > k = size - 1; > start = gethrtime(); > while ((n = read(fd, &table[offset], KB(8))) >0) { > offset += n; > offset %= size; > } > > end = gethrtime(); > total += (end - start); > normalized = (double)(end - start) / count; > if (verbose) { > printf("total time: %llu, normalized time: %g\n", > end - start, normalized); > } else if (normalize) { > printf("%g\n", > (double)(end - start) / count); > } > close(fd); > } > printf("%llu\n", total); > exit(0); > } > > > ----- End forwarded message ----- > > -- > Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym -- Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym
pgsql-hackers by date: