Thread: ice-broker scan thread
I am considering add an "ice-broker scan thread" to accelerate PostgreSQL sequential scan IO speed. The basic idea of this thread is just like the "read-ahead" method, but the difference is this one does not read the data into shared buffer pool directly, instead, it reads the data into file system cache, which makes the integration easy and this is unique to PostgreSQL. What happens to the original sequential scan: for (;;) {/* * a physical read may happen, due to current content of * file system cache and if the kernel is smart enough to * understandyou want to do sequential scan */physical or logical read a page;process the page; } What happens to the sequential scan with ice-broker: for (;;) {/* since the ice-broker has read the page in already */logical read a page with big chance;process the page; } I wrote a program to simulate the sequential scan in PostgreSQL with/without ice-broker. The results indicate this technique has the following characters: (1) The important factor of speedup is the how much CPU time PostgreSQL used on each data page. If PG is fast enough, then no speedup occurs; else a 10% to 20% speedup is expected due to my test. (2) It uses more CPU - this is easy to understand, since it does more work; (3) The benefits also depends on other factors, like how smart your file system ... Here is a test results on my machine: --- $#uname -a Linux josh.db 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown $#cat /proc/meminfo | grep MemTotal MemTotal: 1030988 kB $#cat /proc/cpuinfo | grep CPU model name : Intel(R) Pentium(R) 4 CPU 2.40GHz $#./seqscan 10 $HOME/pginstall/bin/data/base/10794/18986 50 PostgreSQL sequential scan simulator configuration: Memory size: 943718400 CPU cost per page: 50 Scan threadread unit size: 4 With scan threads off - duration: 56862.738 ms With scan threads on - duration: 40611.101 ms With scan threads off - duration: 46859.207 ms With scan threads on - duration: 38598.234 ms With scan threads off - duration: 56919.572 ms With scan threads on - duration: 47023.606 ms With scan threads off - duration: 52976.825 ms With scan threads on - duration: 43056.506 ms With scan threads off - duration: 54292.979 ms With scan threads on - duration: 42946.526 ms With scan threads off - duration: 51893.590 ms With scan threads on - duration: 42137.684 ms With scan threads off - duration: 46552.571 ms With scan threads on - duration: 41892.628 ms With scan threads off - duration: 45107.800 ms With scan threads on - duration: 38329.785 ms With scan threads off - duration: 47527.787 ms With scan threads on - duration: 38293.581 ms With scan threads off - duration: 48810.656 ms With scan threads on - duration: 39018.500 ms --- Notice in above the cpu_cost=50 might looks too big (if you look into the code) - but in concurrent situation, it is not that huge. Also, on my windows box(PIII, 800), a cpu_cost=5 can is enough to prove the benefits of 10%. So in general, it does help in some situations, but not a rocket science since we can't predicate the performance of the file system. It fairly easy to be integrated, and we should add a GUC parameter to control it. We need more tests, any comments and tests are welcome, Regards, Qingqing --- /** seqscan.c* PostgreSQL sequential scan simulator with helper scan thread** Note* I wrote this simulatorto see if there is any benefits for sequential scan to* do read-ahead by another thread. The only thing youmay want to change in the* source file is MEMSZ, make it big enough to thrash your file system cache.** Use the following command to compile:* $gcc -O2 -Wall -pthread -lm seqscan.c -o seqscan* To use it:* $./seqscan <rounds> <datafile> <cpu_cost>* In which rounds is how many times you want to run the test (noticeeach round include* two disk-burn test), datafile is the path to any file (suggest size > 100M), and cpu_cost* is the cost that processing each page of the file. Try different cpu_cost.*/ #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <memory.h> #include <errno.h> #include <math.h> #ifdef WIN32 #include <io.h> #include <windows.h> #define PG_BINARY O_BINARY #else #include <unistd.h> #include <pthread.h> #include <sys/stat.h> #include <sys/time.h> #include <sys/file.h> #define PG_BINARY 0 #endif typedef char bool; #define true ((bool) 1) #define false ((bool) 0) #define BLCKSZ 8192 #define UNITSZ 4 #define MEMSZ (950*1024*1024) char *data_file; int cpu_cost; volatile bool stop_scan; char thread_buffer[BLCKSZ*UNITSZ]; static void cleanup_cache(void) {char *p; if (NULL == (p = (char *)malloc(MEMSZ))){ fprintf(stderr, "insufficient memory\n"); exit(-1);} memset(p, 'a', MEMSZ);free(p); } #ifdef WIN32 bool enable_aio = false; static const unsigned __int64 epoch = 116444736000000000L; static int gettimeofday(struct timeval * tp, struct timezone * tzp) {FILETIME file_time;SYSTEMTIME system_time;ULARGE_INTEGER ularge; GetSystemTime(&system_time);SystemTimeToFileTime(&system_time, &file_time);ularge.LowPart = file_time.dwLowDateTime;ularge.HighPart= file_time.dwHighDateTime; tp->tv_sec = (long) ((ularge.QuadPart - epoch) / 10000000L);tp->tv_usec = (long) (system_time.wMilliseconds * 1000); return 0; } static void sleep(int secs) {SleepEx(secs*1000, true); } static int thread_open() {HANDLE fd;SECURITY_ATTRIBUTES sa; sa.nLength = sizeof(sa);sa.bInheritHandle = TRUE;sa.lpSecurityDescriptor = NULL; fd = CreateFile(data_file, GENERIC_READ, FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE, &sa, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN | (enable_aio?FILE_FLAG_OVERLAPPED:0), NULL); if (fd == INVALID_HANDLE_VALUE){ int errCode; switch (errCode = GetLastError()) { /* EMFILE, ENFILE should not occur from CreateFile. */ case ERROR_PATH_NOT_FOUND: case ERROR_FILE_NOT_FOUND: errno = ENOENT; break; case ERROR_FILE_EXISTS: errno= EEXIST; break; case ERROR_ACCESS_DENIED: errno = EACCES; break; default: fprintf(stderr,"thread_open failed: %d\n", errCode); errno = EINVAL; } return -1;} return (int)fd; } static int thread_read(int fd, int blkno, size_t nblk, char *buf) {long offset = BLCKSZ*blkno;long nbytes;OVERLAPPED ol; memset(&ol, 0, sizeof(OVERLAPPED));ol.Offset = offset;ol.OffsetHigh = 0; if (ReadFile((HANDLE)fd, buf, BLCKSZ*nblk, &nbytes, &ol)){ /* successfully done without delay */ NULL;}else{ interrCode; switch (errCode = GetLastError()) { case ERROR_IO_PENDING: break; case ERROR_HANDLE_EOF: break; default: /* unknown error occured */ fprintf(stderr, "asyncread failed: %d\n", errCode); exit(-1); }} return nbytes; } static void thread_close(int fd) {CloseHandle((HANDLE)fd); } #else /* non-windows platforms */ static int thread_open() {int fd; fd = open(data_file, O_RDWR | PG_BINARY, 0600);if (fd < 0){ fprintf(stderr, "thread_open failed: %d\n", errno); exit(-1);} return (int)fd; } static int thread_read(int fd, int blkno, size_t nblk, char *buf) {long offset = BLCKSZ*blkno;long nbytes; nbytes = lseek(fd, offset, SEEK_SET);nbytes = read(fd, buf, BLCKSZ*nblk);if (nbytes <= 0){ fprintf(stderr, "thread_readfailed: %d\n", errno); exit(-1);} return nbytes; } static void thread_close(int fd) {close(fd); } #endif #ifdef WIN32 static DWORD WINAPI scan_thread(LPVOID args) #else static void * scan_thread(void *args) #endif {int i, fd;int start, end; start = 0;end = (size_t)args; fd = thread_open();for (i = start; i < end; i+=UNITSZ){ thread_read(fd, i, UNITSZ, (char *)thread_buffer); /* check if I was asked to stop */ if (stop_scan == true) break;}thread_close(fd); return 0; } static int init_scan(bool with_threads, size_t *nblocks) {int fd; /* open file for do_scan */fd = open(data_file, O_RDWR | PG_BINARY, 0600);if (fd < 0){ fprintf(stderr, "failed to openfile %s\n", data_file); exit(-1);} *nblocks = lseek(fd, 0, SEEK_END) / BLCKSZ;if (*nblocks < 0){ fprintf(stderr, "failed to get file length %s\n", data_file); exit(-1);} if (with_threads){ #ifndef WIN32 pthread_t thread; #endif /* create scan threads */ stop_scan = false; #ifdef WIN32 if (NULL == CreateThread(NULL, 0, scan_thread, (void *)(*nblocks), 0, NULL)) #else if (pthread_create(&thread, NULL, scan_thread, (void *)(*nblocks))) #endif { fprintf(stderr, "failed to start scan thread"); exit(-1); }} return fd; } static void do_scan(int fd, size_t nblocks) {int i, j, k, nbytes;char buffer[BLCKSZ]; for (i = 0; i < nblocks; i++){ nbytes = lseek(fd, i*BLCKSZ, SEEK_SET); nbytes = read(fd, buffer, BLCKSZ); if (nbytes!= BLCKSZ) { fprintf(stderr, "do_scan read failed\n"); exit(-1); } /* pretend to do some CPU intensive analysis */ for (k = 0; k < cpu_cost; k++) { for (j = (k*sizeof(int))%BLCKSZ; j < BLCKSZ / (5 * sizeof(int)); j += sizeof(int)) { int x, y; x = ((int *)buffer)[j]; x = (int)pow((double)x, (double)(x+1)); y = (int)sin((double)x*x); ((int *)buffer)[j] = x*y; } }} } static void close_scan(fd) {stop_scan = true;close(fd); } int main(int argc, char *argv[]) {int i, rounds, fd;size_t nblocks; if (argc != 4){ fprintf(stderr, "usage: cache <rounds> <datafile> <cpu_cost>\n"); exit(-1);} rounds = atoi(argv[1]);data_file = argv[2];cpu_cost = atoi(argv[3]);fd = init_scan(false, &nblocks);close_scan(fd);fprintf(stdout,"PostgreSQL sequential scan simulator configuration:\n" "\tMemorysize: %u\n" "\tCPU cost per page: %d\n" "\tScan thread read unit size:%d\n\n", MEMSZ, cpu_cost, UNITSZ); for (i = 0; i < 2*rounds; i++){ struct timeval start_t, stop_t; long usecs; bool enable = i%2?true:false; /* eliminate system cached data */ cleanup_cache(); sleep(2); /* do the scan task */ gettimeofday(&start_t, NULL); fd = init_scan(enable, &nblocks); do_scan(fd, nblocks); close_scan(fd); gettimeofday(&stop_t, NULL); /* measure the time */ if (stop_t.tv_usec < start_t.tv_usec) { stop_t.tv_sec--; stop_t.tv_usec +=1000000; } usecs = (long) (stop_t.tv_sec - start_t.tv_sec) * 1000000 + (long) (stop_t.tv_usec - start_t.tv_usec); fprintf (stdout, "With scan threads %s - duration: %ld.%03ld ms\n", enable?"on":"off", (long) ((stop_t.tv_sec - start_t.tv_sec) * 1000 + (stop_t.tv_usec - start_t.tv_usec) / 1000), (long) (stop_t.tv_usec - start_t.tv_usec) % 1000); sleep(2);} exit(0); }
Qingqing Zhou wrote: >I am considering add an "ice-broker scan thread" to accelerate PostgreSQL >sequential scan IO speed. The basic idea of this thread is just like the >"read-ahead" method, but the difference is this one does not read the data >into shared buffer pool directly, instead, it reads the data into file >system cache, which makes the integration easy and this is unique to >PostgreSQL. > > Interesting, and I wondered about this too. But for my taste the demonstrated benefit really isn't large enough to make it worthwhile. BTW, I heard a long time ago that NTFS has quite fancy read-ahead, where it attempts to detect the application's access pattern including if it is reading sequentially and even if there is a 'stride' to the accesses when they're not contiguous. I would imagine that other filesystems attempt similar tricks. So one might expect a simple linear prefectch to not help much in the presence of such a filesystem. Were you worried about the icebreaker thread getting too far ahead of the scan ? If it did it might page out the data you're about to read, I think. Of course this could be fixed by having the read ahead thread perodically check the current location being read by the query thread and pausing if it's got too far ahead. Anyway, the recent performance thread has been intersting to me because in all my career I've never seen a database that scanned scads of data from disk to process a query. Typically the problems I work on arrange to read the entire database into memory. I think I need to get out more... ;)
On Mon, 28 Nov 2005, Qingqing Zhou wrote: > > I am considering add an "ice-broker scan thread" to accelerate PostgreSQL > sequential scan IO speed. The basic idea of this thread is just like the > "read-ahead" method, but the difference is this one does not read the data > into shared buffer pool directly, instead, it reads the data into file > system cache, which makes the integration easy and this is unique to > PostgreSQL. > MySQL, Oracle and others implement read-ahead threads to simulate async IO 'pre-fetching'. I've been experimenting with two ideas. The first is to increase the readahead when we're doing sequential scans (see prototype patch using posix fadvise attached). I've not got any hardware at the moment which I can test this patch on but I am waiting on some dbt-3 results which should indicate whether fadvise is a good idea or a bad one. The second idea is using posix async IO at key points within the system to better parallelise CPU and IO work. There areas I think we could use async IO are: during sequential scans, use async IO to do pre-fetching of blocks; inside WAL, begin flushing WAL buffers to disk before we commit; and, inside the background writer/check point process, asynchronously write out pages and, potentially, asynchronously build new checkpoint segments. The motivation for using async IO is two fold: first, the results of this paper[1] are compelling; second, modern OSs support async IO. I know that Linux[2], Solaris[3], AIX and Windows all have async IO and I presume that all their rivals have it as well. The fundamental premise of the paper mentioned above is that if the database is busy, IO should be busy. With our current block-at-a-time processing, this isn't always the case. This is why Qingqing's read-ahead thread makes sense. My reason for mailing is, however, that the async IO results are more compelling than the read ahead thread. I haven't had time to prototype whether we can easily implement async IO but I am planning to work on it in December. The two main goals will be to a) integrate and utilise async IO, at least within the executor context, and b) build a primitive kind of scheduler so that we stop prefetching when we know that there are a certain number of outstanding IOs for a given device. Thanks, Gavin [1] http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf [2] http://lse.sourceforge.net/io/aionotes.txt [3] http://developers.sun.com/solaris/articles/event_completion.html - I'm fairly sure they have a posix AIO wrapper around these routines, but I cannot see it documented anywhere :-(
Qingqing, >> I am considering add an "ice-broker scan thread" to accelerate PostgreSQL >> sequential scan IO speed. The basic idea of this thread is just like the >> "read-ahead" method, but the difference is this one does not read the >> data >> into shared buffer pool directly, instead, it reads the data into file >> system cache, which makes the integration easy and this is unique to >> PostgreSQL. You probably mean "ice-breaker" by the way :) Chris
Gavin Sherry <swm@linuxworld.com.au> writes: > I haven't had time to prototype whether we can easily implement async IO Just as with any suggestion to depend on threads, you are going to have to show results that border on astounding to have any chance of getting this in. Otherwise the portability issues are just going to make it not worth the trouble. regards, tom lane
Gavin Sherry wrote: > MySQL, Oracle and others implement read-ahead threads to simulate async IO I always believed that Oracle used async file I/O. Not that I've seen their code, but I'm fairly sure they funded the addition of kernel aio to Linux a few years back. But....Oracle comes from a time long ago when threads and decent filesystems didn't exist, so some of the things they do may not be appropriate to add to a product that doesn't have them today. Now...network async I/O...that'd be really useful in my world...
On Mon, 28 Nov 2005, David Boreham wrote: > Gavin Sherry wrote: > > > MySQL, Oracle and others implement read-ahead threads to simulate async IO > > I always believed that Oracle used async file I/O. Not that I've seen their > code, but I'm fairly sure they funded the addition of kernel aio to Linux > a few years back. That's right. > > But....Oracle comes from a time long ago when threads and decent > filesystems didn't exist, so some of the things they do may not be > appropriate > to add to a product that doesn't have them today. The paper I linked to seemed to suggest that they weren't using async IO in 9.2 -- which is fairly old. I'm not sure why the authors didn't test 10g. Gavin
Tom Lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > >>I haven't had time to prototype whether we can easily implement async IO > > > Just as with any suggestion to depend on threads, you are going to have > to show results that border on astounding to have any chance of getting > this in. Otherwise the portability issues are just going to make it not > worth the trouble. Do these ideas require threads in principle? ISTM that there could be (additional) process(es) waiting to perform pre-fetching or async io, and we could use the usual IPC machinary to talk between them... cheers Mark
<br /><blockquote cite="midPine.LNX.4.58.0511291513260.18370@linuxworld.com.au" type="cite"><pre wrap=""> The paper I linked to seemed to suggest that they weren't using async IO in 9.2 -- which is fairly old. I'm not sure why the authors didn't test 10g. </pre></blockquote> ...<reads paper>... ok, interesting. Did they say that Oracle isn't using aio ?<br /> I can'tsee that. They that Oracle has no more than one outstanding I/O <br /> operation in flight per concurrent query, <br/> and they appear to think that's a bad thing. I'm not seeing<br /> that myself. Perhaps once I sleep on it, it'll becomeclear what they're getting at.<br /><br /> One theory for lack of aio in Oracle as tested in that paper would be thatthey<br /> were testing on Linux. Since aio is relatively new in Linux I wouldn't be surprised<br /> if Oracle didn'tactually use it until it's known to be widely deployed in the field<br /> and to have proven reliability. Perhaps we'vereached that state around now,<br /> and so Oracle may not yet have released an aio-capable Linux version of their<br/> RDBMS. Just a theory...someone from those tubular towers lurking here<br /> could tell us for sure I guess...<br/><br /><br /><br /><br />
On Mon, 28 Nov 2005, Tom Lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > > I haven't had time to prototype whether we can easily implement async IO > > Just as with any suggestion to depend on threads, you are going to have > to show results that border on astounding to have any chance of getting > this in. Otherwise the portability issues are just going to make it not > worth the trouble. The architecture I am looking at would not rely on threads. I didn't want to jump on list and waive my hands until I had something to show, but since Qingqing is looking at the issue I thought I better raise it. Gavin
Gavin Sherry wrote: > > The paper I linked to seemed to suggest that they weren't using async IO > in 9.2 -- which is fairly old. I'm not sure why the authors didn't test > 10g. > There have been async io type parameters in Oracle's init.ora files from (at least) 8i (disk_async_io=true IIRC) - on Solaris anyway. Whether this enabled real or simulated async io is probably a good question - I recall during testing turning it off and seeing kio()? or similar type calls become write()/read() in truss oupout. regards Mark
On Mon, 28 Nov 2005, Mark Kirkwood wrote: > > Do these ideas require threads in principle? ISTM that there could be > (additional) process(es) waiting to perform pre-fetching or async io, > and we could use the usual IPC machinary to talk between them... > Right. I use threads because it is easy to write simulation program :-) Regards, Qingqing
FYI, I've personally used Oracle 9.2.0.4's async IO on Linux and have seen several installations which make use of it also.
On 11/28/05, Gavin Sherry <swm@linuxworld.com.au> wrote:
On Mon, 28 Nov 2005, Tom Lane wrote:
> Gavin Sherry <swm@linuxworld.com.au > writes:
> > I haven't had time to prototype whether we can easily implement async IO
>
> Just as with any suggestion to depend on threads, you are going to have
> to show results that border on astounding to have any chance of getting
> this in. Otherwise the portability issues are just going to make it not
> worth the trouble.
The architecture I am looking at would not rely on threads.
I didn't want to jump on list and waive my hands until I had something to
show, but since Qingqing is looking at the issue I thought I better raise
it.
Gavin
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
On Mon, 28 Nov 2005, Gavin Sherry wrote: > > MySQL, Oracle and others implement read-ahead threads to simulate async IO > 'pre-fetching'. Due to my tests on Windows (using the attached program and change enable_aio=true), seems aio doesn't help as a separate thread - but maybe because my usage is wrong ... Regards, Qingqing
On Mon, 28 Nov 2005, Gavin Sherry wrote: > > I didn't want to jump on list and waive my hands until I had something to > show, but since Qingqing is looking at the issue I thought I better raise > it. > Don't worry :-) I separate the logic into a standalone program in order to let more people can help on this issue. Regards, Qingqing
"David Boreham" <david_list@boreham.org> wrote > > BTW, I heard a long time ago that NTFS has quite fancy read-ahead, where > it attempts to detect the application's access pattern including if it is > reading sequentially and even if there is a 'stride' to the accesses when > they're not contiguous. I would imagine that other filesystems attempt > similar tricks. So one might expect a simple linear prefectch > to not help much in the presence of such a filesystem. > So we need more tests. I understand how smart current file systems are, and seems that depends on the interval that you send next file block read request (decided by cpu_cost parameter in my program). I imagine on a multi-way machine with strong IO device, the ice-breaker could do much better ... > Were you worried about the icebreaker thread getting too far ahead of the > scan ? If it did it might page out the data you're about to read, I think. > Of course this could be fixed by having the read ahead thread perodically > check the current location being read by the query thread and pausing if > it's got too far ahead. > Right. Regards, Qingqing
Qingqing Zhou wrote: <blockquote cite="midPine.LNX.4.58.0511282350370.13833@josh.db" type="cite"><pre wrap=""> On Mon, 28 Nov 2005, Gavin Sherry wrote: </pre><blockquote type="cite"><pre wrap="">MySQL, Oracle and others implement read-aheadthreads to simulate async IO 'pre-fetching'. </pre></blockquote><pre wrap=""> Due to my tests on Windows (using the attached program and change enable_aio=true), seems aio doesn't help as a separate thread - but maybe because my usage is wrong ... </pre></blockquote> I don't think your NT overlapped I/O code is quite right. At least<br />I think it will issue reads at a high rate without waiting for any of them<br /> to complete. Beyond some point that hasto give the kernel gut-rot.<br /> But anyway, I wouldn't expect the use of aio to make any<br /> significant differencein an already threaded test program. <br /> The point of aio is to allow<br /> I/O concurrency _without_ the useof threads or multiple processes.<br /> You could re-write your program to have a single thread but use aio.<br /> Inthat case it should show the same read ahead benefit that you see<br /> with the thread.<br /><br /><br /><br />
On Mon, 28 Nov 2005, Qingqing Zhou wrote: > > > On Mon, 28 Nov 2005, Gavin Sherry wrote: > > > > MySQL, Oracle and others implement read-ahead threads to simulate async IO > > 'pre-fetching'. > > Due to my tests on Windows (using the attached program and change > enable_aio=true), seems aio doesn't help as a separate thread - but maybe > because my usage is wrong ... Right, I would imagine that it's very close. I intend to use kernel based async IO so that we can have the prefetch effect of your sample program without the need for threads. Thanks, Gavin
"David Boreham" <david_list@boreham.org> wrote >> > I don't think your NT overlapped I/O code is quite right. At least > I think it will issue reads at a high rate without waiting for any of them > to complete. Beyond some point that has to give the kernel gut-rot. > [also with reply to Gavin] look up dictionary for "gut-rot", got it ... Uh, this behavior is intended - I try to push enough requests shortly to kernel so that it understands that I am doing sequential scan, so it would pull the data from disk to file system cache more efficiently. Some file systems may have "free-behind" mechanism, but our main thread (who really process the query) should be fast enough before the data vanished. > > You could re-write your program to have a single thread but use aio. > In that case it should show the same read ahead benefit that you see > with the thread. > I guess this is also Gavin's point - I understand that will be two different methodologies to handle "read-ahead". If no other thread/process involved, then the main thread will be responsible to grab a free buffer page from bufferpool and ask the kernel to put the data there by sync IO (current PostgreSQL does) or async IOs. And that's what I want to avoid. I'd like to use a dedicated thread/process to "break the ice" only, i.e., pull data from disk to file system cache, so that the main thread will only issue *logical* read. Regards, Qingqing
On Tue, Nov 29, 2005 at 02:53:36PM +1100, Gavin Sherry wrote: > The second idea is using posix async IO at key points within the system > to better parallelise CPU and IO work. There areas I think we could use > async IO are: during sequential scans, use async IO to do pre-fetching of > blocks; inside WAL, begin flushing WAL buffers to disk before we commit; > and, inside the background writer/check point process, asynchronously > write out pages and, potentially, asynchronously build new checkpoint segments. I actually worked on this and got it to the stage where it wouldn't crash anymore. It basically added a command to bufmgr.c called PrefetchBuffer() which would initiate a request but not block. I then hooked a few strategic places to call this. In particular during an index scan, it would prefetch the next index block and the next few data blocks and then return them in order as they came in. Unfortunatly I can't really test it at it's full potential because it uses glibc's default POSIX AIO which is *lame*. No more than one outstanding request per fd which for PostgreSQL is crappy. There was some evidence that in an index scan of a highly uncorrelated index that it did make a small difference, but I never got around to testing it fully. But bitmap scans already hugely reduce the cost of uncorrelated indexes. It doesn't pass regression because index_getmulti doesn't do backward scans. Everything else works though. If anyone is interested in the code I can send it to them. The results on my system just wern't good enough to justify a lot more effort. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
First, we need a new term for a thread of execution, that could be a thread or could be a process, I don't care. When discussing anything that is to run in parallel, the first thing that pops out of someones mouth is "Don't you mean (thread/process)?" But that's an implementation detail and should not be considered during a planning phase, unless it is fundamental to the problem. Hence, the term TOE to mean "I don't really care if it is in it's own address space, or the same address space.". However, I understand that this is not in common usage, so in the following discussion I use the term thread, as it is more correct than process. I am just not defining if that thread is the only thread running in its process or not. I've implemented this on another database product, using buf reading threads to pull the data all the way into the database cache. In testing on Unix production systems (4 CPU machines, large RAID devices, 100Gb+ databases), table scans performed 5 to 7 times faster; on MVS table scans are up to 10 times faster. But, I never had much luck on getting the performance to change on Windows. Partially, I think, it's because the machine I was using was IDE, not SCSI, so I was already greatly bottlenecked. Maybe SATA would be better? I haven't tested there, either. Anyway, what I did was the following. When doing a sequential scan, we were starting at the beginning of the table and scanning forward. If I threw up some threads to read ahead, then my user thread and my read ahead threads would thrash on trying to lock the buffer slots. So, I had the read ahead threads start at some distance into the table, and work toward the beginning. The user thread would do its own I/O until it passed the read ahead threads. I also broke the read ahead section into multiple contiguous sections, and had different threads read each section, so the user thread would only have a problem with the first section; by the time it was finished with that, the other sections would be read in. When the user thread got to about 80% of the nodes that got read ahead, it would schedule another section to be read. +----------------------------------------------------------------+ | table + +----------------------------------------------------------------+ (user->) (<-readahead) (<-readahead) (<-readaehead) so above, the user threads is starting low in the table and working high; the readahead threads are starting higher (but not at the end of the table), and working low. Like I said, this worked very well for me. Mike Pollard SUPRA Server SQL Engineering and Support Cincom Systems, Inc. --------------------------------Better to remain silent and be thought a fool than to speak out and remove all doubt. Abraham Lincoln -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Qingqing Zhou Sent: Tuesday, November 29, 2005 12:56 AM To: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] ice-broker scan thread "David Boreham" <david_list@boreham.org> wrote >> > I don't think your NT overlapped I/O code is quite right. At least > I think it will issue reads at a high rate without waiting for any of them > to complete. Beyond some point that has to give the kernel gut-rot. > [also with reply to Gavin] look up dictionary for "gut-rot", got it ... Uh, this behavior is intended - I try to push enough requests shortly to kernel so that it understands that I am doing sequential scan, so it would pull the data from disk to file system cache more efficiently. Some file systems may have "free-behind" mechanism, but our main thread (who really process the query) should be fast enough before the data vanished. > > You could re-write your program to have a single thread but use aio. > In that case it should show the same read ahead benefit that you see > with the thread. > I guess this is also Gavin's point - I understand that will be two different methodologies to handle "read-ahead". If no other thread/process involved, then the main thread will be responsible to grab a free buffer page from bufferpool and ask the kernel to put the data there by sync IO (current PostgreSQL does) or async IOs. And that's what I want to avoid. I'd like to use a dedicated thread/process to "break the ice" only, i.e., pull data from disk to file system cache, so that the main thread will only issue *logical* read. Regards, Qingqing ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
>threw up some threads to read ahead, then my user thread and my read >ahead threads would thrash on trying to lock the buffer slots. So, I >had the read ahead threads start at some distance into the table, and >work toward the beginning. The user thread would do its own I/O until > > Ah. The lightbulb went on. You want multiple outstanding I/O operations in the case that table or index spans multiple physical disks.
On Tue, Nov 29, 2005 at 09:45:30AM -0500, Pollard, Mike wrote: > Anyway, what I did was the following. When doing a sequential scan, we > were starting at the beginning of the table and scanning forward. If I > threw up some threads to read ahead, then my user thread and my read > ahead threads would thrash on trying to lock the buffer slots. So, I <snip> > so above, the user threads is starting low in the table and working > high; the readahead threads are starting higher (but not at the end of > the table), and working low. Ok, this may be a really dumb question, but doesn't this rely on the fact that the table is smaller than the amount of buffers? If the table is large most of your data will be tossed out again by later data before it's been used by the backend. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
No, I only go x number of pages ahead of the user scan (where x is currently user defined, but it should be significantly smaller than your number of data buffers). I have found that reading about 16Mb ahead gives optimal performance, and on modern machines isn't all that much memory. Once the user scan has processed most of that 16Mb, the next section of the tree is schedule to be read. I don't keep the read ahead threads a constant distance ahead, because I found it to be more efficient if they occasionally had a lot of pages to read at once, rather than constantly having a few pages to read. Mike Pollard SUPRA Server SQL Engineering and Support Cincom Systems, Inc. --------------------------------Better to remain silent and be thought a fool than to speak out and remove all doubt. Abraham Lincoln -----Original Message----- From: Martijn van Oosterhout [mailto:kleptog@svana.org] Sent: Tuesday, November 29, 2005 10:06 AM To: Pollard, Mike Cc: Qingqing Zhou; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] ice-broker scan thread On Tue, Nov 29, 2005 at 09:45:30AM -0500, Pollard, Mike wrote: > Anyway, what I did was the following. When doing a sequential scan, we > were starting at the beginning of the table and scanning forward. If I > threw up some threads to read ahead, then my user thread and my read > ahead threads would thrash on trying to lock the buffer slots. So, I <snip> > so above, the user threads is starting low in the table and working > high; the readahead threads are starting higher (but not at the end of > the table), and working low. Ok, this may be a really dumb question, but doesn't this rely on the fact that the table is smaller than the amount of buffers? If the table is large most of your data will be tossed out again by later data before it's been used by the backend. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
>Unfortunatly I can't really test it at it's full potential because it >uses glibc's default POSIX AIO which is *lame*. No more than one >outstanding request per fd which for PostgreSQL is crappy. There was > > I had the impression from the kernel aio mailing list a while back that post-<some kernel version> linux, the POSIX aio calls were forwarded to the kernel aio interface. Or are you saying that the POSIX API itself imposes that limitation ?
On Tue, Nov 29, 2005 at 03:14:38PM +1100, Gavin Sherry wrote: > On Mon, 28 Nov 2005, David Boreham wrote: > > Gavin Sherry wrote: > > > MySQL, Oracle and others implement read-ahead threads to simulate async IO > > > > I always believed that Oracle used async file I/O. Not that I've seen their > The paper I linked to seemed to suggest that they weren't using async IO > in 9.2 -- which is fairly old. http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf "Getting Priorities Straight: Improving Linux Support for DatabaseI/O" by Hall and Bonnet Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005 I think you've misread that paper. AFAICT it neither says nor even suggests that Oracle 9.2 does not use asynchronous I/O on Linux. In fact, it seems to strongly suggest exactly the opposite, that Oracle does use async I/O whereever it can. Note they also reference this document, which as of 2002 and Linux kernel 2.4.x, was urging Oracle DBAs to use Oracle's kernel-based asynchronous I/O support whenever possible: http://www.ixora.com.au/tips/use_asynchronous_io.htm What Hall and Bonnet's paper DOES say, is that both Oracle and MySQL InnoDB appear to use a "conservative" I/O submission policy, but Oracle does so more efficiently. They also argue that both Oracle and MySQL fail to utilize the "full potential" of Linux async I/O because of their conservative submission policies, and that an "agressive" I/O submissions policy would work better, but only if support for Prioritized I/O is added to Linux. They then proceed to add that support, and make some basic changes to InnoDB to partially take advantage of it. Also interesting is their casual mention that for RDBMS workloads, the default Linux 2.6 disk scheduler "anticipatory" is inferior to the "deadline" scheduler. They base their (simple sounding) Prioritized I/O support on the deadline scheduler. -- Andrew Piskorski <atp@piskorski.com> http://www.piskorski.com/
On Tue, Nov 29, 2005 at 08:42:18AM -0700, David Boreham wrote: > > >Unfortunatly I can't really test it at it's full potential because it > >uses glibc's default POSIX AIO which is *lame*. No more than one > >outstanding request per fd which for PostgreSQL is crappy. There was > > > I had the impression from the kernel aio mailing list a while back > that post-<some kernel version> linux, the POSIX aio calls were > forwarded to the kernel aio interface. Or are you saying that the > POSIX API itself imposes that limitation ? By default when you use aio you get the version in libc (-lrt IIRC) which has the issue I mentioned, probably because it's probably optimised for the lots-of-network-connections type program where multiple outstanding requests on a single fd are not meaningful. You can however link in some other library which gives you kernel support. However, I don't have a new enough kernel to have the kernel support so I havn't tested that. POSIX AIO doesn't prescribe either way. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
>By default when you use aio you get the version in libc (-lrt IIRC) >which has the issue I mentioned, probably because it's probably >optimised for the lots-of-network-connections type program where >multiple outstanding requests on a single fd are not meaningful. You >can however link in some other library which gives you kernel support. >However, I don't have a new enough kernel to have the kernel support so >I havn't tested that. > > Actually, after reading up on the current state of things, I'm not sure you can even get POSIX aio on top of kernel aio in Linux. There are also a few limitations in the 2.6 aio implementation that might prove troublesome: for example it only works with O_DIRECT. libaio gives userland access to the kernel aio api (which is different from POSIX aio).
On Tue, 2005-11-29 at 09:45 -0500, Pollard, Mike wrote: > Anyway, what I did was the following. When doing a sequential scan, we > were starting at the beginning of the table and scanning forward. If I > threw up some threads to read ahead, then my user thread and my read > ahead threads would thrash on trying to lock the buffer slots. So, I > had the read ahead threads start at some distance into the table, and > work toward the beginning. I believe this is commonly called a synchronous scan. -jwb
On Tue, Nov 29, 2005 at 10:28:57AM -0700, David Boreham wrote: > Actually, after reading up on the current state of things, I'm not sure you > can even get POSIX aio on top of kernel aio in Linux. There are also a > few limitations in the 2.6 aio implementation that might prove troublesome: > for example it only works with O_DIRECT. Which is bizarre because it's semantically equivalent to having a seperate thread doing the read() and sending you a signal when it's done. What I'm thinking of testing is a join across two large table so there is actually more than one outstanding request at a time. But it's irritating to have to code to a special api... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
"Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote > > I wrote a program to simulate the sequential scan in PostgreSQL > with/without ice-broker. > > We need more tests > If anybody has a test results then I'd love to see it ... Thanks, Qingqing
On Tue, 29 Nov 2005, Andrew Piskorski wrote: > On Tue, Nov 29, 2005 at 03:14:38PM +1100, Gavin Sherry wrote: > > On Mon, 28 Nov 2005, David Boreham wrote: > > > Gavin Sherry wrote: > > > > MySQL, Oracle and others implement read-ahead threads to simulate async IO > > > > > > I always believed that Oracle used async file I/O. Not that I've seen their > > > The paper I linked to seemed to suggest that they weren't using async IO > > in 9.2 -- which is fairly old. > > http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf > "Getting Priorities Straight: Improving Linux Support for Database I/O" > by Hall and Bonnet > Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005 > > I think you've misread that paper. AFAICT it neither says nor even > suggests that Oracle 9.2 does not use asynchronous I/O on Linux. In > fact, it seems to strongly suggest exactly the opposite, that Oracle > does use async I/O whereever it can. > > Note they also reference this document, which as of 2002 and Linux > kernel 2.4.x, was urging Oracle DBAs to use Oracle's kernel-based > asynchronous I/O support whenever possible: > > http://www.ixora.com.au/tips/use_asynchronous_io.htm > > What Hall and Bonnet's paper DOES say, is that both Oracle and MySQL > InnoDB appear to use a "conservative" I/O submission policy, but > Oracle does so more efficiently. They also argue that both Oracle and > MySQL fail to utilize the "full potential" of Linux async I/O because > of their conservative submission policies, and that an "agressive" I/O > submissions policy would work better, but only if support for > Prioritized I/O is added to Linux. They then proceed to add that > support, and make some basic changes to InnoDB to partially take > advantage of it. > > Also interesting is their casual mention that for RDBMS workloads, the > default Linux 2.6 disk scheduler "anticipatory" is inferior to the > "deadline" scheduler. They base their (simple sounding) Prioritized > I/O support on the deadline scheduler. > Right. I had seemed to recall that they configured Oracle to use a kind of readahead thread not native async IO, but I am wrong. That's not material to the discussion at hand. What we need to find out is if we can easily integrate prefetching into PostgreSQL for some subset of the work we do, find non-trivial performance gains and demonstrate it on more than one OS. Ideally, we'd see some non-trivial gain irrespective of the IO scheduler being used. Thanks, Gavin
On Tue, 29 Nov 2005, David Boreham wrote: > > >By default when you use aio you get the version in libc (-lrt IIRC) > >which has the issue I mentioned, probably because it's probably > >optimised for the lots-of-network-connections type program where > >multiple outstanding requests on a single fd are not meaningful. You > >can however link in some other library which gives you kernel support. > >However, I don't have a new enough kernel to have the kernel support so > >I havn't tested that. > > > > > Actually, after reading up on the current state of things, I'm not sure you > can even get POSIX aio on top of kernel aio in Linux. There are also a > few limitations in the 2.6 aio implementation that might prove troublesome: > for example it only works with O_DIRECT. > > libaio gives userland access to the kernel aio api (which is different > from POSIX aio). Yes. The O_DIRECT issue is my biggest concern about Linux at the moment. That being said, the plan is to only pre-fetch the next N blocks, where N < 32, and to read them into the local buffer cache. In a situation where space in the cache low (and prefetched pages might be pushed out before we even get to read them), we need to provide such information to the readahead mechanism so that it can reduce the number of blocks which it prefetches. Gavin
On Wed, 2005-11-30 at 08:30 +1100, Gavin Sherry wrote: > On Tue, 29 Nov 2005, David Boreham wrote: > > > > > >By default when you use aio you get the version in libc (-lrt IIRC) > > >which has the issue I mentioned, probably because it's probably > > >optimised for the lots-of-network-connections type program where > > >multiple outstanding requests on a single fd are not meaningful. You > > >can however link in some other library which gives you kernel support. > > >However, I don't have a new enough kernel to have the kernel support so > > >I havn't tested that. > > > > > > > > Actually, after reading up on the current state of things, I'm not sure you > > can even get POSIX aio on top of kernel aio in Linux. There are also a > > few limitations in the 2.6 aio implementation that might prove troublesome: > > for example it only works with O_DIRECT. > > > > libaio gives userland access to the kernel aio api (which is different > > from POSIX aio). > > Yes. The O_DIRECT issue is my biggest concern about Linux at the moment. > That being said, the plan is to only pre-fetch the next N blocks, where N > < 32, and to read them into the local buffer cache. In a situation where > space in the cache low (and prefetched pages might be pushed out before we > even get to read them), we need to provide such information to the > readahead mechanism so that it can reduce the number of blocks which it > prefetches. My understanding was that Linux at least has a reasonable readahead mechanism that works on the scale you suggest. I think its fair to assume that anybody that wants this can afford sufficient memory to make it worthwhile. Multiple processes per scan implies (low numbers of users or I/O overkill). Best Regards, Simon Riggs
On Tue, 2005-11-29 at 09:45 -0500, Pollard, Mike wrote: > I've implemented this on another database product You're scaring me. Is the information you describe in the public domain or is it intellectual property of any particular company? Are you sure? We just recovered from one patent scare. Good to have you around though, if we're covered. Best Regards, Simon Riggs
Jeff, On 11/29/05 9:35 AM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote: > On Tue, 2005-11-29 at 09:45 -0500, Pollard, Mike wrote: > >> Anyway, what I did was the following. When doing a sequential scan, we >> were starting at the beginning of the table and scanning forward. If I >> threw up some threads to read ahead, then my user thread and my read >> ahead threads would thrash on trying to lock the buffer slots. So, I >> had the read ahead threads start at some distance into the table, and >> work toward the beginning. > > I believe this is commonly called a synchronous scan. I think sync scan refers to the use of a scanner shared among concurrent queries, where they can join a scan in progress from it's current location. It sounds like the logic could be shared. Sync scan (as I've described above) is another important optimization we'd like to see. - Luke
No, it's all right. In fact, I believe my boss spoke to Bruce about this idea in August. But I have permission to discuss the algorithm. I may even be able to get the code, but to be honest, it isn't that much; it would probably be just as easy for it to be re-written as it would be to fit it into Postgres. Mike Pollard SUPRA Server SQL Engineering and Support Cincom Systems, Inc. --------------------------------Better to remain silent and be thought a fool than to speak out and remove all doubt. Abraham Lincoln -----Original Message----- From: Simon Riggs [mailto:simon@2ndquadrant.com] Sent: Tuesday, November 29, 2005 5:23 PM To: Pollard, Mike Cc: Qingqing Zhou; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] ice-broker scan thread On Tue, 2005-11-29 at 09:45 -0500, Pollard, Mike wrote: > I've implemented this on another database product You're scaring me. Is the information you describe in the public domain or is it intellectual property of any particular company? Are you sure? We just recovered from one patent scare. Good to have you around though, if we're covered. Best Regards, Simon Riggs
>Yes. The O_DIRECT issue is my biggest concern about Linux at the moment. >That being said, the plan is to only pre-fetch the next N blocks, where N >< 32, and to read them into the local buffer cache. In a situation where >space in the cache low (and prefetched pages might be pushed out before we >even get to read them), we need to provide such information to the >readahead mechanism so that it can reduce the number of blocks which it >prefetches. > > > > Would you open a separate handle O_DIRECT, just for the prefetch ? My experience with O_DIRECT and databases in the past has not been great : what you gain with being able to control your own caching you loose (and more) in other ways. BTW, has anyone tried O_DIRECT and the prefetch idea on Linux ? I'm wondering if it may not work (because the read data won't get cached in the fs cache due to O_DIRECT).
Qingqing Zhou wrote: >[also with reply to Gavin] look up dictionary for "gut-rot", got it ... Uh, >this behavior is intended - I try to push enough requests shortly to kernel >so that it understands that I am doing sequential scan, so it would pull the >data from disk to file system cache more efficiently. Some file systems may >have "free-behind" mechanism, but our main thread (who really process the >query) should be fast enough before the data vanished. > > I guess I was concerned that very large numbers of concurrent operations on the same file handle in flight at the same time might lead to poor performance or even instability. e.g. the kernel may make long linked lists, it might create lock contention with itself, that kind of bad stuff. My thinking being that the kernel wasn't designed with applications that fire off 10,000 concurrent reads against the same file. >I guess this is also Gavin's point - I understand that will be two different >methodologies to handle "read-ahead". If no other thread/process involved, >then the main thread will be responsible to grab a free buffer page from >bufferpool and ask the kernel to put the data there by sync IO (current >PostgreSQL does) or async IOs. And that's what I want to avoid. I'd like to >use a dedicated thread/process to "break the ice" only, i.e., pull data from >disk to file system cache, so that the main thread will only issue *logical* >read. > > Right, understood. My point was that a thread with sync I/O and the query thread with async I/O are in fact logically identical. They're just two different implementation techniques for the same fundemental functionality. In some cases the non-thread implementation might be prefered (for example on a platform with no threads).