Thread: ice-broker scan thread

ice-broker scan thread

From
Qingqing Zhou
Date:
I am considering add an "ice-broker scan thread" to accelerate PostgreSQL
sequential scan IO speed. The basic idea of this thread is just like the
"read-ahead" method, but the difference is this one does not read the data
into shared buffer pool directly, instead, it reads the data into file
system cache, which makes the integration easy and this is unique to
PostgreSQL.

What happens to the original sequential scan:
for (;;)
{/* * a physical read may happen, due to current content of * file system cache and if the kernel is smart enough to *
understandyou want to do sequential scan */physical or logical read a page;process the page;
 
}

What happens to the sequential scan with ice-broker:
for (;;)
{/* since the ice-broker has read the page in already */logical read a page with big chance;process the page;
}

I wrote a program to simulate the sequential scan in PostgreSQL
with/without ice-broker. The results indicate this technique has the
following characters:
(1) The important factor of speedup is the how much CPU time PostgreSQL
used on each data page. If PG is fast enough, then no speedup occurs; else
a 10% to 20% speedup is expected due to my test.
(2) It uses more CPU - this is easy to understand, since it does more
work;
(3) The benefits also depends on other factors, like how smart your file
system ...

Here is a test results on my machine:
---
$#uname -a
Linux josh.db 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown
$#cat /proc/meminfo | grep MemTotal
MemTotal:      1030988 kB
$#cat /proc/cpuinfo | grep CPU
model name      : Intel(R) Pentium(R) 4 CPU 2.40GHz
$#./seqscan 10 $HOME/pginstall/bin/data/base/10794/18986 50
PostgreSQL sequential scan simulator configuration:       Memory size: 943718400       CPU cost per page: 50       Scan
threadread unit size: 4
 

With scan threads off - duration: 56862.738 ms
With scan threads on - duration: 40611.101 ms
With scan threads off - duration: 46859.207 ms
With scan threads on - duration: 38598.234 ms
With scan threads off - duration: 56919.572 ms
With scan threads on - duration: 47023.606 ms
With scan threads off - duration: 52976.825 ms
With scan threads on - duration: 43056.506 ms
With scan threads off - duration: 54292.979 ms
With scan threads on - duration: 42946.526 ms
With scan threads off - duration: 51893.590 ms
With scan threads on - duration: 42137.684 ms
With scan threads off - duration: 46552.571 ms
With scan threads on - duration: 41892.628 ms
With scan threads off - duration: 45107.800 ms
With scan threads on - duration: 38329.785 ms
With scan threads off - duration: 47527.787 ms
With scan threads on - duration: 38293.581 ms
With scan threads off - duration: 48810.656 ms
With scan threads on - duration: 39018.500 ms
---

Notice in above the cpu_cost=50 might looks too big (if you look into the
code) - but in concurrent situation, it is not that huge. Also, on my
windows box(PIII, 800), a cpu_cost=5 can is enough to prove the benefits
of 10%.

So in general, it does help in some situations, but not a rocket science
since we can't predicate the performance of the file system. It fairly
easy to be integrated, and we should add a GUC parameter to control it.

We need more tests, any comments and tests are welcome,

Regards,
Qingqing

---

/** seqscan.c*        PostgreSQL sequential scan simulator with helper scan thread** Note*        I wrote this
simulatorto see if there is any benefits for sequential scan to*        do read-ahead by another thread. The only thing
youmay want to change in the*        source file is MEMSZ, make it big enough to thrash your file system cache.**
Use the following command to compile:*            $gcc -O2 -Wall -pthread -lm seqscan.c -o seqscan*        To use it:*
         $./seqscan <rounds> <datafile> <cpu_cost>*        In which rounds is how many times you want to run the test
(noticeeach round include*        two disk-burn test), datafile is the path to any file (suggest size > 100M), and
cpu_cost*       is the cost that processing each page of the file. Try different cpu_cost.*/
 

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <memory.h>
#include <errno.h>
#include <math.h>

#ifdef WIN32
#include <io.h>
#include <windows.h>
#define PG_BINARY        O_BINARY
#else
#include <unistd.h>
#include <pthread.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/file.h>
#define PG_BINARY        0
#endif

typedef char bool;
#define true    ((bool) 1)
#define false    ((bool) 0)

#define BLCKSZ    8192
#define UNITSZ    4
#define MEMSZ    (950*1024*1024)

char    *data_file;
int     cpu_cost;
volatile bool stop_scan;
char    thread_buffer[BLCKSZ*UNITSZ];

static void
cleanup_cache(void)
{char    *p;
if (NULL == (p = (char *)malloc(MEMSZ))){    fprintf(stderr, "insufficient memory\n");    exit(-1);}
memset(p, 'a', MEMSZ);free(p);
}

#ifdef WIN32
bool    enable_aio = false;

static const unsigned __int64 epoch = 116444736000000000L;
static int gettimeofday(struct timeval * tp, struct timezone * tzp)
{FILETIME    file_time;SYSTEMTIME    system_time;ULARGE_INTEGER ularge;
GetSystemTime(&system_time);SystemTimeToFileTime(&system_time, &file_time);ularge.LowPart =
file_time.dwLowDateTime;ularge.HighPart= file_time.dwHighDateTime;
 
tp->tv_sec = (long) ((ularge.QuadPart - epoch) / 10000000L);tp->tv_usec = (long) (system_time.wMilliseconds * 1000);
return 0;
}

static void
sleep(int secs)
{SleepEx(secs*1000, true);
}

static int
thread_open()
{HANDLE        fd;SECURITY_ATTRIBUTES sa;
sa.nLength = sizeof(sa);sa.bInheritHandle = TRUE;sa.lpSecurityDescriptor = NULL;
fd = CreateFile(data_file,        GENERIC_READ,        FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE,        &sa,
     OPEN_EXISTING,        FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN        |
(enable_aio?FILE_FLAG_OVERLAPPED:0),       NULL);
 
if (fd == INVALID_HANDLE_VALUE){    int     errCode;
    switch (errCode = GetLastError())    {        /* EMFILE, ENFILE should not occur from CreateFile. */        case
ERROR_PATH_NOT_FOUND:       case ERROR_FILE_NOT_FOUND:    errno = ENOENT; break;        case ERROR_FILE_EXISTS:
errno= EEXIST; break;        case ERROR_ACCESS_DENIED:    errno = EACCES; break;        default:
fprintf(stderr,"thread_open failed: %d\n", errCode);            errno = EINVAL;    }
 
    return -1;}
return (int)fd;
}

static int
thread_read(int fd, int blkno, size_t nblk, char *buf)
{long        offset = BLCKSZ*blkno;long        nbytes;OVERLAPPED    ol;
memset(&ol, 0, sizeof(OVERLAPPED));ol.Offset = offset;ol.OffsetHigh = 0;
if (ReadFile((HANDLE)fd, buf, BLCKSZ*nblk, &nbytes, &ol)){    /* successfully done without delay */    NULL;}else{
interrCode;    switch (errCode = GetLastError())    {    case ERROR_IO_PENDING:        break;    case ERROR_HANDLE_EOF:
      break;    default:        /* unknown error occured */        fprintf(stderr, "asyncread failed: %d\n", errCode);
     exit(-1);    }}
 
return nbytes;
}

static void
thread_close(int fd)
{CloseHandle((HANDLE)fd);
}

#else        /* non-windows platforms */

static int
thread_open()
{int     fd;
fd = open(data_file, O_RDWR | PG_BINARY, 0600);if (fd < 0){    fprintf(stderr, "thread_open failed: %d\n", errno);
exit(-1);}
return (int)fd;
}

static int
thread_read(int fd, int blkno, size_t nblk, char *buf)
{long        offset = BLCKSZ*blkno;long        nbytes;
nbytes = lseek(fd, offset, SEEK_SET);nbytes = read(fd, buf, BLCKSZ*nblk);if (nbytes <= 0){        fprintf(stderr,
"thread_readfailed: %d\n", errno);        exit(-1);}
 
return nbytes;
}

static void
thread_close(int fd)
{close(fd);
}
#endif

#ifdef WIN32
static DWORD WINAPI
scan_thread(LPVOID args)
#else
static void *
scan_thread(void *args)
#endif
{int     i, fd;int     start, end;
start = 0;end = (size_t)args;
fd = thread_open();for (i = start; i < end;  i+=UNITSZ){    thread_read(fd, i, UNITSZ, (char *)thread_buffer);
    /* check if I was asked to stop */    if (stop_scan == true)            break;}thread_close(fd);
return 0;
}

static int
init_scan(bool with_threads, size_t *nblocks)
{int     fd;
/* open file for do_scan */fd = open(data_file, O_RDWR | PG_BINARY, 0600);if (fd < 0){    fprintf(stderr, "failed to
openfile %s\n", data_file);    exit(-1);}
 
*nblocks = lseek(fd, 0, SEEK_END) / BLCKSZ;if (*nblocks < 0){    fprintf(stderr, "failed to get file length %s\n",
data_file);   exit(-1);}
 
if (with_threads){
#ifndef WIN32    pthread_t    thread;
#endif    /* create scan threads */    stop_scan = false;
#ifdef WIN32    if (NULL == CreateThread(NULL, 0,                        scan_thread, (void *)(*nblocks),
        0, NULL))
 
#else    if (pthread_create(&thread, NULL,                        scan_thread, (void *)(*nblocks)))
#endif    {        fprintf(stderr, "failed to start scan thread");        exit(-1);    }}
return fd;
}

static void
do_scan(int fd, size_t nblocks)
{int     i, j, k, nbytes;char    buffer[BLCKSZ];
for (i = 0; i < nblocks; i++){    nbytes = lseek(fd, i*BLCKSZ, SEEK_SET);    nbytes = read(fd, buffer, BLCKSZ);    if
(nbytes!= BLCKSZ)    {        fprintf(stderr, "do_scan read failed\n");        exit(-1);    }
 
    /* pretend to do some CPU intensive analysis */    for (k = 0; k < cpu_cost; k++)    {        for (j =
(k*sizeof(int))%BLCKSZ;            j < BLCKSZ / (5 * sizeof(int));             j += sizeof(int))        {
int    x, y;
 
            x = ((int *)buffer)[j];            x = (int)pow((double)x, (double)(x+1));            y =
(int)sin((double)x*x);           ((int *)buffer)[j] = x*y;        }    }}
 
}

static void
close_scan(fd)
{stop_scan = true;close(fd);
}

int
main(int argc, char *argv[])
{int     i, rounds, fd;size_t    nblocks;
if (argc != 4){    fprintf(stderr, "usage: cache <rounds> <datafile> <cpu_cost>\n");    exit(-1);}
rounds = atoi(argv[1]);data_file = argv[2];cpu_cost  = atoi(argv[3]);fd = init_scan(false,
&nblocks);close_scan(fd);fprintf(stdout,"PostgreSQL sequential scan simulator configuration:\n"
"\tMemorysize: %u\n"                        "\tCPU cost per page: %d\n"                        "\tScan thread read unit
size:%d\n\n",                        MEMSZ, cpu_cost, UNITSZ);
 
for (i = 0; i < 2*rounds; i++){    struct    timeval start_t, stop_t;    long    usecs;    bool    enable =
i%2?true:false;
    /* eliminate system cached data */    cleanup_cache();    sleep(2);
    /* do the scan task */    gettimeofday(&start_t, NULL);    fd = init_scan(enable, &nblocks);    do_scan(fd,
nblocks);   close_scan(fd);    gettimeofday(&stop_t, NULL);
 
    /* measure the time */    if (stop_t.tv_usec < start_t.tv_usec)    {        stop_t.tv_sec--;        stop_t.tv_usec
+=1000000;    }    usecs = (long) (stop_t.tv_sec - start_t.tv_sec) * 1000000            + (long) (stop_t.tv_usec -
start_t.tv_usec);   fprintf (stdout, "With scan threads %s - duration: %ld.%03ld ms\n",            enable?"on":"off",
        (long) ((stop_t.tv_sec - start_t.tv_sec) * 1000 +                    (stop_t.tv_usec - start_t.tv_usec) /
1000),           (long) (stop_t.tv_usec - start_t.tv_usec) % 1000);
 
    sleep(2);}
exit(0);
}


Re: ice-broker scan thread

From
David Boreham
Date:
Qingqing Zhou wrote:

>I am considering add an "ice-broker scan thread" to accelerate PostgreSQL
>sequential scan IO speed. The basic idea of this thread is just like the
>"read-ahead" method, but the difference is this one does not read the data
>into shared buffer pool directly, instead, it reads the data into file
>system cache, which makes the integration easy and this is unique to
>PostgreSQL.
>  
>
Interesting, and I wondered about this too. But for my taste the 
demonstrated benefit really
isn't large enough to make it worthwhile.
BTW, I heard a long time ago that NTFS has quite fancy read-ahead, where 
it attempts
to detect the application's access pattern including if it is reading 
sequentially and even
if there is a 'stride' to the accesses when they're not contiguous. I 
would imagine that
other filesystems attempt similar tricks. So one might expect a simple 
linear prefectch
to not help much in the presence of such a filesystem.

Were you worried about the icebreaker thread getting too far ahead of 
the scan ?
If it did it might page out the data you're about to read, I think. Of 
course this could
be fixed by having the read ahead thread perodically check the current 
location being
read by the query thread and pausing if it's got too far ahead.

Anyway, the recent performance thread has been intersting to me because 
in all my career
I've never seen a database that scanned scads of data from disk to 
process a query.
Typically the problems I work on arrange to read the entire database 
into memory.
I think I need to get out more... ;)





Re: ice-broker scan thread

From
Gavin Sherry
Date:
On Mon, 28 Nov 2005, Qingqing Zhou wrote:

>
> I am considering add an "ice-broker scan thread" to accelerate PostgreSQL
> sequential scan IO speed. The basic idea of this thread is just like the
> "read-ahead" method, but the difference is this one does not read the data
> into shared buffer pool directly, instead, it reads the data into file
> system cache, which makes the integration easy and this is unique to
> PostgreSQL.
>

MySQL, Oracle and others implement read-ahead threads to simulate async IO
'pre-fetching'. I've been experimenting with two ideas. The first is to
increase the readahead when we're doing sequential scans (see prototype
patch using posix fadvise attached). I've not got any hardware at the
moment which I can test this patch on but I am waiting on some dbt-3
results which should indicate whether fadvise is a good idea or a bad one.

The second idea is using posix async IO at key points within the system
to better parallelise CPU and IO work. There areas I think we could use
async IO are: during sequential scans, use async IO to do pre-fetching of
blocks; inside WAL, begin flushing WAL buffers to disk before we commit;
and, inside the background writer/check point process, asynchronously
write out pages and, potentially, asynchronously build new checkpoint segments.

The motivation for using async IO is two fold: first, the results of this
paper[1] are compelling; second, modern OSs support async IO. I know that
Linux[2], Solaris[3], AIX and Windows all have async IO and I presume that
all their rivals have it as well.

The fundamental premise of the paper mentioned above is that if the
database is busy, IO should be busy. With our current block-at-a-time
processing, this isn't always the case. This is why Qingqing's read-ahead
thread makes sense. My reason for mailing is, however, that the async IO
results are more compelling than the read ahead thread.

I haven't had time to prototype whether we can easily implement async IO
but I am planning to work on it in December. The two main goals will be to
a) integrate and utilise async IO, at least within the executor context,
and b) build a primitive kind of scheduler so that we stop prefetching
when we know that there are a certain number of outstanding IOs for a
given device.

Thanks,

Gavin



[1] http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf
[2] http://lse.sourceforge.net/io/aionotes.txt
[3] http://developers.sun.com/solaris/articles/event_completion.html - I'm
fairly sure they have a posix AIO wrapper around these routines, but I
cannot see it documented anywhere :-(

Re: ice-broker scan thread

From
Christopher Kings-Lynne
Date:
Qingqing,

>> I am considering add an "ice-broker scan thread" to accelerate PostgreSQL
>> sequential scan IO speed. The basic idea of this thread is just like the
>> "read-ahead" method, but the difference is this one does not read the 
>> data
>> into shared buffer pool directly, instead, it reads the data into file
>> system cache, which makes the integration easy and this is unique to
>> PostgreSQL.

You probably mean "ice-breaker" by the way :)

Chris



Re: ice-broker scan thread

From
Tom Lane
Date:
Gavin Sherry <swm@linuxworld.com.au> writes:
> I haven't had time to prototype whether we can easily implement async IO

Just as with any suggestion to depend on threads, you are going to have
to show results that border on astounding to have any chance of getting
this in.  Otherwise the portability issues are just going to make it not
worth the trouble.
        regards, tom lane


Re: ice-broker scan thread

From
David Boreham
Date:
Gavin Sherry wrote:

> MySQL, Oracle and others implement read-ahead threads to simulate async IO

I always believed that Oracle used async file I/O. Not that I've seen their
code, but I'm fairly sure they funded the addition of kernel aio to Linux
a few years back.

But....Oracle comes from a time long ago when threads and decent
filesystems didn't exist, so some of the things they do may not be 
appropriate
to add to a product that doesn't have them today.

Now...network async I/O...that'd be really useful in my world...




Re: ice-broker scan thread

From
Gavin Sherry
Date:
On Mon, 28 Nov 2005, David Boreham wrote:

> Gavin Sherry wrote:
>
> > MySQL, Oracle and others implement read-ahead threads to simulate async IO
>
> I always believed that Oracle used async file I/O. Not that I've seen their
> code, but I'm fairly sure they funded the addition of kernel aio to Linux
> a few years back.

That's right.

>
> But....Oracle comes from a time long ago when threads and decent
> filesystems didn't exist, so some of the things they do may not be
> appropriate
> to add to a product that doesn't have them today.

The paper I linked to seemed to suggest that they weren't using async IO
in 9.2 -- which is fairly old. I'm not sure why the authors didn't test
10g.

Gavin


Re: ice-broker scan thread

From
Mark Kirkwood
Date:
Tom Lane wrote:
> Gavin Sherry <swm@linuxworld.com.au> writes:
> 
>>I haven't had time to prototype whether we can easily implement async IO
> 
> 
> Just as with any suggestion to depend on threads, you are going to have
> to show results that border on astounding to have any chance of getting
> this in.  Otherwise the portability issues are just going to make it not
> worth the trouble.

Do these ideas require threads in principle? ISTM that there could be 
(additional) process(es) waiting to perform pre-fetching or async io, 
and we could use the usual IPC machinary to talk between them...

cheers

Mark


Re: ice-broker scan thread

From
David Boreham
Date:
<br /><blockquote cite="midPine.LNX.4.58.0511291513260.18370@linuxworld.com.au" type="cite"><pre wrap="">
The paper I linked to seemed to suggest that they weren't using async IO
in 9.2 -- which is fairly old. I'm not sure why the authors didn't test
10g. </pre></blockquote> ...<reads paper>... ok, interesting. Did they say that Oracle isn't using aio ?<br /> I
can'tsee that. They that Oracle has no more than one outstanding I/O <br /> operation in flight per concurrent query,
<br/> and they appear to think that's a bad thing. I'm not seeing<br /> that myself. Perhaps once I sleep on it, it'll
becomeclear what they're getting at.<br /><br /> One theory for lack of aio in Oracle as tested in that paper would be
thatthey<br /> were testing on Linux. Since aio is relatively new in Linux I wouldn't be surprised<br /> if Oracle
didn'tactually use it until it's known to be widely deployed in the field<br /> and to have proven reliability. Perhaps
we'vereached that state around now,<br /> and so Oracle may not yet have released an aio-capable Linux version of
their<br/> RDBMS. Just a theory...someone from those tubular towers lurking here<br /> could tell us for sure I
guess...<br/><br /><br /><br /><br /> 

Re: ice-broker scan thread

From
Gavin Sherry
Date:
On Mon, 28 Nov 2005, Tom Lane wrote:

> Gavin Sherry <swm@linuxworld.com.au> writes:
> > I haven't had time to prototype whether we can easily implement async IO
>
> Just as with any suggestion to depend on threads, you are going to have
> to show results that border on astounding to have any chance of getting
> this in.  Otherwise the portability issues are just going to make it not
> worth the trouble.

The architecture I am looking at would not rely on threads.

I didn't want to jump on list and waive my hands until I had something to
show, but since Qingqing is looking at the issue I thought I better raise
it.

Gavin


Re: ice-broker scan thread

From
Mark Kirkwood
Date:
Gavin Sherry wrote:

> 
> The paper I linked to seemed to suggest that they weren't using async IO
> in 9.2 -- which is fairly old. I'm not sure why the authors didn't test
> 10g.
> 

There have been async io type parameters in Oracle's init.ora files from 
(at least) 8i (disk_async_io=true IIRC) - on Solaris anyway. Whether 
this enabled real or simulated async io is probably a good question - I 
recall during testing turning it off and seeing kio()? or similar type 
calls become write()/read() in truss oupout.

regards

Mark


Re: ice-broker scan thread

From
Qingqing Zhou
Date:

On Mon, 28 Nov 2005, Mark Kirkwood wrote:
>
> Do these ideas require threads in principle? ISTM that there could be
> (additional) process(es) waiting to perform pre-fetching or async io,
> and we could use the usual IPC machinary to talk between them...
>

Right. I use threads because it is easy to write simulation program :-)

Regards,
Qingqing


Re: ice-broker scan thread

From
"Jonah H. Harris"
Date:
FYI, I've personally used Oracle 9.2.0.4's async IO on Linux and have seen several installations which make use of it also.


 
On 11/28/05, Gavin Sherry <swm@linuxworld.com.au> wrote:
On Mon, 28 Nov 2005, Tom Lane wrote:

> Gavin Sherry <swm@linuxworld.com.au > writes:
> > I haven't had time to prototype whether we can easily implement async IO
>
> Just as with any suggestion to depend on threads, you are going to have
> to show results that border on astounding to have any chance of getting
> this in.  Otherwise the portability issues are just going to make it not
> worth the trouble.

The architecture I am looking at would not rely on threads.

I didn't want to jump on list and waive my hands until I had something to
show, but since Qingqing is looking at the issue I thought I better raise
it.

Gavin

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Re: ice-broker scan thread

From
Qingqing Zhou
Date:

On Mon, 28 Nov 2005, Gavin Sherry wrote:
>
> MySQL, Oracle and others implement read-ahead threads to simulate async IO
> 'pre-fetching'.

Due to my tests on Windows (using the attached program and change
enable_aio=true), seems aio doesn't help as a separate thread - but maybe
because my usage is wrong ...

Regards,
Qingqing


Re: ice-broker scan thread

From
Qingqing Zhou
Date:

On Mon, 28 Nov 2005, Gavin Sherry wrote:
>
> I didn't want to jump on list and waive my hands until I had something to
> show, but since Qingqing is looking at the issue I thought I better raise
> it.
>

Don't worry :-) I separate the logic into a standalone program in order to
let more people can help on this issue.

Regards,
Qingqing


Re: ice-broker scan thread

From
"Qingqing Zhou"
Date:
"David Boreham" <david_list@boreham.org> wrote
>
> BTW, I heard a long time ago that NTFS has quite fancy read-ahead, where 
> it attempts to detect the application's access pattern including if it is
> reading sequentially and even if there is a 'stride' to the accesses when
> they're not contiguous. I would imagine that other filesystems attempt 
> similar tricks. So one might expect a simple linear prefectch
> to not help much in the presence of such a filesystem.
>

So we need more tests. I understand how smart current file systems are, and 
seems that depends on the interval that you send next file block read 
request (decided by cpu_cost parameter in my program).

I imagine on a multi-way machine with strong IO device, the ice-breaker 
could do much better ...

> Were you worried about the icebreaker thread getting too far ahead of the 
> scan ? If it did it might page out the data you're about to read, I think. 
> Of course this could be fixed by having the read ahead thread perodically 
> check the current location being read by the query thread and pausing if 
> it's got too far ahead.
>

Right.

Regards,
Qingqing 




Re: ice-broker scan thread

From
David Boreham
Date:
Qingqing Zhou wrote: <blockquote cite="midPine.LNX.4.58.0511282350370.13833@josh.db" type="cite"><pre wrap="">
On Mon, 28 Nov 2005, Gavin Sherry wrote: </pre><blockquote type="cite"><pre wrap="">MySQL, Oracle and others implement
read-aheadthreads to simulate async IO
 
'pre-fetching'.   </pre></blockquote><pre wrap="">
Due to my tests on Windows (using the attached program and change
enable_aio=true), seems aio doesn't help as a separate thread - but maybe
because my usage is wrong ... </pre></blockquote> I don't think your NT overlapped I/O code is quite right. At least<br
/>I think it will issue reads at a high rate without waiting for any of them<br /> to complete. Beyond some point that
hasto give the kernel gut-rot.<br /> But anyway, I wouldn't expect the use of aio to make any<br /> significant
differencein an already threaded test program. <br /> The point of aio is to allow<br /> I/O concurrency _without_ the
useof threads or multiple processes.<br /> You could re-write your program to have a single thread but use aio.<br />
Inthat case it should show the same read ahead benefit that you see<br /> with the thread.<br /><br /><br /><br /> 

Re: ice-broker scan thread

From
Gavin Sherry
Date:
On Mon, 28 Nov 2005, Qingqing Zhou wrote:

>
>
> On Mon, 28 Nov 2005, Gavin Sherry wrote:
> >
> > MySQL, Oracle and others implement read-ahead threads to simulate async IO
> > 'pre-fetching'.
>
> Due to my tests on Windows (using the attached program and change
> enable_aio=true), seems aio doesn't help as a separate thread - but maybe
> because my usage is wrong ...

Right, I would imagine that it's very close. I intend to use kernel based
async IO so that we can have the prefetch effect of your sample program
without the need for threads.

Thanks,

Gavin


Re: ice-broker scan thread

From
"Qingqing Zhou"
Date:
"David Boreham" <david_list@boreham.org> wrote
>>
> I don't think your NT overlapped I/O code is quite right. At least
> I think it will issue reads at a high rate without waiting for any of them
> to complete. Beyond some point that has to give the kernel gut-rot.
>

[also with reply to Gavin] look up dictionary for "gut-rot", got it ... Uh, 
this behavior is intended - I try to push enough requests shortly to kernel 
so that it understands that I am doing sequential scan, so it would pull the 
data from disk to file system cache more efficiently. Some file systems may 
have "free-behind" mechanism, but our main thread (who really process the 
query) should be fast enough before the data vanished.

>
> You could re-write your program to have a single thread but use aio.
> In that case it should show the same read ahead benefit that you see
> with the thread.
>

I guess this is also Gavin's point - I understand that will be two different 
methodologies to handle "read-ahead". If no other thread/process involved, 
then the main thread will be responsible to grab a free buffer page from 
bufferpool and ask the kernel to put the data there by sync IO (current 
PostgreSQL does) or async IOs. And that's what I want to avoid. I'd like to 
use a dedicated thread/process to "break the ice" only, i.e., pull data from 
disk to file system cache, so that the main thread will only issue *logical* 
read.

Regards,
Qingqing 




Re: ice-broker scan thread

From
Martijn van Oosterhout
Date:
On Tue, Nov 29, 2005 at 02:53:36PM +1100, Gavin Sherry wrote:
> The second idea is using posix async IO at key points within the system
> to better parallelise CPU and IO work. There areas I think we could use
> async IO are: during sequential scans, use async IO to do pre-fetching of
> blocks; inside WAL, begin flushing WAL buffers to disk before we commit;
> and, inside the background writer/check point process, asynchronously
> write out pages and, potentially, asynchronously build new checkpoint segments.

I actually worked on this and got it to the stage where it wouldn't
crash anymore. It basically added a command to bufmgr.c called
PrefetchBuffer() which would initiate a request but not block. I then
hooked a few strategic places to call this. In particular during an
index scan, it would prefetch the next index block and the next few
data blocks and then return them in order as they came in.

Unfortunatly I can't really test it at it's full potential because it
uses glibc's default POSIX AIO which is *lame*. No more than one
outstanding request per fd which for PostgreSQL is crappy. There was
some evidence that in an index scan of a highly uncorrelated index that
it did make a small difference, but I never got around to testing it
fully. But bitmap scans already hugely reduce the cost of uncorrelated
indexes.

It doesn't pass regression because index_getmulti doesn't do backward
scans. Everything else works though.

If anyone is interested in the code I can send it to them. The results
on my system just wern't good enough to justify a lot more effort.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: ice-broker scan thread

From
"Pollard, Mike"
Date:
First, we need a new term for a thread of execution, that could be a
thread or could be a process, I don't care.  When discussing anything
that is to run in parallel, the first thing that pops out of someones
mouth is "Don't you mean (thread/process)?"  But that's an
implementation detail and should not be considered during a planning
phase, unless it is fundamental to the problem.  Hence, the term TOE to
mean "I don't really care if it is in it's own address space, or the
same address space.".  However, I understand that this is not in common
usage, so in the following discussion I use the term thread, as it is
more correct than process.  I am just not defining if that thread is the
only thread running in its process or not.

I've implemented this on another database product, using buf reading
threads  to pull the data all the way into the database cache.  In
testing on Unix production systems (4 CPU machines, large RAID devices,
100Gb+ databases), table scans performed 5 to 7 times faster; on MVS
table scans are up to 10 times faster.  But, I never had much luck on
getting the performance to change on Windows.  Partially, I think, it's
because the machine I was using was IDE, not SCSI, so I was already
greatly bottlenecked.  Maybe SATA would be better?  I haven't tested
there, either.

Anyway, what I did was the following.  When doing a sequential scan, we
were starting at the beginning of the table and scanning forward.  If I
threw up some threads to read ahead, then my user thread and my read
ahead threads would thrash on trying to lock the buffer slots.  So, I
had the read ahead threads start at some distance into the table, and
work toward the beginning.  The user thread would do its own I/O until
it passed the read ahead threads.  I also broke the read ahead section
into multiple contiguous sections, and had different threads read each
section, so the user thread would only have a problem with the first
section; by the time it was finished with that, the other sections would
be read in.  When the user thread got to about 80% of the nodes that got
read ahead, it would schedule another section to be read.

+----------------------------------------------------------------+
|   table                                                        +
+----------------------------------------------------------------+  (user->) (<-readahead) (<-readahead) (<-readaehead)

so above, the user threads is starting low in the table and working
high; the readahead threads are starting higher (but not at the end of
the table), and working low.

Like I said, this worked very well for me.

Mike Pollard
SUPRA Server SQL Engineering and Support
Cincom Systems, Inc.
--------------------------------Better to remain silent and be thought a fool than to speak out and
remove all doubt.        Abraham Lincoln

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Qingqing Zhou
Sent: Tuesday, November 29, 2005 12:56 AM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] ice-broker scan thread


"David Boreham" <david_list@boreham.org> wrote
>>
> I don't think your NT overlapped I/O code is quite right. At least
> I think it will issue reads at a high rate without waiting for any of
them
> to complete. Beyond some point that has to give the kernel gut-rot.
>

[also with reply to Gavin] look up dictionary for "gut-rot", got it ...
Uh,
this behavior is intended - I try to push enough requests shortly to
kernel
so that it understands that I am doing sequential scan, so it would pull
the
data from disk to file system cache more efficiently. Some file systems
may
have "free-behind" mechanism, but our main thread (who really process
the
query) should be fast enough before the data vanished.

>
> You could re-write your program to have a single thread but use aio.
> In that case it should show the same read ahead benefit that you see
> with the thread.
>

I guess this is also Gavin's point - I understand that will be two
different
methodologies to handle "read-ahead". If no other thread/process
involved,
then the main thread will be responsible to grab a free buffer page from

bufferpool and ask the kernel to put the data there by sync IO (current
PostgreSQL does) or async IOs. And that's what I want to avoid. I'd like
to
use a dedicated thread/process to "break the ice" only, i.e., pull data
from
disk to file system cache, so that the main thread will only issue
*logical*
read.

Regards,
Qingqing



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
              http://www.postgresql.org/docs/faq


Re: ice-broker scan thread

From
David Boreham
Date:
>threw up some threads to read ahead, then my user thread and my read
>ahead threads would thrash on trying to lock the buffer slots.  So, I
>had the read ahead threads start at some distance into the table, and
>work toward the beginning.  The user thread would do its own I/O until
>  
>
Ah. The lightbulb went on. You want multiple outstanding I/O operations
in the case that table or index spans multiple physical disks.




Re: ice-broker scan thread

From
Martijn van Oosterhout
Date:
On Tue, Nov 29, 2005 at 09:45:30AM -0500, Pollard, Mike wrote:
> Anyway, what I did was the following.  When doing a sequential scan, we
> were starting at the beginning of the table and scanning forward.  If I
> threw up some threads to read ahead, then my user thread and my read
> ahead threads would thrash on trying to lock the buffer slots.  So, I

<snip>

> so above, the user threads is starting low in the table and working
> high; the readahead threads are starting higher (but not at the end of
> the table), and working low.

Ok, this may be a really dumb question, but doesn't this rely on the
fact that the table is smaller than the amount of buffers? If the table
is large most of your data will be tossed out again by later data
before it's been used by the backend.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: ice-broker scan thread

From
"Pollard, Mike"
Date:
No, I only go x number of pages ahead of the user scan (where x is
currently user defined, but it should be significantly smaller than your
number of data buffers).  I have found that reading about 16Mb ahead
gives optimal performance, and on modern machines isn't all that much
memory.  Once the user scan has processed most of that 16Mb, the next
section of the tree is schedule to be read.  I don't keep the read ahead
threads a constant distance ahead, because I found it to be more
efficient if they occasionally had a lot of pages to read at once,
rather than constantly having a few pages to read.

Mike Pollard
SUPRA Server SQL Engineering and Support
Cincom Systems, Inc.
--------------------------------Better to remain silent and be thought a fool than to speak out and
remove all doubt.        Abraham Lincoln

-----Original Message-----
From: Martijn van Oosterhout [mailto:kleptog@svana.org]
Sent: Tuesday, November 29, 2005 10:06 AM
To: Pollard, Mike
Cc: Qingqing Zhou; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] ice-broker scan thread

On Tue, Nov 29, 2005 at 09:45:30AM -0500, Pollard, Mike wrote:
> Anyway, what I did was the following.  When doing a sequential scan,
we
> were starting at the beginning of the table and scanning forward.  If
I
> threw up some threads to read ahead, then my user thread and my read
> ahead threads would thrash on trying to lock the buffer slots.  So, I

<snip>

> so above, the user threads is starting low in the table and working
> high; the readahead threads are starting higher (but not at the end of
> the table), and working low.

Ok, this may be a really dumb question, but doesn't this rely on the
fact that the table is smaller than the amount of buffers? If the table
is large most of your data will be tossed out again by later data
before it's been used by the backend.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is
a
> tool for doing 5% of the work and then sitting around waiting for
someone
> else to do the other 95% so you can sue them.


Re: ice-broker scan thread

From
David Boreham
Date:
>Unfortunatly I can't really test it at it's full potential because it
>uses glibc's default POSIX AIO which is *lame*. No more than one
>outstanding request per fd which for PostgreSQL is crappy. There was
>  
>
I had the impression from the kernel aio mailing list a while back that
post-<some kernel version> linux, the POSIX aio calls were forwarded
to the kernel aio interface. Or are you saying that the POSIX API itself 
imposes
that limitation ?




Re: ice-broker scan thread

From
Andrew Piskorski
Date:
On Tue, Nov 29, 2005 at 03:14:38PM +1100, Gavin Sherry wrote:
> On Mon, 28 Nov 2005, David Boreham wrote:
> > Gavin Sherry wrote:
> > > MySQL, Oracle and others implement read-ahead threads to simulate async IO
> >
> > I always believed that Oracle used async file I/O. Not that I've seen their

> The paper I linked to seemed to suggest that they weren't using async IO
> in 9.2 -- which is fairly old.
 http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf "Getting Priorities Straight: Improving Linux Support for
DatabaseI/O" by Hall and Bonnet Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005
 

I think you've misread that paper.  AFAICT it neither says nor even
suggests that Oracle 9.2 does not use asynchronous I/O on Linux.  In
fact, it seems to strongly suggest exactly the opposite, that Oracle
does use async I/O whereever it can.

Note they also reference this document, which as of 2002 and Linux
kernel 2.4.x, was urging Oracle DBAs to use Oracle's kernel-based
asynchronous I/O support whenever possible:
 http://www.ixora.com.au/tips/use_asynchronous_io.htm

What Hall and Bonnet's paper DOES say, is that both Oracle and MySQL
InnoDB appear to use a "conservative" I/O submission policy, but
Oracle does so more efficiently.  They also argue that both Oracle and
MySQL fail to utilize the "full potential" of Linux async I/O because
of their conservative submission policies, and that an "agressive" I/O
submissions policy would work better, but only if support for
Prioritized I/O is added to Linux.  They then proceed to add that
support, and make some basic changes to InnoDB to partially take
advantage of it.

Also interesting is their casual mention that for RDBMS workloads, the
default Linux 2.6 disk scheduler "anticipatory" is inferior to the
"deadline" scheduler.  They base their (simple sounding) Prioritized
I/O support on the deadline scheduler.

-- 
Andrew Piskorski <atp@piskorski.com>
http://www.piskorski.com/


Re: ice-broker scan thread

From
Martijn van Oosterhout
Date:
On Tue, Nov 29, 2005 at 08:42:18AM -0700, David Boreham wrote:
>
> >Unfortunatly I can't really test it at it's full potential because it
> >uses glibc's default POSIX AIO which is *lame*. No more than one
> >outstanding request per fd which for PostgreSQL is crappy. There was
> >
> I had the impression from the kernel aio mailing list a while back
> that post-<some kernel version> linux, the POSIX aio calls were
> forwarded to the kernel aio interface. Or are you saying that the
> POSIX API itself imposes that limitation ?

By default when you use aio you get the version in libc (-lrt IIRC)
which has the issue I mentioned, probably because it's probably
optimised for the lots-of-network-connections type program where
multiple outstanding requests on a single fd are not meaningful. You
can however link in some other library which gives you kernel support.
However, I don't have a new enough kernel to have the kernel support so
I havn't tested that.

POSIX AIO doesn't prescribe either way.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: ice-broker scan thread

From
David Boreham
Date:
>By default when you use aio you get the version in libc (-lrt IIRC)
>which has the issue I mentioned, probably because it's probably
>optimised for the lots-of-network-connections type program where
>multiple outstanding requests on a single fd are not meaningful. You
>can however link in some other library which gives you kernel support.
>However, I don't have a new enough kernel to have the kernel support so
>I havn't tested that.
>  
>
Actually, after reading up on the current state of things, I'm not sure you
can even get POSIX aio on top of kernel aio in Linux. There are also a
few limitations in the 2.6 aio implementation that might prove troublesome:
for example it only works with O_DIRECT.

libaio gives userland access to the kernel aio api (which is different 
from POSIX aio).








Re: ice-broker scan thread

From
"Jeffrey W. Baker"
Date:
On Tue, 2005-11-29 at 09:45 -0500, Pollard, Mike wrote:

> Anyway, what I did was the following.  When doing a sequential scan, we
> were starting at the beginning of the table and scanning forward.  If I
> threw up some threads to read ahead, then my user thread and my read
> ahead threads would thrash on trying to lock the buffer slots.  So, I
> had the read ahead threads start at some distance into the table, and
> work toward the beginning. 

I believe this is commonly called a synchronous scan.

-jwb


Re: ice-broker scan thread

From
Martijn van Oosterhout
Date:
On Tue, Nov 29, 2005 at 10:28:57AM -0700, David Boreham wrote:
> Actually, after reading up on the current state of things, I'm not sure you
> can even get POSIX aio on top of kernel aio in Linux. There are also a
> few limitations in the 2.6 aio implementation that might prove troublesome:
> for example it only works with O_DIRECT.

Which is bizarre because it's semantically equivalent to having a
seperate thread doing the read() and sending you a signal when it's
done. What I'm thinking of testing is a join across two large table so
there is actually more than one outstanding request at a time. But it's
irritating to have to code to a special api...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: ice-broker scan thread

From
"Qingqing Zhou"
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote
>
> I wrote a program to simulate the sequential scan in PostgreSQL
> with/without ice-broker.
>
> We need more tests
>

If anybody has a test results then I'd love to see it ...

Thanks,
Qingqing




Re: ice-broker scan thread

From
Gavin Sherry
Date:
On Tue, 29 Nov 2005, Andrew Piskorski wrote:

> On Tue, Nov 29, 2005 at 03:14:38PM +1100, Gavin Sherry wrote:
> > On Mon, 28 Nov 2005, David Boreham wrote:
> > > Gavin Sherry wrote:
> > > > MySQL, Oracle and others implement read-ahead threads to simulate async IO
> > >
> > > I always believed that Oracle used async file I/O. Not that I've seen their
>
> > The paper I linked to seemed to suggest that they weren't using async IO
> > in 9.2 -- which is fairly old.
>
>   http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf
>   "Getting Priorities Straight: Improving Linux Support for Database I/O"
>   by Hall and Bonnet
>   Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005
>
> I think you've misread that paper.  AFAICT it neither says nor even
> suggests that Oracle 9.2 does not use asynchronous I/O on Linux.  In
> fact, it seems to strongly suggest exactly the opposite, that Oracle
> does use async I/O whereever it can.
>
> Note they also reference this document, which as of 2002 and Linux
> kernel 2.4.x, was urging Oracle DBAs to use Oracle's kernel-based
> asynchronous I/O support whenever possible:
>
>   http://www.ixora.com.au/tips/use_asynchronous_io.htm
>
> What Hall and Bonnet's paper DOES say, is that both Oracle and MySQL
> InnoDB appear to use a "conservative" I/O submission policy, but
> Oracle does so more efficiently.  They also argue that both Oracle and
> MySQL fail to utilize the "full potential" of Linux async I/O because
> of their conservative submission policies, and that an "agressive" I/O
> submissions policy would work better, but only if support for
> Prioritized I/O is added to Linux.  They then proceed to add that
> support, and make some basic changes to InnoDB to partially take
> advantage of it.
>
> Also interesting is their casual mention that for RDBMS workloads, the
> default Linux 2.6 disk scheduler "anticipatory" is inferior to the
> "deadline" scheduler.  They base their (simple sounding) Prioritized
> I/O support on the deadline scheduler.
>

Right. I had seemed to recall that they configured Oracle to use a kind of
readahead thread not native async IO, but I am wrong. That's not material
to the discussion at hand.

What we need to find out is if we can easily integrate prefetching into
PostgreSQL for some subset of the work we do, find non-trivial performance
gains and demonstrate it on more than one OS. Ideally, we'd see some
non-trivial gain irrespective of the IO scheduler being used.

Thanks,

Gavin


Re: ice-broker scan thread

From
Gavin Sherry
Date:
On Tue, 29 Nov 2005, David Boreham wrote:

>
> >By default when you use aio you get the version in libc (-lrt IIRC)
> >which has the issue I mentioned, probably because it's probably
> >optimised for the lots-of-network-connections type program where
> >multiple outstanding requests on a single fd are not meaningful. You
> >can however link in some other library which gives you kernel support.
> >However, I don't have a new enough kernel to have the kernel support so
> >I havn't tested that.
> >
> >
> Actually, after reading up on the current state of things, I'm not sure you
> can even get POSIX aio on top of kernel aio in Linux. There are also a
> few limitations in the 2.6 aio implementation that might prove troublesome:
> for example it only works with O_DIRECT.
>
> libaio gives userland access to the kernel aio api (which is different
> from POSIX aio).

Yes. The O_DIRECT issue is my biggest concern about Linux at the moment.
That being said, the plan is to only pre-fetch the next N blocks, where N
< 32, and to read them into the local buffer cache. In a situation where
space in the cache low (and prefetched pages might be pushed out before we
even get to read them), we need to provide such information to the
readahead mechanism so that it can reduce the number of blocks which it
prefetches.

Gavin


Re: ice-broker scan thread

From
Simon Riggs
Date:
On Wed, 2005-11-30 at 08:30 +1100, Gavin Sherry wrote:
> On Tue, 29 Nov 2005, David Boreham wrote:
> 
> >
> > >By default when you use aio you get the version in libc (-lrt IIRC)
> > >which has the issue I mentioned, probably because it's probably
> > >optimised for the lots-of-network-connections type program where
> > >multiple outstanding requests on a single fd are not meaningful. You
> > >can however link in some other library which gives you kernel support.
> > >However, I don't have a new enough kernel to have the kernel support so
> > >I havn't tested that.
> > >
> > >
> > Actually, after reading up on the current state of things, I'm not sure you
> > can even get POSIX aio on top of kernel aio in Linux. There are also a
> > few limitations in the 2.6 aio implementation that might prove troublesome:
> > for example it only works with O_DIRECT.
> >
> > libaio gives userland access to the kernel aio api (which is different
> > from POSIX aio).
> 
> Yes. The O_DIRECT issue is my biggest concern about Linux at the moment.
> That being said, the plan is to only pre-fetch the next N blocks, where N
> < 32, and to read them into the local buffer cache. In a situation where
> space in the cache low (and prefetched pages might be pushed out before we
> even get to read them), we need to provide such information to the
> readahead mechanism so that it can reduce the number of blocks which it
> prefetches.

My understanding was that Linux at least has a reasonable readahead
mechanism that works on the scale you suggest. 

I think its fair to assume that anybody that wants this can afford
sufficient memory to make it worthwhile. Multiple processes per scan
implies (low numbers of users or I/O overkill).

Best Regards, Simon Riggs



Re: ice-broker scan thread

From
Simon Riggs
Date:
On Tue, 2005-11-29 at 09:45 -0500, Pollard, Mike wrote:
> I've implemented this on another database product

You're scaring me. Is the information you describe in the public domain
or is it intellectual property of any particular company? Are you sure?

We just recovered from one patent scare.

Good to have you around though, if we're covered.

Best Regards, Simon Riggs



Re: ice-broker scan thread

From
"Luke Lonergan"
Date:
Jeff,


On 11/29/05 9:35 AM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote:

> On Tue, 2005-11-29 at 09:45 -0500, Pollard, Mike wrote:
>
>> Anyway, what I did was the following.  When doing a sequential scan, we
>> were starting at the beginning of the table and scanning forward.  If I
>> threw up some threads to read ahead, then my user thread and my read
>> ahead threads would thrash on trying to lock the buffer slots.  So, I
>> had the read ahead threads start at some distance into the table, and
>> work toward the beginning.
>
> I believe this is commonly called a synchronous scan.

I think sync scan refers to the use of a scanner shared among concurrent
queries, where they can join a scan in progress from it's current location.

It sounds like the logic could be shared.  Sync scan (as I've described
above) is another important optimization we'd like to see.

- Luke




Re: ice-broker scan thread

From
"Pollard, Mike"
Date:
No, it's all right.  In fact, I believe my boss spoke to Bruce about
this idea in August.  But I have permission to discuss the algorithm.  I
may even be able to get the code, but to be honest, it isn't that much;
it would probably be just as easy for it to be re-written as it would be
to fit it into Postgres.

Mike Pollard
SUPRA Server SQL Engineering and Support
Cincom Systems, Inc.
--------------------------------Better to remain silent and be thought a fool than to speak out and
remove all doubt.        Abraham Lincoln

-----Original Message-----
From: Simon Riggs [mailto:simon@2ndquadrant.com]
Sent: Tuesday, November 29, 2005 5:23 PM
To: Pollard, Mike
Cc: Qingqing Zhou; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] ice-broker scan thread

On Tue, 2005-11-29 at 09:45 -0500, Pollard, Mike wrote:
> I've implemented this on another database product

You're scaring me. Is the information you describe in the public domain
or is it intellectual property of any particular company? Are you sure?

We just recovered from one patent scare.

Good to have you around though, if we're covered.

Best Regards, Simon Riggs



Re: ice-broker scan thread

From
David Boreham
Date:
>Yes. The O_DIRECT issue is my biggest concern about Linux at the moment.
>That being said, the plan is to only pre-fetch the next N blocks, where N
>< 32, and to read them into the local buffer cache. In a situation where
>space in the cache low (and prefetched pages might be pushed out before we
>even get to read them), we need to provide such information to the
>readahead mechanism so that it can reduce the number of blocks which it
>prefetches.
>
>
>  
>
Would you open a separate handle O_DIRECT, just for the prefetch ?

My experience with O_DIRECT and databases in the past has not been
great : what you gain with being able to control your own caching you loose
(and more) in other ways.

BTW, has anyone tried O_DIRECT and the prefetch idea on Linux ?
I'm wondering if it may not work (because the read data won't get cached
in the fs cache due to O_DIRECT).





Re: ice-broker scan thread

From
David Boreham
Date:
Qingqing Zhou wrote:

>[also with reply to Gavin] look up dictionary for "gut-rot", got it ... Uh, 
>this behavior is intended - I try to push enough requests shortly to kernel 
>so that it understands that I am doing sequential scan, so it would pull the 
>data from disk to file system cache more efficiently. Some file systems may 
>have "free-behind" mechanism, but our main thread (who really process the 
>query) should be fast enough before the data vanished.
>  
>
I guess I was concerned that very large numbers of concurrent operations 
on the same file handle
in flight at the same time might lead to poor performance or even 
instability. e.g. the kernel may
make long linked lists, it might create lock contention with itself, 
that kind of bad stuff. My thinking
being that the kernel wasn't designed with applications that fire off 
10,000 concurrent reads against
the same file.

>I guess this is also Gavin's point - I understand that will be two different 
>methodologies to handle "read-ahead". If no other thread/process involved, 
>then the main thread will be responsible to grab a free buffer page from 
>bufferpool and ask the kernel to put the data there by sync IO (current 
>PostgreSQL does) or async IOs. And that's what I want to avoid. I'd like to 
>use a dedicated thread/process to "break the ice" only, i.e., pull data from 
>disk to file system cache, so that the main thread will only issue *logical* 
>read.
>  
>
Right, understood. My point was that a thread with sync I/O and the 
query thread with
async I/O are in fact logically identical. They're just two different 
implementation techniques
for the same fundemental functionality. In some cases the non-thread 
implementation might
be prefered (for example on a platform with no threads).