Re: hackers-digest V1 #771 (safe/fast I/O) - Mailing list pgsql-hackers
From | Paul A Vixie |
---|---|
Subject | Re: hackers-digest V1 #771 (safe/fast I/O) |
Date | |
Msg-id | 199804122338.QAA03476@wisdom.rc.vix.com Whole thread Raw |
Responses |
Re: [HACKERS] Re: hackers-digest V1 #771 (safe/fast I/O)
|
List | pgsql-hackers |
mmap() is cool since it avoids copying data between kernel and user address spaces. However, mmap() is going to be either synchronous ("won't return 'til it has set up the page table stuff and maybe allocated backing store") or not ("will return immediately but your process will silently block if you try to access the address range before the back office work is done for the region"). There is no callback facility and no way to poll for region readiness. aio_*() is cool since you can queue a read or write and then either get a callback when it's complete or poll it. However, there's no way to allocate the backing store before you start scribbling, so there is always a copy on aio_write(). And there's no page flipping in aio_read()'s definition, so unless you allocate your read buffers in page boundaries and unless your kernel is really smart, you're always going to see a copy in aio_read(). O_ASYNC and select() are only useful for externally synchronized I/O like TTY and network. select() always returns both readable and writable for either files in a file system or for block or character special disk files. As far as I know, other than on the MASSCOMP (which more or less did what VMS did and what Win/NT now does in this area), no UNIX system, especially including POSIX.1B systems, has quite what's wanted for high performance transactional I/O. True asynchrony means having the ability to choose when to block, and to parallize computation with I/O, and to get more total work done per unit time by doing application level seek ordering and write buffering (avoiding excess mechanical movement). In the last I/O intensive system I helped build here, we decided that mmap(), even with its periodic time losses, gave us better total throughput due to the lack of copy overhead. It helps if you both mmap things with a lot of regionality, and access them with high locality of reference. But it was the savings of memory bus bandwidth that bought us the most. #ifndef BUFFER_H #define BUFFER_H #include <stdio.h> #include "misc.h" #define BUF_SIZE 4096 typedef struct buffer { void * opaque; } buffer; typedef enum bufprot { buf_ro, buf_rw /* Note that there is no buf_wo since RMW is the processor standard. */ } bufprot; int buf_init(int nmax, int grow); int buf_shutdown(FILE *); int buf_get(buffer *); int buf_mget(buffer *, int, off_t, bufprot); int buf_refcount(buffer); void buf_ref(buffer); void buf_unref(buffer); void buf_clear(buffer); void buf_add(buffer, size_t); void buf_sub(buffer, size_t); void buf_shift(buffer, size_t); size_t buf_used(buffer); size_t buf_avail(buffer); void * buf_used_ptr(buffer); void * buf_avail_ptr(buffer); struct iovec buf_used_iov(buffer); struct iovec buf_avail_iov(buffer); region buf_used_reg(buffer); region buf_avail_reg(buffer); int buf_printf(buffer, const char *, ...); #endif /* !BUFFER_H */
pgsql-hackers by date: