Re: hackers-digest V1 #771 (safe/fast I/O) - Mailing list pgsql-hackers

From Paul A Vixie
Subject Re: hackers-digest V1 #771 (safe/fast I/O)
Date
Msg-id 199804122338.QAA03476@wisdom.rc.vix.com
Whole thread Raw
Responses Re: [HACKERS] Re: hackers-digest V1 #771 (safe/fast I/O)
List pgsql-hackers
mmap() is cool since it avoids copying data between kernel and user address
spaces.  However, mmap() is going to be either synchronous ("won't return 'til
it has set up the page table stuff and maybe allocated backing store") or not
("will return immediately but your process will silently block if you try to
access the address range before the back office work is done for the region").
There is no callback facility and no way to poll for region readiness.

aio_*() is cool since you can queue a read or write and then either get a
callback when it's complete or poll it.  However, there's no way to allocate
the backing store before you start scribbling, so there is always a copy on
aio_write().  And there's no page flipping in aio_read()'s definition, so
unless you allocate your read buffers in page boundaries and unless your
kernel is really smart, you're always going to see a copy in aio_read().

O_ASYNC and select() are only useful for externally synchronized I/O like
TTY and network.  select() always returns both readable and writable for
either files in a file system or for block or character special disk files.

As far as I know, other than on the MASSCOMP (which more or less did what
VMS did and what Win/NT now does in this area), no UNIX system, especially
including POSIX.1B systems, has quite what's wanted for high performance
transactional I/O.

True asynchrony means having the ability to choose when to block, and to
parallize computation with I/O, and to get more total work done per unit time
by doing application level seek ordering and write buffering (avoiding excess
mechanical movement).  In the last I/O intensive system I helped build here,
we decided that mmap(), even with its periodic time losses, gave us better
total throughput due to the lack of copy overhead.  It helps if you both mmap
things with a lot of regionality, and access them with high locality of
reference.  But it was the savings of memory bus bandwidth that bought us
the most.

#ifndef BUFFER_H
#define BUFFER_H

#include <stdio.h>
#include "misc.h"

#define    BUF_SIZE        4096

typedef struct buffer {
    void *            opaque;
} buffer;

typedef enum bufprot {
    buf_ro,
    buf_rw
    /* Note that there is no buf_wo since RMW is the processor standard. */
} bufprot;

int        buf_init(int nmax, int grow);
int        buf_shutdown(FILE *);
int        buf_get(buffer *);
int        buf_mget(buffer *, int, off_t, bufprot);
int        buf_refcount(buffer);
void        buf_ref(buffer);
void        buf_unref(buffer);
void        buf_clear(buffer);
void        buf_add(buffer, size_t);
void        buf_sub(buffer, size_t);
void        buf_shift(buffer, size_t);
size_t        buf_used(buffer);
size_t        buf_avail(buffer);
void *        buf_used_ptr(buffer);
void *        buf_avail_ptr(buffer);
struct iovec    buf_used_iov(buffer);
struct iovec    buf_avail_iov(buffer);
region        buf_used_reg(buffer);
region        buf_avail_reg(buffer);
int        buf_printf(buffer, const char *, ...);

#endif /* !BUFFER_H */

pgsql-hackers by date:

Previous
From: dg@illustra.com (David Gould)
Date:
Subject: Re: [HACKERS] Safe/Fast I/O ...
Next
From: Ryan Kirkpatrick
Date:
Subject: Re: [HACKERS] Linux/Alpha and pgsql....