Thread: Sorting writes during checkpoint
Here is a patch for TODO item, "Consider sorting writes during checkpoint". It writes dirty buffers in the order of block number during checkpoint so that buffers are written sequentially. I proposed the patch before, but it was rejected because 8.3 feature has been frozen already at that time. http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php I rewrote it to be applied cleanly against current HEAD, but the concept is not changed at all -- Memorizing pairs of (buf_id, BufferTag) for each dirty buffer into an palloc-ed array at the start of checkpoint. Sorting the array in BufferTag order and writing buffers in the order. There are 10% of performance win in pgbench on my machine with RAID-0 disks. There can be more benefits on RAID-5 disks, because random writes are slower than sequential writes there. [HEAD] tps = 1134.233955 (excluding connections establishing) [HEAD with patch] tps = 1267.446249 (excluding connections establishing) [pgbench] transaction type: TPC-B (sort of) scaling factor: 100 query mode: simple number of clients: 32 number of transactions per client: 100000 number of transactions actually processed: 3200000/3200000 [hardware] 2x Quad core Xeon, 16GB RAM, 4x HDD (RAID-0) [postgresql.conf] shared_buffers = 2GB wal_buffers = 4MB checkpoint_segments = 64 checkpoint_timeout = 5min checkpoint_completion_target = 0.5 Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
On Tue, 15 Apr 2008, ITAGAKI Takahiro wrote: > 2x Quad core Xeon, 16GB RAM, 4x HDD (RAID-0) What is the disk controller in this system? I'm specifically curious about what write cache was involved, so I can get a better feel for the hardware your results came from. I'm busy rebuilding my performance testing systems right now, once that's done I can review this on a few platforms. One thing that jumped out at me just reading the code is this happening inside BufferSync: buf_to_write = (BufAndTag *) palloc(NBuffers * sizeof(BufAndTag)); If shared_buffers(=NBuffers) is set to something big, this could give some memory churn. And I think it's a bad idea to allocate something this large at checkpoint time, because what happens if that fails? Really not the time you want to discover there's no RAM left. Since you're always going to need this much memory for the system to operate, and the current model has the system running a checkpoint >50% of the time, the only thing that makes sense to me is to allocate it at server start time once and be done with it. That should improve performance over the original patch as well. BufAndTag is a relatively small structure (5 ints). Let's call it 40 bytes; even that's only a 0.5% overhead relative to the shared buffer allocation. If we can speed checkpoints significantly with that much overhead it sounds like a good tradeoff to me. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> wrote: > On Tue, 15 Apr 2008, ITAGAKI Takahiro wrote: > > > 2x Quad core Xeon, 16GB RAM, 4x HDD (RAID-0) > > What is the disk controller in this system? I'm specifically curious > about what write cache was involved, so I can get a better feel for the > hardware your results came from. I used HP ProLiant DL380 G5 with Smart Array P400 with 256MB cache (http://h10010.www1.hp.com/wwpc/us/en/sm/WF06a/15351-15351-3328412-241644-241475-1121516.html) and ext3fs on LVM of CentOS 5.1 (Linux version 2.6.18-53.el5). Dirty region of database was probably larger than disk controller's cache. > buf_to_write = (BufAndTag *) palloc(NBuffers * sizeof(BufAndTag)); > > If shared_buffers(=NBuffers) is set to something big, this could give some > memory churn. And I think it's a bad idea to allocate something this > large at checkpoint time, because what happens if that fails? Really not > the time you want to discover there's no RAM left. Hmm, but I think we need to copy buffer tags into bgwriter's local memory in order to avoid locking taga many times in the sorting. Is it better to allocate sorting buffers at the first time and keep and reuse it from then on? > BufAndTag is a relatively small structure (5 ints). Let's call it 40 > bytes; even that's only a 0.5% overhead relative to the shared buffer > allocation. If we can speed checkpoints significantly with that much > overhead it sounds like a good tradeoff to me. I thinks sizeof(BufAndTag) is 20 bytes because sizeof(int) is 4 on typical platforms (and if not, I should rewrite the patch to be always so). It is 0.25% of shared buffers; when shared_buffers is set to 10GB, it takes 25MB of process local memory. If we want to consume less memory for it, RelFileNode in BufferTag could be hashed and packed into an integer; The blockNum order is important for this purpose, but RelFileNode is not. It makes the overhead to 12 bytes per page (0.15%). Is it worth doing? Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Wed, 16 Apr 2008, ITAGAKI Takahiro wrote: > Dirty region of database was probably larger than disk controller's cache. Might be worthwhile to run with log_checkpoints on and collect some statistics there next time you're running these tests. It's a good habit to get other testers into regardless; it's nice to be able to say something like "during the 15 checkpoints encountered during this test, the largest dirty area was 516MB while the median was 175MB". > Hmm, but I think we need to copy buffer tags into bgwriter's local memory > in order to avoid locking taga many times in the sorting. Is it better to > allocate sorting buffers at the first time and keep and reuse it from then on? That what I was thinking: allocate the memory when the background writer starts and just always have it there, the allocation you're doing is always the same size. If it's in use 50% of the time anyway (which it is if you have checkpoint_completion_target at its default), why introduce the risk that an allocation will fail at checkpoint time? Just allocate it once and keep it around. > It is 0.25% of shared buffers; when shared_buffers is set to 10GB, > it takes 25MB of process local memory. Your numbers are probably closer to correct. I was being pessimistic about the size of all the integers just to demonstrate that it's not really a significant amount of memory even if they're large. > If we want to consume less memory for it, RelFileNode in BufferTag could > be hashed and packed into an integer I personally don't feel it's worth making the code any more complicated than it needs to be just to save a fraction of a percent of the total memory used by the buffer pool. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > Greg Smith <gsmith@gregsmith.com> wrote: >> If shared_buffers(=NBuffers) is set to something big, this could give some >> memory churn. And I think it's a bad idea to allocate something this >> large at checkpoint time, because what happens if that fails? Really not >> the time you want to discover there's no RAM left. > Hmm, but I think we need to copy buffer tags into bgwriter's local memory > in order to avoid locking taga many times in the sorting. I updated this patch to permanently allocate the working array as Greg suggests, and to fix a bunch of commenting issues (attached). However, I am completely unable to measure any performance improvement from it. Given the possible risk of out-of-memory failures, I think the patch should not be applied without some direct proof of performance benefits, and I don't see any. regards, tom lane Index: src/backend/storage/buffer/bufmgr.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.228 diff -c -r1.228 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 1 Jan 2008 19:45:51 -0000 1.228 --- src/backend/storage/buffer/bufmgr.c 4 May 2008 01:11:08 -0000 *************** *** 56,61 **** --- 56,68 ---- #define BUF_WRITTEN 0x01 #define BUF_REUSABLE 0x02 + /* Struct for BufferSync's internal to-do list */ + typedef struct BufAndTag + { + int buf_id; + BufferTag tag; + } BufAndTag; + /* GUC variables */ bool zero_damaged_pages = false; *************** *** 986,991 **** --- 993,1025 ---- } /* + * qsort comparator for BufferSync + */ + static int + bufandtagcmp(const void *a, const void *b) + { + const BufAndTag *lhs = (const BufAndTag *) a; + const BufAndTag *rhs = (const BufAndTag *) b; + int r; + + /* + * We don't much care about the order in which different relations get + * written, so memcmp is enough for comparing the relfilenodes, + * even though its behavior will be platform-dependent. + */ + r = memcmp(&lhs->tag.rnode, &rhs->tag.rnode, sizeof(lhs->tag.rnode)); + if (r != 0) + return r; + + /* We do want blocks within a relation to be ordered accurately */ + if (lhs->tag.blockNum < rhs->tag.blockNum) + return -1; + if (lhs->tag.blockNum > rhs->tag.blockNum) + return 1; + return 0; + } + + /* * BufferSync -- Write out all dirty buffers in the pool. * * This is called at checkpoint time to write out all dirty shared buffers. *************** *** 995,1004 **** static void BufferSync(int flags) { int buf_id; - int num_to_scan; int num_to_write; int num_written; /* Make sure we can handle the pin inside SyncOneBuffer */ ResourceOwnerEnlargeBuffers(CurrentResourceOwner); --- 1029,1056 ---- static void BufferSync(int flags) { + static BufAndTag *bufs_to_write = NULL; int buf_id; int num_to_write; int num_written; + int i; + + /* + * We allocate the bufs_to_write[] array on first call and keep it + * around for the life of the process. This is okay because in normal + * operation this function is only called within the bgwriter, so + * we won't have lots of large arrays floating around. We prefer this + * way because we don't want checkpoints to suddenly start failing + * when the system gets under memory pressure. + */ + if (bufs_to_write == NULL) + { + bufs_to_write = (BufAndTag *) malloc(NBuffers * sizeof(BufAndTag)); + if (bufs_to_write == NULL) + ereport(FATAL, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of memory"))); + } /* Make sure we can handle the pin inside SyncOneBuffer */ ResourceOwnerEnlargeBuffers(CurrentResourceOwner); *************** *** 1033,1038 **** --- 1085,1092 ---- if (bufHdr->flags & BM_DIRTY) { bufHdr->flags |= BM_CHECKPOINT_NEEDED; + bufs_to_write[num_to_write].buf_id = buf_id; + bufs_to_write[num_to_write].tag = bufHdr->tag; num_to_write++; } *************** *** 1043,1061 **** return; /* nothing to do */ /* ! * Loop over all buffers again, and write the ones (still) marked with ! * BM_CHECKPOINT_NEEDED. In this loop, we start at the clock sweep point ! * since we might as well dump soon-to-be-recycled buffers first. ! * ! * Note that we don't read the buffer alloc count here --- that should be ! * left untouched till the next BgBufferSync() call. */ - buf_id = StrategySyncStart(NULL, NULL); - num_to_scan = NBuffers; num_written = 0; ! while (num_to_scan-- > 0) { ! volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id]; /* * We don't need to acquire the lock here, because we're only looking --- 1097,1120 ---- return; /* nothing to do */ /* ! * Sort the buffers-to-be-written into order by file and block number. ! * This improves sequentiality of access for the upcoming I/O. ! */ ! qsort(bufs_to_write, num_to_write, sizeof(BufAndTag), bufandtagcmp); ! ! /* ! * Loop over all buffers to be written, and write the ones (still) marked ! * with BM_CHECKPOINT_NEEDED. Note that we don't need to recheck the ! * buffer tag, because if the buffer has been reassigned it cannot have ! * BM_CHECKPOINT_NEEDED still set. */ num_written = 0; ! for (i = 0; i < num_to_write; i++) { ! volatile BufferDesc *bufHdr; ! ! buf_id = bufs_to_write[i].buf_id; ! bufHdr = &BufferDescriptors[buf_id]; /* * We don't need to acquire the lock here, because we're only looking *************** *** 1077,1096 **** num_written++; /* - * We know there are at most num_to_write buffers with - * BM_CHECKPOINT_NEEDED set; so we can stop scanning if - * num_written reaches num_to_write. - * - * Note that num_written doesn't include buffers written by - * other backends, or by the bgwriter cleaning scan. That - * means that the estimate of how much progress we've made is - * conservative, and also that this test will often fail to - * trigger. But it seems worth making anyway. - */ - if (num_written >= num_to_write) - break; - - /* * Perform normal bgwriter duties and sleep to throttle our * I/O rate. */ --- 1136,1141 ---- *************** *** 1098,1110 **** (double) num_written / num_to_write); } } - - if (++buf_id >= NBuffers) - buf_id = 0; } /* ! * Update checkpoint statistics. As noted above, this doesn't include * buffers written by other backends or bgwriter scan. */ CheckpointStats.ckpt_bufs_written += num_written; --- 1143,1152 ---- (double) num_written / num_to_write); } } } /* ! * Update checkpoint statistics. The num_written count doesn't include * buffers written by other backends or bgwriter scan. */ CheckpointStats.ckpt_bufs_written += num_written; Index: src/backend/storage/buffer/freelist.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/storage/buffer/freelist.c,v retrieving revision 1.64 diff -c -r1.64 freelist.c *** src/backend/storage/buffer/freelist.c 1 Jan 2008 19:45:51 -0000 1.64 --- src/backend/storage/buffer/freelist.c 4 May 2008 01:11:08 -0000 *************** *** 241,250 **** } /* ! * StrategySyncStart -- tell BufferSync where to start syncing * ! * The result is the buffer index of the best buffer to sync first. ! * BufferSync() will proceed circularly around the buffer array from there. * * In addition, we return the completed-pass count (which is effectively * the higher-order bits of nextVictimBuffer) and the count of recent buffer --- 241,251 ---- } /* ! * StrategySyncStart -- tell BgBufferSync where we are reclaiming buffers * ! * The result is the buffer index of the next possible victim buffer. ! * BgBufferSync() tries to keep the buffers immediately in front of this ! * point clean. * * In addition, we return the completed-pass count (which is effectively * the higher-order bits of nextVictimBuffer) and the count of recent buffer
On Sun, 4 May 2008, Tom Lane wrote: > However, I am completely unable to measure any performance improvement > from it. Given the possible risk of out-of-memory failures, I think the > patch should not be applied without some direct proof of performance > benefits, and I don't see any. Fair enough. There were some pgbench results attached to the original patch submission that gave me a good idea how to replicate the situation where there's some improvement. I expect I can take a shot at quantifying that independantly near the end of this month if nobody else gets to it before then (I'm stuck sorting out a number of OS level issue right now before my testing system is online again). Was planning to take a longer look at Greg Stark's prefetching work at that point as well. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > On Sun, 4 May 2008, Tom Lane wrote: >> However, I am completely unable to measure any performance improvement >> from it. Given the possible risk of out-of-memory failures, I think the >> patch should not be applied without some direct proof of performance >> benefits, and I don't see any. > Fair enough. There were some pgbench results attached to the original > patch submission that gave me a good idea how to replicate the situation > where there's some improvement. Well, I tried a pgbench test similar to that one --- on smaller hardware than was reported, so it was a bit smaller test case, but it should have given similar results. I didn't see any improvement; if anything it was a bit worse. So that's what got me concerned. Of course it's notoriously hard to get consistent numbers out of pgbench anyway, so I'd rather see some other test case ... > I expect I can take a shot at quantifying > that independantly near the end of this month if nobody else gets to it > before then (I'm stuck sorting out a number of OS level issue right now > before my testing system is online again). Was planning to take a longer > look at Greg Stark's prefetching work at that point as well. Fair enough. Unless someone can volunteer to test sooner, I think we should drop this item from the current commitfest queue. regards, tom lane
On Sun, 4 May 2008, Tom Lane wrote: > Well, I tried a pgbench test similar to that one --- on smaller hardware > than was reported, so it was a bit smaller test case, but it should have > given similar results. My pet theory on cases where sorting will help suggests you may need a write-caching controller for this patch to be useful. I expect we'll see the biggest improvement in situations where the total amount of dirty buffers is larger than the write cache and the cache becomes blocked. If you're not offloading to another device like that, the OS-level elevator sorting will handle sorting for you close enough to optimally that I doubt this will help much (and in fact may just get in the way). > Of course it's notoriously hard to get consistent numbers out of pgbench > anyway, so I'd rather see some other test case ... I have some tools to run pgbench results many times and look for patterns that work fairly well for the consistency part. pgbench will dirty a very high percentage of the buffer cache by checkpoint time relative to how much work it does, which makes it close to a best case for confirming there is a potential improvement here. I think a reasonable approach is to continue trying to quantify some improvement using pgbench with an eye toward also doing DBT2 tests, which provoke similar behavior at checkpoint time. I suspect someone who already has a known good DBT2 lab setup with caching controller hardware (EDB?) might be able to do a useful test of this patch without too much trouble on their part. > Unless someone can volunteer to test sooner, I think we should drop this > item from the current commitfest queue. This patch took a good step forward toward being commited this round with your review, which is the important part from my perspective (as someone who would like this to be committed if it truly works). I expect that performance related patches will often take more than one commitfest to pass through. From the perspective of keeping the committer's plates clean, a reasonable system for this situation might be for you to bounce this into the rejected pile as "Returned for testing" immediately, to clearly remove it from the main queue. A reasonable expectation there is that you might consider it again during May if someone gets back with said testing results before the 'fest ends. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > On Sun, 4 May 2008, Tom Lane wrote: >> Well, I tried a pgbench test similar to that one --- on smaller hardware >> than was reported, so it was a bit smaller test case, but it should have >> given similar results. > ... If > you're not offloading to another device like that, the OS-level elevator > sorting will handle sorting for you close enough to optimally that I doubt > this will help much (and in fact may just get in the way). Yeah. It bothers me a bit that the patch forces writes to be done "all of file A in order, then all of file B in order, etc". We don't know enough about the disk layout of the files to be sure that that's good. (This might also mean that whether there is a win is going to be platform and filesystem dependent ...) >> Unless someone can volunteer to test sooner, I think we should drop this >> item from the current commitfest queue. > From the perspective of keeping the committer's plates clean, a reasonable > system for this situation might be for you to bounce this into the > rejected pile as "Returned for testing" immediately, to clearly remove it > from the main queue. A reasonable expectation there is that you might > consider it again during May if someone gets back with said testing > results before the 'fest ends. Right, that's in the ground rules for commitfests: if the submitter can respond to complaints before the fest is over, we'll reconsider the patch. regards, tom lane
On Mon, 5 May 2008, Tom Lane wrote: > It bothers me a bit that the patch forces writes to be done "all of file > A in order, then all of file B in order, etc". We don't know enough > about the disk layout of the files to be sure that that's good. (This > might also mean that whether there is a win is going to be platform and > filesystem dependent ...) I think most platform and filesystem implementations have disk location correlated enough with block order that this particular issue isn't a large one. If the writes are mainly going to one logical area (a single partition or disk array), it should be a win as long as the sorting step itself isn't introducing a delay. I am concered that in a more complicated case than pgbench, where the writes are spread across multiple arrays say, that forcing writes in order may slow things down. Example: let's say there's two tablespaces mapped to two arrays, A and B, that the data is being written to at checkpoint time. In the current case, that I/O might be AABAABABBBAB, which is going to keep both arrays busy writing. The sorted case would instead make that AAAAAABBBBBB so only one array will be active at a time. It may very well be the case that the improvement from lowering seeks on the writes to A and B is less than the loss coming from not keeping both continuously busy. I think I can simulate this by using a modified pgbench script that works against an accounts1 and accounts2 with equal frequency, where 1&2 are actually on different tablespaces on two disks. > Right, that's in the ground rules for commitfests: if the submitter can > respond to complaints before the fest is over, we'll reconsider the > patch. The small optimization I was trying to suggest was that you just bounce this type of patch automatically to the "rejected for <x>" section of the commitfest wiki page in cases like these. The standard practice on this sort of queue is to automatically reclassify when someone has made a pass over the patch, leaving the original source to re-open with more information. That keeps the unprocessed part of the queue always shrinking, and as long as people know that they can get it reconsidered by submitting new results it's not unfair to them. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mon, 2008-05-05 at 00:23 -0400, Tom Lane wrote: > Greg Smith <gsmith@gregsmith.com> writes: > > On Sun, 4 May 2008, Tom Lane wrote: > >> Well, I tried a pgbench test similar to that one --- on smaller hardware > >> than was reported, so it was a bit smaller test case, but it should have > >> given similar results. > > > ... If > > you're not offloading to another device like that, the OS-level elevator > > sorting will handle sorting for you close enough to optimally that I doubt > > this will help much (and in fact may just get in the way). > > Yeah. It bothers me a bit that the patch forces writes to be done "all > of file A in order, then all of file B in order, etc". We don't know > enough about the disk layout of the files to be sure that that's good. > (This might also mean that whether there is a win is going to be > platform and filesystem dependent ...) No action on this seen since last commitfest, but I think we should do something with it, rather than just ignore it. Agree with all comments myself, so proposed solution is to implement this as an I/O elevator hook. Standard elevator is to issue them in order as they come, additional elevator in contrib is file/block sorted. That will make testing easier and will also give Itagaki his benefit, while allowing on-going research. If this solution's good enough for Linux it ought to be good enough for us. Note that if we do this for checkpoint we should also do this for FlushRelationBuffers(), used during heap_sync(), for exactly the same reasons. Would suggest calling it bulk_io_hook() or similar. Further observation would be that if there was an effect then it would be at the block-device level, i.e. tablespace. Sorting the writes so that we issued one tablespace at a time might at least help the I/O elevators/disk caches to work with the whole problem at once. We might get benefit on one tablespace but not on another. Sorting by file might have inadvertently shown benefit at the tablespace level on a larger server with spread out data whereas on Tom's test system I would guess just a single tablespace was used. Anyway, I note that we don't have an easy way of sorting by tablespace, but I'm sure it would be possible to look up the tablespace for a file within a plugin. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndquadrant.com> writes: > Anyway, I note that we don't have an easy way of sorting by tablespace, Say what? tablespace is the first component of relfilenode. > but I'm sure it would be possible to look up the tablespace for a file > within a plugin. If the information weren't readily available from relfilenode, it would *not* be possible for a bufmgr plugin to look it up. bufmgr is much too low-level to be dependent on performing catalog lookups. regards, tom lane
On Fri, 2008-07-04 at 12:05 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Anyway, I note that we don't have an easy way of sorting by tablespace, > > Say what? tablespace is the first component of relfilenode. OK, thats a mistake... what about the rest of the idea? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Fri, 4 Jul 2008, Simon Riggs wrote: > No action on this seen since last commitfest, but I think we should do > something with it, rather than just ignore it. Just no action worth reporting yet. Over the weekend I finally reached the point where I've got a system that should be capable of independently replicating the results improvement setup here, and I've started performance testing of the patch. Getting useful checkpoint test results from pgbench is really a pain. > Sorting by file might have inadvertently shown benefit at the tablespace > level on a larger server with spread out data whereas on Tom's test > system I would guess just a single tablespace was used. I doubt this has anything to do with it, only because the pgbench schema doesn't split into tablespaces usefully. Almost all of the real action is on a single table, accounts. My suspicion is that sorting only benefits in situations where you have a disk controller with a significant amount of RAM on it, but the server RAM is much larger. In that case the sorting horizon of the controller itself is smaller than what the server can do, and the sorting makes it less likely you'll end up with the controller filled with unsorted stuff that takes a long time to clear. In Tom's test, there's probably only 8 or 16MB worth of cache on the disk itself, so you can't get a large backlog of unsorted writes clogging the write pipeline. But most server systems have 256MB or more of RAM there, and if you get that filled with seek-heavy writes (which might only clear at a couple of MB a second) the delay for that cache to empty can be considerable. That said, I've got a 256MB controller here and have a very similar disk setup to the one postiive results were reported on, but so far I don't see any significant difference after applying the sorted writes patch. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, 2008-07-09 at 21:39 -0400, Greg Smith wrote: > On Fri, 4 Jul 2008, Simon Riggs wrote: > > > No action on this seen since last commitfest, but I think we should > do > > something with it, rather than just ignore it. > > Just no action worth reporting yet. Over the weekend I finally > reached the point where I've got a system that should be capable of > independently replicating the results improvement setup here, and I've > started performance testing of the patch. Getting useful checkpoint > test results from pgbench is really a pain. I agree completely. That's why I've suggested a plugin approach. That way Itagaki can have his performance, the rest of us don't need to fret and yet we hold open the door indefinitely for additional ways of doing it. And we can test it on production systems with realistic workloads. If one clear way emerges as best, we adopt that plugin permanently. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support