On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 29 October 2013 16:10, Peter Geoghegan <pg@heroku.com> wrote: >> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote: >>> I don't see much interest in insert-efficient indexes. >> >> Presumably someone will get around to implementing a btree index >> insertion buffer one day. I think that would be a particularly >> compelling optimization for us, because we could avoid ever inserting >> index tuples that are already dead when the deferred insertion >> actually occurs. > > That's pretty much what the LSM-tree is.
What is pretty cool about this sort of thing is that there's no intrinsic reason the insertion buffer needs to be block-structured or disk-backed.
How do we commit to not spilling to disk, in the face of an unbounded number of indexes existing and wanting to use this mechanism simultaneously? If it routinely needs to spill to disk, that would probably defeat the purpose of having it in the first place, but committing to never doing so seems to be extremely restrictive. As you say it is also freeing, in terms of using pointers and such, but I think the restrictions would outweigh the freedom.
In theory, you can structure the in-memory portion of the tree any way you like, using pointers and arbitrary-size memory allocations and all that fun stuff. You need to log that there's a deferred insert (or commit to flushing the insertion buffer before every commit, which would seem to miss the point) so that recovery can reconstruct the in-memory data structure and flush it, but that's it: the WAL format need not know any other details of the in-memory portion of the tree. I think that, plus the ability to use pointers and so forth, might lead to significant performance gains.
In practice, the topology of our shared memory segment makes this a bit tricky. The problem isn't so much that it's fixed size as that it lacks a real allocator, and that all the space used for shared_buffers is nailed down and can't be borrowed for other purposes.
I think the fixed size is also a real problem, especially given the ubiquitous advice not to exceed 2 to 8 GB.