On Mon, Nov 4, 2013 at 5:01 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Of course, it's possible that even we do get a shared memory
>> allocator, a hypothetical person working on this project might prefer
>> to make the data block-structured anyway and steal storage from
>> shared_buffers. So my aspirations in this area may not even be
>> relevant. But I wanted to mention them, just in case anyone else is
>> thinking about similar things, so that we can potentially coordinate.
>
> If anyone was going to work on LSM tree, I would advise building a
> tree in shared/temp buffers first, then merging with the main tree.
> The merge process could use the killed tuple approach to mark the
> merging.
>
> The most difficult thing about buffering the inserts is deciding which
> poor sucker gets the task of cleaning up. That's probably better as an
> off-line process, which is where the work comes in. Non shared
> buffered approaches would add too much overhead to the main task.
Thing is, if you want crash safety guarantees, you cannot use temp
(unlogged) buffers, and then you always have to flush to WAL at each
commit. If the staging index is shared, then it could mean a lot of
WAL (ie: probably around double the amount of WAL a regular b-tree
would generate).
Process-private staging trees that get merged on commit, ie:
transaction-scope staging trees, on the other hand, do not require WAL
logging, they can use temp buffers, and since they don't outlive the
transaction, it's quite obvious who does the merging (the committer).
Question is what kind of workload does that speed up with any
significance and whether the amount of work is worth that speedup on
those workloads.