Pluggable storage - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Pluggable storage |
Date | |
Msg-id | 20160812231527.GA690404@alvherre.pgsql Whole thread Raw |
Responses |
Re: Pluggable storage
Re: Pluggable storage Re: Pluggable storage Re: Pluggable storage Re: Pluggable storage |
List | pgsql-hackers |
Many have expressed their interest in this topic, but I haven't seen any design of how it should work. Here's my attempt; I've been playing with this for some time now and I think what I propose here is a good initial plan. This will allow us to write permanent table storage that works differently than heapam.c. At this stage, I haven't throught through whether this is going to allow extensions to define new storage modules; I am focusing on AMs that can coexist with heapam in core. The design starts with a new row type in pg_am, of type "s" (for "storage"). The handler function returns a struct of node StorageAmRoutine. This contains functions for 1) scans (beginscan, getnext, endscan) 2) tuples (tuple_insert/update/delete/lock, as well as set_oid, get_xmin and the like), and operations on tuples that are part of slots (tuple_deform, materialize). To support this, we introduce StorageTuple and StorageScanDesc. StorageTuples represent a physical tuple coming from some storage AM. It is necessary to have a pointer to a StorageAmRoutine in order to manipulate the tuple. For heapam.c, a StorageTuple is just a HeapTuple. RelationData gains ->rd_stamroutine which is a pointer to the StorageAmRoutine for the relation in question. Similarly, TupleTableSlot is augmented with a link to the StorageAmRoutine to handle the StorageTuple it contains (probably in most cases it's set at the same time as the tupdesc). This implies that routines such as ExecAssignScanType need to pass down the StorageAmRoutine from the relation to the slot. The executor is modified so that instead of calling heap_insert etc directly, it uses rel->rd_stamroutine to call these methods. The executor is still in charge of dealing with indexes, constraints, and any other thing that's not the tuple storage itself (this is one major point in which this differs from FDWs). This all looks simple enough, with one exception and a few notes: exception a) ExecMaterializeSlot needs special consideration. This is used in two different ways: a1) is the stated "make tuple independent from any underlying storage" point, which is handled by ExecMaterializeSlot itself and calling a method from the storage AM to do any byte copying as needed. ExecMaterializeSlot no longer returns a HeapTuple, because there might not be any. The second usage pattern a2) is to create a HeapTuple that's passed to other modules which only deal with HT and not slots (triggers are the main case I noticed, but I think there are others such as the executor itself wanting tuples as Datum for some reason). For the moment I'm handling this by having a new ExecHeapifyTuple which creates a HeapTuple from a slot, regardless of the original tuple format. note b) EvalPlanQual currently maintains an array of HeapTuple in EState->es_epqTuple. I think it works to replace that with an array of StorageTuples; EvalPlanQualFetch needs to call the StorageAmRoutine methods in order to interact with it. Other than those changes, it seems okay. note c) nodeSubplan has curTuple as a HeapTuple. It seems simple to replace this with an independent slot-based tuple. note d) grp_firstTuple in nodeAgg / nodeSetOp. These are less simple than the above, but replacing the HeapTuple with a slot-based tuple seems doable too. note e) nodeLockRows uses lr_curtuples to feed EvalPlanQual. TupleTableSlot also seems a good replacement. This has fallout in other users of EvalPlanQual, too. note f) More widespread, MinimalTuples currently use a tweaked HeapTuple format. In the long run, it may be possible to replace them with a separate storage module that's specifically designed to handle tuples meant for tuplestores etc. That may simplify TupleTableSlot and execTuples. For the moment we keep the tts_mintuple as it is. Whenever a tuple is not already in heap format, we heapify it in order to put in the store. The current heapam.c routines need some changes. Currently, practice is that heap_insert, heap_multi_insert, heap_fetch, heap_update scribble on their input tuples to set the resulting ItemPointer in tuple->t_self. This is messy if we want StorageTuples to be abstract. I'm changing this so that the resulting ItemPointer is returned in a separate output argument; the tuple itself is left alone. This is somewhat messy in the case of heap_multi_insert because it returns several items; I think it's acceptable to return an array of ItemPointers in the same order as the input tuples. This works fine for the only caller, which is COPY in batch mode. For the other routines, they don't really care where the TID is returned AFAICS. Additional noteworthy items: i) Speculative insertion: the speculative insertion token is no longer installed directly in the heap tuple by the executor (of course). Instead, the token becomes part of the slot. When the tuple_insert method is called, the insertion routine is in charge of setting the token from the slot into the storage tuple. Executor is in charge of calling method->speculative_finish() / abort() once the insertion has been confirmed by the indexes. ii) execTuples has additional accessors for tuples-in-slot, such as ExecFetchSlotTuple and friends. I expect to have some of them to return abstract StorageTuples, others HeapTuple or MinimalTuples (possibly wrapped in Datum), depending on callers. We might be able to cut down on these later; my first cut will try to avoid API changes to keep fallout to a minimum. iii) All tuples need to be identifiable by ItemPointers. Storages that have different requirements will need careful additional thought across the board. iv) System catalogs cannot use pluggable storage. We continue to use heap_open etc in the DDL code, in order not to make this more invasive that it already is. We may lift this restriction later for specific catalogs, as needed. v) Currently, one Buffer may be associated with one HeapTuple living in a slot; when the slot is cleared, the buffer pin is released. My current patch moves the buffer pin to inside the heapam-based storage AM and the buffer is released by the ->slot_clear_tuple method. The rationale for doing this is that some storage AMs might want to keep several buffers pinned at once, for example, and must not to release those pins individually but in batches as the scan moves forwards (say a batch of tuples in a columnar storage AM has column values spread across many buffers; they must all be kept pinned until the scan has moved past the whole set of tuples). But I'm not really sure that this is a great design. I welcome comments on these ideas. My patch for this is nowhere near completion yet; expect things to change for items that I've overlooked, but I hope I didn't overlook any major. If things are handwavy, it is probably because I haven't fully figured them out yet. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: