Thread: Index only scan paving the way for "auto" clustered tables?
Hi all, I wonder, could the recent work on index only scans pave the way for auto clustered tables? Consider a wide, mostly inserttable with some subset of columns that I'd like to cluster on. I'm after locality of tuples that are very frequentlyfetched together, but not keen on the downtime for a cluster, nor the maintenance that it requires. Would it bea stretch to have an index that branches on the subset of "cluster" columns, but still stores all the columns, making ita covering index? Given that we can already index concurrently, such an index would not require downtime, and would beself maintaining. From my understanding of the index-only scan implementation, I suspect that such an index would effectivelygive locality, with some caveats… I'd expect the overhead of inserting in to such a table would be high, perhaps prohibitive. Perhaps better ways have beendiscussed. Stupid idea? --Royce
On Tue, Oct 11, 2011 at 7:08 AM, Royce Ausburn <royce.ml@inomial.com> wrote: > I wonder, could the recent work on index only scans pave the way for auto clustered tables? Consider a wide, mostly inserttable with some subset of columns that I'd like to cluster on. I'm after locality of tuples that are very frequentlyfetched together, but not keen on the downtime for a cluster, nor the maintenance that it requires. Would it bea stretch to have an index that branches on the subset of "cluster" columns, but still stores all the columns, making ita covering index? Given that we can already index concurrently, such an index would not require downtime, and would beself maintaining. From my understanding of the index-only scan implementation, I suspect that such an index would effectivelygive locality, with some caveats… I guess we could do that, but I'm not convinced there would be much benefit. The only thing you'd be saving would be the cost of keeping the tuples sorted by only the high-order columns rather than all of them, and I doubt that's significant. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: >> [implement "clustered index" as a covering index with all columns >> which are present in the heap] > I guess we could do that, but I'm not convinced there would be > much benefit. The "traditional" way to implement a clustered index is to have the leaf level of the index contain the tuples rather than pointers to the tuples. If we're going to do clustered tables, we might want to jump all the way to that, rather than a half-way solution which stores everything twice. -Kevin
On Tue, Oct 11, 2011 at 3:02 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: >>> [implement "clustered index" as a covering index with all columns >>> which are present in the heap] >> I guess we could do that, but I'm not convinced there would be >> much benefit. > > The "traditional" way to implement a clustered index is to have the > leaf level of the index contain the tuples rather than pointers to > the tuples. If we're going to do clustered tables, we might want to > jump all the way to that, rather than a half-way solution which > stores everything twice. Not a bad thought. Actually, I've been thinking for a while now that we might need a pluggable heapam, similar to the pluggable indexams we already have. Our current heap format has served us pretty well, but there are any number of things that we can't really do without changing it. Of course, if we came up with a new format that was better in every case, across the board, then perhaps we'd be willing to just replace the current format outright -- though even then, that would break pg_upgrade, which would be painful, to put it mildly. And it seems to me that there could easily be format changes that would make sense for particular cases, but not across the board, like: - index-organized tables (heap is a btree, and secondary indexes reference the PK rather than the TID; this is how MySQL does it, and Oracle offers it as an option) - WORM tables (no updates or deletes, and no inserts after creating transaction commits, allowing a much smaller tuple header) - non-transactional tables (tuples visible as soon as they're written, again allowing for smaller tuple header; useful for internal stuff and perhaps for insert-only log tables) Alternatively, we could try to graft the concept of a self-clustering table on top of the existing heap implementation. But I'm having trouble seeing how that would work. The TODO describes it as something like "maintain CLUSTER ordering", but that's a gross oversimplification, because we have no structure that would allow us to sensibly do any such thing... the current heap implementation is just that: a pile of stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Oct11, 2011, at 21:27 , Robert Haas wrote: > Alternatively, we could try to graft the concept of a self-clustering > table on top of the existing heap implementation. But I'm having > trouble seeing how that would work. The TODO describes it as > something like "maintain CLUSTER ordering", but that's a gross > oversimplification, because we have no structure that would allow us > to sensibly do any such thing... the current heap implementation is > just that: a pile of stuff. We could still be smarter about where we insert new rows in a clustered table, though. Upon INSERT and UPDATE, we'd need to lookup the leaf page where the new tuple will eventually go in the index we're supposed to maintain CLUSTER for. Then we'd check if any of the pages referenced there contains enough space, and if so place the new tuple there. If not it'd go at the end. best regards, Florian Pflug
Robert Haas wrote: """ And it seems to me that there could easily be format changes that would make sense for particular cases, but not across the board, like: - index-organized tables (heap is a btree, and secondary indexes reference the PK rather than the TID; this is how MySQL does it, and Oracle offers it as an option) - WORM tables (no updates or deletes, and no inserts after creating transaction commits, allowing a much smaller tuple header) - non-transactional tables (tuples visible as soon as they're written, again allowing for smaller tuple header; useful for internal stuff and perhaps for insert-only log tables) """ This is probably a silly idea, but I have been wondering about the following idea: Instead of having visibility info in the row header, have a couple of row visibility slots in the page header. These slots could be shared between rows in the page, so that if you do a bulk insert/update/delete you would only use one slot. If the slots overflow, you would use external slots buffer. When the row is all visible, no slot would be used at all. The xmin, xmax and cid would be in the slots. ctid would have its current meaning, except when the external slots would be used, then ctid would point to the external slot, and it would have the real row header. I don't know if there would be any other row header parts which could be shared. The external slots buffer would then contain xmin, xmax, cid and the real ctid. Updates would write the new rows to another page in the heap, and old rows would stay in place, just as now. So there would not be any redo log like configuration. Also, the external slots buffer would be small (18 bytes per row), so it would not get out of cache too easily. The performance would suck if you had lots of small updates, or long running transactions. On the other hand in data warehousing, where bulk loads are normal, and there are a lot of small rows, this could actually work. As said, this is probably a silly idea. But as pluggable heap types came up, I thought to ask if this could actually work. If this kind of wondering posts are inappropriate for this list, please tell me so that I can avoid these in the future. - Anssi Kääriäinen
Robert Haas <robertmhaas@gmail.com> writes: > Alternatively, we could try to graft the concept of a self-clustering > table on top of the existing heap implementation. But I'm having > trouble seeing how that would work. The TODO describes it as > something like "maintain CLUSTER ordering", but that's a gross > oversimplification, because we have no structure that would allow us > to sensibly do any such thing... the current heap implementation is > just that: a pile of stuff. I currently think that's the way to go, with some coarser granularity than tuple or page. Picture HOT inserts, if you will. That would be at the page level, but do we need that level of precision? I'm thinking that we need something more like segment based here, or maybe some intermediate value would be good between a page of 8Kb and a segment of 1GB, but I'm not so sure. We would have to track the bounds of each segment for the indexed columns, and maintain them, and the planner would have to exercise pruning at the segment level. So going down too much in granularity would have negative impacts on planning performances (too many data to play with), and anyway a server that needs that kind of optimization can certainly handle a couple of GB in its file system cache. So, it's quite hand wavy still, but Segment Exclusion has been discussed here already, and it seems to me that's the next thing we need. Call it partial Seq Scan and HOT inserts if new names are your thing :) Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Tue, Oct 11, 2011 at 4:00 PM, Florian Pflug <fgp@phlo.org> wrote: > On Oct11, 2011, at 21:27 , Robert Haas wrote: >> Alternatively, we could try to graft the concept of a self-clustering >> table on top of the existing heap implementation. But I'm having >> trouble seeing how that would work. The TODO describes it as >> something like "maintain CLUSTER ordering", but that's a gross >> oversimplification, because we have no structure that would allow us >> to sensibly do any such thing... the current heap implementation is >> just that: a pile of stuff. > > We could still be smarter about where we insert new rows in a clustered > table, though. > > Upon INSERT and UPDATE, we'd need to lookup the leaf page where the new > tuple will eventually go in the index we're supposed to maintain CLUSTER > for. Then we'd check if any of the pages referenced there contains enough > space, and if so place the new tuple there. If not it'd go at the end. That's an interesting idea, but my guess is that the "if not" clause would trigger frequently enough to make this not work very well. Of course, we'd need to test it to know for sure.... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Oct 11, 2011 at 4:21 PM, Kääriäinen Anssi <anssi.kaariainen@thl.fi> wrote: > This is probably a silly idea, but I have been wondering about the > following idea: Instead of having visibility info in the row header, > have a couple of row visibility slots in the page header. These slots > could be shared between rows in the page, so that if you do a bulk > insert/update/delete you would only use one slot. If the slots > overflow, you would use external slots buffer. > > When the row is all visible, no slot would be used at all. I've thought about stuff like this, too. One idea would be to make the slots just be items on the page. Each tuple includes a 2-byte field which points to the item number (on the same page) where the XMAX, CID, and CTID information for the update/delete of that tuple is stored. If the tuple isn't updated or deleted, it points back to the tuple's own item ID. When a tuple is updated or deleted, add an item to the page with the necessary XMAX/CMAX/CTID and set the pointer to the new item ID. If the page is full, allocate an "overflow block" and store the overflow block number in the page header. Add the new item to the overflow block and set the 2-byte field to point to it, setting the high bit or something like that to indicate that the data is in the overflow block rather than the data block. The main objection to this idea I see is that if the overflow blocks get much use, the overall performance might end up being poor. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company