Thread: Question about indexes
How feasible would it be to have a btree index on ctid? I'm thinking it ought to work simply enough for the normal case of insert/delet/update, but I'm not completely certain how vacuum, vacuum full, and cluster would interact. You may think this would be utterly useless, but I have a cunning plan. -- greg
Greg Stark <gsstark@mit.edu> writes: > How feasible would it be to have a btree index on ctid? Why would you want one? Direct access by ctid beats out an index lookup every time. In any case, vacuum and friends would break such an index entirely. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > > How feasible would it be to have a btree index on ctid? > > Why would you want one? Direct access by ctid beats out an index lookup > every time. Of course. But as I mentioned, I have a cunning plan. If you have two indexes (a,ctid) and (b,ctid) and do a query where a=1 and b=2 then it would be particularly easy to combine the two efficiently. If specially marked btree indexes -- or even all btree indexes -- implicitly had ctid as a final sort order after all the index column, then it would esentially obviate the need for bitmap indexes. They wouldn't have the space advantage, but they would be possible to combine using arbitrary boolean expressions without looking at the actual tuples. This is essentially what is in the TODO about using bitmaps, but without having to do any extra sorts. This would only really be an advantage for particularly wide tables where the combination of boolean clauses narrows the result set down a lot more than any one clause. > In any case, vacuum and friends would break such an index entirely. That was what I was afraid of. -- greg
Greg Stark <gsstark@mit.edu> writes: > If you have two indexes (a,ctid) and (b,ctid) and do a query where a=1 and b=2 > then it would be particularly easy to combine the two efficiently. > If specially marked btree indexes -- or even all btree indexes -- implicitly > had ctid as a final sort order after all the index column, then it would > esentially obviate the need for bitmap indexes. I don't think so. You are thinking only of exact-equality queries --- as soon as the WHERE clause describes a range of index entries, the readout wouldn't be sorted by ctid anyway. Combining indexes via a bitmap intermediate step (which is not really the same thing as bitmap indexes, IIUC) seems like a more robust approach than relying on the index entries to be in ctid order. But if we did want to sort indexes that way, we could do it today, I think. The ctid is already stored in index entries (it is the "payload" remember...) and we could use it as a tiebreaker when determining insertion position. This doesn't have the problems that putting ctid into the user columns would do, because the system knows about that ctid as being special; the difficulty with ctid in the user columns is the code not knowing that it'd need to change on a tuple move. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > I don't think so. You are thinking only of exact-equality queries --- > as soon as the WHERE clause describes a range of index entries, the > readout wouldn't be sorted by ctid anyway. But then even bitmap indexes would fail in that way too, or at least have a lot of extra cost that would have to be taken into account based on the number of values in the range. > Combining indexes via a bitmap intermediate step (which is not really > the same thing as bitmap indexes, IIUC) seems like a more robust > approach than relying on the index entries to be in ctid order. I would see that as the next step, But it seems to me it would be only a small set of queries where it would really help enough to outweigh the extra work of the sort. Whereas if the ctid is already pre-sorted then the extra cost is fairly low. Sort of like the difference in cost between a merge join where both sides have to be sorted and a merge join where both sides are pre-sorted. > But if we did want to sort indexes that way, we could do it today, > I think. The ctid is already stored in index entries (it is the > "payload" remember...) and we could use it as a tiebreaker when > determining insertion position. This doesn't have the problems that > putting ctid into the user columns would do, because the system knows > about that ctid as being special; the difficulty with ctid in the user > columns is the code not knowing that it'd need to change on a tuple move. That's exactly what I was thinking. I just don't know how badly it would complicate the vacuum{,full}/cluster code and whether those are the only cases to worry about. Note that the space saving of bitmap indexes is still a substantial factor. Using btree indexes the i/o costs of doing multiple index scans plus a table scan of the relevant pages would still be quite substantial. So this doesn't completely obviate the need for bitmap indexes, but I think it would remove a lot of the pressure from people who just need them to handle a few select queries. -- greg
Greg Stark <gsstark@mit.edu> writes: >> Combining indexes via a bitmap intermediate step (which is not really >> the same thing as bitmap indexes, IIUC) seems like a more robust >> approach than relying on the index entries to be in ctid order. > I would see that as the next step, But it seems to me it would be only a small > set of queries where it would really help enough to outweigh the extra work of > the sort. What sort? The whole point of a bitmap is that it makes it easy to visit the tuples in heap order. You scan the index, you set the appropriate bits in the bitmap, and then you scan the bitmap and go to the heap tuples that have their bits set. If you are using multiple indexes you can AND or OR their results at the bitmap phase before you go to the heap. An implementation of this kind would not produce tuples in index order, so if you have an ORDER BY to satisfy then you end up doing an explicit sort after you have the tuples. It would be up to the planner to consider this cost versus the advantages of being able to use multiple indexes; we'd certainly want to keep the existing scan mechanism as an available alternative. But if the query is suited to multiple indexes I suspect it'd be a win pretty often. > Note that the space saving of bitmap indexes is still a substantial factor. I think you are still confusing what I'm talking about with a bitmap index, ie, a persistent structure on-disk. It's not that at all, but a transient structure built in-memory during an index scan. I'm a little dubious that true bitmap indexes would be worth building for Postgres. Seems like partial indexes cover the same sorts of applications and are more flexible. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > In any case, this discussion is predicated on the assumption that the > operations involving the bitmap are a significant fraction of the total > time, which I think is quite uncertain. Until we build it and profile > it, we won't know that. The other thought I had was that it would be difficult to tell when to follow this path. Since the main case where it wins is when the individual indexes aren't very selective but the combination is very selective, and we don't have inter-column correlation statistics ... -- greg
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > > > I would see that as the next step, But it seems to me it would be only a small > > set of queries where it would really help enough to outweigh the extra work of > > the sort. > > What sort? To build the in-memory bitmap you effectively have to do a sort. If the tuples come out of the index in heap order then you can combine them without having to go through that step. > I'm a little dubious that true bitmap indexes would be worth building > for Postgres. Seems like partial indexes cover the same sorts of > applications and are more flexible. I'm clear on the distinction. I think bitmap indexes still have a place, but if regular btree indexes could be combined efficiently then that would be an even narrower niche. Partial indexes are very handy, and they're useful in corner cases where bitmap indexes are useful, such as flags for special types of records. But I think bitmap indexes are specifically wanted by certain types of data warehousing applications where you have an index on virtually every column and then want to do arbitrary boolean combinations of all of them. btree indexes would generate more i/o scanning all the indexes than just doing a sequential scan would. Whereas bitmap indexes are much denser on disk. However my experience leans more towards the OLTP side and I very rarely saw applications like this. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> What sort? > To build the in-memory bitmap you effectively have to do a sort. Hm, you're thinking that the operation of inserting a bit into a bitmap has to be at least O(log N). Seems to me that that depends on the data structure you use. In principle it could be O(1), if you use a true bitmap (linear array) -- just index and set the bit. You might be right that practical data structures would be O(log N), but I'm not totally convinced. > If the tuples come out of the index in heap order then you can combine > them without having to go through that step. But considering the restrictions implied by that assumption --- no range scans, no non-btree indexes --- I doubt we will take the trouble to implement that variant. We'll want to do the generalized bitmap code anyway. In any case, this discussion is predicated on the assumption that the operations involving the bitmap are a significant fraction of the total time, which I think is quite uncertain. Until we build it and profile it, we won't know that. regards, tom lane
Some potentially helpful background comments on the discussion so far... >Tom Lane writes >>Greg Stark writes >> Note that the space saving of bitmap indexes is still a substantial >> factor. >I think you are still confusing what I'm talking about with a bitmap index, >ie, a persistent structure on-disk. It's not that at all, but a transient >structure built in-memory during an index scan. Oracle allows the creation of bitmap indices as persistent data structures. The "space saving" of bitmap indices is only a saving when compared with btree indices. If you don't have them at all because they are built dynamically when required, as Tom is suggesting, then you "save" even more space. Maintaining the bitmap index is a costly operation. You tend to want to build them on "characteristic" columns, of which there tends to be more of in a database than "partial/full identity" columns on which you build btrees (forgive the vagueness of that comment), so you end up with loads of the damn things, so the space soon adds up. It can be hard to judge which ones are the important ones, especially when each is used by a different user/group. Building them dynamically is a good way of solving the question "which ones are needed?". Ever seen 58 indices on a table? Don't go there. My vote would be implement the dynamic building capability, then return to implement a persisted structure later if that seems like it would be a further improvement. [The option would be nice] If we do it dynamically, as Tom suggests, then we don't have to code the index maintenance logic at all and the functionality will be with us all the sooner. Go Tom! >Tom Lane writes > In any case, this discussion is predicated on the assumption that the > operations involving the bitmap are a significant fraction of the total > time, which I think is quite uncertain. Until we build it and profile > it, we won't know that. Dynamically building the bitmaps has been the strategy in use by Teradata for nearly a decade on many large datawarehouses. I can personally vouch for the effectiveness of this approach - I was surprised when Oracle went for the persistent option. Certainly in that case building the bitmaps adds much less time than is saved overall by the better total query strategy. >Greg Stark writes > > To build the in-memory bitmap you effectively have to do a sort. Not sure on this latter point: I think I agree with Greg on that point, but want to believe Tom because requiring a sort will definitely add time. To shed some light in this area, some other major implementations are: In Teradata, tables are stored based upon a primary index, which is effectively an index-organised table. The index pointers are stored in sorted order lock step with the blocks of the associated table - No sort required. (The ordering is based upon a hashed index, but that doesn't change the technique). Oracle's tables/indexes use heaps/btrees also, though they do provide an index-organised table feature similar to Teradata. Maybe the lack of heap/btree consistent ordering in Oracle and their subsequent design choice of persistent bitmap indices is an indication for PostgreSQL too? In Oracle, bitmap indices are an important precursor to the star join technique. AFAICS it is still possible to have a star join plan without having persistent bitmap indices. IMHO, the longer term goal of a good star join plan is an important one - that may influence the design selection for this discussion. Hope some of that helps, Best regards, Simon Riggs
A small comment on Oracle's implementation of persistent bitmap indexes: Oracle's bitmap index is concurently locked by DML, i.e. it suites for OLAP (basically read only data warehouses) but in no way for OLTP. IMHO, Laimis > Maybe the lack of heap/btree consistent ordering in Oracle > and their subsequent design choice of persistent bitmap > indices is an indication for PostgreSQL too?
<lnd@hnit.is> writes: > A small comment on Oracle's implementation of persistent bitmap indexes: > > Oracle's bitmap index is concurently locked by DML, i.e. it suites for OLAP > (basically read only data warehouses) but in no way for OLTP. I knew this. I think they figured that was ok because bitmap indexes were mainly intended to solve data warehouse problems anyways. Thinking out loud here, I wonder whether this would be less of a problem for postgres. Since tuples are never updated in place there would never be a need to lock the entire bitmap until a transaction completes. There would never be as much concurrency as btrees, assuming there was any kind of compression on the bitmap, but I don't see any reason why a long-term lock would have to be held for updates. Even regular vacuum might not have to lock anything for long, just long enough to clear the bits. and vacuum full/cluster already take table locks anyways. I think the problem Oracle ran into was that storing rollback ids in the bitmap is untenable. The whole point of persistent bitmap indexes is to store a very dense representation that represents thousands of records per page. Allocating space to store thousands of pending transaction ids and having thousands of old versions of the page in the rollback segment would defeat the purpose. -- greg
Greg Stark wrote: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > In any case, this discussion is predicated on the assumption that the > > operations involving the bitmap are a significant fraction of the total > > time, which I think is quite uncertain. Until we build it and profile > > it, we won't know that. > > The other thought I had was that it would be difficult to tell when to follow > this path. Since the main case where it wins is when the individual indexes > aren't very selective but the combination is very selective, and we don't have > inter-column correlation statistics ... I like the idea of building in-memory bitmapped indexes. In your example, if you are restricting on A and B, and have no A,B index but an A index and B index, why wouldn't you always create an in-memory bitmapped index from indexes A and B, unless index A hits only a few rows. In fact, from the optimizer statistics, you can guess on how many bits you will hit from index A and index B, so we only have to decide if it is better to take the more restrictive index and do heap lookups for those, or scan the second index and then hit the heap. The only thing A,B combined statistics would tell you is how many heap matches you will find. The time to scan A and B indexes and create the bitmap is already guessable from the single column statistics. Also, what does an in-memory bitmapped index look like? Is it: value: bitmap...value: bitmap... with the values organized in a btree fashion? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Also, what does an in-memory bitmapped index look like? One idea that might work: a binary search tree in which each node represents a single page of the table, and contains a bit array with one bit for each possible item number on the page. You could not need more than BLCKSZ/(sizeof(HeapTupleHeaderData)+sizeof(ItemIdData)) bits in a node, or about 36 bytes at default BLCKSZ --- for most tables you could probably prove it would be a great deal less. You only allocate nodes for pages that have at least one interesting row. I think this would represent a reasonable compromise between size and insertion speed. It would only get large if the indexscan output demanded visiting many different pages --- but at some point you could abandon index usage and do a sequential scan, so I think that property is okay. A variant is to make the per-page bit arrays be entries in a hash table with page number as hash key. This would reduce insertion to a nearly constant-time operation, but the drawback is that you'd need an explicit sort at the end to put the per-page entries into page number order before you scan 'em. You might come out ahead anyway, not sure. Or we could try a true linear bitmap (indexed by page number times max-items-per-page plus item number) that's compressed in some fashion, probably just by eliminating large runs of zeroes. The difficulty here is that inserting a new one-bit could be pretty expensive, and we need it to be cheap. Perhaps someone can come up with other better ideas ... regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Also, what does an in-memory bitmapped index look like? > > One idea that might work: a binary search tree in which each node > represents a single page of the table, and contains a bit array with > one bit for each possible item number on the page. You could not need > more than BLCKSZ/(sizeof(HeapTupleHeaderData)+sizeof(ItemIdData)) bits > in a node, or about 36 bytes at default BLCKSZ --- for most tables you > could probably prove it would be a great deal less. You only allocate > nodes for pages that have at least one interesting row. Actually, I think I made a mistake. I was wondering what on-disk bitmapped indexes look like. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Greg Stark wrote: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > In any case, this discussion is predicated on the assumption that the > > operations involving the bitmap are a significant fraction of the total > > time, which I think is quite uncertain. Until we build it and profile > > it, we won't know that. > > The other thought I had was that it would be difficult to tell when to follow > this path. Since the main case where it wins is when the individual indexes > aren't very selective but the combination is very selective, and we don't have > inter-column correlation statistics ... We actually have heap access cost and index access cost. You could compare costs of looking at all of index A's heap vs. looking at index B and then hopefully fewer heap rows. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, Jan 30, 2004 at 09:48:19AM -0500, Tom Lane wrote: > A variant is to make the per-page bit arrays be entries in a hash table > with page number as hash key. This would reduce insertion to a nearly > constant-time operation, but the drawback is that you'd need an explicit > sort at the end to put the per-page entries into page number order > before you scan 'em. You might come out ahead anyway, not sure. Is there a reason sort the pages before scanning them? The result won't come out sorted one way or the other. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Para tener más hay que desear menos"
Alvaro Herrera wrote: > On Fri, Jan 30, 2004 at 09:48:19AM -0500, Tom Lane wrote: > > > A variant is to make the per-page bit arrays be entries in a hash table > > with page number as hash key. This would reduce insertion to a nearly > > constant-time operation, but the drawback is that you'd need an explicit > > sort at the end to put the per-page entries into page number order > > before you scan 'em. You might come out ahead anyway, not sure. > > Is there a reason sort the pages before scanning them? The result won't > come out sorted one way or the other. I think the goal would be to hit the heap in sequential order as much as possible. When we are doing reading right from the index, we haven't collected all the heap values in one place, but since we have them in memory, we might as well sort them, though I don't think that is a requirement, just a performance enhancement, or at least that is my guess. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Alvaro Herrera wrote: >> Is there a reason sort the pages before scanning them? The result won't >> come out sorted one way or the other. > I think the goal would be to hit the heap in sequential order as much as > possible. Exactly. Also, it'll be harder to AND or OR two bitmaps together if you don't presort their pages. (I think the hash representation could do OR without presort, but not AND.) regards, tom lane
On Jan 30, 2004, at 6:31 AM, Bruce Momjian wrote: > I like the idea of building in-memory bitmapped indexes. Me too (FWIW)! This thread is really interesting as the whole idea would help to solve the biggest issue with my (currently stalled) project to integrate Xapian as a full-text search engine. Getting index scans used in all situations would drastically improve performance. eric
Tom Lane kirjutas R, 30.01.2004 kell 16:48: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Also, what does an in-memory bitmapped index look like? > > One idea that might work: a binary search tree in which each node > represents a single page of the table, and contains a bit array with > one bit for each possible item number on the page. You could not need > more than BLCKSZ/(sizeof(HeapTupleHeaderData)+sizeof(ItemIdData)) bits > in a node, or about 36 bytes at default BLCKSZ --- for most tables you > could probably prove it would be a great deal less. You only allocate > nodes for pages that have at least one interesting row. Another idea would be using bitmaps where we have just one bit per database page and do a seq scan but just over marked pages. Even when allocating them in full such indexes would occupy just 1/(8k*8bit) of the amount they describe, so index for 1GB table would be 1G/(8k*8bit) = 16 kilobytes (2 pages) Also, such indexes, if persistent, could also be used (together with FSM) when deciding placement of new tuples, so they provide a form of clustering. This would of course be most useful for data-warehouse type operations, where database is significantöy bigger than memory. And the seqscan over bitmap should not be done in simple page order, but rather in two passes -1. over those pages which are already in cache (either postgresqls or systems (if we find a wayto get such info from the system))2. in sequential order over the rest. > I think this would represent a reasonable compromise between size and > insertion speed. It would only get large if the indexscan output > demanded visiting many different pages --- but at some point you could > abandon index usage and do a sequential scan, so I think that property > is okay. One case where almost full intermediate bitmap could be needed is when doing a star join or just AND of several conditions, where each single index spans a significant part of the table, but the result does not. > A variant is to make the per-page bit arrays be entries in a hash table > with page number as hash key. This would reduce insertion to a nearly > constant-time operation, but the drawback is that you'd need an explicit > sort at the end to put the per-page entries into page number order > before you scan 'em. You might come out ahead anyway, not sure. > > Or we could try a true linear bitmap (indexed by page number times > max-items-per-page plus item number) that's compressed in some fashion, > probably just by eliminating large runs of zeroes. The difficulty here > is that inserting a new one-bit could be pretty expensive, and we need > it to be cheap. > > Perhaps someone can come up with other better ideas ... I have also contemplated a scenario, where we could use some not-quite-max power-of-2 bits-per-page linear bitmap and mark intra-page wraps (when we tried to mark a point past that not-quite-max number in a page) in high bit (or another bitmap) making info for that page folded. AN example would be setting bit 40 in 32-bits/page index - this would set bit 40&31 and mark the page folded. When combining such indexes using AND or OR, we need some spcial handling of folded pages, but could still get non-folded (0) results out from AND of 2 folded pages if the bits are distributed nicely. -------------- Hannu
Hannu Krosing <hannu@tm.ee> writes: > Another idea would be using bitmaps where we have just one bit per > database page and do a seq scan but just over marked pages. That seems a bit too lossy for me, but I really like your later idea about folding. Generalizing that a little, we can choose any fold point we like. We could allocate, say, one 32-bit word per page and set the (i mod 32) bit when item i is fingered by the index. After retrieving the heap page, we'd need to test all the valid rows that have item numbers matching a set bit mod 32. On typical tables (with circa 100 items per page) this would require testing only about 3 rows per page. ORing and ANDing of such bitmaps still works, with the understanding that it's lossy and you have to double check each retrieved tuple. If the fold point is above about 100, your idea of keeping track of whether we actually set any wrapped-around bits would become useful, but below that I think we'd just be wasting a bit. regards, tom lane
Tom Lane kirjutas L, 31.01.2004 kell 01:02: > Hannu Krosing <hannu@tm.ee> writes: > > Another idea would be using bitmaps where we have just one bit per > > database page and do a seq scan but just over marked pages. > > That seems a bit too lossy for me, I originally thought of it in context of data-warehousing and persistent bitmap indexes. there the use of these same bitmaps for clustering would un-lossify this approach. > but I really like your later idea > about folding. Generalizing that a little, we can choose any fold point > we like. We could allocate, say, one 32-bit word per page and set the > (i mod 32) bit when item i is fingered by the index. After retrieving > the heap page, we'd need to test all the valid rows that have item > numbers matching a set bit mod 32. On typical tables (with circa 100 > items per page) this would require testing only about 3 rows per page. > ORing and ANDing of such bitmaps still works, with the understanding > that it's lossy and you have to double check each retrieved tuple. > > If the fold point is above about 100, your idea of keeping track of > whether we actually set any wrapped-around bits would become useful, > but below that I think we'd just be wasting a bit. Not only wasting bits, but also making the code hairier - we can't just do simple ANDs and ORs. -------------- Hannu
Tom Lane <tgl@sss.pgh.pa.us> writes: > That seems a bit too lossy for me, but I really like your later idea > about folding. Generalizing that a little, we can choose any fold point > we like. We could allocate, say, one 32-bit word per page and set the > (i mod 32) bit when item i is fingered by the index. After retrieving > the heap page, we'd need to test all the valid rows that have item > numbers matching a set bit mod 32. On typical tables (with circa 100 > items per page) this would require testing only about 3 rows per page. > ORing and ANDing of such bitmaps still works, with the understanding > that it's lossy and you have to double check each retrieved tuple. That would make it really hard to ever clear the bits. What do you do when you vacuum and one of the tuples is no longer needed. You can't be sure you can clear the bit in the index because there could be multiple tuples represented by the bit being set. You would have to test the condition on the other tuples covered by the bit to see if it can be cleared. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> ORing and ANDing of such bitmaps still works, with the understanding >> that it's lossy and you have to double check each retrieved tuple. > That would make it really hard to ever clear the bits. We're speaking of in-memory bitmaps constructed on-the-fly here. You're right that it wouldn't work for persistent indexes, but I'm not very interested in that case at the moment ... regards, tom lane