Thread: A Silly Idea for Vertically-Oriented Databases
Be forewarned - this is probably a very long post, and I'm just a mere mortal (ie. admin) who doesn't write copious amounts of C code. Take the following posts and suggestions with a grain of salt. So I've been seeing/hearing all of the hoopla over vertical databases (column stores), and how they'll not only slice bread but also make toast, etc. I've done some quick searches for past articles on "C-Store", "Vertica", "Column Store", and "Vertical Database", and have seen little discussion on this. And then a funny thought occurs to me - when I look at the directory structure and file layout of a PostgreSQL database, I see that each OID corresponds to a table, which corresponds to (generally) a single file. Then I have a second funny thought - what if there was a low-friction, low-cost-of-implementation way to bring similar advantages to PostgreSQL without major alterations, recoding, etc? Finally it occurs to me that PostgreSQL already does something similar but it could do it so much better, with only one language change and minor changes to the storage layout. So here's my plum-crazy proposal (and I've made some before - see http://archives.postgresql.org/pgsql-odbc/2006-10/msg00040.php - and they not only made it into production, but they are in active use by me on a weekly basis - Thanks Hiroshi!!!), bear with me... Make one small, very tiny syntactic change to "CREATE TABLE" that includes a new keyword, "COLUMN-STORE" or something similar. I don't care where it appears as long as it's after the "CREATE TABLE". You would not have to break any existing SQL conventions, PostgreSQL would continue to be SQL compliant, and given the odd wording, I highly doubt that the folks who work on SQL keywords will end up using it at any point in time. If adding COLUMN-STORE is objectionable because it will "cloud the compliance of the language" then simply move the functionality into the table space functionality. In hindsight, it might even belong there instead. So, instead of specifying it by table, create a table space that has an attribute "Column Storage" set as active. When inactive, it uses the traditional "one-file-per-table" layout. Make each table column capable of receiving an OID. This will be critical for the following steps... If a table is created with "COLUMN-STORE" as an option, then it will continue to behave in the same way it always has, but the storage will be different. Each column in the table will be represented by a single file, with the file name being (naturally) the OID. INSERT/UPDATE/DELETE would function as it always has, etc. Nothing would change. Except how the data is stored. The existing TOAST mechanisms continue to work - because the engine would treat each file as a single-column table! One additional "column" would be added to the store, an invisible one that not only tracks the OID for the "rows" in this type of setup, but also the state of the row. Let's call this the "Control Column". Given that the metadata size for the row would be fixed/constant, we won't have to worry about what is in the other "columns" and "rows", they can be any size. BTW, the "Control Column" would be just another column from the storage engine's point of view. It just happens to be one that no-one can see, other than the database (and maybe the administrator). When you go to VACUUM a table, you would treat each column as a single-row table, so if a row is a candidate for a VACUUM reclamation, then it will adjust each "column" an equal amount. Under no circumstances would you have columns "out of sync", so if a record goes, it means each adjacent column goes with it. This sounds disk-intensive at first, until you realize that the admin will have made a contentious decision to use this format, and understands the advantages/drawbacks to this method. So it takes a little longer to VACUUM, I don't care, because as an admin I will have specified this layout for a reason - to do OLAP, not OLTP. Which means, I rarely VACUUM it. Add to this the high efficiency you would gain by packing more records into buffers per read, and most of the losses you take in re-reading data would really not amount to as big a loss as you might think. DELETE would simply mark a row off as deleted in the "Control Column". If the storage engine needed to reclaim a row, it would not have to look any further than the "control column" to find an empty spot where it could overwrite data. INSERT/UPDATE continue to work as they always have. The storage engine would perceive each "column" as a single-column table, meaning that the existing TOAST mechanisms continue to work! Nothing needs to change there. The real change would be that the table's columns would be "split up" into individual updates, and the "Control Column" would be used to keep all of the records in sync. Why bother with this? Because, when you are said and done, you will find yourself with a rough equivelent of a column-store database, with all of the OLAP goodness that people are looking for. You have little if any impact on the admin/users perception, other than a flag was checked somewhere and forgotten about in the database. From the storage engine's perspective, you have many many many small 1-column tables to take care of, and they all update at the same "place" at the same "time" to keep the records in sync when you recompose a row. TOAST and large object storage works the same as before, nothing changes, and that's as it should be. All with what would be (hopefully) a minor change to the storage backend. We're not talking about brain surgery on existing, tested code, but rather, a new feature that uses existing features in-place. As TOAST improves so does this feature. As caching improves, so does the feature again. And so on. I've been in a bit of a hurry to blurt all of this out, and I'm sure that I've forgotten something along the way, so if you find something missing, please be patient- I had to write all of this in about 20 minutes or less and I didn't have alot of time.
Avery, > Make one small, very tiny syntactic change to "CREATE TABLE" that > includes a new keyword, "COLUMN-STORE" or something similar. If someone writes the rest of the code, I doubt the syntax will be the holdup. But writing an efficient C-store table mechanism is much harder than I think you think it is; Vertica worked on it for a year and failed, and Paraccel took two years to succeed. FYI, Paraccel is based on Postgres. So, put up a pgfoundry project and start hacking a c-store table; I'm sure you;ll get interest if you can make something work. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
<pre><font><font face="Arial, Helvetica, sans-serif">Avery, <my ramblings snipped> >If someone writes the rest of the code, I doubt the syntax will be the >holdup. But writing an efficient C-store table mechanism is much harder >than I think you think it is; Vertica worked on it for a year and failed, >and Paraccel took two years to succeed. FYI, Paraccel is based on >Postgres. >So, put up a pgfoundry project and start hacking a c-store table; I'm sure >you;ll get interest if you can make something work. >-- >--Josh Well, I did say it was a *crazy* idea. :-) Given that I would be starting from the ground-floor, learning not only the innards of PostgreSQL but also C coding as well, I would probably have to overcome near-insurmountable odds to make this project take off. Still, if I was crazy enough to think it, maybe I'll be crazy enough to try for it. ;-) Just ignore my 2nd posting, I was trying to clarify some of the ramblings I was typing. </font></font></pre>
On 9/7/07, Avery Payne <apayne@pcfruit.com> wrote: > > Avery, > > <my ramblings snipped> > > >If someone writes the rest of the code, I doubt the syntax will be the > >holdup. But writing an efficient C-store table mechanism is much harder > >than I think you think it is; Vertica worked on it for a year and failed, > >and Paraccel took two years to succeed. FYI, Paraccel is based on > >Postgres. > > >So, put up a pgfoundry project and start hacking a c-store table; I'm sure > >you;ll get interest if you can make something work. > > >-- > >--Josh > > Well, I did say it was a *crazy* idea. :-) > > Given that I would be starting from the ground-floor, learning not only > the innards of PostgreSQL but also C coding as well, I would probably > have to overcome near-insurmountable odds to make this project take off. > Still, > if I was crazy enough to think it, maybe I'll be crazy enough to > try for it. ;-) > > Just ignore my 2nd posting, I was trying to clarify some of the > ramblings I was typing. For whatever it's worth, I was reading about the same things today and came up with the same basic idea, without the same level of implementation details you've talked about, Avery. And it sounds really neat. Hard, but neat, and potentially worth it if, say, Paraccel doesn't open source their stuff first :) But I don't know the PostgreSQL internals either, though I've been known to throw together some C code now and again. So in short, there's interest. Whether there's collective skill/free time/motivation/insanity/etc. enough to make something useful of the interest is another question altogether. :) - Josh/eggyknap
On Fri, 2007-09-07 at 13:52 -0700, Avery Payne wrote: > So I've been seeing/hearing all of the hoopla over vertical databases > (column stores), and how they'll not only slice bread but also make > toast, etc. I've done some quick searches for past articles on > "C-Store", "Vertica", "Column Store", and "Vertical Database", and have > seen little discussion on this. I looked at doing this a while back, for similar reasons. ISTM we would be able to do this fairly well if we implemented Index-only columns. i.e. columns that don't exist in the heap, only in an index. Taken to the extreme, all columns could be removed from the heap and placed in an index(es). Only the visibility information would remain on the heap. Syntax for this would be an ALTER TABLE SET STORAGE command, with a new type of storage definition that will only be accepted if an index already has been defined on the table which includes the specified column. Doing this per column would be a big win over vertical databases since AFAIK they *have* to do this to *every* column, even if it is not beneficial to do so. Every existing index plan works immediately. The main annoyance is retrieving a column value that doesn't exist on the heap. That would require a new kind of plan that involves preparing the index(es) by sorting them on htid and then doing a left merge join with the main heap. By now, most people will be screaming at their monitors "what an idiot, thats gonna REALLY suck". True, but this is the same thing that column-oriented databases have to do also, so it would be taking the best that vertical databases have to offer and accepting the consequences. There are some other plan possibilities also, apart from this basic value retrieval, but they would require further index API changes; I'm not certain of those as being primary use cases, however. Vertical databases honestly do have their uses and there are many kinds of marketing query that have complex where clauses yet only simple select clauses. There are a number of industry-specific database products that have utilised this technique to good effect for a number of years. So ISTM the main changes would be executor changes to allow retrieving column values of index-only columns when they are required, and to modify the insert/update tuple code somewhat. I thought maybe we can call it COAST, Column-oriented attribute storage technique, :-) -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > ISTM we would be able to do this fairly well if we implemented > Index-only columns. i.e. columns that don't exist in the heap, only in > an index. > > Taken to the extreme, all columns could be removed from the heap and > placed in an index(es). Only the visibility information would remain on > the heap. > Wouldn't the extreme be - even the visibility information is in the index? :-) Cheers, mark -- Mark Mielke <mark@mielke.cc>
Mark Mielke wrote: > Simon Riggs wrote: >> ISTM we would be able to do this fairly well if we implemented >> Index-only columns. i.e. columns that don't exist in the heap, only in >> an index. >> Taken to the extreme, all columns could be removed from the heap and >> placed in an index(es). Only the visibility information would remain on >> the heap. >> > Wouldn't the extreme be - even the visibility information is in the index? Maybe we could extract the visibility information and put it in a storage separate from the heap. A seqscan would be a bit slower (have to check the heap and the visibility storage) but some index scans could be faster (can check the index and visibility storage, skipping the heap), and updates would be faster too (append into the heap, update visibility storage). This would allow something closer to index-only scans. Of course, the key is to allow efficient lookup of visibility info given a TID. Has this been explored before? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
<div id="pgContentWrap"><div id="txtArchives"><pre><font face="Arial, Helvetica, sans-serif">>ISTM we would be able todo this fairly well if we implemented >Index-only columns. i.e. columns that don't exist in the heap, only in >an index. >Taken to the extreme, all columns could be removed from the heap and >placed in an index(es). Only the visibility information would remain on >the heap. So, let me understand this correctly - you want to index the columns and use the index to reconstruct the data? Some kind of "implicit" reconstruction? >Doing this per column would be a big win over vertical databases >since AFAIK they *have* to do this to *every* column, even if it is not >beneficial to do so.</font><font><font face="Arial, Helvetica, sans-serif"></font></font> <font face="Arial, Helvetica, sans-serif"> <snip></font><font><font><font><font face="Arial, Helvetica, sans-serif"> I was thinking about something a little more crude - each column being a free-standing table, but being "viewed" by the client as a single entity, a kind of "data federation". The idea was that the existing storage mechanism wouldn't be altered, and with a little slight-of-hand, we could extend the mechanism without hampering things like clustering, future extensions, optimizations, etc. No changes to MVCC would be needed, because it would above and through it. The idea was that for a "wide" table you target only a few columns, so you could sequential read without the penalty of having to seek to a new offset for each record. Instead of processing all the columns, you process a single column, which means less data to read for the same number of records. That gain starts to slope off when you specify more and more columns from the table. </font></font></font></font><font><font face="Arial, Helvetica, sans-serif"> </font></font><font face="Arial, Helvetica, sans-serif">Throw in selective indexing (similar to what you were talking about)and suddenly we can reclaim some of that lost speed. We'll make a kind of "compressed index", where the key turns into a hash that points to (is attached to?) a bucket, and the bucket contains the offset of all the records that relate to that key. Other tricks can be employed, such as run-length encoding entire ranges of offsets, etc. to keep this really really small. Really small = faster and faster to read, using more CPU than I/O. And I/O is still more of an issue than CPU at this point. Then again, if you ditch the column altogether and use a "compressed index" to reconstruct data implicitly, now we're close to what you were talking about (assuming I understand you correctly and also assuming that PostgreSQL doesn't already do this with indexes). So, if the column is indexed, then maybe split it off into a "compressed index", and if not, keep it in the main table outright? I guess I really need to think about this a bit more before I start delving into code. <Vertical DB market discussion - snipped> >I thought maybe we can call it COAST, Column-oriented attribute storage technique, :-) I like it. :-)<a href="http://www.2ndQuadrant.com" rel="nofollow"></a> I just wish I would have read this before applyingfor a project name at pgfoundry, the current proposal is given as "pg-cstore". </font></pre></div></div>
Avery Payne wrote: > >I thought maybe we can call it COAST, Column-oriented attribute storage technique, :-) > > I like it. :-)<a rel="nofollow" href="http://www.2ndQuadrant.com"></a> I just wish I would have read this before applyingfor a project name > at pgfoundry, the current proposal is given as "pg-cstore". You can email the pgfoundry admins at gforge-admins@pgfoundry.org and ask it to be cancelled instead of approved, and resubmit with the other name. -- Alvaro Herrera http://www.PlanetPostgreSQL.org/ "In fact, the basic problem with Perl 5's subroutines is that they're not crufty enough, so the cruft leaks out into user-defined code instead, by the Conservation of Cruft Principle." (Larry Wall, Apocalypse 6)
Ühel kenal päeval, E, 2007-09-10 kell 11:01, kirjutas Alvaro Herrera: > Mark Mielke wrote: > > Simon Riggs wrote: > >> ISTM we would be able to do this fairly well if we implemented > >> Index-only columns. i.e. columns that don't exist in the heap, only in > >> an index. > >> Taken to the extreme, all columns could be removed from the heap and > >> placed in an index(es). Only the visibility information would remain on > >> the heap. > >> > > Wouldn't the extreme be - even the visibility information is in the index? > > Maybe we could extract the visibility information and put it in a > storage separate from the heap. A seqscan would be a bit slower (have > to check the heap and the visibility storage) but some index scans could > be faster (can check the index and visibility storage, skipping the > heap), and updates would be faster too (append into the heap, update > visibility storage). This would allow something closer to index-only > scans. Of course, the key is to allow efficient lookup of visibility > info given a TID. > > Has this been explored before? I wrote to this list about a vague plan for doing it some months ago. And I don't think that even seqscans need to be slower, as at least for typical Data Warehouse application the visibility info can be compressed a lot by even a simple RLE compression and thus you just do the sequscan like this while 1: get next visible range read in the visible range, without checking each tuple I suspect that this can be made much more cache-friendly than current scheme. (A Guess: I even go as far as to claim that separate visibility heap _may_ be more cahce friendly even uncompressed, so if we have every second tuple deleted, it will still be faster (at least for bigger tuples) to look up visibility in a dense visibility array and then access just the visible tuples. Here even separate heap may be not needed, just rearranging the heap structure to have visibility info in the beginning of page together with tuple pointers may be a win ) Another bonus of separate visibility heap is that it can be maintained separately, that is it can be shuffled around, CID's cleaned up, compressed, etc. much more effectively than current one, which, among other things will make VACUUM much more efficient, as neither all-visible or all-dead pages need to be visited at all. It can probably be made to serve as both DSM and VACUUM map also. And it can be used for finding "best" insert spot for CLUSTERing-preserving inserts/updates. On subject of vertical or column-per-heap storage I see the biggest obstacle to be te use of TID-s as tuple identifiers, as the "real" TIDs (pageno32:tupleno:16) will be different for each column. So if we have a table with a 1 kb text column, we will also have to store just 8 ints per page in the int column if we want to preserve unique TID->tuple mapping. What we could do of course is start defining TID as just a 48-bit tuple number and then have some rules for finding the PAGE:NR "tid" from that it would be easy for fixed-size columns (PAGE:NR) = (TID/items_per_page:TID mod items_per_page) but more complicated for VARLEN fields. All the ideas I've had sofar for solving this have quite a lot of downsides. -------------- Hannu