Thread: Column stores
Column stores are all the rage now in the data warehousing world as an alternative to traditional approaches and to MPP approaches that include Greenplum. There's one company, InfoBright, that's offering a column-store engine, not open source, for MySQL. Would a column store work within PostgreSQL from a technical point-of-view and is anyone pursuing this? (As an aside, I explored something along these lines myself in the mid-'90s using InterBase (now Firebird) and Illustra, building on top of blobs to store data arrays.) Seth -- Seth Grimes Alta Plana Corp, analytical computing & data management Intelligent Enterprise magazine (CMP), Contributing Editor grimes@altaplana.com http://altaplana.com 301-270-0795
On Mon, 2008-01-28 at 12:04 -0800, Seth Grimes wrote: > Column stores are all the rage now in the data warehousing world as an > alternative to traditional approaches and to MPP approaches that include > Greenplum. There's one company, InfoBright, that's offering a > column-store engine, not open source, for MySQL. > > Would a column store work within PostgreSQL from a technical point-of-view > and is anyone pursuing this? > > (As an aside, I explored something along these lines myself in the > mid-'90s using InterBase (now Firebird) and Illustra, building on top of > blobs to store data arrays.) I've looked into doing this to see how hard it would be. The main thing to consider is what it can be used for. Column stores don't have the same use case as row stores, as Stonebraker himself points out. That's a slightly different thought than "it just goes faster", which is the 0.1% summary of his research touted by the marketing department. The column approach is basically the same thing as having all indexes, but not actually storing the row in the heap. So it's smaller and more efficient for many types of query, but not all. It's a fairly drastic move to say we know enough about the types of queries people will run that we can just not store the entire row. Some databases might know that but generally Data Warehouses aim for business flexibility, not just performance at any price. Any implementation for Postgres would gain benefit from blending row and column approaches within the same database, as an option rather than as a must-have. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Hi Seth, On Mon, 2008-01-28 at 12:04 -0800, Seth Grimes wrote: > Column stores are all the rage now in the data warehousing world as an > alternative to traditional approaches and to MPP approaches that include > Greenplum. There's one company, InfoBright, that's offering a > column-store engine, not open source, for MySQL. That's funny - all of the column approaches that are "the rage" in DW are MPP in addition to column store. If people wanted a non-MPP column store they buy Sybase IQ. So - not only do the raging column companies have to do a new version of the old column idea (that avoids the big drawbacks of near read-only behaviors, specialized "projection" operations, etc), but they also have to get the MPP part right. Speaking from our experience - of the two, it's a lot harder to get all the MPP parts right than it is to add a column store module. Not that there aren't some good ideas coming from these discussions - our point of view at Greenplum is that column store / compression isn't a company or product, it's a feature. Further, it's one we've also got covered :-) - Luke
No, not all column stores are MPP. I believe that Brighthouse from Infobright isn't although the company is looking into it. Compression and a columnar storage are separate things that can go hand-in-hand. Does Greenplum have both or just compression? Thanks, Seth On Wed, 30 Jan 2008, Luke Lonergan wrote: > Hi Seth, > > On Mon, 2008-01-28 at 12:04 -0800, Seth Grimes wrote: >> Column stores are all the rage now in the data warehousing world as an >> alternative to traditional approaches and to MPP approaches that include >> Greenplum. There's one company, InfoBright, that's offering a >> column-store engine, not open source, for MySQL. > > That's funny - all of the column approaches that are "the rage" in DW > are MPP in addition to column store. If people wanted a non-MPP column > store they buy Sybase IQ. > > So - not only do the raging column companies have to do a new version of > the old column idea (that avoids the big drawbacks of near read-only > behaviors, specialized "projection" operations, etc), but they also have > to get the MPP part right. Speaking from our experience - of the two, > it's a lot harder to get all the MPP parts right than it is to add a > column store module. > > Not that there aren't some good ideas coming from these discussions - > our point of view at Greenplum is that column store / compression isn't > a company or product, it's a feature. Further, it's one we've also got > covered :-) > > - Luke > -- Seth Grimes Alta Plana Corp, analytical computing & data management Intelligent Enterprise magazine (CMP), Contributing Editor grimes@altaplana.com http://altaplana.com 301-270-0795
Hi Seth, On 2/1/08 11:41 AM, "Seth Grimes" <grimes@altaplana.com> wrote: > No, not all column stores are MPP. I believe that Brighthouse from > Infobright isn't although the company is looking into it. They're not one that I'd say is "all the rage" - two reasons: MySQL doesn't work for DW queries and more importantly doing a different storage engine doesn't solve the DW performance and scale problems. WRT "all the rage" - you'd have to point to more than press releases about technology to make that case. We've beaten the "all the rage" companies in every POC so far and I haven't seen or heard of significant reference installations. > Compression and a columnar storage are separate things that can go > hand-in-hand. Does Greenplum have both or just compression? See above :-) - Luke
Luke Lonergan wrote: >> ...column stores... > > WRT "all the rage" - you'd have to point to more than press releases about > technology to make that case. We've beaten the "all the rage" companies in > every POC so far and I haven't seen or heard of significant reference > installations. I thought KX System's KDB is a column-oriented database which had impressive reference customers (Fidelity, BofA, JPMorgan, etc) with their column store database since the early 90s. But yeah - I'm curious why all the hype around new column oriented databases when it seems people already determined what niche applications they're great for quite some time ago.
Ron, On 2/1/08 11:18 PM, "Ron Mayer" <rm_pg@cheapcomplexdevices.com> wrote: > I thought KX System's KDB is a column-oriented database which > had impressive reference customers (Fidelity, BofA, JPMorgan, etc) > with their column store database since the early 90s. > > But yeah - I'm curious why all the hype around new column oriented > databases when it seems people already determined what niche > applications they're great for quite some time ago. Exactly. SybaseIQ, KDB, etc have proven to have their excellent uses. Resurrecting the column idea and proclaiming it the solution to all DW problems is pretty off-base IMO. The limitations are real, that's why it's not the general answer even to the old DW problem, much less the new one which involves much more real time load/query and multi-user work. The complete answer uses parallelism and processing embedded within data, which is a much harder technology to get right. Grafting a column store module onto MySQL does not fix it's lack of complex query capabilities and make it scale for DW. - Luke
Isn't Kdb pretty much about queries on what I'll call "data vectors," whether streamed or stored, particularly time series? That is, you wouldn't have complex queries that cross a lot of fields. The hype around column stores is generated by the entry of industry luminaries into the field, attracting venture money that funds high-priced marketing talent and initiatives, substantiated by *likely* (even if not yet proven) applicability of the technology to a much broader array of business problems, given the data explosion, than was possible in the past. No, the approach is not yet established in the mainstream even if there are notable successes for certain types of application. Most observers of the data-warehousing world would say the same about open-source and open-source-based DBMSes. Clareos (Crosscut) was a more generalized column-store that failed but then so did Great Bridge (not that GB targeted data warehousing). Past failures don't mean that Vertica or Greenplum, for example, won't go on to win broader-market success. Seth On Fri, 1 Feb 2008, Ron Mayer wrote: > Luke Lonergan wrote: >>> ...column stores... >> >> WRT "all the rage" - you'd have to point to more than press releases >> about technology to make that case. We've beaten the "all the rage" >> companies in every POC so far and I haven't seen or heard of >> significant reference installations. > > I thought KX System's KDB is a column-oriented database which > had impressive reference customers (Fidelity, BofA, JPMorgan, etc) > with their column store database since the early 90s. > > But yeah - I'm curious why all the hype around new column oriented > databases when it seems people already determined what niche > applications they're great for quite some time ago. >
This response is going to veer off into commercial areas not directly related to PostgreSQL. I hope noone minds. Let me know privately if you do. On Fri, 1 Feb 2008, Luke Lonergan wrote: > On 2/1/08 11:41 AM, "Seth Grimes" <grimes@altaplana.com> wrote: > >> No, not all column stores are MPP. I believe that Brighthouse from >> Infobright isn't although the company is looking into it. > > They're not one that I'd say is "all the rage" - two reasons: MySQL > doesn't work for DW queries and more importantly doing a different > storage engine doesn't solve the DW performance and scale problems. No, but they're a counterexample to your incorrect statement that all column stores are also MPP. Doing a different storage engine targets current MySQL users who would prefer to expand the scope of what they (try to) do with MySQL before they switch DBMSes or bring in a second DBMS. That approach makes sense in many cases. > WRT "all the rage" - you'd have to point to more than press releases > about technology to make that case. We've beaten the "all the rage" > companies in every POC so far and I haven't seen or heard of significant > reference installations. No I don't. "Rages" are about image, not about substance. "Rages" die away when the substance doesn't match the image. I'd like to hear about additional, significant Greenplum reference installations. The only one I've seen, not for want of asking, is Frontier Airlines, and when I looked at it a year ago, the application was limited to fare optimiation or something like that with a small number of users. Perhaps you can send me information about others. I see PR from a year ago on Greenplum's site for a deal with Smart Communications in the Philippines. How has that worked out for instance? The only announcements I see posted to Greenplum's site are about industry deals and the like. Gartner says that Greenplum "has been quietly adding clients ... for about two years." So how about describing some? >> Compression and a columnar storage are separate things that can go >> hand-in-hand. Does Greenplum have both or just compression? > > See above :-) Since you belittle column stores, I take it the answer is No, just compression. By reputation you're a fine technologist, Luke, but would it hurt to just answer the question? Seth -- Seth Grimes Alta Plana Corp, analytical computing & data management Intelligent Enterprise magazine (CMP), Contributing Editor grimes@altaplana.com http://altaplana.com 301-270-0795
Hi Seth, On 2/2/08 4:52 AM, "Seth Grimes" <grimes@altaplana.com> wrote: >> They're not one that I'd say is "all the rage" - two reasons: MySQL >> doesn't work for DW queries and more importantly doing a different >> storage engine doesn't solve the DW performance and scale problems. > > No, but they're a counterexample to your incorrect statement that all > column stores are also MPP. Nah - I said that all "raging" column stores are MPP and I make that contention still :-) > Doing a different storage engine targets current MySQL users who would > prefer to expand the scope of what they (try to) do with MySQL before they > switch DBMSes or bring in a second DBMS. That approach makes sense in > many cases. Sure - "current MySQL users" that want a little more bang than they can get now is far away from the very active DW marketplace. > No I don't. "Rages" are about image, not about substance. "Rages" die > away when the substance doesn't match the image. Yep! > The only announcements I see posted to Greenplum's site are about industry > deals and the like. Gartner says that Greenplum "has been quietly adding > clients ... for about two years." So how about describing some? Are you a Gartner customer? If so, you can get the magic quadrant report and they will talk to you in detail about significant Greenplum installations. > I'd like to hear about additional, significant Greenplum reference > installations. The only one I've seen, not for want of asking, is > Frontier Airlines, and when I looked at it a year ago, the application was > limited to fare optimiation or something like that with a small number of > users. Perhaps you can send me information about others. I see PR from a > year ago on Greenplum's site for a deal with Smart Communications in the > Philippines. How has that worked out for instance? Our approach has been quiet progress - we're about to get a lot less quiet. Part of the challenge of this market is that the demands of large scale enterprise data warehouse customers are very high and it takes time to earn a reputation as a trusted provider. BTW - I love the Frontier example, the DAXPY guys doing revenue optimization using Pentaho on Greenplum is a beautiful example of how it should work. Frontier is able to do things they couldn't otherwise do because the technology revolutionized the price point at which multiple TB of data could be used for fast query. I'm happy to say that DAXPY has now sold the same solution to other airlines using Greenplum and Frontier remains a reference for us. - Luke
On Sat, 2 Feb 2008 04:52:36 -0800 (PST) Seth Grimes <grimes@altaplana.com> wrote: > Since you belittle column stores, I take it the answer is No, just > compression. By reputation you're a fine technologist, Luke, but > would it hurt to just answer the question? > > Seth Greenplum's collection of customers or the way they deploy those customers is none of our business. Further discussion of technical deployment albeit interesting are off topic for this list. Joshua D. Drake
Attachment
Hi Simon, Seth, On 1/30/08 10:44 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote: > Any implementation for Postgres would gain benefit from blending row and > column approaches within the same database, as an option rather than as > a must-have. Agreed and this is a known "must-have" for a modern column store to be competitive. IMO the main benefits claimed and attainable by column stores are derived from: 1) differentially better compression for some columns being separated from others 2) a tendency for the implementations to remove abstraction from the executor and do more than one row at a time, thereby improving processor efficiency. See the X100 project for an example Claimed benefits that aren't that generally applicable include the ability to operate directly on compressed data, thereby improving memory bandwidth and CPU usage. The major drawbacks include another point that you mentioned Simon - there is a lot of overhead to the approach of putting isolated columns out to disk and having to re-assemble them all the time. If you are doing a query with a few columns chosen from a hundred in a table, then this easily pays off, but how many people are running with flattened schemas today? If there are enough to make a market, then we'd see a lot more simple approaches to getting the work done - we'd not have needed to spend 4 years building what we have for instance. For this reason, some of the column implementations have created an approach that looks a whole lot like indexing - they're calling it "projection" and I think it is what it sounds like. They have to "pre-run" the queries through an analyzer that chooses which columns to project and they may even duplicate the data into the new columns - sounds familiar - it's an index. If the best approach to column for DW is to create indexes, that's another old idea and we've got plenty of that in Postgres. That said - PG needs the true bitmap index to compete IMO, along with index-only access. Lastly there's the update problem. When you vertically partition the schema, you have to coordinate the updates across all of the columns, each of which are compressed differently. It's not an unsolvable problem, it's just another overhead associated with the approach, but it's a big one. Taking all of the above into consideration, there's an approach I'd like to see in Postgres that can deliver everything the column people claim without the drawbacks. +1 to have a technical discussion about it. - Luke