Thread: Column stores

Column stores

From
Seth Grimes
Date:
Column stores are all the rage now in the data warehousing world as an
alternative to traditional approaches and to MPP approaches that include
Greenplum.  There's one company, InfoBright, that's offering a
column-store engine, not open source, for MySQL.

Would a column store work within PostgreSQL from a technical point-of-view
and is anyone pursuing this?

(As an aside, I explored something along these lines myself in the
mid-'90s using InterBase (now Firebird) and Illustra, building on top of
blobs to store data arrays.)

                     Seth



--
Seth Grimes   Alta Plana Corp, analytical computing & data management
               Intelligent Enterprise magazine (CMP), Contributing Editor
grimes@altaplana.com       http://altaplana.com    301-270-0795

Re: Column stores

From
Simon Riggs
Date:
On Mon, 2008-01-28 at 12:04 -0800, Seth Grimes wrote:
> Column stores are all the rage now in the data warehousing world as an
> alternative to traditional approaches and to MPP approaches that include
> Greenplum.  There's one company, InfoBright, that's offering a
> column-store engine, not open source, for MySQL.
>
> Would a column store work within PostgreSQL from a technical point-of-view
> and is anyone pursuing this?
>
> (As an aside, I explored something along these lines myself in the
> mid-'90s using InterBase (now Firebird) and Illustra, building on top of
> blobs to store data arrays.)

I've looked into doing this to see how hard it would be.

The main thing to consider is what it can be used for. Column stores
don't have the same use case as row stores, as Stonebraker himself
points out. That's a slightly different thought than "it just goes
faster", which is the 0.1% summary of his research touted by the
marketing department.

The column approach is basically the same thing as having all indexes,
but not actually storing the row in the heap. So it's smaller and more
efficient for many types of query, but not all. It's a fairly drastic
move to say we know enough about the types of queries people will run
that we can just not store the entire row. Some databases might know
that but generally Data Warehouses aim for business flexibility, not
just performance at any price.

Any implementation for Postgres would gain benefit from blending row and
column approaches within the same database, as an option rather than as
a must-have.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Column stores

From
Luke Lonergan
Date:
Hi Seth,

 On Mon, 2008-01-28 at 12:04 -0800, Seth Grimes wrote:
> Column stores are all the rage now in the data warehousing world as an
> alternative to traditional approaches and to MPP approaches that include
> Greenplum.  There's one company, InfoBright, that's offering a
> column-store engine, not open source, for MySQL.

That's funny - all of the column approaches that are "the rage" in DW are
MPP in addition to column store.  If people wanted a non-MPP column store
they buy Sybase IQ.

So - not only do the raging column companies have to do a new version of the
old column idea (that avoids the big drawbacks of near read-only behaviors,
specialized "projection" operations, etc), but they also have to get the MPP
part right.  Speaking from our experience - of the two, it's a lot harder to
get all the MPP parts right than it is to add a column store module.

Not that there aren't some good ideas coming from these discussions - our
point of view at Greenplum is that column store / compression isn't a
company or product, it's a feature.  Further, it's one we've also got
covered :-)

- Luke


Re: Column stores

From
Seth Grimes
Date:
No, not all column stores are MPP.  I believe that Brighthouse from
Infobright isn't although the company is looking into it.

Compression and a columnar storage are separate things that can go
hand-in-hand.  Does Greenplum have both or just compression?

Thanks,

                     Seth

On Wed, 30 Jan 2008, Luke Lonergan wrote:

> Hi Seth,
>
> On Mon, 2008-01-28 at 12:04 -0800, Seth Grimes wrote:
>> Column stores are all the rage now in the data warehousing world as an
>> alternative to traditional approaches and to MPP approaches that include
>> Greenplum.  There's one company, InfoBright, that's offering a
>> column-store engine, not open source, for MySQL.
>
> That's funny - all of the column approaches that are "the rage" in DW
> are MPP in addition to column store.  If people wanted a non-MPP column
> store they buy Sybase IQ.
>
> So - not only do the raging column companies have to do a new version of
> the old column idea (that avoids the big drawbacks of near read-only
> behaviors, specialized "projection" operations, etc), but they also have
> to get the MPP part right.  Speaking from our experience - of the two,
> it's a lot harder to get all the MPP parts right than it is to add a
> column store module.
>
> Not that there aren't some good ideas coming from these discussions -
> our point of view at Greenplum is that column store / compression isn't
> a company or product, it's a feature.  Further, it's one we've also got
> covered :-)
>
> - Luke
>

--
Seth Grimes   Alta Plana Corp, analytical computing & data management
               Intelligent Enterprise magazine (CMP), Contributing Editor
grimes@altaplana.com       http://altaplana.com    301-270-0795

Re: Column stores

From
Luke Lonergan
Date:
Hi Seth,

On 2/1/08 11:41 AM, "Seth Grimes" <grimes@altaplana.com> wrote:

> No, not all column stores are MPP.  I believe that Brighthouse from
> Infobright isn't although the company is looking into it.

They're not one that I'd say is "all the rage" - two reasons: MySQL doesn't
work for DW queries and more importantly doing a different storage engine
doesn't solve the DW performance and scale problems.

WRT "all the rage" - you'd have to point to more than press releases about
technology to make that case.  We've beaten the "all the rage" companies in
every POC so far and I haven't seen or heard of significant reference
installations.

> Compression and a columnar storage are separate things that can go
> hand-in-hand.  Does Greenplum have both or just compression?

See above :-)

- Luke


Re: Column stores

From
Ron Mayer
Date:
Luke Lonergan wrote:
>> ...column stores...
>
> WRT "all the rage" - you'd have to point to more than press releases about
> technology to make that case.  We've beaten the "all the rage" companies in
> every POC so far and I haven't seen or heard of significant reference
> installations.

I thought KX System's KDB is a column-oriented database which
had impressive reference customers (Fidelity, BofA, JPMorgan, etc)
with their column store database since the early 90s.

But yeah - I'm curious why all the hype around new column oriented
databases when it seems people already determined what niche
applications they're great for quite some time ago.


Re: Column stores

From
Luke Lonergan
Date:
Ron,

On 2/1/08 11:18 PM, "Ron Mayer" <rm_pg@cheapcomplexdevices.com> wrote:

> I thought KX System's KDB is a column-oriented database which
> had impressive reference customers (Fidelity, BofA, JPMorgan, etc)
> with their column store database since the early 90s.
>
> But yeah - I'm curious why all the hype around new column oriented
> databases when it seems people already determined what niche
> applications they're great for quite some time ago.

Exactly.

SybaseIQ, KDB, etc have proven to have their excellent uses.  Resurrecting
the column idea and proclaiming it the solution to all DW problems is pretty
off-base IMO.  The limitations are real, that's why it's not the general
answer even to the old DW problem, much less the new one which involves much
more real time load/query and multi-user work.

The complete answer uses parallelism and processing embedded within data,
which is a much harder technology to get right.  Grafting a column store
module onto MySQL does not fix it's lack of complex query capabilities and
make it scale for DW.

- Luke


Re: Column stores

From
Seth Grimes
Date:
Isn't Kdb pretty much about queries on what I'll call "data vectors,"
whether streamed or stored, particularly time series?  That is, you
wouldn't have complex queries that cross a lot of fields.

The hype around column stores is generated by the entry of industry
luminaries into the field, attracting venture money that funds high-priced
marketing talent and initiatives, substantiated by *likely* (even if not
yet proven) applicability of the technology to a much broader array of
business problems, given the data explosion, than was possible in the
past.

No, the approach is not yet established in the mainstream even if there
are notable successes for certain types of application.  Most observers of
the data-warehousing world would say the same about open-source and
open-source-based DBMSes.

Clareos (Crosscut) was a more generalized column-store that failed but
then so did Great Bridge (not that GB targeted data warehousing).  Past
failures don't mean that Vertica or Greenplum, for example, won't go on to
win broader-market success.

                     Seth


On Fri, 1 Feb 2008, Ron Mayer wrote:

> Luke Lonergan wrote:
>>> ...column stores...
>>
>> WRT "all the rage" - you'd have to point to more than press releases
>> about technology to make that case.  We've beaten the "all the rage"
>> companies in every POC so far and I haven't seen or heard of
>> significant reference installations.
>
> I thought KX System's KDB is a column-oriented database which
> had impressive reference customers (Fidelity, BofA, JPMorgan, etc)
> with their column store database since the early 90s.
>
> But yeah - I'm curious why all the hype around new column oriented
> databases when it seems people already determined what niche
> applications they're great for quite some time ago.
>

Re: Column stores

From
Seth Grimes
Date:
This response is going to veer off into commercial areas not directly
related to PostgreSQL.  I hope noone minds.  Let me know privately if you
do.

On Fri, 1 Feb 2008, Luke Lonergan wrote:

> On 2/1/08 11:41 AM, "Seth Grimes" <grimes@altaplana.com> wrote:
>
>> No, not all column stores are MPP.  I believe that Brighthouse from
>> Infobright isn't although the company is looking into it.
>
> They're not one that I'd say is "all the rage" - two reasons: MySQL
> doesn't work for DW queries and more importantly doing a different
> storage engine doesn't solve the DW performance and scale problems.

No, but they're a counterexample to your incorrect statement that all
column stores are also MPP.

Doing a different storage engine targets current MySQL users who would
prefer to expand the scope of what they (try to) do with MySQL before they
switch DBMSes or bring in a second DBMS.  That approach makes sense in
many cases.

> WRT "all the rage" - you'd have to point to more than press releases
> about technology to make that case.  We've beaten the "all the rage"
> companies in every POC so far and I haven't seen or heard of significant
> reference installations.

No I don't.  "Rages" are about image, not about substance.  "Rages" die
away when the substance doesn't match the image.

I'd like to hear about additional, significant Greenplum reference
installations.  The only one I've seen, not for want of asking, is
Frontier Airlines, and when I looked at it a year ago, the application was
limited to fare optimiation or something like that with a small number of
users.  Perhaps you can send me information about others.  I see PR from a
year ago on Greenplum's site for a deal with Smart Communications in the
Philippines.  How has that worked out for instance?

The only announcements I see posted to Greenplum's site are about industry
deals and the like.  Gartner says that Greenplum "has been quietly adding
clients ... for about two years."  So how about describing some?

>> Compression and a columnar storage are separate things that can go
>> hand-in-hand.  Does Greenplum have both or just compression?
>
> See above :-)

Since you belittle column stores, I take it the answer is No, just
compression.  By reputation you're a fine technologist, Luke, but would it
hurt to just answer the question?

                     Seth



--
Seth Grimes   Alta Plana Corp, analytical computing & data management
               Intelligent Enterprise magazine (CMP), Contributing Editor
grimes@altaplana.com       http://altaplana.com    301-270-0795

Re: Column stores

From
Luke Lonergan
Date:
Hi Seth,

On 2/2/08 4:52 AM, "Seth Grimes" <grimes@altaplana.com> wrote:

>> They're not one that I'd say is "all the rage" - two reasons: MySQL
>> doesn't work for DW queries and more importantly doing a different
>> storage engine doesn't solve the DW performance and scale problems.
>
> No, but they're a counterexample to your incorrect statement that all
> column stores are also MPP.

Nah - I said that all "raging" column stores are MPP and I make that
contention still :-)

> Doing a different storage engine targets current MySQL users who would
> prefer to expand the scope of what they (try to) do with MySQL before they
> switch DBMSes or bring in a second DBMS.  That approach makes sense in
> many cases.

Sure - "current MySQL users" that want a little more bang than they can get
now is far away from the very active DW marketplace.

> No I don't.  "Rages" are about image, not about substance.  "Rages" die
> away when the substance doesn't match the image.

Yep!

> The only announcements I see posted to Greenplum's site are about industry
> deals and the like.  Gartner says that Greenplum "has been quietly adding
> clients ... for about two years."  So how about describing some?

Are you a Gartner customer?  If so, you can get the magic quadrant report
and they will talk to you in detail about significant Greenplum
installations.

> I'd like to hear about additional, significant Greenplum reference
> installations.  The only one I've seen, not for want of asking, is
> Frontier Airlines, and when I looked at it a year ago, the application was
> limited to fare optimiation or something like that with a small number of
> users.  Perhaps you can send me information about others.  I see PR from a
> year ago on Greenplum's site for a deal with Smart Communications in the
> Philippines.  How has that worked out for instance?

Our approach has been quiet progress - we're about to get a lot less quiet.

Part of the challenge of this market is that the demands of large scale
enterprise data warehouse customers are very high and it takes time to earn
a reputation as a trusted provider.

BTW - I love the Frontier example, the DAXPY guys doing revenue optimization
using Pentaho on Greenplum is a beautiful example of how it should work.
Frontier is able to do things they couldn't otherwise do because the
technology revolutionized the price point at which multiple TB of data could
be used for fast query.  I'm happy to say that DAXPY has now sold the same
solution to other airlines using Greenplum and Frontier remains a reference
for us.

- Luke


Re: Column stores

From
"Joshua D. Drake"
Date:
On Sat, 2 Feb 2008 04:52:36 -0800 (PST)
Seth Grimes <grimes@altaplana.com> wrote:

> Since you belittle column stores, I take it the answer is No, just
> compression.  By reputation you're a fine technologist, Luke, but
> would it hurt to just answer the question?
>
>                      Seth

Greenplum's collection of customers or the way they deploy those
customers is none of our business. Further discussion of technical
deployment albeit interesting are off topic for this list.

Joshua D. Drake

Attachment

Re: Column stores

From
Luke Lonergan
Date:
Hi Simon, Seth,

On 1/30/08 10:44 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote:

> Any implementation for Postgres would gain benefit from blending row and
> column approaches within the same database, as an option rather than as
> a must-have.

Agreed and this is a known "must-have" for a modern column store to be
competitive.

IMO the main benefits claimed and attainable by column stores are derived
from:
1) differentially better compression for some columns being separated from
others
2) a tendency for the implementations to remove abstraction from the
executor and do more than one row at a time, thereby improving processor
efficiency.  See the X100 project for an example

Claimed benefits that aren't that generally applicable include the ability
to operate directly on compressed data, thereby improving memory bandwidth
and CPU usage.

The major drawbacks include another point that you mentioned Simon - there
is a lot of overhead to the approach of putting isolated columns out to disk
and having to re-assemble them all the time.  If you are doing a query with
a few columns chosen from a hundred in a table, then this easily pays off,
but how many people are running with flattened schemas today?  If there are
enough to make a market, then we'd see a lot more simple approaches to
getting the work done - we'd not have needed to spend 4 years building what
we have for instance.

For this reason, some of the column implementations have created an approach
that looks a whole lot like indexing - they're calling it "projection" and I
think it is what it sounds like.  They have to "pre-run" the queries through
an analyzer that chooses which columns to project and they may even
duplicate the data into the new columns - sounds familiar - it's an index.

If the best approach to column for DW is to create indexes, that's another
old idea and we've got plenty of that in Postgres.  That said - PG needs the
true bitmap index to compete IMO, along with index-only access.

Lastly there's the update problem.  When you vertically partition the
schema, you have to coordinate the updates across all of the columns, each
of which are compressed differently.  It's not an unsolvable problem, it's
just another overhead associated with the approach, but it's a big one.

Taking all of the above into consideration, there's an approach I'd like to
see in Postgres that can deliver everything the column people claim without
the drawbacks.  +1 to have a technical discussion about it.

- Luke