Thread: COLUMNAR postgreSQL ?

COLUMNAR postgreSQL ?

From
"Juan Bru"
Date:

Hi,

 

 

Is there any plan to develop a PostgreSQL Columnar release?

 

I’m a researcher in the scope of Health (computer scientist, DBA background), used to work very well with PostgrSQL. Recently I’m facing analysis on 100M record tables so a columnar database could be a better option, but I would like to stay using postgreSQL.

 

Thanks in advance

 

 

 

Juan Bru

http://www.linkedin.com/in/juanbru

 

Re: COLUMNAR postgreSQL ?

From
Simon Riggs
Date:
On Tue, Sep 20, 2011 at 8:48 AM, Juan Bru <juan_bru@yahoo.es> wrote:

> Is there any plan to develop a PostgreSQL Columnar release?
>
> I’m a researcher in the scope of Health (computer scientist, DBA
> background), used to work very well with PostgrSQL. Recently I’m facing
> analysis on 100M record tables so a columnar database could be a better
> option, but I would like to stay using postgreSQL.

It would be useful to get some balanced viewpoints on this. I see you
have Alterian experience, so if you are using both it could be
valuable info. I've never heard anyone describe the downsides of
columnar datastores, presumably there are some?

My understanding is that columnar will work well for queries like this

SELECT count(*)
FROM table
WHERE col1 AND col2 AND col3

but less well when we include the columns in the SELECT clause.

Would you be able to give some viewpoints and measurements on that?

Do you think some funding could be available for that?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: COLUMNAR postgreSQL ?

From
Craig Ringer
Date:
On 09/20/2011 03:48 PM, Juan Bru wrote:

Hi,

 

 

Is there any plan to develop a PostgreSQL Columnar release?

 

I’m a researcher in the scope of Health (computer scientist, DBA background), used to work very well with PostgrSQL. Recently I’m facing analysis on 100M record tables so a columnar database could be a better option, but I would like to stay using postgreSQL.


To what extent would your needs be satisfied by a covering index or index-oriented table?

That is: Are you doing deep data mining where you have LOTS of different columns you're interested in finding patterns in, or do you only have a few important columns you need to retrieve most of the time, plus a lot of less-used data?

--
Craig Ringer

Re: COLUMNAR postgreSQL ?

From
Ondrej Ivanič
Date:
Hi,

On 20 September 2011 18:16, Simon Riggs <simon@2ndquadrant.com> wrote:
> It would be useful to get some balanced viewpoints on this. I see you
> have Alterian experience, so if you are using both it could be
> valuable info. I've never heard anyone describe the downsides of
> columnar datastores, presumably there are some?

Inserts are slower. I haven't done proper benchmark because there is
no need for thousands inserts per sec in our database.

> My understanding is that columnar will work well for queries like this
>
> SELECT count(*)
> FROM table
> WHERE col1 AND col2 AND col3
>
> but less well when we include the columns in the SELECT clause.

Columnar store is good if:
- you are selecting less than 60% of the total row size (our table has
400 cols and usual query needs 5 - 10 cols)
- aggregates: count(*), avg(), ...

In some cases columnar store is able to beat Postgres + High IOPS
(250k+) SSD card

> Would you be able to give some viewpoints and measurements on that?

Check this:
http://www.greenplum.com/community/forums/showthread.php?499-enable_mergejoin-and-random_page_cost&p=1553#post1553



--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: COLUMNAR postgreSQL ?

From
"Tomas Vondra"
Date:
On 21 Září 2011, 0:13, Ondrej Ivanič wrote:
> Hi,
>
> On 20 September 2011 18:16, Simon Riggs <simon@2ndquadrant.com> wrote:
>> It would be useful to get some balanced viewpoints on this. I see you
>> have Alterian experience, so if you are using both it could be
>> valuable info. I've never heard anyone describe the downsides of
>> columnar datastores, presumably there are some?
>
> Inserts are slower. I haven't done proper benchmark because there is
> no need for thousands inserts per sec in our database.
>
>> My understanding is that columnar will work well for queries like this
>>
>> SELECT count(*)
>> FROM table
>> WHERE col1 AND col2 AND col3
>>
>> but less well when we include the columns in the SELECT clause.
>
> Columnar store is good if:
> - you are selecting less than 60% of the total row size (our table has
> 400 cols and usual query needs 5 - 10 cols)
> - aggregates: count(*), avg(), ...

Where did those numbers come from? What columnar database are you using?
What options were used (e.g. compression)?

> In some cases columnar store is able to beat Postgres + High IOPS
> (250k+) SSD card

What do you mean by "in some cases"? If that means a DWH/DSS workloads,
then it's apples to oranges I guess.

SSDs are great for OLTP workloads (with a lot of random I/O). With DWH/DSS
workloads, the performance gain is much smaller (not worth the money in
some cases).

With this kind of workload the IOPS is not that important, the sequential
reads is. And SSDs are not significantly faster in case of sequential I/O
- you can usually achieve the same sequential performance with spinners
for less money).

So I wouldn't be surprised if a columnar store on a spinner beats
PostgreSQL on a SSD, when running a DWH/DSS workload. Actually I'd expect
that.

This is not a rant against columnar stores - I find them very interesting.


Tomas


Re: COLUMNAR postgreSQL ?

From
Ondrej Ivanič
Date:
Hi,

2011/9/21 Tomas Vondra <tv@fuzzy.cz>:
>> Columnar store is good if:
>> - you are selecting less than 60% of the total row size (our table has
>> 400 cols and usual query needs 5 - 10 cols)
>> - aggregates: count(*), avg(), ...
>
> Where did those numbers come from? What columnar database are you using?
> What options were used (e.g. compression)?

Aster's nCluster and Greenplum with no and maximum compression (there
was no difference between compression level 5 and 9 but hoge
difference between compression level 0 and 9) and partitioned.

>> In some cases columnar store is able to beat Postgres + High IOPS
>> (250k+) SSD card
>
> What do you mean by "in some cases"? If that means a DWH/DSS workloads,
> then it's apples to oranges I guess.

> SSDs are great for OLTP workloads (with a lot of random I/O). With DWH/DSS
> workloads, the performance gain is much smaller (not worth the money in
> some cases).

Yes, our DB is hybrid: we need OLAP solution with OLTP performance.
Schema si very simple star schema and is multitenant. So "random io"
to "seq io" is 80% : 20% but most of the queries are simple aggregates
and select queries (drill downs, dicing, slicing, summaries, and
queries generated by machine learning algos). Users are anxious if
they have to wait for more than 30 sec.

> With this kind of workload the IOPS is not that important, the sequential
> reads is. And SSDs are not significantly faster in case of sequential I/O
> - you can usually achieve the same sequential performance with spinners
> for less money).

yes, you are right:
seq IO: FusionIO is 3-5 times faster than our Hitachi SAN. SAN is 5-10
times faster than local SAS 15k drive. Random IO is completely
different story.

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)