Thread: Use Postgres as a column store by creating one table per column

Use Postgres as a column store by creating one table per column

From
Lev Kokotov
Date:
Is it efficient to use Postgres as a column store by creating one table per column?

I would query it with something like `[...] UNION SELECT value AS <table> FROM <table> WHERE value = <value> UNION [...]` to build a row.

I'm thinking since Postgres stores tables in continuous blocks of 16MB each (I think that's the default page size?) I would get efficient reads and with parallel queries I could benefit from multiple cores.

Thanks!

Best,
Lev

Re: Use Postgres as a column store by creating one table per column

From
Justin Pryzby
Date:
On Tue, May 21, 2019 at 09:28:07PM -0700, Lev Kokotov wrote:
> Is it efficient to use Postgres as a column store by creating one table per
> column?
> 
> I would query it with something like `[...] UNION SELECT value AS <table>
> FROM <table> WHERE value = <value> UNION [...]` to build a row.

I think you mean JOIN not UNION.

It'd be awful (At one point I tried it very briefly).  If you were joining 2,
10 column tables, that'd be 19 joins.  I imagine the tables would be "serial id
unique, float value" or similar, so the execution might not be terrible, as
it'd be using an index lookup for each column.  But the planner would suffer,
badly.  Don't even try to read EXPLAIN.

Actually, the execution would also be hitting at least 2x files per "column"
(one for the index and one for the table data), so that's not great.

Also, the overhead of a 2-column table is high, so your DB would be much bigger
and have very high overhead.  Sorry to reference a 2ndary source, but..
https://stackoverflow.com/questions/13570613/making-sense-of-postgres-row-sizes

> I'm thinking since Postgres stores tables in continuous blocks of 16MB each
> (I think that's the default page size?) I would get efficient reads and
> with parallel queries I could benefit from multiple cores.

Default page size is 8kb

Justin



Re: Use Postgres as a column store by creating one table per column

From
George Neuner
Date:
On Tue, 21 May 2019 21:28:07 -0700, Lev Kokotov <lev.kokotov@gmail.com> 
wrote:

 >Is it efficient to use Postgres as a column store by creating one 
table per
 >column?
 >
 >I would query it with something like `[...] UNION SELECT value AS <table>
 >FROM <table> WHERE value = <value> UNION [...]` to build a row.

I think you mean JOIN.

You'd need more than that: Postgresql uses MVCC for concurrency, so 
whenever you update any row in a table, the ordering of the rows within 
the table changes.  And the JOIN operation inherently is unordered - you 
need to sort the result deliberately to control ordering.

To emulate a column-store, at the very least you need a way to associate 
values from different "columns" that belong to the same "row" of the 
virtual table.  IOW, every value in every "column" needs an explicit 
"row" identifier.  E.g.,

    col1 = { rowid, value1 }, col2 = { rowid, value2 }, ...

For performance you would need to have indexes on at least the rowid in 
each of the "column" tables.

This is a bare minimum and can only work if the columns of your virtual 
table and the queries against it are application controlled or 
statically known.  If you want to do something more flexible that will 
support ad hoc table modifications, elastically sized values (strings, 
bytes, arrays, JSON, XML), etc. this example is not suffice and the 
implementation can get very complicated very quickly


Justin Pryzby was not joking when he said the performance could be awful 
... at least as compared to a more normal row-oriented structure.  
Performance of a query that involves more than a handful of "columns", 
in general, will be horrible.  It is up to you to decide whether some 
(maybe little) increase in performance in processing *single* columns 
will offset likely MASSIVE loss of performance in processing multiple 
columns.


 >I'm thinking since Postgres stores tables in continuous blocks of 16MB 
each
 >(I think that's the default page size?)

Default page size is 8 KB.  You'd have to recompile to change that, and 
it might break something - a whole lot of code depends on the knowing 
the size of storage pages.


George




Re: Use Postgres as a column store by creating one table per column

From
Tomas Vondra
Date:
On Thu, May 23, 2019 at 01:08:42AM -0400, George Neuner wrote:
>On Tue, 21 May 2019 21:28:07 -0700, Lev Kokotov 
><lev.kokotov@gmail.com> wrote:
>
>>Is it efficient to use Postgres as a column store by creating one 
>table per
>>column?
>>
>>I would query it with something like `[...] UNION SELECT value AS <table>
>>FROM <table> WHERE value = <value> UNION [...]` to build a row.
>
>I think you mean JOIN.
>
>You'd need more than that: Postgresql uses MVCC for concurrency, so 
>whenever you update any row in a table, the ordering of the rows 
>within the table changes.  And the JOIN operation inherently is 
>unordered - you need to sort the result deliberately to control 
>ordering.
>
>To emulate a column-store, at the very least you need a way to 
>associate values from different "columns" that belong to the same 
>"row" of the virtual table.  IOW, every value in every "column" needs 
>an explicit "row" identifier.  E.g.,
>
>   col1 = { rowid, value1 }, col2 = { rowid, value2 }, ...
>
>For performance you would need to have indexes on at least the rowid 
>in each of the "column" tables.
>
>This is a bare minimum and can only work if the columns of your 
>virtual table and the queries against it are application controlled or 
>statically known.  If you want to do something more flexible that will 
>support ad hoc table modifications, elastically sized values (strings, 
>bytes, arrays, JSON, XML), etc. this example is not suffice and the 
>implementation can get very complicated very quickly
>
>
>Justin Pryzby was not joking when he said the performance could be 
>awful ... at least as compared to a more normal row-oriented 
>structure.  Performance of a query that involves more than a handful 
>of "columns", in general, will be horrible.  It is up to you to decide 
>whether some (maybe little) increase in performance in processing 
>*single* columns will offset likely MASSIVE loss of performance in 
>processing multiple columns.
>

Maybe take a look at this paper:

    http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf

which essentially compares this approach to a "real" column store.

It certainly won't give you performance comparable to column store, it
adds quite a bit of overhead (disk space because of row headers, CPU
because of extra joins, etc.).

And it can't give you the column-store benefits - compression and/or
more efficient execution.

>
>>I'm thinking since Postgres stores tables in continuous blocks of 
>16MB each
>>(I think that's the default page size?)
>
>Default page size is 8 KB.  You'd have to recompile to change that, 
>and it might break something - a whole lot of code depends on the 
>knowing the size of storage pages.
>
>

Right. And the largest page size is 64kB. But 8kB is a pretty good
trade-off, in most cases.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Use Postgres as a column store by creating one table per column

From
Stephen Frost
Date:
Greetings,

* Lev Kokotov (lev.kokotov@gmail.com) wrote:
> Is it efficient to use Postgres as a column store by creating one table per
> column?

Short answer is no, not in a traditional arrangement, anyway.  The tuple
overhead would be extremely painful.  It's possible to improve on that,
but it requires sacrificing what the tuple header gives you- visibility
information, along with some other things.  The question will be if
that's acceptable or not.

> I'm thinking since Postgres stores tables in continuous blocks of 16MB each
> (I think that's the default page size?) I would get efficient reads and
> with parallel queries I could benefit from multiple cores.

The page size in PG is 8k, not 16MB.

Thanks,

Stephen

Attachment

Re: Use Postgres as a column store by creating one table per column

From
Mark Kirkwood
Date:
On 22/05/19 4:28 PM, Lev Kokotov wrote:
> Is it efficient to use Postgres as a column store by creating one 
> table per column?
>
> I would query it with something like `[...] UNION SELECT value AS 
> <table> FROM <table> WHERE value = <value> UNION [...]` to build a row.
>
> I'm thinking since Postgres stores tables in continuous blocks of 16MB 
> each (I think that's the default page size?) I would get efficient 
> reads and with parallel queries I could benefit from multiple cores.
>
>
Take a look at Zedstore, which is a column store built to plug into v12 
storage layer:

https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w%40mail.gmail.com