Re: Use Postgres as a column store by creating one table per column - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: Use Postgres as a column store by creating one table per column
Date
Msg-id 20190524180629.42d3u4m7a3jgh5hq@development
Whole thread Raw
In response to Re: Use Postgres as a column store by creating one table per column  (George Neuner <gneuner2@comcast.net>)
List pgsql-performance
On Thu, May 23, 2019 at 01:08:42AM -0400, George Neuner wrote:
>On Tue, 21 May 2019 21:28:07 -0700, Lev Kokotov 
><lev.kokotov@gmail.com> wrote:
>
>>Is it efficient to use Postgres as a column store by creating one 
>table per
>>column?
>>
>>I would query it with something like `[...] UNION SELECT value AS <table>
>>FROM <table> WHERE value = <value> UNION [...]` to build a row.
>
>I think you mean JOIN.
>
>You'd need more than that: Postgresql uses MVCC for concurrency, so 
>whenever you update any row in a table, the ordering of the rows 
>within the table changes.  And the JOIN operation inherently is 
>unordered - you need to sort the result deliberately to control 
>ordering.
>
>To emulate a column-store, at the very least you need a way to 
>associate values from different "columns" that belong to the same 
>"row" of the virtual table.  IOW, every value in every "column" needs 
>an explicit "row" identifier.  E.g.,
>
>   col1 = { rowid, value1 }, col2 = { rowid, value2 }, ...
>
>For performance you would need to have indexes on at least the rowid 
>in each of the "column" tables.
>
>This is a bare minimum and can only work if the columns of your 
>virtual table and the queries against it are application controlled or 
>statically known.  If you want to do something more flexible that will 
>support ad hoc table modifications, elastically sized values (strings, 
>bytes, arrays, JSON, XML), etc. this example is not suffice and the 
>implementation can get very complicated very quickly
>
>
>Justin Pryzby was not joking when he said the performance could be 
>awful ... at least as compared to a more normal row-oriented 
>structure.  Performance of a query that involves more than a handful 
>of "columns", in general, will be horrible.  It is up to you to decide 
>whether some (maybe little) increase in performance in processing 
>*single* columns will offset likely MASSIVE loss of performance in 
>processing multiple columns.
>

Maybe take a look at this paper:

    http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf

which essentially compares this approach to a "real" column store.

It certainly won't give you performance comparable to column store, it
adds quite a bit of overhead (disk space because of row headers, CPU
because of extra joins, etc.).

And it can't give you the column-store benefits - compression and/or
more efficient execution.

>
>>I'm thinking since Postgres stores tables in continuous blocks of 
>16MB each
>>(I think that's the default page size?)
>
>Default page size is 8 KB.  You'd have to recompile to change that, 
>and it might break something - a whole lot of code depends on the 
>knowing the size of storage pages.
>
>

Right. And the largest page size is 64kB. But 8kB is a pretty good
trade-off, in most cases.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Generic Plans for Prepared Statement are 158155 times slower than Custom Plans
Next
From: Stephen Frost
Date:
Subject: Re: Use Postgres as a column store by creating one table per column