Re: [HACKERS] [WIP]Vertical Clustered Index (columnar storeextension) - Mailing list pgsql-hackers

From Jim Nasby
Subject Re: [HACKERS] [WIP]Vertical Clustered Index (columnar storeextension)
Date
Msg-id 570c522f-3736-c0ac-83db-0313318d939d@BlueTreble.com
Whole thread Raw
In response to Re: [HACKERS] [WIP]Vertical Clustered Index (columnar store extension)  (Haribabu Kommi <kommi.haribabu@gmail.com>)
List pgsql-hackers
On 1/16/17 10:09 PM, Haribabu Kommi wrote:
> Yes, that' correct. Currently with this approach, it is not possible to
> ditch the
> heap completely. This approach is useful for the cases, where the user wants
> to store only some columns as part of clustered index.

Ahh, that's unfortunate. Billion row+ tables are becoming rather common, 
and that 24GB of overhead starts becoming very painful. It's actually a 
lot worse considering there will be at least one index on the table, so 
100GB+ of overhead isn't that uncommon.

>     Another complication is that one of the big advantages of a CSTORE
>     is allowing analysis to be done efficiently on a column-by-column
>     (as opposed to row-by-row) basis. Does your patch by chance provide
>     that?
>
> Not the base patch that I shared. But the further patches provides the
> data access
> column-by-column basis using the custom plan methods.

Great, that's something else that a column store really needs to be 
successful. Something else I suspect is necessary is a faster/better way 
to eliminate chunks of rows from scans.

Just as an example, with my simple array-based approach, you can store a 
range type along with each array that contains the min and max values 
for the array. That means any query that wants values between 50 and 100 
can include a clause that filters on range types that overlap with 
[50,100]. That can be indexed very efficiently and is fast to run checks 
against.

>     Generally speaking, I do think the idea of adding support for this
>     as an "index" is a really good starting point, since that part of

... as discussed elsewhere in the thread, adding a bunch of hooks is 
probably not a good way to do this. :/

>     That would be a great way to gain knowledge on what users would want
>     to see in a column store, something else I suspect we need. It would
>     also be far less code than what you or Alvaro are proposing. When it
>     comes to large changes that don't have crystal-clear requirements, I
>     think that's really important.
>
> The  main use case of this patch is to support mixed load environments,
> where both OLTP and OLAP queries are possible. The advantage of
> proposed patch design is, providing good performance to OLAP queries
> without affecting OLTP.

Yeah, that's a big part of what I was envisioning with my array-based 
approach. In simple terms, there would be a regular row-based table, and 
an array-based table, with a view that allows seamless querying into 
both (re-presenting the array-storage on a per-row basis). There would 
be a periodic process that moves entire sets of rows from the row 
storage into the array storage.

If you updated or deleted a row that was part of an array, the contents 
of the entire array could be moved back into row-based storage. After a 
period of time, rows would get moved back into array storage. Or the 
array could be modified in place, but you need to be very careful about 
bloating the array storage if you do that.

The big missing piece here is getting the planner to intelligently 
handle a mixed row/column store. As I mentioned, you can easily add 
range type fields to greatly increase performance, but they won't do any 
good unless the appropriate filters get added. It's not THAT hard to do 
that by hand, but it'd be great if there was a more automated method. 
Such a method might also be very useful for transforming expressions 
like date_part('quarter', ...) into something that could use existing 
indexes.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: Re: [HACKERS] Logical replication failing when foreign key present
Next
From: Corey Huinker
Date:
Subject: \if, \elseif, \else, \endif (was Re: [HACKERS] PSQL commands:\quit_if, \quit_unless)