Re: [HACKERS] [WIP]Vertical Clustered Index (columnar store extension) - Mailing list pgsql-hackers

From Haribabu Kommi
Subject Re: [HACKERS] [WIP]Vertical Clustered Index (columnar store extension)
Date
Msg-id CAJrrPGefW1VomwWUS9BaLoXyHEAu9dLcW5s3JNNp5FxA0R5cFw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [WIP]Vertical Clustered Index (columnar storeextension)  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
Responses Re: [HACKERS] [WIP]Vertical Clustered Index (columnar storeextension)  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
List pgsql-hackers


On Sun, Jan 8, 2017 at 2:01 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 12/29/16 9:55 PM, Haribabu Kommi wrote:
The tuples which don't have multiple copies or frozen data will be moved
from WOS to ROS periodically by the background worker process or autovauum
process. Every column data is stored separately in it's relation file. There
is no transaction information is present in ROS. The data in ROS can be
referred with tuple ID.

Would updates be handled via the delete mechanism you described then?

Updates are handled similar like delete operations, but there are some extra
index insert operations occurs in this index even when the update is of HOT
type, because of TID-CRID mapping.
 
In this approach, the column data is present in both heap and columnar
storage.

ISTM one of the biggest reasons to prefer a column store over heap is to ditch the 24 byte overhead, so I'm not sure how much of a win this is.

Yes, that' correct. Currently with this approach, it is not possible to ditch the
heap completely. This approach is useful for the cases, where the user wants
to store only some columns as part of clustered index.


Another complication is that one of the big advantages of a CSTORE is allowing analysis to be done efficiently on a column-by-column (as opposed to row-by-row) basis. Does your patch by chance provide that?

Not the base patch that I shared. But the further patches provides the data access
column-by-column basis using the custom plan methods. 
 
Generally speaking, I do think the idea of adding support for this as an "index" is a really good starting point, since that part of the system is pluggable. It might be better to target getting only what needs to be in core into core to begin with, allowing the other code to remain an extension for now. I think there's a lot of things that will be discovered as we start moving into column stores, and it'd be very unfortunate to accidentally paint the core code into a corner somewhere.

Yes, it is possible to add only the code that is required in the core and keep the other part
as extension. Without providing the complete clustered index approach, I doubt whether
the necessary hooks and it's code gets accepted to the core.
 
As a side note, it's possible to get a lot of the benefits of a column store by using arrays. I've done some experiments with that and got an 80-90% space reduction, and most queries saw improved performance as well (there were a few cases that weren't better). The biggest advantage to this approach is people could start using it today, on any recent version of Postgres.

Interesting experiment.
 
That would be a great way to gain knowledge on what users would want to see in a column store, something else I suspect we need. It would also be far less code than what you or Alvaro are proposing. When it comes to large changes that don't have crystal-clear requirements, I think that's really important.

The  main use case of this patch is to support mixed load environments,
where both OLTP and OLAP queries are possible. The advantage of
proposed patch design is, providing good performance to OLAP queries
without affecting OLTP.

Regards,
Hari Babu
Fujitsu Australia

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: [HACKERS] [PATCH] Rename pg_switch_xlog to pg_switch_wal
Next
From: Amit Kapila
Date:
Subject: Re: [HACKERS] Cache Hash Index meta page.