Thread: [WIP]Vertical Clustered Index (columnar store extension) - take2

[WIP]Vertical Clustered Index (columnar store extension) - take2

From
"Aya Iwata (Fujitsu)"
Date:

Hi All,

 

Suggestions

==========

 

When analyzing real-time data collected by PostgreSQL,

it can be difficult to tune the current PostgreSQL server for satisfactory performance.

Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory column store function that holds data in a state suitable for business analysis and is also expected to improve analysis performance.

With VCI, you can also expect to run analysis 7.8 times faster. This is achieved by the analytics engine, which optimizes parallel processing of column-oriented data, in addition to the fact that VCI stores data in a columnar format, enabling efficient retrieval of the columns needed for analysis.

 

Similar Features

============

 

One column store feature available with postgres is Citus Columnar Table.

If you introduces the citus extension, which allows columnar tables to be used using the columnar access method.

This function is intended to analyze the accumulated data. Therefore, you cannot update or delete data.

VCI supports data updates and deletions. This enables you to analyze not only the accumulated data but also the data that occurs in real time.

 

Implementing VCI

============

 

To introduce an updatable column store, we explain how updates to row-oriented data are propagated to column-oriented data.

 

VCI has two storage areas.

 

- Write Optimized Storage (WOS)

- Read Optimized Storage (ROS)

 

Describes WOS.

The WOS stores data for all columns in the VCI in a row-oriented format.

All newly added data is stored in the WOS relation along with the transaction information.

Using WOS to delete and update newly added data has no significant performance impact compared to deleting from columnar storage.

 

ROS is the storage area where all column data is stored.

 

When inserting/updating/deleting, data is written synchronously to WOS. It does not compress or index the data.

This avoids the overhead of converting to a columnar while updating the data.

After a certain amount of data accumulates in the WOS, the ROS control daemon converts it to column data asynchronously with updates.

Column data transformation compresses and indexes the data and writes it to ROS.

 

Describes searching for data.

Since there are two storage formats, the SELECT process needs to convert the WOS data to local ROS to determine whether it is visible or invisible. This conversion cost depends on the number of tuples present in the WOS file. This may introduce some performance overhead.

Obtain search results by referencing the local ROS and referencing the ROS in parallel.

 

These implementation ideas are also posted on Fujitsu's blog for your reference. [1]

 

Past discussions

===========

 

We've proposed features before. [2]

This thread also explains the details, so please check it.

 

In a previous thread, we suggested implementing a modification to the PostgreSQL backend code.

Based on the FB we received at that time, we think we need to re-implement this feature in pluggable storage using the table access method API.

I also got a FB of the features I needed from a PostgreSQLPro member. We believe it is necessary to deal with these issues in stages.

- Need to provide vector processing for nodes (filter, grand aggregate, aggregation with group by...) to speed up computation

- Requires parallel processing support such as scanning

 

It is assumed that the re-implementation will also require additional functionality to the current Table Access Method API.

It is useful not only for VCI but also for other access methods.

Therefore, we decided to propose the VCI feature to the community and proceed with development.

 

 

Request matter

===========

 

Are members of the PostgreSQL hackers interested in VCI features?

We welcome your comments and suggestions on this feature.

In particular, if you have any questions, required features, or implementations, please let me know.

 

 

[1] https://www.postgresql.fastware.com/blog/improve-data-analysis-performance-without-impacting-business-transactions-with-vertical-clustered-index

 

[2]https://www.postgresql.org/message-id/CAJrrPGfaC7WC9NK6PTTy6YN-NN+hCy8xOLAh2doYhVg5d6HsAA@mail.gmail.com

 

Regards,

Aya Iwata

FUJITSU LIMITED

RE: [WIP]Vertical Clustered Index (columnar store extension) - take2

From
"Aya Iwata (Fujitsu)"
Date:
Hi Alvaro san,

I am sorry for my late reply. I continue to work on proposing VCI feature to the community.

> I think this is definitely an important and welcome development.
> I'm looking forward to patches in this area.

Thank you!
I am currently preparing to share VCI designs with PGConf.dev.
I look forward to sharing more about VCI with you.


Best regards,
Aya Iwata
FUJITSU LIMITED

RE: [WIP]Vertical Clustered Index (columnar store extension) - take2

From
"Aya Iwata (Fujitsu)"
Date:
Hi Yura san,


> I just don't get, why it should be "in-memory"? All the same things you
> describe further, but storing in paged index on-disk with caching
> through shared_buffers - why this way it wouldn't work?

We make the columnar store resident in memory for maximum search performance.
But I'm not very particular about this. Comments are welcome.

Best regards,
Aya Iwata
FUJITSU LIMITED


> -----Original Message-----
> From: Yura Sokolov <y.sokolov@postgrespro.ru>
> Sent: Wednesday, January 15, 2025 11:44 PM
> To: pgsql-hackers@lists.postgresql.org
> Subject: Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
> 
> 07.10.2024 17:53, Aya Iwata (Fujitsu) wrote:
> > Hi All,
> >
> > Suggestions
> >
> > ==========
> >
> > When analyzing real-time data collected by PostgreSQL,
> >
> > it can be difficult to tune the current PostgreSQL server for
> > satisfactory performance.
> >
> > Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory
> > column store function that holds data in a state suitable for business
> > analysis and is also expected to improve analysis performance.
> 
> I just don't get, why it should be "in-memory"? All the same things you
> describe further, but storing in paged index on-disk with caching
> through shared_buffers - why this way it wouldn't work?
> 
> 


08.04.2025 13:29, Aya Iwata (Fujitsu) wrote:
> Hi Yura san,
> 
> 
>> I just don't get, why it should be "in-memory"? All the same things you
>> describe further, but storing in paged index on-disk with caching
>> through shared_buffers - why this way it wouldn't work?
> 
> We make the columnar store resident in memory for maximum search performance.
> But I'm not very particular about this. Comments are welcome.

I just wanted to say: there is no need to be super fast.
There is the need to be remarkably faster than it is now.

ClickHouse, DuckDB, Vertica - they are not in-memory, they are disk based.
But they are very fast.
If PostgreSQL will be just as twice slower as ClickHouse, it will be very
great! Most of users will not setup ClickHouse at all then, because twice
slower is still very fast.

Databases could be very huge. Even when they are in "columnar" format,
which usually consumes less space. And memory is still costs more than disk
space.

Certainly there are users who think they need "in-memory". But the truth is
very few of them really need "in-memory".

All of this is just my opinion. I could be wrong.

>> -----Original Message-----
>> From: Yura Sokolov <y.sokolov@postgrespro.ru>
>> Sent: Wednesday, January 15, 2025 11:44 PM
>> To: pgsql-hackers@lists.postgresql.org
>> Subject: Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
>>
>> 07.10.2024 17:53, Aya Iwata (Fujitsu) wrote:
>>> Hi All,
>>>
>>> Suggestions
>>>
>>> ==========
>>>
>>> When analyzing real-time data collected by PostgreSQL,
>>>
>>> it can be difficult to tune the current PostgreSQL server for
>>> satisfactory performance.
>>>
>>> Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory
>>> column store function that holds data in a state suitable for business
>>> analysis and is also expected to improve analysis performance.
>>
>> I just don't get, why it should be "in-memory"? All the same things you
>> describe further, but storing in paged index on-disk with caching
>> through shared_buffers - why this way it wouldn't work?


-- 
regards
Yura Sokolov aka funny-falcon