Home > mailing lists

Re: [WIP]Vertical Clustered Index (columnar store extension) - take2 - Mailing list pgsql-hackers

From	Jim Nasby
Subject	Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
Date	June 5 01:51:16
Msg-id	CAMFBP2piL84fv6cxyd7z+OgKjkBhNW1UDx=NFZamtJ2tc4ik8g@mail.gmail.com Whole thread Raw
In response to	Re: [WIP]Vertical Clustered Index (columnar store extension) - take2 (Jim Nasby <jnasby@upgrade.com>)
List	pgsql-hackers

Tree view

On Wed, Jun 4, 2025 at 5:19 PM Jim Nasby <jnasby@upgrade.com> wrote:

What I can imagine is "VCI" as a "proxy" TAM on top of heap, keeping the
columnar format in a separate fork. And using either that from custom
scans, or the heap as a fallback for cases not supported by VCI.

Yeah, there'd definitely need to be some kind of proxy... I'm just suggesting that we don't *have* to do that as a separate fork...

(tl;dr: there are some key things that can only be implemented in the engine that would enable much more complex features to be added at the SQL level, without requiring tons of C code to implement the larger feature)

Oh, one other thing worth mentioning... it's actually not terribly hard to build a column-store in userspace today: just turn every column of a table into an array and set the TOAST target low enough so that it all gets toasted. I tested that many years ago, and even though I couldn't set the toast target back then saw some really encouraging results... provided that I constructed my queries carefully (I also had range fields for each column that stored the min/max of each array, so the planner could completely skip de-toasting anything that would not contain values of interest.)

The reason I never went anywhere with this concept is it'd be very hard for most folks to write queries that performed well. The transform from column back to row-based was actually hidden behind a view (a bunch of unnest()'s) - but if you didn't make use of the range fields in your query you lost a lot (but not all[1]) of the performance gain. I know that I could have used the hooks to teach the planner how to do this, but it would have been a huge amount of work (at least for me) to do so.

It did occur to me recently that a generic system for teaching the optimizer additional transforms it could make would be generally useful. By far the biggest example would be a way to teach it that

WHERE timestamp_field :: date = '2025-6-4'

is the same thing as

WHERE timestamp_field >= '2025-6-4' AND timestamp_field < '2025-6-5'

That would be extremely helpful in a lot of environments. There are definitely other cases where you can apply the same kinds of logic. In particular, such a feature (if generic enough) would make it possible to write simple queries against a view that transformed columnar data (stored as arrays) back into a row format *and* apply additional predicates that would make those queries highly efficient - all done via pure SQL.

[1] In my testing (which used the taxicab database) there was still a performance gain from storing the data as arrays, even if the queries to access it took no special efforts to eliminate unnecessary data. The reason is that TOAST meant that the base data was being compressed. In fact, testing showed that there was a win even if you didn't treat each individual column as an array; you could simply store an array of a composite type and still see a win.

pgsql-hackers by date:

From: Bruce Momjian
Date: 05 June, 01:40:11
Subject: Re: PG 18 release notes draft committed

From: Thomas Munro
Date: 05 June, 02:35:23
Subject: Re: Custom Glibc collation version strings under LOCPATH

Re: [WIP]Vertical Clustered Index (columnar store extension) - take2 - Mailing list pgsql-hackers

Previous

Next