Home > mailing lists

Thread: [WIP]Vertical Clustered Index (columnar store extension) - take2

[WIP]Vertical Clustered Index (columnar store extension) - take2

From

"Aya Iwata (Fujitsu)"

Date:

07 October 2024, 17:53:19

Hi All,

Suggestions

==========

When analyzing real-time data collected by PostgreSQL,

it can be difficult to tune the current PostgreSQL server for satisfactory performance.

Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory column store function that holds data in a state suitable for business analysis and is also expected to improve analysis performance.

With VCI, you can also expect to run analysis 7.8 times faster. This is achieved by the analytics engine, which optimizes parallel processing of column-oriented data, in addition to the fact that VCI stores data in a columnar format, enabling efficient retrieval of the columns needed for analysis.

Similar Features

============

One column store feature available with postgres is Citus Columnar Table.

If you introduces the citus extension, which allows columnar tables to be used using the columnar access method.

This function is intended to analyze the accumulated data. Therefore, you cannot update or delete data.

VCI supports data updates and deletions. This enables you to analyze not only the accumulated data but also the data that occurs in real time.

Implementing VCI

============

To introduce an updatable column store, we explain how updates to row-oriented data are propagated to column-oriented data.

VCI has two storage areas.

- Write Optimized Storage (WOS)

- Read Optimized Storage (ROS)

Describes WOS.

The WOS stores data for all columns in the VCI in a row-oriented format.

All newly added data is stored in the WOS relation along with the transaction information.

Using WOS to delete and update newly added data has no significant performance impact compared to deleting from columnar storage.

ROS is the storage area where all column data is stored.

When inserting/updating/deleting, data is written synchronously to WOS. It does not compress or index the data.

This avoids the overhead of converting to a columnar while updating the data.

After a certain amount of data accumulates in the WOS, the ROS control daemon converts it to column data asynchronously with updates.

Column data transformation compresses and indexes the data and writes it to ROS.

Describes searching for data.

Since there are two storage formats, the SELECT process needs to convert the WOS data to local ROS to determine whether it is visible or invisible. This conversion cost depends on the number of tuples present in the WOS file. This may introduce some performance overhead.

Obtain search results by referencing the local ROS and referencing the ROS in parallel.

These implementation ideas are also posted on Fujitsu's blog for your reference. [1]

Past discussions

===========

We've proposed features before. [2]

This thread also explains the details, so please check it.

In a previous thread, we suggested implementing a modification to the PostgreSQL backend code.

Based on the FB we received at that time, we think we need to re-implement this feature in pluggable storage using the table access method API.

I also got a FB of the features I needed from a PostgreSQLPro member. We believe it is necessary to deal with these issues in stages.

- Need to provide vector processing for nodes (filter, grand aggregate, aggregation with group by...) to speed up computation

- Requires parallel processing support such as scanning

It is assumed that the re-implementation will also require additional functionality to the current Table Access Method API.

It is useful not only for VCI but also for other access methods.

Therefore, we decided to propose the VCI feature to the community and proceed with development.

Request matter

===========

Are members of the PostgreSQL hackers interested in VCI features?

We welcome your comments and suggestions on this feature.

In particular, if you have any questions, required features, or implementations, please let me know.

[1] https://www.postgresql.fastware.com/blog/improve-data-analysis-performance-without-impacting-business-transactions-with-vertical-clustered-index

[2]https://www.postgresql.org/message-id/CAJrrPGfaC7WC9NK6PTTy6YN-NN+hCy8xOLAh2doYhVg5d6HsAA@mail.gmail.com

Regards,

Aya Iwata

FUJITSU LIMITED

RE: [WIP]Vertical Clustered Index (columnar store extension) - take2

From

"Aya Iwata (Fujitsu)"

Date:

08 April, 10:41:38

Hi Alvaro san,

I am sorry for my late reply. I continue to work on proposing VCI feature to the community.

> I think this is definitely an important and welcome development.
> I'm looking forward to patches in this area.

Thank you!
I am currently preparing to share VCI designs with PGConf.dev.
I look forward to sharing more about VCI with you.


Best regards,
Aya Iwata
FUJITSU LIMITED

RE: [WIP]Vertical Clustered Index (columnar store extension) - take2

From

"Aya Iwata (Fujitsu)"

Date:

08 April, 13:29:48

Hi Yura san,


> I just don't get, why it should be "in-memory"? All the same things you
> describe further, but storing in paged index on-disk with caching
> through shared_buffers - why this way it wouldn't work?

We make the columnar store resident in memory for maximum search performance.
But I'm not very particular about this. Comments are welcome.

Best regards,
Aya Iwata
FUJITSU LIMITED


> -----Original Message-----
> From: Yura Sokolov <y.sokolov@postgrespro.ru>
> Sent: Wednesday, January 15, 2025 11:44 PM
> To: pgsql-hackers@lists.postgresql.org
> Subject: Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
> 
> 07.10.2024 17:53, Aya Iwata (Fujitsu) wrote:
> > Hi All,
> >
> > Suggestions
> >
> > ==========
> >
> > When analyzing real-time data collected by PostgreSQL,
> >
> > it can be difficult to tune the current PostgreSQL server for
> > satisfactory performance.
> >
> > Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory
> > column store function that holds data in a state suitable for business
> > analysis and is also expected to improve analysis performance.
> 
> I just don't get, why it should be "in-memory"? All the same things you
> describe further, but storing in paged index on-disk with caching
> through shared_buffers - why this way it wouldn't work?
> 
>

Re: [WIP]Vertical Clustered Index (columnar store extension) - take2

From

Yura Sokolov

Date:

08 April, 18:24:20

08.04.2025 13:29, Aya Iwata (Fujitsu) wrote:
> Hi Yura san,
> 
> 
>> I just don't get, why it should be "in-memory"? All the same things you
>> describe further, but storing in paged index on-disk with caching
>> through shared_buffers - why this way it wouldn't work?
> 
> We make the columnar store resident in memory for maximum search performance.
> But I'm not very particular about this. Comments are welcome.

I just wanted to say: there is no need to be super fast.
There is the need to be remarkably faster than it is now.

ClickHouse, DuckDB, Vertica - they are not in-memory, they are disk based.
But they are very fast.
If PostgreSQL will be just as twice slower as ClickHouse, it will be very
great! Most of users will not setup ClickHouse at all then, because twice
slower is still very fast.

Databases could be very huge. Even when they are in "columnar" format,
which usually consumes less space. And memory is still costs more than disk
space.

Certainly there are users who think they need "in-memory". But the truth is
very few of them really need "in-memory".

All of this is just my opinion. I could be wrong.

>> -----Original Message-----
>> From: Yura Sokolov <y.sokolov@postgrespro.ru>
>> Sent: Wednesday, January 15, 2025 11:44 PM
>> To: pgsql-hackers@lists.postgresql.org
>> Subject: Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
>>
>> 07.10.2024 17:53, Aya Iwata (Fujitsu) wrote:
>>> Hi All,
>>>
>>> Suggestions
>>>
>>> ==========
>>>
>>> When analyzing real-time data collected by PostgreSQL,
>>>
>>> it can be difficult to tune the current PostgreSQL server for
>>> satisfactory performance.
>>>
>>> Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory
>>> column store function that holds data in a state suitable for business
>>> analysis and is also expected to improve analysis performance.
>>
>> I just don't get, why it should be "in-memory"? All the same things you
>> describe further, but storing in paged index on-disk with caching
>> through shared_buffers - why this way it wouldn't work?

-- 
regards
Yura Sokolov aka funny-falcon

RE: [WIP]Vertical Clustered Index (columnar store extension) - take2

From

"Aya Iwata (Fujitsu)"

Date:

13 May, 04:08:34

Hello,

I found some dead codes in 0001 patches. I removed.
Here are new patches.

Regards,
Aya Iwata
Fujitsu Limited

Attachment

Re: [WIP]Vertical Clustered Index (columnar store extension) - take2

From

"Aya Iwata (Fujitsu)"

Date:

17 May, 19:31:30

Hi hackers,

I will share the notes on the discussions in PGConf.dev. Thanks all for participation.

Feedback on the Advanced Patch Feedback Session;

- A Basic idea that allows both OLTP/OLAP workloads on the same table is OK
  - Buffering mechanism has already been used by existing code GIN, IIUC
- IAM seems a more proper approach than TAM
  - One reason is that VCI can only optimize the data lookup stuff
  - TAM needs all possible queries; it's too much
- Divide the patch more and more
  - to allow committers to consider pros and cons and push one by one
  - 15 patches is the maximum amount
  - Separate features to some committable group

Sawada san suggested several points;
1. Find parts that are useful not only for the VCI and dispatching new threads
2. Develop codes incrementally. E.g., VCI has a custom scan, but we may
be able to give up on the first version


Also, I have questions and advice below on May 16th.

- How to handle visibility?
  - What if transactions that insert tuples are aborted?
- VCI vs Index Only Scan
  - VCI seems the same as Index Only Scan
  - VCI has compression. So, is the amount of size better than the Index?
- How about the efficiency for storage? How much data can we store?
- using TAM (Alvalo's suggestion)
  - Most of the code can be ported from the heap. ROS and WOS can exist as forks
    of main files, like .vm and .fm. API can be like:

```
CREATE TABLE foo (id, product, price) USING vci WITH (to_be_indexed=price);
```

Again, I really appreciate your efforts.


Regards,
Aya Iwata
Fujitsu Limited

Re: [WIP]Vertical Clustered Index (columnar store extension) - take2

From

Tomas Vondra

Date:

24 May, 00:29:18

Hello Iwata-san,

Thanks for presenting the patch at pgconf.dev last week, and for the
discussions at the APFS session. As promised I did a quick review of the
patch this week, trying to understand the design, the trade-offs etc.

I'm still trying to understand the patch, so please correct me when I
got something wrong.

Before I get to the review comments (mostly about the first part) and
questions, let me comment on a couple items from your earlier response:

On 5/17/25 18:31, Aya Iwata (Fujitsu) wrote:
> Hi hackers,
> 
> I will share the notes on the discussions in PGConf.dev. Thanks all for participation.
>  
> Feedback on the Advanced Patch Feedback Session;
>  
> - A Basic idea that allows both OLTP/OLAP workloads on the same table is OK
>   - Buffering mechanism has already been used by existing code GIN, IIUC

Yes. GIN has a "fastupdate" feature, which allows accumulating entries
in a buffer, and merging them into the index later (as a batch, which
makes it more efficient). This is similar to the WOS/ROS architecture.

AFAIK some other columnar databases use a similar WOS/ROS architecture,
with similar goal (to move the index updates into a background job, and
make it more efficient).

I expect there to be a trade-off between the WOS size and query
performance. Larger WOS allows larger updates / better efficiency, but
it likely also means slower queries (that also have to scan the WOS).

> - IAM seems a more proper approach than TAM
>   - One reason is that VCI can only optimize the data lookup stuff
>   - TAM needs all possible queries; it's too much

I believe this (IAM vs. TAM) is very much an open question.

At the conference the goal of the patch was explained as a way to
improve performance with analytical queries, while maintaining OLTP
performance and features. From this point of view, it makes sense to
consider VCI to be an "accelerator" feature. Which is what indexes do,
i.e. it's a way to speed-up queries, but we can do without them if
that's faster.

If VCI got implemented as TAM, i.e. serving as the primary storage, we
wouldn't have this option - we'd have to do VCI even for OLTP workloads
(e.g. random inserts / selects), and that sounds quite inefficient. Not
to mention the VCI would need to support everything heap does, and it
supports a lot. VCI doesn't even support arbitrary data types, etc.

But this hinges on the assumption that the VCI can be implemented as an
IAM, right? And that IAM is a good fit for the stuff VCI needs to do.
Because if it can't be implemented as IAM, then the simplicity argument
is quite moot. I'm not saying it can't be done as IAM, I don't have a
clear opinion on that yet.

Another reason to think about this is that after the talk, someone in
the audience pointed out this patch originally started on Postgres 9.5,
and that back then there was no TAM interface. So when implementing it
there was no IAM vs. TAM choice, IAM was the only way to do something
like this, and thus it was used. Maybe if TAM was available back then,
it'd be done differently? Not sure.

Also, Alvaro seemed to think TAM is the way to go, and in order to keep
the OLTP performance he suggested to use both heap and VCI at the same
time, in different "forks". I'm not sure how would that work, or if we
can already do that - AFAIK we can't, because ForkNumber does not allow
adding custom forks. We'd have to relax that, or invent some sort of
federated TAM (that just multiplexes it to two TAMs). Maybe.

But it's not like the IAM approach doesn't need to do this. The first
patch had to add stuff to a lot of random places to make this work. And
some of the places touch stuff that we don't expect indexes to worry
about, like ALTER TABLE, etc.

> - Divide the patch more and more
>   - to allow committers to consider pros and cons and push one by one
>   - 15 patches is the maximum amount
>   - Separate features to some committable group
>  

Yes. Both parts need to be divided into small (but meaningful) pieces. I
don't recall where the "15 patches" came from, I don't think there's
some sort of maximum number of patches. All I care about is that the
patches need to be small-ish, and still make sense. Ideally, each part
should be committable on it's own. This makes it easier to test/review,
but also easier to agree on individual parts - in which case we may
commit the "preparatory" pieces and not have to rebase them forever.

The 0001 part is ~83K, which is not entirely crazy, but I can easily
imagine it split into ~10 parts, each addressing a different subset. For
0002 it's even more important, because no one can review a 1.3MB patch.

> Sawada san suggested several points;
> 1. Find parts that are useful not only for the VCI and dispatching new threads
> 2. Develop codes incrementally. E.g., VCI has a custom scan, but we may
> be able to give up on the first version
>  

Yes, I think this makes perfect sense.

I'd probably keep it in a single thread (essentially as a single patch
series), because it's very useful to see the "bigger picture" why the
parts are important. But if you find something that can stand on it's
own, then sure. But you'll still need to keep the part in the main patch
series (it should not be necessary for people to collect patches from
different threads, and it would also break the CI testing commitfests
patches).

Breaking 0002 into smaller parts makes perfect sense. Something like:

- minimum working patch, with only serial sequential scan
- add parallel sequential scan
- add other custom scans, one by one ...
- ...

> 
> Also, I have questions and advice below on May 16th.
>  
> - How to handle visibility?
>   - What if transactions that insert tuples are aborted?
> - VCI vs Index Only Scan
>   - VCI seems the same as Index Only Scan
>   - VCI has compression. So, is the amount of size better than the Index?
> - How about the efficiency for storage? How much data can we store?
> - using TAM (Alvalo's suggestion)
>   - Most of the code can be ported from the heap. ROS and WOS can exist as forks
>     of main files, like .vm and .fm. API can be like:
>  
> ```
> CREATE TABLE foo (id, product, price) USING vci WITH (to_be_indexed=price);
> ```
> 
Yes, I think those questions are important. Not only for understanding
the current patch, but also to understand what trade-offs the solution
can make (e.g. about the consistency/visibility model), etc.

What I really miss in the current patch is some sort of READMEs with

- high-level design of the VCI indexes

- description of the consistency/visibility model (does it behave the
same way as querying the heap, can it be out of sync for, ...)

- WOS/ROS architecture (when are rows inserted into WOS, promoted into
ROS, what triggers that, ...)

- what's the in-memory / on-disk format

- some places in the patch mention "internal VCI tables" but I have no
idea what that is

- how does the execution work? compression? crucial points to consider
for optimal performance, etc.

- limitations (temporary - can be relaxed in the future, permanent -
inherent to the columnar design) and trade-offs

- what are the various custom scan executor nodes

- what "background" processes happen (custom workers, ...)

- anything else substantial for understanding the design

Maybe there's some of this in the 0002 patch, but I haven't stumbled
over it so far.

------------------------------------------------------------------------

Finally, some review comments about the 0001 patch - it's mostly in the
order as appearing in git diff, file by file. Some of the points repeat,
because multiple places made me think about the same thing. And some of
it was mentioned above.

- It seems the VCI is implemented as an index AM, but it seems to be
trying to do some TAM stuff, e.g. with visibility info etc. For example,
I don't think indexes generally care about XIDs that deleted rows, etc.
Similarly, I can't think of another index AM that'd try to hook into
tablecmds.c or execExpr.

- Not sure what's the right choice (makes sense to have it as index on
top of heap, so that we do regular stuff for OLTP queries, and columnar
stuff using the index for analytics). But it seems it does stuff
expected from table AM and not index AM.

- But making a table AM would make it much more demanding (it'd have to
support storing all data types, ...) and would hurt OLTP performance.

- I very much support the idea of implementing this as an extension (so
minimal infrastructure changes to core + extension in contrib or
entirely outside).

- But the patch is not there, not even with 0001, because there's a
number of places mentioning "vci". There should be no reference to VCI
in core, it should be generic extensibility infrastructure. It's a good
indication of places that may need some extensibility.


relation_open
------------------
- Why is this change (disabling assert enforcing proper locking) needed?
There's a comment in lock.c which mentions VCI parallel workers, but
it's suspicious.


reloptions.c
------------------
- Changes that should be moved to the contrib module Why should in-core
reloptions know about this? See for example how "contrib/bloom" defined
reloptions.


heapam.c
------------------
- add_index_delete_hook, not sure I understand the purpose of this,
likely related to the consistency model, needs explanation. Why is it
prefixed with "add_"?

- If we need such DELETE hook, heapam.c is not where to call it, it
should be called from the same "layer" adding tuples to indexes (which
certainly does not happen in heap_insert).

- Furthermore, if we need this, why should it be a hook and not a new
callback in IndexAmRoutine, just like aminsert()?


initscan
------------------
- Isn't it using keep_startblock in a wrong way? it certainly is not
meant to indicate "keep the old number of blocks"

- Why is this even needed? This is in heapam, so why would this even
matter for VCI, which is an index? Shouldn't that be mostly independent
from which TAM is used by the table?

heapam_handler.c
------------------
- Why does this need to preserve the IndexHeapTuple at all? And why
should it be done in heapam_handler? Again, shouldn't IAM be mostly
independent of this?

- Using a global variable does not seem great. We probably can't have
concurrent calls (only a single heapam_index_build_range_scan call at a
time), but globals are dubious.

- If VCI really needs to remember the tuple, there should be a "proper"
way to explicitly pass it (adjust the IndexAmRoutine callback to have a
new argument or something like that).


heapam_visibility.c
-------------------
- It's probably sensible to make visibility snapshots extensible, i.e.
to allow defining custom snapshot types etc. But the patch still does
that by hard-coding stuff in core, even if the actual "visibility" code
is in extension.

- But this doesn't really allow custom snapshot types, those are still
hardcoded SNAPSHOT_VCI_WOS2ROS / SNAPSHOT_VCI_LOCALROS.

- The add_snapshot_satisfies_hook does not seem like the way to make
this extensible - what if two extensions want to do this. First
HeapTupleSatisfiesVisibility would have to know about all the
extensions, and it could easily lead to infinite loop. Because we call
the first hook, and it calls HeapTupleSatisfiesVisibility again.

- This is probably a good example of VCI trying to do TAM stuff. TAM
interface has tuple_satisfies_snapshot, which seems to have the same
purpose as the hook. Maybe not (it still assumes regular snapshots).

- But I still don't understand why should a new index AM require changes
to how heap AM does visibility checks (or any visibility checks in
general, as indexes don't have visibility info in PG). And why would it
make sense for heap to know about this. IAM is the layer *above* TAM, in
a way.


xlogfuncs.c
------------------
- RECOVERY_VCI_PAUSE_REQUESTED - what's that for? why do we need a
separate state from RECOVERY_PAUSE_REQUESTED?


xlogrecovery.c
------------------
- I don't see why we should not log that the recovery is paused because
of VCI. I sure users would still like to know why it's paused:

  /* If pause requested by VCI, the log is not output. */
  else if (GetRecoveryPauseState() != RECOVERY_VCI_PAUSE_REQUESTED)


dependency.c
------------------
- add_drop_relation_hook raises mostly the same questions as
add_index_delete_hook

- Don't we already have event triggers to do this kind of stuff in an
extensible way?

- Why should an index even care about the table getting dropped? I mean,
it'll get dropped too, before the table. So why this?


index.c
------------------

- Why do we need add_reindex_index_hook at all? we already have ambuild
callback, right?

- It seems really suspicious ConstructTupleDescriptor gains the ability
to index system attributes, why is that needed? surely the VCI does not
need to index system attributes, right? Or why is that needed?

- It seems add_reindex_index_hook is used to "skip" rebuilding the
index, but why would it be useful/necessary, and why should it be done here?

- Why couldn't ambuild() just return? How do I rebuild the index if
something goes wrong (e.g. it gets corrupted). Or to remove bloat, if
that applies to VCI.


explain.c
------------------
- T_CustomPlanMarkPos - so this introduces a new type of custom plan
node? How is that different from CustomScan? We don't even have MarkPos
plan in core, right? So what does it do?

- It even appears to be exactly the same as CustomScan in most places,
but it also introduces some new callbacks (ExplainCustomPlanTargetRel
and SetBoundCustomScan).

- If useful, this should be a separate patch (custom scan improvements).
And it definitely needs to update the CustomScan docs in SGML.

- I don't understand why we need ExplainPropertySortGroupKeys, does not
align with the rest of the API (ExplainProperty is for different data
types - float, text, ...). Maybe expose show_sort_group_keys? Or just
copy show_sort_group_keys into the extension ...


indexcmds.c
------------------
- Yet another place referencing "vci" explicitly (not extensible, should
not be in core)

- Not sure I understand the limitations mentioned there.


tablecmds.c
------------------
- add_alter_tablespace_hook, add_alter_table_change_owner_hook,
  add_alter_table_change_schema_hook

- Why does VCI this even need to handle these actions?

- And we have event triggers, so why have these new hooks?

- Not sure what the code in ATExecSetRelOptions does (another example
that handles "vci" AM explicitly, in a non-extensible way).

- Is it maybe rebuilding some reloptions? Maybe it's about the attnums?


vacuum.c
------------------
- add_skip_vacuum_hook is used to "disable" vacuum for internal VCI tables.

- Why. And what is "internal table"?


execAmi.c
------------------
- what's T_CustomPlanMarkPos?


execExpr.c
------------------
- Why does this expose ExecReadyExpr, ExecCreateExprSetupSteps?

- Isn't this a bit too fragile?

- Probably needed because VciExecInitExpr needs a new argument? But why
does it need that?


execExprInterp.c
------------------
- Why does this need ExprEvalVar_hook and ExprEvalParam_hook?

- Doesn't even check if the hooks are != NULL

- CASE_EEOP_VCI_VAR, CASE_EEOP_VCI_PARAM_EXEC - not extensible, should
not reference VCI explicitly

- VciExprEvalVarHook (for llvmjit?)



execGrouping.c
------------------
- LookupTupleHashEntry_internal removes comment, why?


execProcnode.c, execScan.c
------------------
- New MarkPos node, apparetly, but why do we need that? Why can't we
just build the Materialize when constructing the plan?


execUtils.c
------------------
- again, disables locking check, very suspicious


nodeCustom.c
------------------
- Isn't it strange that ExecInitCustomScan short-circuits the init,
skipping most of the initialization for selected scans?

- If anything, it should not hard-code the scan names like this. Why is
this necessary? is the rest of the initialization irrelevant, or what?
Maybe this is sign it should be table AM rather than index AM?

- But IOS scans would also do this initialization, so ...


nodeModifytable.c
------------------
- add_should_index_insert_hook another strange hook

- Doesn't the hook apply to all relations all the time, not just those
with VCI indexes etc.?


gen_node_support.c
------------------
- So is CustomPlanMarkPos a proper node, or why do we need it? why is
CustomPlan not enough?

- What's SmcAllocSetContext for? Doesn't seem to be mentioned anywhere.


params.c
------------------
- ExecInitParam_hook seems unused, no?


allpaths.c
------------------
- set_plain_rel_pathlist - I don't understand why this checks the VCI
vs. parallelism like this, what it's trying to prevent. Maybe if you
don't want certain plan shape, don't generate the partial paths in that
case from the CustomScan planner callback?

- It seems this tries to allow parallel seqscan, but not other more
complex parallel scans (like parallel agg/join ...), but that could be
done in callback, no?

- Again, we should not reference the index AM this explicitly, it's not
extensible.


createplan.c
------------------
- I don't think we want to make copy_plan_costsize part of API. Just a
detail, though.


autovacuum.c
------------------
- Why does this need a custom autovacuum launcher, or what's the purpose
of this code? is there a special autovacuum launcher for VCI indexes?

- Again, explicit references to "VCI", should not be done like this.


bgworker.c
------------------
- I have no idea what's the point of these changes. seems to be enabling
canceling bgworkers, but that seems like a generic improvement. In which
case it should be a separate patch.

- No idea if it's correct etc.

- Seems to be done to allow CountOtherDBBackends() to cancel backends,
so that it can be dropped etc.

- There apparently are bgworkers doing cleanup for VCI, and we want to
treat them like autovacuum workers (which get interrupted in various
cases instead of waiting forever).

- But this will cancel all bgworkers, no? Not just the VCI ones ...


lock.c
------------------
- this seems a bit strange, surely this should use the same locking
approach as other parallel queries

/*
 * Don't lock for VCI parallel workers and other type of workers should go
 * in normal flow, In case if there is any change in background worker
 * name for VCI parallel workers, the following code also needs an update.
 * FIXME: Try to use the community parallelism code, so that we don't need
 * our own VCI parallel infrastructure.
 */
if (AmBackgroundWorkerProcess() && strstr(MyBgworkerEntry->bgw_name,
"backend="))

- Presumably this is why some of the Asserts on locking were commented
out earlier.


lwlock.c
------------------
- Vci_lwlock_kind seems unused.


timestamp.c
------------------
- I believe the IntervalAggState is internal on purpose, why should it
be made external? It also means timestamp.h has to worry about FRONTEND.


relcache.c
------------------
- add_skip_vci_index_hook - yet another hook, same questions as for the
other hooks earlier

- It seems suspicious we have to skip VCI indexes like this during
planning in RelationGetIndexAttrBitmap, not sure why that's needed? Why
should't we cost the index and maybe reject it based on that?

- Again, references VCI explicitly, so not extensible.


mctx.c
------------------
- why MemoryContextMethods shouldn't be "const"?


common.c
pg_dump.c
pg_dump.h
------------------
- isvciview breaks extensibility, tied to a particular IAM / extension.
Needs to be abstracted / made extensible.

- What is "vci view table" for, actually?


reloptions.h
------------------
- Why does this need to add an entry into the relopt_kind enum? It
should be extensible, see the "contrib/bloom".


xlogrecovery.h
------------------
- Why is RECOVERY_VCI_PAUSE_REQUESTED separate from
RECOVERY_PAUSE_REQUESTED?


timestamp.h
------------------
- The FRONTEND stuff is a sign making IntervalAggState public may not be
a good idea.


executor.h
------------------
- How come it's suddenly OK to return entry->additional without the if
conditions in TupleHashEntryGetAdditional?

- Ah, it's now stored as a separate palloc chunk, not right after the
tuple? Well, maybe that's a separately useful improvement? should be a
separate patch, then. No opinion on whether it's a good idea.

- What's with the struct LimitState; ? Why is it needed?


extensible.h
------------------
- Adds SetBoundCustomScan / ExplainCustomPlanTargetRel, but then it
should also add them to the SGML docs.


memnodes.h
------------------
- What's SmcAllocSetContext? It's not defined anywhere.


params.h
------------------
- Adds noNeedLoadFromMain, loadedFromMain to ParamExecData, but it
doesn't seem to be used anywhere (in either patch).


plannodes.h
------------------
- Adds plan_no to Plan, but it's not set anywhere in core, only in the
VCI extension - which seems wrong (e.g. what if there are two extensions
that want to set it, and they disagree?)

- Either if we want/need this, it should be initialized by core during
planning, in a non-ambiguous way.

- But also how is this different from the existing plan_node_id?


autovacuum.h
------------------
- It seems a bit strage to add "built-in" bgworkers for an extension. If
the extension really needs some sort of workers, it should be defined in
the extension.


bgworker.h
------------------
- So what's the problem with canceling bgworkers? Should this be done
  as a separate patch?


itemptr.h
------------------
- Why do we need this SizeOfIptrData definition? How is it different
from sizeof(ItemPointerData)?


lwlock.h
------------------
- LWLockKind / Vci_lwlock_kind seems unused


numeric.h
------------------
- Why does this expose NumericVar, when the second patch redefines the
struct anyway? also, isn't that fragile? although, we're unlikely to
tweak the NumericVar definition.


rel.h
------------------
- Again, not extensible way to do stuff in an extension.

- We should not have pieces specific to AM in StdRdOptions.

- Doesn't AM have custom reloptions now? Like contrib/bloom?


I also tried to build 0002, but unfortunately that fails with:

executor/vci_executor.c: In function ‘vci_setup_executor_hook’:
executor/vci_executor.c:115:28: error: assignment to
‘ExecutorStart_hook_type’ {aka ‘void (*)(QueryDesc *, int)’} from
incompatible pointer type ‘_Bool (*)(QueryDesc *, int)’
[-Wincompatible-pointer-types]
  115 |         ExecutorStart_hook = vci_executor_start_routine;
      |                            ^
executor/vci_executor.c: In function ‘vci_executor_start_routine’:
executor/vci_executor.c:161:28: error: void value not ignored as it
ought to be
  161 |                 plan_valid = executor_start_prev(queryDesc, eflags);
      |                            ^
executor/vci_executor.c:163:28: error: void value not ignored as it
ought to be
  163 |                 plan_valid = standard_ExecutorStart(queryDesc,
eflags);
      |                            ^
make: *** [../../src/Makefile.global:973: executor/vci_executor.o] Error 1

The extension is not added to contrib/Makefile, so "make world" does not
trigger this failure.



regards

-- 
Tomas Vondra

Re: [WIP]Vertical Clustered Index (columnar store extension) - take2

From

Jim Nasby

Date:

04 June, 20:59:18

On Fri, May 23, 2025 at 4:29 PM Tomas Vondra <tomas@vondra.me> wrote:

Also, Alvaro seemed to think TAM is the way to go, and in order to keep
the OLTP performance he suggested to use both heap and VCI at the same
time, in different "forks". I'm not sure how would that work, or if we
can already do that - AFAIK we can't, because ForkNumber does not allow
adding custom forks. We'd have to relax that, or invent some sort of
federated TAM (that just multiplexes it to two TAMs). Maybe.

But it's not like the IAM approach doesn't need to do this. The first
patch had to add stuff to a lot of random places to make this work. And
some of the places touch stuff that we don't expect indexes to worry
about, like ALTER TABLE, etc.

I suspect another option would be to handle this with table inheritance: have one child that is heap-based, a second that's VCI, and a background job to move data from heap to VCI (and vice-versa for updates and maybe deletes).

Note that you could actually implement all that in user-space. Personally I'd much rather have a way to do pure VCI / column-store sooner and manage it myself than have to wait another release (or more) to get a complete solution...

Re: [WIP]Vertical Clustered Index (columnar store extension) - take2

From

Tomas Vondra

Date:

04 June, 21:15:58

On 6/4/25 19:59, Jim Nasby wrote:
> 
> 
> On Fri, May 23, 2025 at 4:29 PM Tomas Vondra <tomas@vondra.me
> <mailto:tomas@vondra.me>> wrote:
> 
>     Also, Alvaro seemed to think TAM is the way to go, and in order to keep
>     the OLTP performance he suggested to use both heap and VCI at the same
>     time, in different "forks". I'm not sure how would that work, or if we
>     can already do that - AFAIK we can't, because ForkNumber does not allow
>     adding custom forks. We'd have to relax that, or invent some sort of
>     federated TAM (that just multiplexes it to two TAMs). Maybe.
> 
>     But it's not like the IAM approach doesn't need to do this. The first
>     patch had to add stuff to a lot of random places to make this work. And
>     some of the places touch stuff that we don't expect indexes to worry
>     about, like ALTER TABLE, etc.
> 
> 
> I suspect another option would be to handle this with table inheritance:
> have one child that is heap-based, a second that's VCI, and a background
> job to move data from heap to VCI (and vice-versa for updates and maybe
> deletes).
> 
> Note that you could actually implement all that in user-space.
> Personally I'd much rather have a way to do pure VCI / column-store
> sooner and manage it myself than have to wait another release (or more)
> to get a complete solution...  

I don't see how could this ever work with the optimizer, which assumes
scanning an inheritance hierarchy means scanning all parts. But this
would require making planner "smarter" to know it should scan only one
of the child relations. And I believe it's not possible to do that while
constructing scans for the heap/VCI parts, those places are not aware of
what other parts are being scanned etc.

Sure, you could do this in "user-space" by constructing queries that
reference either the heap or VCI part. But then why put that into
inheritance tree at all? It certainly does not help with moving data
between the parts.

I may be missing something, of course. But it's not clear to me how is
this supposed to work ...

What I can imagine is "VCI" as a "proxy" TAM on top of heap, keeping the
columnar format in a separate fork. And using either that from custom
scans, or the heap as a fallback for cases not supported by VCI.

regars

-- 
Tomas Vondra

Re: [WIP]Vertical Clustered Index (columnar store extension) - take2

From

Jim Nasby

Date:

05 June, 01:19:35

On Wed, Jun 4, 2025 at 1:16 PM Tomas Vondra <tomas@vondra.me> wrote:

On 6/4/25 19:59, Jim Nasby wrote:
>
>
> On Fri, May 23, 2025 at 4:29 PM Tomas Vondra <tomas@vondra.me
> <mailto:tomas@vondra.me>> wrote:
>
> Also, Alvaro seemed to think TAM is the way to go, and in order to keep
> the OLTP performance he suggested to use both heap and VCI at the same
> time, in different "forks". I'm not sure how would that work, or if we
> can already do that - AFAIK we can't, because ForkNumber does not allow
> adding custom forks. We'd have to relax that, or invent some sort of
> federated TAM (that just multiplexes it to two TAMs). Maybe.
>
> But it's not like the IAM approach doesn't need to do this. The first
> patch had to add stuff to a lot of random places to make this work. And
> some of the places touch stuff that we don't expect indexes to worry
> about, like ALTER TABLE, etc.
>
>
> I suspect another option would be to handle this with table inheritance:
> have one child that is heap-based, a second that's VCI, and a background
> job to move data from heap to VCI (and vice-versa for updates and maybe
> deletes).
>
> Note that you could actually implement all that in user-space.
> Personally I'd much rather have a way to do pure VCI / column-store
> sooner and manage it myself than have to wait another release (or more)
> to get a complete solution...

I don't see how could this ever work with the optimizer, which assumes
scanning an inheritance hierarchy means scanning all parts. But this
would require making planner "smarter" to know it should scan only one
of the child relations. And I believe it's not possible to do that while
constructing scans for the heap/VCI parts, those places are not aware of
what other parts are being scanned etc.

Right; I was envisioning that one child would be a conventional heap that stored very recent data and another child would be columnar in nature. So you'd definitely want to always look at both children.

I am making an assumption (based on the comment about multiple forks) that we'd have some way to handle VCI without having an actual heap.

Sure, you could do this in "user-space" by constructing queries that
reference either the heap or VCI part. But then why put that into
inheritance tree at all? It certainly does not help with moving data
between the parts.

Right; I only brought it up because just having a working column-store would be a big win, even if you had to code something to deal with any DML that wasn't already batch up. Of course it would be better if it just did the RightThing(TM) out of the box... but the perfect can be the enemy of the good.

What I can imagine is "VCI" as a "proxy" TAM on top of heap, keeping the
columnar format in a separate fork. And using either that from custom
scans, or the heap as a fallback for cases not supported by VCI.

Yeah, there'd definitely need to be some kind of proxy... I'm just suggesting that we don't *have* to do that as a separate fork...

Of course I could also just be missing something :)

Re: [WIP]Vertical Clustered Index (columnar store extension) - take2

From

Jim Nasby

Date:

05 June, 01:51:16

On Wed, Jun 4, 2025 at 5:19 PM Jim Nasby <jnasby@upgrade.com> wrote:

What I can imagine is "VCI" as a "proxy" TAM on top of heap, keeping the
columnar format in a separate fork. And using either that from custom
scans, or the heap as a fallback for cases not supported by VCI.

Yeah, there'd definitely need to be some kind of proxy... I'm just suggesting that we don't *have* to do that as a separate fork...

(tl;dr: there are some key things that can only be implemented in the engine that would enable much more complex features to be added at the SQL level, without requiring tons of C code to implement the larger feature)

Oh, one other thing worth mentioning... it's actually not terribly hard to build a column-store in userspace today: just turn every column of a table into an array and set the TOAST target low enough so that it all gets toasted. I tested that many years ago, and even though I couldn't set the toast target back then saw some really encouraging results... provided that I constructed my queries carefully (I also had range fields for each column that stored the min/max of each array, so the planner could completely skip de-toasting anything that would not contain values of interest.)

The reason I never went anywhere with this concept is it'd be very hard for most folks to write queries that performed well. The transform from column back to row-based was actually hidden behind a view (a bunch of unnest()'s) - but if you didn't make use of the range fields in your query you lost a lot (but not all[1]) of the performance gain. I know that I could have used the hooks to teach the planner how to do this, but it would have been a huge amount of work (at least for me) to do so.

It did occur to me recently that a generic system for teaching the optimizer additional transforms it could make would be generally useful. By far the biggest example would be a way to teach it that

WHERE timestamp_field :: date = '2025-6-4'

is the same thing as

WHERE timestamp_field >= '2025-6-4' AND timestamp_field < '2025-6-5'

That would be extremely helpful in a lot of environments. There are definitely other cases where you can apply the same kinds of logic. In particular, such a feature (if generic enough) would make it possible to write simple queries against a view that transformed columnar data (stored as arrays) back into a row format *and* apply additional predicates that would make those queries highly efficient - all done via pure SQL.

[1] In my testing (which used the taxicab database) there was still a performance gain from storing the data as arrays, even if the queries to access it took no special efforts to eliminate unnecessary data. The reason is that TOAST meant that the base data was being compressed. In fact, testing showed that there was a win even if you didn't treat each individual column as an array; you could simply store an array of a composite type and still see a win.

Re: [WIP]Vertical Clustered Index (columnar store extension) - take2

From

Peter Smith

Date:

19 June, 07:14:16

Here are the v8 patches. The main changes are as follows:

v8-0001 VCI - changes to postgres core
- same

v8-0002 VCI module - main
- extracted "compression" related code from this main patch
- applied Timur's top-up patch [1] re "session_preload_libraries"
- removed some dead code

v8-0003 VCI module - documentation
- removed mentions about compression

~~~

To avoid too much noise, other extracted patches (below) will be
maintained off-list.
0004 VCI module - compression
0005 VCI module - hothash
0006 VCI module - defer WOS2ROS

======
[1] https://www.postgresql.org/message-id/c4e314ef0a58050c8b7847ac1852555876674a69.camel%40postgrespro.ru

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: [WIP]Vertical Clustered Index (columnar store extension) - take2

From

Peter Smith

Date:

25 June, 05:22:57

Hi.

Here are the latest v9* patches. These have following changes:

0001
-  fixes some of the minor quirks reported by Tomas [1].

0002
- fixes to added address some of Timur's feedback/patches [2]
- test code updated to remove unsupported type
- test code updated to remove dynamic costings

======
[1] Tomas - https://www.postgresql.org/message-id/a748aa6b-c7e6-4d02-a590-ab404d590448%40vondra.me
[2] Timur - https://www.postgresql.org/message-id/6a6058fc089f89561b2545f024953e4daa0b8561.camel%40postgrespro.ru

Kind Regards,
Peter Smith
Fujitsu Australia

Attachment

Re: [WIP]Vertical Clustered Index (columnar store extension) - take2

From

Timur Magomedov

Date:

27 June, 17:24:41

On Wed, 2025-06-25 at 12:22 +1000, Peter Smith wrote:
> Hi.
>
> Here are the latest v9* patches. These have following changes:
>
> 0001
> -  fixes some of the minor quirks reported by Tomas [1].
>
> 0002
> - fixes to added address some of Timur's feedback/patches [2]
> - test code updated to remove unsupported type
> - test code updated to remove dynamic costings
>
Hello Peter!

Thanks for updates.
I was trying to move some code from Postgres core patch to contrib/vci.
Started with moving VCI options from StdRdOptions struct and
reloptions.c, reloptions.h files into contrib/vci like Tomas suggested
earlier. Getting new relopt kind using add_reloption_kind() instead of
hardcoded RELOPT_KIND_VCI and so on, just like in contrib/bloom.

During this work I've found that those VCI-specific options,
"vci_column_ids" and "vci_dropped_column_ids", are only read in case
vci_IsExtendedToMoreThan32Columns() is true. And
vci_IsExtendedToMoreThan32Columns() itself checks if vci_column_ids
option is non-empty string to return true. AFAIK no one sets those
options and vci_IsExtendedToMoreThan32Columns() always returns false.

There are comments in code mentioning that one can create VCI index
with more than 32 columns using vci_create(). However, there is no such
a function anywhere. I guess it is now impossible to create VCI index
with more than 32 (INDEX_MAX_KEYS) columns. Maybe the options were used
to store column information for indexes created with vci_create() but
this function has gone.

So, giving the ideas above, can we remove those vci_column_ids,
vci_dropped_column_ids options and all the code that assumes
vci_IsExtendedToMoreThan32Columns() to be true? This would make core
patch a bit shorter, specifically there will be no changes in
reloptions.{c,h} files and no changes in StdRdOptions struct in rel.h.

--
Regards,
Timur Magomedov