Re: Questions about btree_gin vs btree_gist for low cardinality columns - Mailing list pgsql-general

From Jeff Janes
Subject Re: Questions about btree_gin vs btree_gist for low cardinality columns
Date
Msg-id CAMkU=1xpw_U6ZvymatbOXi5A=e8=NmCzyLmAiBKNO3B6qKqijQ@mail.gmail.com
Whole thread Raw
In response to Questions about btree_gin vs btree_gist for low cardinality columns  (Jeremy Finzel <finzelj@gmail.com>)
Responses Re: Questions about btree_gin vs btree_gist for low cardinality columns
List pgsql-general
On Fri, May 24, 2019 at 11:26 AM Jeremy Finzel <finzelj@gmail.com> wrote:
I have been hoping for clearer direction from the community about specifically btree_gin indexes for low cardinality columns (as well as low cardinality multi-column indexes).  In general there is very little discussion about this both online and in the docs.  Rather, the emphasis for GIN indexes discussed is always on full text search of JSON indexing, not btree_gin indexes.

However, I have never been happy with the options open to me for indexing low cardinality columns and was hoping this could be a big win.  Often I use partial indexes as a solution, but I really want to know how many use cases btree_gin could solve better than either a normal btree or a partial index.

What does "low cardinality" mean here?  For example, I have a million different items with an item_id, but on average about 30 copies of each item.  The inventory table has a row for each copy (copies are not fully interchangeable, so can't be just be a count in some other table).  A million is not usually considered a low number, but I find a gin index (over btree_gin on the item_id) useful here as it produces a much smaller index due to not repeating the item_id each time and due to compressing the tid list, even though there are only about 30 tids to compress.

(You mention basically the reciprocal of this, 30 distinct values with 3,000,000 copies each, but I don't if that was your actual use case, or just the case you were reading about in the documents you were reading) 
 

Here are my main questions:

1.

"The docs say regarding *index only scans*: The index type must support index-only scans. B-tree indexes always do. GiST and SP-GiST indexes support index-only scans for some operator classes but not others. Other index types have no support. The underlying requirement is that the index must physically store, or else be able to reconstruct, the original data value for each index entry. As a counterexample, GIN indexes cannot support index-only scans because each index entry typically holds only part of the original data value."

This is confusing to say "B-tree indexes always do" and "GIN indexes cannot support index-only scans", when we have a btree_gin index type.  Explanation please ???

B-tree is the name of a specific index implementation in PostgreSQL.  That is what is referred to here.   btree_gin offers operators to be used in a GIN index to mimic/implement b-tree data structures/algorithms, but that doesn't make it the same thing. It is just a GIN index doing btree things.  Perhaps PostgreSQL should use a "brand name" for their specific implementation of the default index type, to distinguish it from the generic algorithm description.
 

Is it true that for a btree_gin index on a regular column, "each index entry typically holds only part of the original data value"?  Do these still not support index only scans?  Could they?  I can't see why they shouldn't be able to for a single indexed non-expression field?

For single column using a btree_gin operator, each index entry holds the entire data value.  But the system doesn't know about that in any useful way, so it doesn't implement index only scans, other than in the special case where the value does not matter, like a 'count(*)'.  Conceptually perhaps this could be fixed, but I don't see it happening. Since an index-only scan is usually not much good with only a single-column index, I don't see much excitement to improve things here. If if there is more than one column in the index, then for GIN the entire value from both columns would not be stored in the same index entry, so in this case it can't use an index-only scan even conceptually to efficiently fetch the value of one column based on the supplied value of another one.  
 

2. 

Lack of index only scans is definitely a downside.  However, I see basically identical performance, but way less memory and space usage, for gin indexes.  In terms of read-only performance, if index only scans are not a factor, why not always recommend btree_gin indexes instead of regular btree for low cardinality fields, which will yield similar performance but use far, far less space and resources?

GIN indexes over btree_gin operators do not support inequality or BETWEEN queries efficiently.  True read-onlyness is often talked about but rarely achieved, I would be reluctant to design a system around management promises that something won't ever change.  Btree indexes are way more thoroughly tested than GIN.  When I became interested in using them in more-or-less the way you describe, I started torture testing on them and quickly found some bugs (hopefully all fixed now, but I wouldn't bet my life on it).  btree_gin covers the most frequently used data types, but far from all of them.  And as described above, multi-column GIN indexes are just entirely different from multi-column B-Tree indexes, and generally not very useful.  And in the case of very low cardinality, I think indexing those at all will often only be useful as a component in a regular multi-column index, where the selectivity gets multiplied in with other columns in an efficient way.

With those caveats in mind, I would and do recommend them.  But where would such a recommendation be stored, other than email archives or blog posts?  Is there a part of the documentation you would propose amending? 
 

3.

This relates to 2.  I understand the write overhead can be much greater for GIN indexes, which is why the fastupdate feature exists.  But again, in those discussions in the docs, it appears to me they are emphasizing that penalty more for full text or json GIN indexes.  Does the same overhead apply to a btree_gin index on a regular column with no expressions?

It is not the same overhead.  With FTS, inserting a row (or non-HOT updating a row) without fastupdate means you would need to change a lot of individual index leaf pages, one for each token in the document.  With btree_gin, that particular thing should not be a problem.  If you have 30 values each with 1,000,000 rows, updating the gin index might cause more WAL traffic than a similar B-Tree index, because more of the leaf page may need to get modified in each update as you are recompressing a large list, but the difference is not that large, maybe a factor of 2 in my tests, and it might depend on many factors like fastupdate setting.  I do recall that the replay of GIN WAL  onto a standby was annoying slow, but I haven't tested it with your particular set up and not in the most recent versions.

Cheers,

Jeff

pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: psql: FATAL: the database system is starting up
Next
From: Tom K
Date:
Subject: Re: psql: FATAL: the database system is starting up