Thread: new index type with clustering in mind.

new index type with clustering in mind.

From
"Jack Douglas"
Date:

Hi

 

Please forgive this question if it is naïve.

 

Would the following be practical to implement:

 

A btree-like index type that points to *pages* rather than individual rows. Ie if there are many rows in a page with the same data (in the indexed columns), only one index entry will exist. In its normal use case, this index would be much smaller than a regular index on the same columns which would contain one entry for each individual row.

 

To reduce complexity (eg MVCC/snapshot related issues), index entries would be added when a row is inserted, but they would not be removed when the row is updated/deleted (or when an insert is rolled back): this would cause index bloat over time in volatile tables but this would be acceptable for the use case I have in mind. So in essence, an entry in the index would indicate that there *may* be matching rows in the page, not that there actually are.

 

For data where there is a high degree of clustering on the columns of the index, this index would be quite efficient at locating matching rows: the index itself would be small and most pages would contain largely rows that match. (In practice it would behave similar to a ‘cluster’ index in Oracle, without the guarantee of clustering).

 

This brings me on to my use case: a method to efficiently keep a table largely clustered (with ‘clustered’ more loosely defined than the usual Postgres definition) without locking more than a page at a time. The above index could be scanned periodically for entries where multiple keys point to the same page. Of that set of keys/pages, the subset where the same key appears in multiple pages can be clustered by simply deleting and reinserting the rows, key by key. In this case, the pages affected by the deletes could be locked briefly in order to safely delete the corresponding index entries without conflict with other transactions. If a lock can’t be obtained the process could simply skip and move on to the next candidate key/pages.

 

Although this kind of ‘clustered’ data is less ordered that a normal Postgres cluster, it retains some of the useful properties: particularly that data for a particular key is likely to be found in a small number of pages rather than scattered evenly through the entire table, requiring potentially far more io to select.

 

Many thanks in advance and my apologies once again if this kind of thing has been rejected before or is obviously impractical.

Jack

 

 

 

 

 

Re: new index type with clustering in mind.

From
Martijn van Oosterhout
Date:
On Sat, May 24, 2014 at 05:58:37PM +0100, Jack Douglas wrote:
> Would the following be practical to implement:

> A btree-like index type that points to *pages* rather than individual rows.
> Ie if there are many rows in a page with the same data (in the indexed
> columns), only one index entry will exist. In its normal use case, this
> index would be much smaller than a regular index on the same columns which
> would contain one entry for each individual row.

> To reduce complexity (eg MVCC/snapshot related issues), index entries would
> be added when a row is inserted, but they would not be removed when the row
> is updated/deleted (or when an insert is rolled back): this would cause
> index bloat over time in volatile tables but this would be acceptable for
> the use case I have in mind. So in essence, an entry in the index would
> indicate that there *may* be matching rows in the page, not that there
> actually are.

It's an interesting idea, but, how can you *ever* delete index entries?
I.e. is there a way to maintain the index without rebuilding it
regularly?

Maybe there's something you could do with tracking all the entries that
point to one page or something, or a counter.  Because really, the fact
that the item pointer in a btree index includes the item number is only
really needed for deletion.  Postgres always has to read in the whole
page anyway, so if you can find a way around that it might be an
interesting improvement.

Mind you, hash indexes could get this almost free, except they're not
crash safe.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer

Attachment

Re: new index type with clustering in mind.

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Sat, May 24, 2014 at 05:58:37PM +0100, Jack Douglas wrote:
>> Would the following be practical to implement:
>> A btree-like index type that points to *pages* rather than individual rows.

> It's an interesting idea, but, how can you *ever* delete index entries?
> I.e. is there a way to maintain the index without rebuilding it
> regularly?

The discussions at PGCon pointed out that with the posting-list
compression logic added in 9.4, GIN indexes are pretty close to this
already.  Multiple items on the same heap page will typically only take
one byte of index space per item; but there is an identifiable entry, so
you don't get into these questions of when VACUUM should remove entries,
and it's not lossy so you're not forced to pay the overhead of rechecking
every entry on the linked-to page.

Not to say that 9.4 GIN is necessarily the last word on the subject, but
it would be worth testing it out before deciding that we need something
better.  (beta1 is out.  It needs testing.  Hint hint.)

            regards, tom lane


Re: new index type with clustering in mind.

From
"Jack Douglas"
Date:
> The discussions at PGCon pointed out that with the posting-list
compression logic added in 9.4, GIN indexes are pretty close to this
already.  Multiple items on the same heap page will typically only take one
byte of index space per item; but there is an identifiable entry, so you
don't get into these questions of when VACUUM should remove entries, and
it's not lossy so you're not forced to pay the overhead of rechecking every
entry on the linked-to page.

> Not to say that 9.4 GIN is necessarily the last word on the subject, but
it would be worth testing it out before deciding that we need something
better.  (beta1 is out.  It needs testing.  Hint hint.)

Hint taken, and first impressions are positive: the compression is very
efficient for the kind of scenario I'm imagining where the key is
deliberately chosen so that the average page has one distinct key. I have a
25Mb gin 'cluster' index on a table where an equivalent regular btree index
is 10 times as large.

So the questions are, a) is this kind of clustering broadly useful (ie not
just to me), b) how much effort will it be to implement a 'vacuum-like'
operation that scans a designated index and performs the relevant
delete/inserts to achieve this kind of clustering? And c) if it is broadly
useful and not a major implementation mountain to climb, is it something
that might be added to the todo list?

If someone can tell me how to decode a `ctid` into a page number (discarding
the row number portion - is there a better way than `
(replace(replace(ctid::text,'(','{'),')','}')::integer[])[1]`), I should be
able to show some analysis demonstrating this working, albeit inefficiently
as I'll have to scan the table itself for the page/key statistics. Would
that sort of analysis be helpful?

Kindest regards
Jack

PS It occurs to me that the btree_gin documentation page for 9.4,
http://www.postgresql.org/docs/9.4/static/btree-gin.html, might benefit from
including some mention of index compression when discussing the relative
performance of regular and gin btree indexes.



Re: new index type with clustering in mind.

From
"Jack Douglas"
Date:
> > > To reduce complexity (eg MVCC/snapshot related issues), index entries
> > > would be added when a row is inserted, but they would not be removed
> > > when the row is updated/deleted (or when an insert is rolled back):

> > It's an interesting idea, but, how can you *ever* delete index entries?
> > I.e. is there a way to maintain the index without rebuilding it
> > regularly?
>
> The discussions at PGCon pointed out that with the posting-list
compression
> logic added in 9.4, GIN indexes are pretty close to this already.
Multiple
> items on the same heap page will typically only take one byte of index
space
> per item; but there is an identifiable entry, so you don't get into these
> questions of when VACUUM should remove entries, and it's not lossy so
> you're not forced to pay the overhead of rechecking every entry on the
> linked-to page.

The README file in the source code for GIN indexes says: "Note: There is no
delete operation in the key (entry) tree. The reason for this is that in our
experience, the set of distinct words in a large corpus changes very slowly.
This greatly simplifies the code and concurrency algorithms.". Does that
mean that for the case where GIN is used as a simple replacement for btree,
my initial suggestion above ("...would be added when a row is inserted, but
they would not be removed...") is effectively what happens already with GIN
indexes?

---

I've written up a test to demonstrate the principle of this 'clustering
lite': http://dba.stackexchange.com/q/66292/1396

In brief, it shows the benefit of this sort of clustering (much lower io
versus unclustered, and no exclusive lock or heavy WAL generation versus
current clustering implementations) with a workable way of achieving it in
the current release. The basic idea is:

1. turn off autovacuum for the table
2. check each block to determine the degree of clustering
3. delete and re-insert all the rows from blocks below a clustering
threshold
4. manually vacuum to free those (complete) blocks
5. repeat steps 2-4 as regularly as necessary

A weakness is that is requires a full table scan is required. All that is
needed to avoid the full-scan would be a way of getting `blkno` from an
index-only scan, which as far as I can tell is not currently possible, is
it?



Re: new index type with clustering in mind.

From
"Jack Douglas"
Date:
> in 9.4, GIN indexes are pretty close to this already

Do I understand correctly that BRIN indexes will be even closer to this?

Kindest regards
Jack

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: 24 May 2014 22:46
To: Martijn van Oosterhout
Cc: Jack Douglas; pgsql-general@postgresql.org
Subject: Re: [GENERAL] new index type with clustering in mind.

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Sat, May 24, 2014 at 05:58:37PM +0100, Jack Douglas wrote:
>> Would the following be practical to implement:
>> A btree-like index type that points to *pages* rather than individual
rows.

> It's an interesting idea, but, how can you *ever* delete index entries?
> I.e. is there a way to maintain the index without rebuilding it
> regularly?

The discussions at PGCon pointed out that with the posting-list compression
logic added in 9.4, GIN indexes are pretty close to this already.  Multiple
items on the same heap page will typically only take one byte of index space
per item; but there is an identifiable entry, so you don't get into these
questions of when VACUUM should remove entries, and it's not lossy so you're
not forced to pay the overhead of rechecking every entry on the linked-to
page.

Not to say that 9.4 GIN is necessarily the last word on the subject, but it
would be worth testing it out before deciding that we need something better.
(beta1 is out.  It needs testing.  Hint hint.)

            regards, tom lane



Re: new index type with clustering in mind.

From
Alvaro Herrera
Date:
Jack Douglas wrote:
> > in 9.4, GIN indexes are pretty close to this already
>
> Do I understand correctly that BRIN indexes will be even closer to this?
>

Yeah, in a way.  You could say they are closer from the opposite end.
There is one index tuple in a BRIN index for each page range (contiguous
set of pages); each index tuple contains a "summary" of what in that
page range.  There are no exact entries.  If the values are randomly
scattered, the index is useless; all page ranges will have to be scanned
for possibly matching tuples.  If the values are perfectly clustered,
the index is optimal because you scan the minimal set of pages.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: new index type with clustering in mind.

From
"Jack Douglas"
Date:
> If the values are perfectly clustered, the index is optimal because you
scan the minimal set of pages.

That's the bit I'm particularly interested in, as my plan would be to keep
the pages well clustered: http://dba.stackexchange.com/a/66293/1396

Do you see any blocker preventing BRIN being used for a continuous
background re-clustering job (in parallel with or as part of vacuum),
similar to the mechanism I experimented with before? If not is this
something there might be support for adding to the TODO list?

Kindest regards
Jack




Re: new index type with clustering in mind.

From
Alvaro Herrera
Date:
Jack Douglas wrote:
> > If the values are perfectly clustered, the index is optimal because you
> scan the minimal set of pages.
>
> That's the bit I'm particularly interested in, as my plan would be to keep
> the pages well clustered: http://dba.stackexchange.com/a/66293/1396
>
> Do you see any blocker preventing BRIN being used for a continuous
> background re-clustering job (in parallel with or as part of vacuum),
> similar to the mechanism I experimented with before? If not is this
> something there might be support for adding to the TODO list?

In principle, CLUSTER sucks, and having data clustered is clearly good,
so improvements in this area are certainly welcome.

If you were to propose some general mechanism that works for any index,
I don't see that we would reject having it work specifically for BRIN;
having it for BRIN only would be strange.  (I guess it's good enough if
it works for btree and BRIN.  Not sure about GiST and SP-GiST; GIN
clearly is of no use here.)

Currently, one issue you're going to face is that brin doesn't rescan a
range to find the tighest possible summary tuple.  Thinking in min/max
terms, once a tuple with a very high or very low is inserted, the range
doesn't get any smaller once that tuple is deleted from the range.  You
would need to find a way to fix that.  (The easiest way is to REINDEX
the whole thing, of course, but that processes the whole table and not
just some portion of it.)

Another issue is how to find the best possible ordering.  For minmax
opclasses it's easy, but for other opclass designs it's less clear what
to do.  Even for minmax you need to find some way to communicate to the
system what's the order to follow ...

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: new index type with clustering in mind.

From
"Jack Douglas"
Date:
> Currently, one issue you're going to face is that brin doesn't rescan a
range to
> find the tighest possible summary tuple.

That's going to be an issue I think, thanks for mentioning it. We'd need
some sort of mechanism for achieving this without a complete REINDEX, even
if it only reset the min/max when all the blocks in the range are entirely
cleared out. Ah well :)

> Another issue is how to find the best possible ordering.  For minmax
> opclasses it's easy, but for other opclass designs it's less clear what to
do.
> Even for minmax you need to find some way to communicate to the system
> what's the order to follow ...

Do you mean the ordering for the clustered table tuples or the ordering of
index tuples in the BRIN index? I'm the former because I'm also assuming you
always scan an entire BRIN index as there isn't a trivial way of optimizing
the index scan for ranges (unless you 'granulate' the ranges along the lines
of this: http://dba.stackexchange.com/a/22295/1396)?

If you mean the clustering order, for the use cases I'm concerned with it
isn't important - as long as tuples with the same cluster key gravitate
towards the same blocks, it often doesn't matter what order those blocks are
in because the main mission is to reduce the number of blocks scanned.