Thread: proposal for smaller indexes on index-ordered tables

proposal for smaller indexes on index-ordered tables

From
"Jeffrey Baker"
Date:
The way I read it, the current btree index stores the index value and the TID of every tuple having that value.  When
youhave a table with three columns, you index one of them and you get an index which is practically as large as the
tableitself.<br /><br />Supposing the table is generally or strictly ordered by the column to be indexed, it would be
morecompact if the index stored ranges of tuples.  Instead of storing the TID of every tuple with that value, the index
wouldstore a first and last TID, between which all tuples have the value.<br /><br />Example: table with one million
rowsindexed on a column having one thousand distinct values.  Table is in-order by the indexed column.  The traditional
indexwould contain a million TIDs, whereas a range index would contain only two thousand.  The range index would be 500
timessmaller, more likely to be cached, etc.<br /><br />Thoughts?<br /><br />-jwb<br /> 

Re: proposal for smaller indexes on index-ordered tables

From
"Jonah H. Harris"
Date:
On Tue, Jun 24, 2008 at 4:34 PM, Jeffrey Baker <jwbaker@gmail.com> wrote:
> Supposing the table is generally or strictly ordered by the column to be
> indexed, it would be more compact if the index stored ranges of tuples.
> Instead of storing the TID of every tuple with that value, the index would
> store a first and last TID, between which all tuples have the value.

There are several databases which implement this idea.  Unfortunately,
Postgres does not yet ensure that indexed tables remain indexed.  As
such, an index such as this would soon be ineffective.  IIRC, Heikki
has done some work on keeping clustered tables clustered, but it
hasn't yet made it into core.

-- 
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com
Edison, NJ 08837 | http://www.enterprisedb.com/


Re: proposal for smaller indexes on index-ordered tables

From
Zoltan Boszormenyi
Date:
Jeffrey Baker írta:
> The way I read it, the current btree index stores the index value and
> the TID of every tuple having that value.  When you have a table with
> three columns, you index one of them and you get an index which is
> practically as large as the table itself.
>
> Supposing the table is generally or strictly ordered by the column to
> be indexed, it would be more compact if the index stored ranges of
> tuples.  Instead of storing the TID of every tuple with that value,
> the index would store a first and last TID, between which all tuples
> have the value.
>
> Example: table with one million rows indexed on a column having one
> thousand distinct values.  Table is in-order by the indexed column. 
> The traditional index would contain a million TIDs, whereas a range
> index would contain only two thousand.  The range index would be 500
> times smaller, more likely to be cached, etc.
>
> Thoughts?
>
> -jwb

Example with your theory:
One (not yet committed) transaction changes one tuple
that was in the middle of a range before but the tuple's indexed
column changed. What would you do?
You need to keep track of multiple index versions:
1. the range has to be split for the not-yet-committed modifier transaction,   it might need to re-read the same
table.
2. the old range has to be kept for reader transactions that still see
the old data
Imagine you have thousands of UPDATEs in flight on different rows.

Or you introduce "readers has to wait for writers locks" and
"updaters has to wait for other updaters on the same range"
that the MVCC implementation nicely avoids.

Look at MaxDB once, you'll appreciate PostgreSQL then.
MaxDB stores table tuples in the order of its primary key,
it uses a balanced btree for that. This means slower INSERTs and
UPDATEs and decreased concurrency compared to PostgreSQL.

Best regards,
Zoltán Böszörményi

-- 
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/



Re: proposal for smaller indexes on index-ordered tables

From
"Jeffrey Baker"
Date:
On Tue, Jun 24, 2008 at 1:59 PM, Zoltan Boszormenyi <zb@cybertec.at> wrote:
Jeffrey Baker írta:
> The way I read it, the current btree index stores the index value and
> the TID of every tuple having that value.  When you have a table with
> three columns, you index one of them and you get an index which is
> practically as large as the table itself.
>
> Supposing the table is generally or strictly ordered by the column to
> be indexed, it would be more compact if the index stored ranges of
> tuples.  Instead of storing the TID of every tuple with that value,
> the index would store a first and last TID, between which all tuples
> have the value.
>
> Example: table with one million rows indexed on a column having one
> thousand distinct values.  Table is in-order by the indexed column.
> The traditional index would contain a million TIDs, whereas a range
> index would contain only two thousand.  The range index would be 500
> times smaller, more likely to be cached, etc.
>
> Thoughts?
>
> -jwb

Example with your theory:
One (not yet committed) transaction changes one tuple
that was in the middle of a range before but the tuple's indexed
column changed. What would you do?

Insert the new tuple at the end of the table and add another range to the index.  Leave the old tuple in place and don't touch the original index range.
 
You need to keep track of multiple index versions:
1. the range has to be split for the not-yet-committed modifier transaction,
   it might need to re-read the same table.
2. the old range has to be kept for reader transactions that still see
the old data

This is only true if you update the tuple in-place. 
 
Imagine you have thousands of UPDATEs in flight on different rows.

I'm quite aware of the problems of maintaining such a table and index, but the fact is that data warehouse type tables may never be updated after being created.  The particular application I'm struggling with does a SELECT ... INTO ... ORDER BY to make an ordered table for querying every night.  The problem is it takes longer, much longer, to create the index than to create the table, and in the end the index is as big as half the table anyway.

So this type of index would only be useful for an essentially read-only table.  I agree.

Quite another proposal would be to somehow instruct the database that the table is strictly in-order by a column and allow a binary search access method.  Then you don't need any index at all.

-jwb

Re: proposal for smaller indexes on index-ordered tables

From
Zoltan Boszormenyi
Date:
Jeffrey Baker írta:
> On Tue, Jun 24, 2008 at 1:59 PM, Zoltan Boszormenyi <zb@cybertec.at
> <mailto:zb@cybertec.at>> wrote:
>
>     Jeffrey Baker írta:
>     > The way I read it, the current btree index stores the index
>     value and
>     > the TID of every tuple having that value.  When you have a table
>     with
>     > three columns, you index one of them and you get an index which is
>     > practically as large as the table itself.
>     >
>     > Supposing the table is generally or strictly ordered by the
>     column to
>     > be indexed, it would be more compact if the index stored ranges of
>     > tuples.  Instead of storing the TID of every tuple with that value,
>     > the index would store a first and last TID, between which all tuples
>     > have the value.
>     >
>     > Example: table with one million rows indexed on a column having one
>     > thousand distinct values.  Table is in-order by the indexed column.
>     > The traditional index would contain a million TIDs, whereas a range
>     > index would contain only two thousand.  The range index would be 500
>     > times smaller, more likely to be cached, etc.
>     >
>     > Thoughts?
>     >
>     > -jwb
>
>     Example with your theory:
>     One (not yet committed) transaction changes one tuple
>     that was in the middle of a range before but the tuple's indexed
>     column changed. What would you do?
>
>
> Insert the new tuple at the end of the table and add another range to
> the index.  Leave the old tuple in place and don't touch the original
> index range.

This is what I described below but I only mentioned the index part:

>     You need to keep track of multiple index versions:
>     1. the range has to be split for the not-yet-committed modifier
>     transaction,
>        it might need to re-read the same table.
>     2. the old range has to be kept for reader transactions that still see
>     the old data
>
>
> This is only true if you update the tuple in-place.

Why? If you update in-place then the above is not needed.
You just need to serialize transactions but there goes concurrency.

>     Imagine you have thousands of UPDATEs in flight on different rows.
>
>
> I'm quite aware of the problems of maintaining such a table and index,
> but the fact is that data warehouse type tables may never be updated
> after being created.  The particular application I'm struggling with
> does a SELECT ... INTO ... ORDER BY to make an ordered table for
> querying every night.  The problem is it takes longer, much longer, to
> create the index than to create the table, and in the end the index is
> as big as half the table anyway.
>
> So this type of index would only be useful for an essentially
> read-only table.  I agree.
>
> Quite another proposal would be to somehow instruct the database that
> the table is strictly in-order by a column and allow a binary search
> access method.  Then you don't need any index at all.

CLUSTER tablename USING indexname;
It's useful for little changing very large tables.

> -jwb
>


-- 
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/



Re: proposal for smaller indexes on index-ordered tables

From
Tom Lane
Date:
"Jeffrey Baker" <jwbaker@gmail.com> writes:
> I'm quite aware of the problems of maintaining such a table and index, but
> the fact is that data warehouse type tables may never be updated after being
> created.  The particular application I'm struggling with does a SELECT ...
> INTO ... ORDER BY to make an ordered table for querying every night.  The
> problem is it takes longer, much longer, to create the index than to create
> the table, and in the end the index is as big as half the table anyway.

There's something wrong with that: sorting the table rows surely ought
to take longer than sorting the same number of (smaller) index entries.
Have you done any profiling to find out what the problem is?  Perhaps
there's something wrong with the setting of maintenance_work_mem (vs
work_mem).
        regards, tom lane


Re: proposal for smaller indexes on index-ordered tables

From
"Jeffrey Baker"
Date:
On Tue, Jun 24, 2008 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Jeffrey Baker" <jwbaker@gmail.com> writes:
> I'm quite aware of the problems of maintaining such a table and index, but
> the fact is that data warehouse type tables may never be updated after being
> created.  The particular application I'm struggling with does a SELECT ...
> INTO ... ORDER BY to make an ordered table for querying every night.  The
> problem is it takes longer, much longer, to create the index than to create
> the table, and in the end the index is as big as half the table anyway.

There's something wrong with that: sorting the table rows surely ought
to take longer than sorting the same number of (smaller) index entries.
Have you done any profiling to find out what the problem is?  Perhaps
there's something wrong with the setting of maintenance_work_mem (vs
work_mem).

For this query, work_mem is 100MB and maintenance_work_mem is 1GB, on a system with 8GB of memory.  Notably I just installed a new storage subsystem and upgraded to 8.3.1 less than a week ago, so my experience with this instance is somewhat limited.  Creating the table in this case takes half an hour and then indexing it requires almost an hour.  Subsequently analyzing the table takes less than a minute, with statistics set to maximum.

Query performance is excellent.  I was just brainstorming on ways to save time on the creation.

-jwb

Re: proposal for smaller indexes on index-ordered tables

From
Tom Lane
Date:
"Jeffrey Baker" <jwbaker@gmail.com> writes:
> For this query, work_mem is 100MB and maintenance_work_mem is 1GB, on a
> system with 8GB of memory.  Notably I just installed a new storage subsystem
> and upgraded to 8.3.1 less than a week ago, so my experience with this
> instance is somewhat limited.  Creating the table in this case takes half an
> hour and then indexing it requires almost an hour.

These numbers seem to me to be pretty strong evidence that
maintenance_work_mem = 1GB is a mistake.  Try it at 100MB and then some
intermediate values.

Now, *why* it is a mistake is interesting to speculate about, but
let's confirm the theory first.
        regards, tom lane


Re: proposal for smaller indexes on index-ordered tables

From
"Kevin Grittner"
Date:
>>> On Tue, Jun 24, 2008 at  4:54 PM, in message
<7020.1214344479@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> "Jeffrey Baker" <jwbaker@gmail.com> writes:
>> Creating the table in this case takes half an
>> hour and then indexing it requires almost an hour.
> 
> These numbers seem to me to be pretty strong evidence that
> maintenance_work_mem = 1GB is a mistake.  Try it at 100MB and then
some
> intermediate values.
> 
> Now, *why* it is a mistake is interesting to speculate about, but
> let's confirm the theory first.
Could this be related to hint bit rewrites during indexing?
Would a vacuum between creation and indexing be a good way to tell?
-Kevin


Re: proposal for smaller indexes on index-ordered tables

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
>> Now, *why* it is a mistake is interesting to speculate about, but
>> let's confirm the theory first.
> Could this be related to hint bit rewrites during indexing?

If so, changing maintenance_work_mem won't improve the situation.

What I personally suspect is that Jeff's index build is swapping like
crazy, or else there's just some problem in the sort code for such a
large sort arena.  But let's get some evidence about how the index build
time varies with maintenance_work_mem before jumping to conclusions.

> Would a vacuum between creation and indexing be a good way to tell?

Yeah, that might be a useful experiment to try too.  It wouldn't improve
the overall time AFAICS, but it would give us some idea how much of the
indexing time was really spent on hintbits.
        regards, tom lane


Re: proposal for smaller indexes on index-ordered tables

From
"Jeffrey Baker"
Date:
On Tue, Jun 24, 2008 at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Now, *why* it is a mistake is interesting to speculate about, but
>> let's confirm the theory first.

> Could this be related to hint bit rewrites during indexing?

If so, changing maintenance_work_mem won't improve the situation.

What I personally suspect is that Jeff's index build is swapping like
crazy, or else there's just some problem in the sort code for such a
large sort arena.  But let's get some evidence about how the index build
time varies with maintenance_work_mem before jumping to conclusions.

Well it definitely isn't that, because the machine doesn't even have a swap area defined.  vmstat during the table creation and index creation look really quite different.  During the table sort there's a heavy r/w traffic 12-20MB/s, during the index creation it's lower.  But seem to be CPU limited (i.e. one CPU is maxed out the whole time, and iowait is not very high).

I guess nobody has any interest in my proposal, only in the departure of my described experience from expected behavior :-(

Re: proposal for smaller indexes on index-ordered tables

From
"Jaime Casanova"
Date:
On Tue, Jun 24, 2008 at 3:50 PM, Jonah H. Harris <jonah.harris@gmail.com> wrote:
> On Tue, Jun 24, 2008 at 4:34 PM, Jeffrey Baker <jwbaker@gmail.com> wrote:
>> Supposing the table is generally or strictly ordered by the column to be
>> indexed, it would be more compact if the index stored ranges of tuples.
>> Instead of storing the TID of every tuple with that value, the index would
>> store a first and last TID, between which all tuples have the value.
>
> There are several databases which implement this idea.  Unfortunately,
> Postgres does not yet ensure that indexed tables remain indexed.
>

Just for the records. you mean *ordered* tables, don't you?
Postgres does not yet ensure that ordered tables remain ordered.

--
regards,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Guayaquil - Ecuador
Cel. (593) 87171157


Re: proposal for smaller indexes on index-ordered tables

From
Tom Lane
Date:
"Jeffrey Baker" <jwbaker@gmail.com> writes:
> I guess nobody has any interest in my proposal, only in the departure of my
> described experience from expected behavior :-(

Well, we certainly should try to understand the unexpected behavior
in detail before we consider solutions.  Per Sir A.C. Doyle, it is a
capital mistake to theorize in advance of the data.

(It's probably also worth noting that this community's historical
interest has not been in read-only or even read-mostly data.  We'd
not be willing to pay all that MVCC overhead if we thought we were
just a warehouse of static data.  If that's what you want, maybe
you need some other DBMS.)
        regards, tom lane


Re: proposal for smaller indexes on index-ordered tables

From
"Heikki Linnakangas"
Date:
Jeffrey Baker wrote:
> The way I read it, the current btree index stores the index value and the
> TID of every tuple having that value.  When you have a table with three
> columns, you index one of them and you get an index which is practically as
> large as the table itself.
> 
> Supposing the table is generally or strictly ordered by the column to be
> indexed, it would be more compact if the index stored ranges of tuples.
> Instead of storing the TID of every tuple with that value, the index would
> store a first and last TID, between which all tuples have the value.

Search the archives for the Grouped Index Tuples, also known as 
Clustered Indexes patch I worked on in spring 2007. It did almost 
exactly that.

It didn't make it into 8.3 for various reasons: the patch was quite 
invasive, the design and performance characteristics were not 
well-understood by fellow hackers, and there was not that much interest 
in it back then.

I'm not working on it or planning to work on it for now, but if you're 
interested, the patch is still out there. It requires a lot of work, but 
the design is still viable. I'm certainly willing to help with it if 
someone wants to pick it up.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com