Re: The need for clustered indexes to boost TPC-V performance - Mailing list pgsql-performance

From Claudio Freire
Subject Re: The need for clustered indexes to boost TPC-V performance
Date
Msg-id CAGTBQpY4J2on5=gN5-ykSpZkxVeGHAfiZwu0mXpqsSXwoxxnPg@mail.gmail.com
Whole thread Raw
In response to The need for clustered indexes to boost TPC-V performance  (Reza Taheri <rtaheri@vmware.com>)
List pgsql-performance
On Tue, Jul 3, 2012 at 8:13 PM, Reza Taheri <rtaheri@vmware.com> wrote:
> So I looked more closely at the indexes. I chose the CASH_TRANSACTION
> table since it has a single index, and we can compare it more directly to the
> Dell data. If you look at page 34 of http://bit.ly/QeWXhE, the index size of CT
> is 1,278,720KB for 6,120,529,488 rows. That’s less than one byte of index
> per data row!  How could that be?  Well, MS SQL used a “clustered index”
> for CT, i.e., the data is held in the leaf pages of the index B-Tree.
> The data and index are in one data structure. Once you lookup the index,
> you also have the data at zero additional cost. For PGSQL, we had to create
> a regular index, which took up 55GB. Once you do the math, this works out
> to around 30 bytes per row. I imagine we have the 15-byte key along with a
> couple of 4-byte or 8-byte pointers.
...
> So MS SQL beats PGSQL by a) having a lower I/O rate due to no competition
> for the buffer pool from indexes (except for secondary indexes); and b) by
> getting the data with a free lookup, whereas we have to work our way down
> both the index and the data trees.

15-byte key?

What about not storing the keys, but a hash, for leaf nodes?

Assuming it can be made to work for both "range" and "equality" scans,
holding only hashes on leaf nodes would reduce index size, but how
much?

I think it's doable, and I could come up with a spec if it's worth it.
It would have to scan the heap for only two extra index pages (the
extremes that cannot be ruled out) and hash collisions, which doesn't
seem like a big loss versus the reduced index.

pgsql-performance by date:

Previous
From: Daniel Farina
Date:
Subject: Re: The need for clustered indexes to boost TPC-V performance
Next
From: Reza Taheri
Date:
Subject: Re: Introducing the TPC-V benchmark, and its relationship to PostgreSQL