The need for clustered indexes to boost TPC-V performance - Mailing list pgsql-performance

Following the earlier email introducing the TPC-V benchmark, and that we are developing an industry standard benchmarking kit for TPC-V using PostgreSQL, here is a specific performance issue we have run into.

 

In running a TPC-E prototype of the benchmark on an 8-core Nehalem blade and a disk array with 14 SSDs and 90 spinning drives, we noticed that we are doing a lot more I/O than the TPC-E benchmark is supposed to produce.  Digging deeper, we noticed that the I/O rate (around 28K IOPS) was not unreasonable for our combination of SQL queries/table and index sizes/buffer pool size. What was unreasonable was the large size of the tables, and especially, of the indexes.

 

To put this in perspective, let us compare our situation to a published TPC-E result on MS SQL at http://bit.ly/QeWXhE. This was run on a similar server, and the database size is close to ours. Our table and index sizes should be 32.5% of the MS SQL size (for those who care, we populated the database with 300,000 customers and 125 Initial Trade Days; they built with 385,000 customers and 300 ITD). Look at page 34 of the disclosure for the table and index sizes, and focus on the large tables. For our large tables, this is what I am seeing:

 

 

                           List of relations

Schema |        Name        | Type  | Owner |    Size    | Description

--------+--------------------+-------+-------+------------+-------------

public | trade              | table | tpce  | 402 GB     |

 public | cash_transaction   | table | tpce  | 309 GB     |

 public | trade_history      | table | tpce  | 291 GB     |

 public | settlement         | table | tpce  | 203 GB     |

 public | holding_history    | table | tpce  | 183 GB     |

 public | daily_market       | table | tpce  | 21 GB      |

 public | holding            | table | tpce  | 15 GB      |

 

                               List of relations

Schema |  Name   | Type  | Owner |       Table        |  Size   | Description

--------+---------+-------+-------+--------------------+---------+-------------

public | idx_th  | index | tpce  | trade_history      | 186 GB  |

 public | idx_hh2 | index | tpce  | holding_history    | 133 GB  |

 public | idx_hh  | index | tpce  | holding_history    | 126 GB  |

 public | idx_t2  | index | tpce  | trade              | 119 GB  |

 public | idx_t3  | index | tpce  | trade              | 110 GB  |

 public | idx_se  | index | tpce  | settlement         | 63 GB   |

 public | idx_t   | index | tpce  | trade              | 62 GB   |

 public | idx_ct  | index | tpce  | cash_transaction   | 55 GB   |

 public | idx_h2  | index | tpce  | holding            | 12 GB   |

 

I don’t know Dell’s exact I/O rate, but judging by their storage configuration and what’s expected of the benchmark, we are several times too high. (Even after cutting the database size by a factor of 10, we are around twice the IOPS rate we should be at.)

 

Comparing the table sizes, we are close to 2X larger (more on this in a later note). But the index size is what stands out. Our overall index usage (again, after accounting for different numbers of rows) is 4.8X times larger. 35% of our I/Os are to the index space. I am guessing that the 4.8X ballooning has something to do with this, and that in itself explains a lot about our high I/O rate, as well as higher CPU/tran cycles compared to MS SQL (we are  2.5-3 times slower).

 

So I looked more closely at the indexes. I chose the CASH_TRANSACTION table since it has a single index, and we can compare it more directly to the Dell data. If you look at page 34 of http://bit.ly/QeWXhE, the index size of CT is 1,278,720KB for 6,120,529,488 rows. That’s less than one byte of index per data row!  How could that be?  Well, MS SQL used a “clustered index” for CT, i.e., the data is held in the leaf pages of the index B-Tree. The data and index are in one data structure. Once you lookup the index, you also have the data at zero additional cost. For PGSQL, we had to create a regular index, which took up 55GB. Once you do the math, this works out to around 30 bytes per row. I imagine we have the 15-byte key along with a couple of 4-byte or 8-byte pointers.

 

So MS SQL beats PGSQL by a) having a lower I/O rate due to no competition for the buffer pool from indexes (except for secondary indexes); and b) by getting the data with a free lookup, whereas we have to work our way down both the index and the data trees.

 

Dell created a clustered index for every single one of the 33 tables. Folks, past experiences with relational databases and TPC benchmarks tells me this could affect the bottom line performance of the benchmark by as much as 2X.

 

Checking online, the subject of clustered indexes for PostgreSQL comes up often. PGSQL does have a concept called “clustered table”, which means a table has been organized in the order of an index. This would help with sequential accesses to a table, but has nothing to do with this problem. PGSQL folks sometimes refer to what we want as “integrated index”.

 

Is the PGSQL community willing to invest in a feature that a) has been requested by many others already; and b) can make a huge difference in a benchmark that can lend substantial credibility to PGSQL performance?


Thanks,
Reza

 

pgsql-performance by date:

Previous
From: Reza Taheri
Date:
Subject: Introducing the TPC-V benchmark, and its relationship to PostgreSQL
Next
From: Craig Ringer
Date:
Subject: Re: Introducing the TPC-V benchmark, and its relationship to PostgreSQL