Re: Index/Function organized table layout - Mailing list pgsql-hackers
| From | James Rogers | 
|---|---|
| Subject | Re: Index/Function organized table layout | 
| Date | |
| Msg-id | 1065117048.9267.39.camel@localhost.localdomain Whole thread Raw | 
| In response to | Re: Index/Function organized table layout (Alvaro Herrera <alvherre@dcc.uchile.cl>) | 
| Responses | Re: Index/Function organized table layout | 
| List | pgsql-hackers | 
On Wed, 2003-10-01 at 09:29, Alvaro Herrera wrote: > On Wed, Oct 01, 2003 at 11:37:38AM -0400, Tom Lane wrote: > > Hm, are you sure that smarter buffer management wouldn't serve the > > purpose? > > It doesn't help when there a lot of access locality in searching. In my > case I want to select some thousands of records that were inserted very > apart from each other, but are logically very near. Having this > pseudoheap that is ordered by definition helps very much with the > selection; the current heap requires me to bring to buffers lots of > uninteresting tuples, whichever buffer management algorithm is used, > because they are in the same page as interesting tuples. Yes, what Alvaro said. For very large tables that routinely run modest range queries, it can be very expensive in terms of cache efficiency if tuples that are closely grouped and ordered logically are scattered throughout the heap. The requirement to buffer a lot of unrelated data for typical case queries can greatly reduce the cache hit rate if the active portion of the data is already quite large relative to the physical RAM available. To give a real world example, a standard query on one of our tables that has not been CLUSTER-ed recently (i.e. within the last several days) generates an average of ~2,000 cache misses. Recently CLUSTER-ed, it generates ~0 cache misses on average. Needless to say, one is *much* faster than the other. The problem is that the number of buffers required to satisfy this query with the tuples scattered is enough to make it swap out the buffers of another competing query on another table that is also running. The result is that performance grinds to a halt as processes are competing with each other and trying to swap out each others buffers, resulting in a lot less *actual* buffering than should be occurring given the amount of data actually being queried. In my case, not only does CLUSTER-ing increase the number of concurrent queries possible without disk thrashing by an integer factor, but the number of buffers touched on a query that generates a cache misses is greatly reduced as well. The problem is that CLUSTER-ing is costly and index-organizing some of the tables would reduce the buffer needs, since the index tuple in these cases are almost as large as the heap tuples they reference. The classic scenario for this is when you have a large collection of time-series data stored in a table, with each series keyed to another table. The the typical tuple distribution creates pathological behaviors when buffer space becomes tight. Cheers, -James Rogersjamesr@best.com
pgsql-hackers by date: