Re: Index/Function organized table layout - Mailing list pgsql-hackers

From James Rogers
Subject Re: Index/Function organized table layout
Date
Msg-id 1065117048.9267.39.camel@localhost.localdomain
Whole thread Raw
In response to Re: Index/Function organized table layout  (Alvaro Herrera <alvherre@dcc.uchile.cl>)
Responses Re: Index/Function organized table layout
List pgsql-hackers
On Wed, 2003-10-01 at 09:29, Alvaro Herrera wrote:
> On Wed, Oct 01, 2003 at 11:37:38AM -0400, Tom Lane wrote:
> > Hm, are you sure that smarter buffer management wouldn't serve the
> > purpose?
> 
> It doesn't help when there a lot of access locality in searching.  In my
> case I want to select some thousands of records that were inserted very
> apart from each other, but are logically very near.  Having this
> pseudoheap that is ordered by definition helps very much with the
> selection; the current heap requires me to bring to buffers lots of
> uninteresting tuples, whichever buffer management algorithm is used,
> because they are in the same page as interesting tuples.


Yes, what Alvaro said.

For very large tables that routinely run modest range queries, it can be
very expensive in terms of cache efficiency if tuples that are closely
grouped and ordered logically are scattered throughout the heap.  The
requirement to buffer a lot of unrelated data for typical case queries
can greatly reduce the cache hit rate if the active portion of the data
is already quite large relative to the physical RAM available.

To give a real world example, a standard query on one of our tables that
has not been CLUSTER-ed recently (i.e. within the last several days)
generates an average of ~2,000 cache misses.  Recently CLUSTER-ed, it
generates ~0 cache misses on average.  Needless to say, one is *much*
faster than the other.  The problem is that the number of buffers
required to satisfy this query with the tuples scattered is enough to
make it swap out the buffers of another competing query on another table
that is also running.  The result is that performance grinds to a halt
as processes are competing with each other and trying to swap out each
others buffers, resulting in a lot less *actual* buffering than should
be occurring given the amount of data actually being queried.

In my case, not only does CLUSTER-ing increase the number of concurrent
queries possible without disk thrashing by an integer factor, but the
number of buffers touched on a query that generates a cache misses is
greatly reduced as well.  The problem is that CLUSTER-ing is costly and
index-organizing some of the tables would reduce the buffer needs, since
the index tuple in these cases are almost as large as the heap tuples
they reference.

The classic scenario for this is when you have a large collection of
time-series data stored in a table, with each series keyed to another
table.  The the typical tuple distribution creates pathological
behaviors when buffer space becomes tight.

Cheers,

-James Rogersjamesr@best.com






pgsql-hackers by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: Thoughts on maintaining 7.3
Next
From: James Rogers
Date:
Subject: Re: Index/Function organized table layout