Thread: keeping an index in memory

keeping an index in memory

From
Rajarshi Guha
Date:
Hi, relating to my previous queries on doing spatial searches on 10M
rows, it seems that most of my queries return within 2 minutes.
Generally this is not too bad, though faster is always better.

Interestingly, it appears that the CUBE index for the table in
question is about 3GB (the table itself is about 14GB). Not knowing
the details of the postgres internals, I assume that when a query
tries to use the index, it will need to read a 3GB file. Is this a
correct assumption?

In such a situation, is there a way to keep the index in memory? My
machine has 8GB installed and currently has about 7.4GB free RAM (64
bit linux 2.6.9)

A side effect of the size of the index is that if I do a query that
performs a seq scan (say using cube_distance) it takes longer than
when an index is used, but not significantly longer. And this is on a
10M row table.

What strategies do people follow when the index becomes very big?

Thanks,

-------------------------------------------------------------------
Rajarshi Guha  <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
Does Ramanujan know Polish?
                --  E.B. Ross



Re: keeping an index in memory

From
"Scott Marlowe"
Date:
On 10/20/07, Rajarshi Guha <rguha@indiana.edu> wrote:
> Hi, relating to my previous queries on doing spatial searches on 10M
> rows, it seems that most of my queries return within 2 minutes.
> Generally this is not too bad, though faster is always better.
>
> Interestingly, it appears that the CUBE index for the table in
> question is about 3GB (the table itself is about 14GB). Not knowing
> the details of the postgres internals, I assume that when a query
> tries to use the index, it will need to read a 3GB file. Is this a
> correct assumption?

Not sure.  I assume not, but will wait for someone more conversant
with pgsql index usage to post that answer.

> In such a situation, is there a way to keep the index in memory? My
> machine has 8GB installed and currently has about 7.4GB free RAM (64
> bit linux 2.6.9)

The kernel will tend to keep it in memory.  Usually it does a pretty
good job of that.

> A side effect of the size of the index is that if I do a query that
> performs a seq scan (say using cube_distance) it takes longer than
> when an index is used, but not significantly longer. And this is on a
> 10M row table.

Depending on how much of the table you're getting back, eventually a
seq scan will outperform an index scan, because in pgsql-land, you
ALWAYS have to hit the table whether there's an index entry or not,
for visibility reasons.  So, at some point, a certain percentage of
the table being retrieved (1 to 25%) will trigger a sequential scan,
and rightly so.

> What strategies do people follow when the index becomes very big?

You can only fit so much data into memory.  Once a db gets big enough
that it simply can't be stuffed into memory, you need a fast storage
subsystem.

Re: keeping an index in memory

From
Bill Moran
Date:
Rajarshi Guha <rguha@indiana.edu> wrote:
>
> Hi, relating to my previous queries on doing spatial searches on 10M
> rows, it seems that most of my queries return within 2 minutes.
> Generally this is not too bad, though faster is always better.
>
> Interestingly, it appears that the CUBE index for the table in
> question is about 3GB (the table itself is about 14GB). Not knowing
> the details of the postgres internals, I assume that when a query
> tries to use the index, it will need to read a 3GB file. Is this a
> correct assumption?
>
> In such a situation, is there a way to keep the index in memory? My
> machine has 8GB installed and currently has about 7.4GB free RAM (64
> bit linux 2.6.9)

Free or cached/buffered?  Your OS should be using most of that to
buffer disk blocks.

> A side effect of the size of the index is that if I do a query that
> performs a seq scan (say using cube_distance) it takes longer than
> when an index is used, but not significantly longer. And this is on a
> 10M row table.
>
> What strategies do people follow when the index becomes very big?

What version of PG are you using and what is your shared_buffers setting?

With 8G of RAM, you should start with shared_buffers around 2 - 3G, if
you're using a modern version of PG.  With that much shared memory, a
large portion of that index should stay in RAM, as long as it's being
used often enough that PG doesn't swap it for other data.

--
Bill Moran
http://www.potentialtech.com

Re: keeping an index in memory

From
Martijn van Oosterhout
Date:
On Sun, Oct 21, 2007 at 07:36:00AM -0400, Bill Moran wrote:
> What version of PG are you using and what is your shared_buffers setting?
>
> With 8G of RAM, you should start with shared_buffers around 2 - 3G, if
> you're using a modern version of PG.  With that much shared memory, a
> large portion of that index should stay in RAM, as long as it's being
> used often enough that PG doesn't swap it for other data.

With that much memory, the index is likely to remain in memory no
matter what size shared_memory he has. Anything in shared_memory is
going to be in the system cache anyway. I wonder if there's something
else we havn't been told, like how big the actual table is and whether
there are any other large tables/indexes.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: keeping an index in memory

From
Rajarshi Guha
Date:
On Oct 21, 2007, at 7:36 AM, Bill Moran wrote:

> Rajarshi Guha <rguha@indiana.edu> wrote:
>>
>> Hi, relating to my previous queries on doing spatial searches on 10M
>> rows, it seems that most of my queries return within 2 minutes.
>> Generally this is not too bad, though faster is always better.
>>
>> Interestingly, it appears that the CUBE index for the table in
>> question is about 3GB (the table itself is about 14GB). Not knowing
>> the details of the postgres internals, I assume that when a query
>> tries to use the index, it will need to read a 3GB file. Is this a
>> correct assumption?
>>
>> In such a situation, is there a way to keep the index in memory? My
>> machine has 8GB installed and currently has about 7.4GB free RAM (64
>> bit linux 2.6.9)
>
> Free or cached/buffered?  Your OS should be using most of that to
> buffer disk blocks.

Aah, correct. Yes they are cached/buffered

>> A side effect of the size of the index is that if I do a query that
>> performs a seq scan (say using cube_distance) it takes longer than
>> when an index is used, but not significantly longer. And this is on a
>> 10M row table.
>>
>> What strategies do people follow when the index becomes very big?
>
> What version of PG are you using and what is your shared_buffers
> setting?

8.2.5

My original shared_buffers setting was 128MB.

> With 8G of RAM, you should start with shared_buffers around 2 - 3G, if
> you're using a modern version of PG.

I can do that but I'm a little confused. Earlier postings on the list
indicate that shared_buffers should be about 10% of the system RAM
and that effective_cache_size can be a large fraction of RAM.

As a result I had effective_cache_size set to 2500MB

Thanks for the pointers

-------------------------------------------------------------------
Rajarshi Guha  <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
How I wish I were what I was when I wished I were what I am.



Re: keeping an index in memory

From
Rajarshi Guha
Date:
On Oct 21, 2007, at 10:40 AM, Martijn van Oosterhout wrote:

> On Sun, Oct 21, 2007 at 07:36:00AM -0400, Bill Moran wrote:
>> What version of PG are you using and what is your shared_buffers
>> setting?
>>
>> With 8G of RAM, you should start with shared_buffers around 2 -
>> 3G, if
>> you're using a modern version of PG.  With that much shared memory, a
>> large portion of that index should stay in RAM, as long as it's being
>> used often enough that PG doesn't swap it for other data.
>
> With that much memory, the index is likely to remain in memory no
> matter what size shared_memory he has. Anything in shared_memory is
> going to be in the system cache anyway. I wonder if there's something
> else we havn't been told, like how big the actual table is and whether
> there are any other large tables/indexes.

The table itself is about 10M rows corresponding to 14GB. The only
other index on this table is a btree index whose size is ~300MB. The
machine is not running anything else.

Now, it might just be the case that given the size of the index, I
cannot make bounding box queries (which will use the CUBE index) go
any faster. But I am surprised that that the other type of query
(using cube_distance which by definition must use a seq scan) is only
slightly longer. If nothing else, scanning through 14GB of data
should be 3 times slower than scanning through 3GB of data.


-------------------------------------------------------------------
Rajarshi Guha  <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
All laws are simulations of reality.
    -- John C. Lilly



Re: keeping an index in memory

From
Tom Lane
Date:
Rajarshi Guha <rguha@indiana.edu> writes:
> Now, it might just be the case that given the size of the index, I
> cannot make bounding box queries (which will use the CUBE index) go
> any faster. But I am surprised that that the other type of query
> (using cube_distance which by definition must use a seq scan) is only
> slightly longer. If nothing else, scanning through 14GB of data
> should be 3 times slower than scanning through 3GB of data.

A single index probe should not touch anything like all of the index ---
unless your data is such that the index is very non-optimally laid out.
GiST should work well if there are lots of subsets of the data that
have bounding boxes disjoint from other subsets'.  If not, maybe you
need to reconsider your data representation.

Have you done any examination of how much of the index gets touched
during a typical query?  I'd try turning on stats_block_level and see
how the delta in pg_statio_all_indexes.idx_blks_read compares to the
index size.

            regards, tom lane

Re: keeping an index in memory

From
Gregory Stark
Date:
"Rajarshi Guha" <rguha@indiana.edu> writes:

> The table itself is about 10M rows corresponding to 14GB.

Each row is on average 1.4kB ? Perhaps you should send more details of the
table definition and the typical size of each column. It's possible you have
the columns you're selecting on being stored out of line ("toasted") which
would hurt performance if you're often accessing many of those columns.

If it's not true then you may have a lot of dead space in your table which
would decrease performance.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: keeping an index in memory

From
"Scott Marlowe"
Date:
On 10/21/07, Rajarshi Guha <rguha@indiana.edu> wrote:
>
> > With 8G of RAM, you should start with shared_buffers around 2 - 3G, if
> > you're using a modern version of PG.
>
> I can do that but I'm a little confused. Earlier postings on the list
> indicate that shared_buffers should be about 10% of the system RAM
> and that effective_cache_size can be a large fraction of RAM.

That was true with 7.4 and before because their cache management
wasn't very efficient.  With 8.0 and above, PostgreSQL can handle much
larger shared_buffer sizes.

Re: keeping an index in memory

From
Rajarshi Guha
Date:
On Oct 21, 2007, at 12:56 PM, Gregory Stark wrote:

> "Rajarshi Guha" <rguha@indiana.edu> writes:
>
>> The table itself is about 10M rows corresponding to 14GB.
>
> Each row is on average 1.4kB ?

Yes, though some rows may 10's of Kb

> Perhaps you should send more details of the
> table definition and the typical size of each column. It's possible
> you have
> the columns you're selecting on being stored out of line
> ("toasted") which
> would hurt performance if you're often accessing many of those
> columns.

The SELECT is simply selecting the cube column and another text
column which just contains ID's. One of the columns can be large and
does involve toasting - but that column is never selected in the query

However, after running a few random queries, it appears that the
index is loaded into memory after which all my queries (I tested
10,000) run in under 10 sec. Even for relatively large bounding boxes
I get under 60 sec query times.

Thanks to everybody for pointing me in the right direction!

-------------------------------------------------------------------
Rajarshi Guha  <rguha@indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
The Heineken Uncertainty Principle:
You can never be sure how many beers you had last night.