Re: Memory usage - indexes - Mailing list pgsql-performance

From Greg Smith
Subject Re: Memory usage - indexes
Date
Msg-id 4C9CE768.3040304@2ndquadrant.com
Whole thread Raw
In response to Memory usage - indexes  (Tobias Brox <tobixen@gmail.com>)
Responses Re: Memory usage - indexes
List pgsql-performance
Tobias Brox wrote:
> 1) Are there any good ways to verify my hypothesis?

You can confim easily whether the contents of the PostgreSQL buffer
cache contain when you think they do by installing pg_buffercache.  My
paper and sample samples at
http://www.pgcon.org/2010/schedule/events/218.en.html go over that.

You can also integrate that with a look at the OS level information by
using pgfincore:  http://www.pgcon.org/2010/schedule/events/261.en.html

I've found that if shared_buffers is set to a largish size, you can find
out enough information from look at it to have a decent idea what's
going on without going to that depth.  But it's available if you want it.


> 2) I've seen it discussed earlier on this list ... shared_buffers vs
> OS caches.  Some claims that it has very little effect to adjust the
> size of the shared buffers.  Anyway, isn't it a risk that memory is
> wasted because important data is stored both in the OS cache and the
> shared buffers?
The risk of overlap is overrated.  What's much more likely to actually
happen is that you'll have good data in shared_buffers, then run
something that completely destroys the OS cache (multiple seq scans just
below the "ring buffer" threshold", multiple large index scans, raging
VACUUM work).  Having copies of the most important pieces that stay in
shared_buffers despite the OS cache being demolished is much more
important to preserving decent performance than the concern about double
buffering database and OS contents--that only happens on trivial
workloads where there's not constant churn on the OS cache throwing
pages out like crazy.

I have easily measurable improvements on client systems increasing
shared_buffers into the 4GB - 8GB range.  Popular indexes move into
there, stay there, and only get written out at checkpoint time rather
than all the time.  However, if you write heavily enough that much of
this space gets dirty fast, you may not be be able to go that high
before checkpoint issues start to make such sizes impractical.

> What would happen if using almost all the available
> memory for shared buffers?  Or turn it down to a bare minimum and let
> the OS do almost all the cache handling?
>

The useful upper limit normally works out to be somewhere between 4GB
and 1/2 of RAM.  Using minimal values works for some people,
particularly on Windows, but you can measure that doing so generates far
more disk I/O activity than using a moderate sized cache by
instrumenting pg_stat_bgwriter, the way I describe in my talk.

> 3) We're discussing to drop some overlapping indexes ... i.e. to drop
> one out of two indexes looking like this:
>
> some_table(a)
> some_table(a,b)
>
> Would the query "select * from some_table where a=?" run slower if we
> drop the first index?  Significantly?
>

Yes, it would run slower, because now it has to sort through blocks in a
larger index in order to find anything.  How significant that is depends
on the relative size of the indexes.  To give a simple example, if (a)
is 1GB, while (a,b) is 2GB, you can expect dropping (a) to halve the
speed of index lookups.  Fatter indexes just take longer to navigate
through.

> (in our situation I found that the number of distinct b's for each a
> is low and that the usage stats on the second index is quite low
> compared with the first one, so I think we'll drop the second index).
>

You are thinking correctly here now.  If the addition of b to the index
isn't buying you significant increases in selectivity, just get rid of
it and work only with the index on a instead.

> 4) We're discussing to drop other indexes.  Does it make sense at all
> as long as we're not experiencing problems with inserts/updates?  I
> suppose that if the index isn't used it will remain on disk and won't
> affect the memory usage ... but what if the index is rarely used ...
> wouldn't it be better to do a seqscan on a table that is frequently
> accessed and mostly in memory than to consult an index that is stored
> on the disk?
>

Don't speculate; measure the exact usage amount that each usage is being
used and evaluate them on a case by case basis.  If they're not being
used, they're just adding overheard in many ways, and you should drop them.

There are a bunch of "find useless index" scripts floating around the
web (I think I swiped ideas from Robert Treat and Josh Berkus to build
mine); here's the one I use now:

SELECT
  schemaname as nspname,
  relname,
  indexrelname,
  idx_scan,
  pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size
FROM
  pg_stat_user_indexes i
  JOIN pg_index USING (indexrelid)
WHERE
  indisunique IS false
ORDER BY idx_scan,pg_relation_size(i.indexrelid) DESC;

Anything that bubbles to the top of that list, you probably want to get
rid of.  Note that this ignores UNIQUE indexes, which you can't drop
anyway, but are being used to answer queries.  You might choose to
include them anyway but just flag them in the output if the goal is to
see how often they are used.

P.S. You seem busy re-inventing pgstatspack this week:
http://pgfoundry.org/projects/pgstatspack/ does all of this "take a
snapshot of the stats and store it in the database for future analysis"
work for you.  Working on that intead of continuing to hack individual
storage/retrieve scripts for each statistics counter set would be a
better contribution to the PostgreSQL community.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


pgsql-performance by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: Memory usage - indexes
Next
From: Greg Smith
Date:
Subject: Re: Memory usage - indexes