Thread: Memory usage - indexes

Memory usage - indexes

From
Tobias Brox
Date:
We've come to a tipping point with one of our database servers, it's
generally quite loaded but up until recently it was handling the load
well - but now we're seeing that it struggles to process all the
selects fast enough.  Sometimes we're observing some weird lock-like
behaviour (see my other post on that), but most of the time the
database server is just not capable of handling the load fast enough
(causing the queries to pile up in the pg_stat_activity-view).

My main hypothesis is that all the important indexes would fit snuggly
into the memory before, and now they don't.  We'll eventually get the
server moved over to new and improved hardware, but while waiting for
that to happen we need to do focus on reducing the memory footprint of
the database.  I have some general questions now ...

1) Are there any good ways to verify my hypothesis?  Some months ago I
thought of running some small memory-gobbling program on the database
server just to see how much memory I could remove before we would see
indications of the database being overloaded.  It seems a bit radical,
but I think the information learned from such an experiment would be
very useful ... and we never managed to set up any testing environment
that faithfully replicates production traffic.  Anyway, it's sort of
too late now that we're already observing performance problems even
without the memory gobbling script running.

2) I've seen it discussed earlier on this list ... shared_buffers vs
OS caches.  Some claims that it has very little effect to adjust the
size of the shared buffers.  Anyway, isn't it a risk that memory is
wasted because important data is stored both in the OS cache and the
shared buffers?  What would happen if using almost all the available
memory for shared buffers?  Or turn it down to a bare minimum and let
the OS do almost all the cache handling?

3) We're discussing to drop some overlapping indexes ... i.e. to drop
one out of two indexes looking like this:

some_table(a)
some_table(a,b)

Would the query "select * from some_table where a=?" run slower if we
drop the first index?  Significantly?

(in our situation I found that the number of distinct b's for each a
is low and that the usage stats on the second index is quite low
compared with the first one, so I think we'll drop the second index).

4) We're discussing to drop other indexes.  Does it make sense at all
as long as we're not experiencing problems with inserts/updates?  I
suppose that if the index isn't used it will remain on disk and won't
affect the memory usage ... but what if the index is rarely used ...
wouldn't it be better to do a seqscan on a table that is frequently
accessed and mostly in memory than to consult an index that is stored
on the disk?

Sorry for all the stupid questions ;-)

Re: Memory usage - indexes

From
Mark Kirkwood
Date:
On 24/09/10 09:50, Tobias Brox wrote:
> We've come to a tipping point with one of our database servers, it's
> generally quite loaded but up until recently it was handling the load
> well - but now we're seeing that it struggles to process all the
> selects fast enough.  Sometimes we're observing some weird lock-like
> behaviour (see my other post on that), but most of the time the
> database server is just not capable of handling the load fast enough
> (causing the queries to pile up in the pg_stat_activity-view).
>
> My main hypothesis is that all the important indexes would fit snuggly
> into the memory before, and now they don't.  We'll eventually get the
> server moved over to new and improved hardware, but while waiting for
> that to happen we need to do focus on reducing the memory footprint of
> the database.  I have some general questions now ...
>
> 1) Are there any good ways to verify my hypothesis?  Some months ago I
> thought of running some small memory-gobbling program on the database
> server just to see how much memory I could remove before we would see
> indications of the database being overloaded.  It seems a bit radical,
> but I think the information learned from such an experiment would be
> very useful ... and we never managed to set up any testing environment
> that faithfully replicates production traffic.  Anyway, it's sort of
> too late now that we're already observing performance problems even
> without the memory gobbling script running.
>
> 2) I've seen it discussed earlier on this list ... shared_buffers vs
> OS caches.  Some claims that it has very little effect to adjust the
> size of the shared buffers.  Anyway, isn't it a risk that memory is
> wasted because important data is stored both in the OS cache and the
> shared buffers?  What would happen if using almost all the available
> memory for shared buffers?  Or turn it down to a bare minimum and let
> the OS do almost all the cache handling?
>
> 3) We're discussing to drop some overlapping indexes ... i.e. to drop
> one out of two indexes looking like this:
>
> some_table(a)
> some_table(a,b)
>
> Would the query "select * from some_table where a=?" run slower if we
> drop the first index?  Significantly?
>
> (in our situation I found that the number of distinct b's for each a
> is low and that the usage stats on the second index is quite low
> compared with the first one, so I think we'll drop the second index).
>
> 4) We're discussing to drop other indexes.  Does it make sense at all
> as long as we're not experiencing problems with inserts/updates?  I
> suppose that if the index isn't used it will remain on disk and won't
> affect the memory usage ... but what if the index is rarely used ...
> wouldn't it be better to do a seqscan on a table that is frequently
> accessed and mostly in memory than to consult an index that is stored
> on the disk?
>
> Sorry for all the stupid questions ;-)
>
>


All good questions! Before (or maybe as well as) looking at index sizes
vs memory I'd check to see if any of your commonly run queries have
suddenly started to use different plans due to data growth, e.g:

- index scan to seq scan (perhaps because effective_cache_size is too
small now)
- hash agg to sort (work_mem too small now)

We had a case of the 1st point happen here a while ago, symptoms looked
very like what you are describing.

Re index size, you could try indexes like:

some_table(a)
some_table(b)

which may occupy less space, and the optimizer can bitmap and/or them to
work like the compound index some_table(a,b).

regards

Mark

Re: Memory usage - indexes

From
"Kevin Grittner"
Date:
Tobias Brox <tobixen@gmail.com> wrote:

> Sorry for all the stupid questions ;-)

I'm with Mark -- I didn't see nay stupid questions there.

Where I would start, though, is by checking the level of bloat.  One
long-running query under load, or one query which updates or deletes
a large number of rows, can put you into this state.  If you find
serious bloat you may need to schedule a maintenance window for
aggressive work (like CLUSTER) to fix it.

Even before that, however, I would spend some time looking at the
patterns of I/O under `vmstat 1` or `iostat 1` to get a sense of
where the bottlenecks are.  If you time-stamp the rows from vmstat
you can match them up against events in your log and periods of
slow response.

-Kevin

Re: Memory usage - indexes

From
Bob Lunney
Date:
Tobias,

Consult pg_statio_user_indexes to see which indexes have been used and how much.  Indexes with comparitively low usages
ratesaren't helping you much and are candidates for elimination.  Also, partitioning large tables can help, since the
indexeson each partition are smaller than one huge index on the original table. 

Good luck!

Bob Lunney

--- On Thu, 9/23/10, Tobias Brox <tobixen@gmail.com> wrote:

> From: Tobias Brox <tobixen@gmail.com>
> Subject: [PERFORM] Memory usage - indexes
> To: pgsql-performance@postgresql.org
> Date: Thursday, September 23, 2010, 5:50 PM
> We've come to a tipping point with
> one of our database servers, it's
> generally quite loaded but up until recently it was
> handling the load
> well - but now we're seeing that it struggles to process
> all the
> selects fast enough.  Sometimes we're observing some
> weird lock-like
> behaviour (see my other post on that), but most of the time
> the
> database server is just not capable of handling the load
> fast enough
> (causing the queries to pile up in the
> pg_stat_activity-view).
>
> My main hypothesis is that all the important indexes would
> fit snuggly
> into the memory before, and now they don't.  We'll
> eventually get the
> server moved over to new and improved hardware, but while
> waiting for
> that to happen we need to do focus on reducing the memory
> footprint of
> the database.  I have some general questions now ...
>
> 1) Are there any good ways to verify my hypothesis? 
> Some months ago I
> thought of running some small memory-gobbling program on
> the database
> server just to see how much memory I could remove before we
> would see
> indications of the database being overloaded.  It
> seems a bit radical,
> but I think the information learned from such an experiment
> would be
> very useful ... and we never managed to set up any testing
> environment
> that faithfully replicates production traffic. 
> Anyway, it's sort of
> too late now that we're already observing performance
> problems even
> without the memory gobbling script running.
>
> 2) I've seen it discussed earlier on this list ...
> shared_buffers vs
> OS caches.  Some claims that it has very little effect
> to adjust the
> size of the shared buffers.  Anyway, isn't it a risk
> that memory is
> wasted because important data is stored both in the OS
> cache and the
> shared buffers?  What would happen if using almost all
> the available
> memory for shared buffers?  Or turn it down to a bare
> minimum and let
> the OS do almost all the cache handling?
>
> 3) We're discussing to drop some overlapping indexes ...
> i.e. to drop
> one out of two indexes looking like this:
>
> some_table(a)
> some_table(a,b)
>
> Would the query "select * from some_table where a=?" run
> slower if we
> drop the first index?  Significantly?
>
> (in our situation I found that the number of distinct b's
> for each a
> is low and that the usage stats on the second index is
> quite low
> compared with the first one, so I think we'll drop the
> second index).
>
> 4) We're discussing to drop other indexes.  Does it
> make sense at all
> as long as we're not experiencing problems with
> inserts/updates?  I
> suppose that if the index isn't used it will remain on disk
> and won't
> affect the memory usage ... but what if the index is rarely
> used ...
> wouldn't it be better to do a seqscan on a table that is
> frequently
> accessed and mostly in memory than to consult an index that
> is stored
> on the disk?
>
> Sorry for all the stupid questions ;-)
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>




Re: Memory usage - indexes

From
Tobias Brox
Date:
On 24 September 2010 18:23, Bob Lunney <bob_lunney@yahoo.com> wrote:
> Consult pg_statio_user_indexes to see which indexes have been used
> and how much.

What is the main differences between pg_statio_user_indexes and
pg_stat_user_indexes?

>   Indexes with comparitively low usages rates aren't helping you much and are
> candidates for elimination.

No doubt about that - but the question was, would it really help us to
drop those indexes?

I think the valid reasons for dropping indexes would be:

1) To speed up inserts, updates and deletes

2) To spend less disk space

3) Eventually, speed up nightly vacuum (it wouldn't be an issue with
autovacuum though)

4) To spend less memory resources?

I'm not at all concerned about 1 and 2 above - we don't have any
performance issues on the write part, and we have plenty of disk
capacity.  We are still doing the nightly vacuum thing, and it does
hurt us a bit since it's dragging ever more out in time.  Anyway, it's
number four I'm wondering most about - is it anything to be concerned
about or not for the least frequently used indexes?  An index that
aren't being used would just stay on disk anyway, right?  And if there
are limited memory resources, the indexes that are most frequently
used would fill up the cache space anyway?  That's my thoughts at
least - are they way off?

We did have similar experiences some years ago - everything was
running very fine all until one day when some semi-complicated
very-frequently-run selects started taking several seconds to run
rather than tens of milliseconds.  I found that we had two slightly
overlapping indexes like this ...

  account_transaction(customer_id, trans_type)
  account_transaction(customer_id, trans_type, created)

both of those indexes where heavily used.  I simply dropped the first
one, and the problems disappeared.  I assume that both indexes up to
some point fitted snuggly into memory, but one day they were competing
for the limited memory space, dropping the redundant index solved the
problem all until the next hardware upgrade.  I would never have found
those indexes searching for the least used indexes in the
pg_stat(io)_user_indexes view.

Re: Memory usage - indexes

From
Tobias Brox
Date:
On 24 September 2010 00:12, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
> All good questions! Before (or maybe as well as) looking at index sizes vs
> memory I'd check to see if any of your commonly run queries have suddenly
> started to use different plans due to data growth, e.g:
>
> - index scan to seq scan (perhaps because effective_cache_size is too small
> now)
> - hash agg to sort (work_mem too small now)

Would be trivial if we had a handful of different queries and knew the
plans by heart ... but our setup is slightly more complex than that.
I would have to log the plans, wouldn't I?  How would you go about it?
 I was having some thoughts to make up some script to scan through the
postgres log, extract some stats on the queries run, and even do some
explains and store query plans.

We've started to chase down on seq scans (causing us to create even
more indexes and eating up more memory...).  I have set up a simple
system for archiving stats from pg_stat_user_tables now, like this:

insert into tmp_pg_stat_user_tables select *,now() as snapshot from
pg_stat_user_tables ;

NBET=> \d tmp_delta_pg_stat_user_tables
       View "public.tmp_delta_pg_stat_user_tables"
      Column      |           Type           | Modifiers
------------------+--------------------------+-----------
 duration         | interval                 |
 relname          | name                     |
 seq_scan         | bigint                   |
 seq_tup_read     | bigint                   |
 idx_scan         | bigint                   |
 idx_tup_fetch    | bigint                   |
 n_tup_ins        | bigint                   |
 n_tup_upd        | bigint                   |
 n_tup_del        | bigint                   |
 n_tup_hot_upd    | bigint                   |
 n_live_tup       | bigint                   |
 n_dead_tup       | bigint                   |
 last_vacuum      | timestamp with time zone |
 last_autovacuum  | timestamp with time zone |
 last_analyze     | timestamp with time zone |
 last_autoanalyze | timestamp with time zone |
View definition:
 SELECT now() - b.snapshot AS duration, a.relname, a.seq_scan -
b.seq_scan AS seq_scan, a.seq_tup_read - b.seq_tup_read AS
seq_tup_read, a.idx_scan - b.idx_scan AS idx_scan, a.idx_tup_fetch -
b.idx_tup_fetch AS idx_tup_fetch, a.n_tup_ins - b.n_tup_ins AS
n_tup_ins, a.n_tup_upd - b.n_tup_upd AS n_tup_upd, a.n_tup_del -
b.n_tup_del AS n_tup_del, a.n_tup_hot_upd - b.n_tup_hot_upd AS
n_tup_hot_upd, a.n_live_tup, a.n_dead_tup, a.last_vacuum,
a.last_autovacuum, a.last_analyze, a.last_autoanalyze
   FROM pg_stat_user_tables a, tmp_pg_stat_user_tables b
  WHERE b.snapshot = (( SELECT max(tmp_pg_stat_user_tables.snapshot) AS max
           FROM tmp_pg_stat_user_tables)) AND b.relname = a.relname;

Re: Memory usage - indexes

From
Brad Nicholson
Date:
  On 10-09-24 12:46 PM, Tobias Brox wrote:
> On 24 September 2010 18:23, Bob Lunney<bob_lunney@yahoo.com>  wrote:
>> Consult pg_statio_user_indexes to see which indexes have been used
>> and how much.
> What is the main differences between pg_statio_user_indexes and
> pg_stat_user_indexes?
>

The pg_stat_* views give you usage information (for indexes - number of
scans, numbers of tuples read/fetched).  The pg_statio_* views give you
information about block reads and block hits


> I'm not at all concerned about 1 and 2 above - we don't have any
> performance issues on the write part, and we have plenty of disk
> capacity.  We are still doing the nightly vacuum thing, and it does
> hurt us a bit since it's dragging ever more out in time.

Why is the vacuum dragging out over time?  Is the size of your data
increasing, are you doing more writes that leave dead tuples, or are
your tables and/or indexes getting bloated?

Also, is there a reason why you do nightly vacuums instead of letting
autovacuum handle the work?  We started doing far less vacuuming when we
let autovacuum handle things.

--
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



Re: Memory usage - indexes

From
Tobias Brox
Date:
On 24 September 2010 19:16, Brad Nicholson <bnichols@ca.afilias.info> wrote:
[Brad Nicholson]
> Why is the vacuum dragging out over time?  Is the size of your data
> increasing, are you doing more writes that leave dead tuples, or are your
> tables and/or indexes getting bloated?

Digressing a bit here ... but the biggest reason is the data size increasing.

We do have some bloat-problems as well - every now and then we decide
to shut down the operation, use pg_dump to dump the entire database to
an sql file and restore it.  The benefits are dramatic, the space
requirement goes down a lot, and often some of our
performance-problems goes away after such an operation.

> Also, is there a reason why you do nightly vacuums instead of letting
> autovacuum handle the work?

If it was to me, we would have had autovacuum turned on.  We've had
one bad experience when the autovacuumer decided to start vacuuming
one of the biggest table at the worst possible moment - and someone
figured autovacuum was a bad idea.  I think we probably still would
need regular vacuums to avoid that happening, but with autovacuum on,
maybe we could have managed with regular vacuums only once a week or
so.

>  We started doing far less vacuuming when we let
> autovacuum handle things.

What do you mean, that you could run regular vacuum less frequently,
or that the regular vacuum would go faster?

Re: Memory usage - indexes

From
"Kevin Grittner"
Date:
Tobias Brox <tobixen@gmail.com> wrote:

> If it was to me, we would have had autovacuum turned on.  We've
> had one bad experience when the autovacuumer decided to start
> vacuuming one of the biggest table at the worst possible moment -
> and someone figured autovacuum was a bad idea.  I think we
> probably still would need regular vacuums to avoid that happening,
> but with autovacuum on, maybe we could have managed with regular
> vacuums only once a week or so.

Right, there's really no need to turn autovacuum off; if you hit it
during normal operations you've got enough bloat that it's going to
tend to start dragging down performance if it *doesn't* run, and if
you don't want it kicking in on really big tables during the day, a
nightly or weekly scheduled vacuum can probably prevent that.

Two other points -- you can adjust how aggressively autovacuum runs;
if it's having a noticeable impact on concurrent queries, try a
small adjustment to autovacuum cost numbers.  Also, if you're not on
8.4 (or higher!) yet, the changes in free space management and
vacuums justify the upgrade all by themselves.

-Kevin

Re: Memory usage - indexes

From
Greg Smith
Date:
Tobias Brox wrote:
> 1) Are there any good ways to verify my hypothesis?

You can confim easily whether the contents of the PostgreSQL buffer
cache contain when you think they do by installing pg_buffercache.  My
paper and sample samples at
http://www.pgcon.org/2010/schedule/events/218.en.html go over that.

You can also integrate that with a look at the OS level information by
using pgfincore:  http://www.pgcon.org/2010/schedule/events/261.en.html

I've found that if shared_buffers is set to a largish size, you can find
out enough information from look at it to have a decent idea what's
going on without going to that depth.  But it's available if you want it.


> 2) I've seen it discussed earlier on this list ... shared_buffers vs
> OS caches.  Some claims that it has very little effect to adjust the
> size of the shared buffers.  Anyway, isn't it a risk that memory is
> wasted because important data is stored both in the OS cache and the
> shared buffers?
The risk of overlap is overrated.  What's much more likely to actually
happen is that you'll have good data in shared_buffers, then run
something that completely destroys the OS cache (multiple seq scans just
below the "ring buffer" threshold", multiple large index scans, raging
VACUUM work).  Having copies of the most important pieces that stay in
shared_buffers despite the OS cache being demolished is much more
important to preserving decent performance than the concern about double
buffering database and OS contents--that only happens on trivial
workloads where there's not constant churn on the OS cache throwing
pages out like crazy.

I have easily measurable improvements on client systems increasing
shared_buffers into the 4GB - 8GB range.  Popular indexes move into
there, stay there, and only get written out at checkpoint time rather
than all the time.  However, if you write heavily enough that much of
this space gets dirty fast, you may not be be able to go that high
before checkpoint issues start to make such sizes impractical.

> What would happen if using almost all the available
> memory for shared buffers?  Or turn it down to a bare minimum and let
> the OS do almost all the cache handling?
>

The useful upper limit normally works out to be somewhere between 4GB
and 1/2 of RAM.  Using minimal values works for some people,
particularly on Windows, but you can measure that doing so generates far
more disk I/O activity than using a moderate sized cache by
instrumenting pg_stat_bgwriter, the way I describe in my talk.

> 3) We're discussing to drop some overlapping indexes ... i.e. to drop
> one out of two indexes looking like this:
>
> some_table(a)
> some_table(a,b)
>
> Would the query "select * from some_table where a=?" run slower if we
> drop the first index?  Significantly?
>

Yes, it would run slower, because now it has to sort through blocks in a
larger index in order to find anything.  How significant that is depends
on the relative size of the indexes.  To give a simple example, if (a)
is 1GB, while (a,b) is 2GB, you can expect dropping (a) to halve the
speed of index lookups.  Fatter indexes just take longer to navigate
through.

> (in our situation I found that the number of distinct b's for each a
> is low and that the usage stats on the second index is quite low
> compared with the first one, so I think we'll drop the second index).
>

You are thinking correctly here now.  If the addition of b to the index
isn't buying you significant increases in selectivity, just get rid of
it and work only with the index on a instead.

> 4) We're discussing to drop other indexes.  Does it make sense at all
> as long as we're not experiencing problems with inserts/updates?  I
> suppose that if the index isn't used it will remain on disk and won't
> affect the memory usage ... but what if the index is rarely used ...
> wouldn't it be better to do a seqscan on a table that is frequently
> accessed and mostly in memory than to consult an index that is stored
> on the disk?
>

Don't speculate; measure the exact usage amount that each usage is being
used and evaluate them on a case by case basis.  If they're not being
used, they're just adding overheard in many ways, and you should drop them.

There are a bunch of "find useless index" scripts floating around the
web (I think I swiped ideas from Robert Treat and Josh Berkus to build
mine); here's the one I use now:

SELECT
  schemaname as nspname,
  relname,
  indexrelname,
  idx_scan,
  pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size
FROM
  pg_stat_user_indexes i
  JOIN pg_index USING (indexrelid)
WHERE
  indisunique IS false
ORDER BY idx_scan,pg_relation_size(i.indexrelid) DESC;

Anything that bubbles to the top of that list, you probably want to get
rid of.  Note that this ignores UNIQUE indexes, which you can't drop
anyway, but are being used to answer queries.  You might choose to
include them anyway but just flag them in the output if the goal is to
see how often they are used.

P.S. You seem busy re-inventing pgstatspack this week:
http://pgfoundry.org/projects/pgstatspack/ does all of this "take a
snapshot of the stats and store it in the database for future analysis"
work for you.  Working on that intead of continuing to hack individual
storage/retrieve scripts for each statistics counter set would be a
better contribution to the PostgreSQL community.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


Re: Memory usage - indexes

From
Greg Smith
Date:
Tobias Brox wrote:
> We do have some bloat-problems as well - every now and then we decide
> to shut down the operation, use pg_dump to dump the entire database to
> an sql file and restore it.  The benefits are dramatic, the space
> requirement goes down a lot, and often some of our
> performance-problems goes away after such an operation.
>

You can do the same thing with far less trouble if you just CLUSTER the
table.  It takes a lock while it runs so there's still downtime needed,
but it's far faster than a dump/reload and safer too.

> If it was to me, we would have had autovacuum turned on.  We've had
> one bad experience when the autovacuumer decided to start vacuuming
> one of the biggest table at the worst possible moment - and someone
> figured autovacuum was a bad idea.  I think we probably still would
> need regular vacuums to avoid that happening, but with autovacuum on,
> maybe we could have managed with regular vacuums only once a week or
> so.
>

The answer to "we once saw autovacuum go mad and cause us problems" is
never the knee-jerk "disable autovacuum", it's usually "change
autovacuum so it runs far more often but with lower intesity".
Sometimes it's "keep autovacuum on but proactively hit the biggest
tables with manual vacuums at slow times" too.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


Re: Memory usage - indexes

From
Brad Nicholson
Date:
  On 10-09-24 01:41 PM, Tobias Brox wrote:
> What do you mean, that you could run regular vacuum less frequently,
> or that the regular vacuum would go faster?

It means that vacuums ran less frequently.  With cron triggered vacuums,
we estimated when tables needed to be vacuumed, and vacuumed them
accordingly.  Because of unpredictable shifts in activity, we scheduled
the vacuums to happen more often than needed.

With autovacuum, we vacuum some of our large tables far less
frequently.  We have a few large tables that used to get vacuumed every
other day that now get vacuumed once or twice a month.

The vacuums themselves take longer now as we use the vacuum cost delay
to control the IO.  That wasn't an option for us when we did manual
vacuums as that was in 8.1 when vacuums were still treated as long
running transactions.  Stretching a vacuum out to a few hours  prior to
8.2 would bloat other tables.

--
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



Re: Memory usage - indexes

From
Bob Lunney
Date:
Tobias,

First off, what version of PostgreSQL are you running?  If you have 8.4, nightly vacuuming shouldn't be necessary with
properlytuned autovacuum jobs.   

The pertinent difference between pg_stat_user_indexes and pg_statio_user_indexes is the latter shows the number of
blocksread from disk or found in the cache.  You're correct, unused indexes will remain on disk, but indexes that don't
completelyfit into memory must be read from disk for each index scan, and that hurts performance.  (In fact, it will
suddenlydrop like  a rock.  BTDT.)  By making smaller equivalent indexes on partitioned data the indexes for individual
partitionsare more likely to stay in memory, which is particularly important when multiple passes are made over the
indexby a query. 

You are correct on all the points you make concerning indexes, but point 4 is the one I'm referring to.  You discovered
thisindependently yourself, according to your anecdote about the overlapping indexes. 

Bob Lunney


--- On Fri, 9/24/10, Tobias Brox <tobixen@gmail.com> wrote:

> From: Tobias Brox <tobixen@gmail.com>
> Subject: Re: [PERFORM] Memory usage - indexes
> To: "Bob Lunney" <bob_lunney@yahoo.com>
> Cc: pgsql-performance@postgresql.org
> Date: Friday, September 24, 2010, 12:46 PM
> On 24 September 2010 18:23, Bob
> Lunney <bob_lunney@yahoo.com>
> wrote:
> > Consult pg_statio_user_indexes to see which indexes
> have been used
> > and how much.
>
> What is the main differences between pg_statio_user_indexes
> and
> pg_stat_user_indexes?
>
> >   Indexes with comparitively low usages rates
> aren't helping you much and are
> > candidates for elimination.
>
> No doubt about that - but the question was, would it really
> help us to
> drop those indexes?
>
> I think the valid reasons for dropping indexes would be:
>
> 1) To speed up inserts, updates and deletes
>
> 2) To spend less disk space
>
> 3) Eventually, speed up nightly vacuum (it wouldn't be an
> issue with
> autovacuum though)
>
> 4) To spend less memory resources?
>
> I'm not at all concerned about 1 and 2 above - we don't
> have any
> performance issues on the write part, and we have plenty of
> disk
> capacity.  We are still doing the nightly vacuum
> thing, and it does
> hurt us a bit since it's dragging ever more out in
> time.  Anyway, it's
> number four I'm wondering most about - is it anything to be
> concerned
> about or not for the least frequently used indexes? 
> An index that
> aren't being used would just stay on disk anyway,
> right?  And if there
> are limited memory resources, the indexes that are most
> frequently
> used would fill up the cache space anyway?  That's my
> thoughts at
> least - are they way off?
>
> We did have similar experiences some years ago - everything
> was
> running very fine all until one day when some
> semi-complicated
> very-frequently-run selects started taking several seconds
> to run
> rather than tens of milliseconds.  I found that we had
> two slightly
> overlapping indexes like this ...
>
>   account_transaction(customer_id, trans_type)
>   account_transaction(customer_id, trans_type,
> created)
>
> both of those indexes where heavily used.  I simply
> dropped the first
> one, and the problems disappeared.  I assume that both
> indexes up to
> some point fitted snuggly into memory, but one day they
> were competing
> for the limited memory space, dropping the redundant index
> solved the
> problem all until the next hardware upgrade.  I would
> never have found
> those indexes searching for the least used indexes in the
> pg_stat(io)_user_indexes view.
>




Re: Memory usage - indexes

From
Brad Nicholson
Date:
On 10-09-24 03:06 PM, Bob Lunney wrote:
> The pertinent difference between pg_stat_user_indexes and pg_statio_user_indexes is the latter shows the number of
blocksread from disk or found in the cache. 

I have a minor, but very important correction involving this point.  The
pg_statio tables show you what blocks are found in the Postgres buffer
cache, and what ones are not.

For the ones that are not, those blocks may come from the OS filesystem
cache, a battery backed cache, or on the actual disk.  There is a big
difference in performance based on where you are actually getting those
blocks from (and you can't this info from Postgres).

--
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



Re: Memory usage - indexes

From
Tobias Brox
Date:
On 24 September 2010 21:06, Bob Lunney <bob_lunney@yahoo.com> wrote:
> First off, what version of PostgreSQL are you running?  If you have 8.4, nightly vacuuming shouldn't be necessary
withproperly tuned autovacuum jobs. 

8.3.  We'll upgrade to 9.0 during the December holidays fwiw.  But
point taken, I will continue to push for autovacuum to be turned on.

Anyway, I think the nightly vacuuming does have some merit.  For some
of the queries, most of the daytime we're quite sensitive to latency.
Well, I guess the proper solution to that is to tune the autovacuum
configuration so it acts less aggressively at the times of the day
where we need low latency...

> You're correct, unused indexes will
> remain on disk, but indexes that don't completely fit into memory must be
> read from disk for each index scan, and that hurts performance.  (In fact, it
> will suddenly drop like  a rock.  BTDT.)

Sounds quite a lot like our problems nowadays - as well as previous
time when I found that overlapping index that could be dropped.

> By making smaller equivalent indexes on partitioned data the indexes for
> individual partitions are more likely to stay in memory, which is particularly
> important when multiple passes are made over the index by a query.

I was looking a bit into table partitioning some years ago, but didn't
really find any nice way to partition our tables.  One solution would
probably be to partition by creation date and set up one partition for
each year, but it seems like a butt ugly solution, and I believe it
would only help if the select statement spans a date range on the
creation time.

> You are correct on all the points you make concerning indexes, but point 4
> is the one I'm referring to.  You discovered this independently yourself,
> according to your anecdote about the overlapping indexes.

Yes, but that was the heavily used index ... my belief is that the
_unused_ index, or infrequently used index wouldn't cause such memory
problems.  (Then again, I suppose it would be faster to scan a
non-optimal index that is in memory than an optimal index that is on
disk?)   Well, if both you and Greg Smith recommends to drop those
indexes, I suppose we probably should do that ... ;-)

Re: Memory usage - indexes

From
Tobias Brox
Date:
On 24 September 2010 00:12, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
> Re index size, you could try indexes like:
>
> some_table(a)
> some_table(b)
>
> which may occupy less space, and the optimizer can bitmap and/or them to
> work like the compound index some_table(a,b).

Hm ... never considered that ... but is it cost effective on large
indexes?  I guess I should do some testing ...

Re: Memory usage - indexes

From
Tobias Brox
Date:
On 24 September 2010 21:24, Brad Nicholson <bnichols@ca.afilias.info> wrote:
>> The pertinent difference between pg_stat_user_indexes and
>> pg_statio_user_indexes is the latter shows the number of blocks read from
>> disk or found in the cache.
>
> I have a minor, but very important correction involving this point.  The
> pg_statio tables show you what blocks are found in the Postgres buffer
> cache, and what ones are not.

Right.  Then, studying how the pg_statio table develops over time
would probably give a hint on my first question in my original post
... how to check the hypothesis that we're running out of memory.
That said, I've sent an email to our sysadmin asking him to consider
the pg_buffercache module suggested by Greg Smith.

Increasing the shared_buffers on the cost of OS caches would then have
one "selling point" ... better possibilities to monitor the memory
usage.

Re: Memory usage - indexes

From
Tobias Brox
Date:
Thanks for spending your time on this ... amidst all the useful
feedback I've received, I'd rate your post as the most useful post.

>> 1) Are there any good ways to verify my hypothesis?
>
> You can confim easily whether the contents of the PostgreSQL buffer cache
> contain when you think they do by installing pg_buffercache.  My paper and
> sample samples at http://www.pgcon.org/2010/schedule/events/218.en.html go
> over that.

I've asked the sysadmin to consider installing it.  From what I
understood from other posts, the pg_statio_user_indexes and
pg_statio_user_tables would also indicate to what extent data is found
in shared buffers and not.  Monitoring it over time could possibly
help us predicting the "tipping points" before they happen.  Though
still, if most of the cacheing takes place on the OS level, one
wouldn't learn that much from studying the shared buffers usage ...

> You can also integrate that with a look at the OS level information by using
> pgfincore:  http://www.pgcon.org/2010/schedule/events/261.en.html

... ah, right ... :-)

> I've found that if shared_buffers is set to a largish size, you can find out
> enough information from look at it to have a decent idea what's going on
> without going to that depth.  But it's available if you want it.

Haven't studied it in details yet, but the information value in itself
may be a "selling point" for increasing the buffer size.

> I have easily measurable improvements on client systems increasing
> shared_buffers into the 4GB - 8GB range.  Popular indexes move into there,
> stay there, and only get written out at checkpoint time rather than all the
> time.

Ours is at 12 GB, out of 70 GB total RAM.

>  However, if you write heavily enough that much of this space gets
> dirty fast, you may not be be able to go that high before checkpoint issues
> start to make such sizes impractical.

I think we did have some issues at some point ... we do have some
applications that are very sensitive towards latency.  Though, I think
the problem was eventually solved.  I think I somehow managed to
deliver the message that it was not a good idea to store
keep-alive-messages sent every second from multiple clients into the
main production database, and that it was an equally bad idea to
disconnect the clients after a three seconds timeout :-)  Anyway,
today we have mostly issues with read access, not write access.

> Using minimal values works for some people, particularly on
> Windows,

Huh ... does it mean Windows have better OS cache handling than Linux?
 To me it sounds insane to run a database under a buggy GUI ... but I
suppose I should keep that to myself :-)

> Yes, it would run slower, because now it has to sort through blocks in a
> larger index in order to find anything.  How significant that is depends on
> the relative size of the indexes.  To give a simple example, if (a) is 1GB,
> while (a,b) is 2GB, you can expect dropping (a) to halve the speed of index
> lookups.  Fatter indexes just take longer to navigate through.

Linear relationship between the time it takes to do index lookups vs
the fatness of the index?  That's not what I guessed in the first
place ... but I suppose you're right.

> P.S. You seem busy re-inventing pgstatspack this week:
>  http://pgfoundry.org/projects/pgstatspack/ does all of this "take a
> snapshot of the stats and store it in the database for future analysis" work
> for you.  Working on that intead of continuing to hack individual
> storage/retrieve scripts for each statistics counter set would be a better
> contribution to the PostgreSQL community.

Sometimes it takes more work to implement work already done by others
than to reimplement the logics ... but anyway, I will have a look
before I make more snapshot tables ;-)

Re: Memory usage - indexes

From
Greg Smith
Date:
Tobias Brox wrote:
I have easily measurable improvements on client systems increasing
shared_buffers into the 4GB - 8GB range.  Popular indexes move into there,
stay there, and only get written out at checkpoint time rather than all the
time.   
Ours is at 12 GB, out of 70 GB total RAM. 

Get a snapshot of what's in there using pg_buffercache.  And then reduce that to at or under 8GB.  Everyone I've seen test in this area says that performance starts to drop badly with shared_buffers greater than somewhere between 8GB and 10GB, so 12GB is well into the area where it's degraded already.


Huh ... does it mean Windows have better OS cache handling than Linux?To me it sounds insane to run a database under a buggy GUI ... but I
suppose I should keep that to myself :-) 

No, windows has slow shared memory issues when used the way PostgreSQL does, so you push at the OS cache instead as the next best thing.


Linear relationship between the time it takes to do index lookups vs
the fatness of the index?  That's not what I guessed in the first
place ... but I suppose you're right. 

If you're scanning 10% of a 10GB index, you can bet that's going to take longer to do than scanning 10% of a 5GB index.  So unless the bigger index is significantly adding to how selective the query is--so that you are, say, only scanning 2% of the 10GB index because indexing on two rows allowed you to remove many candidate rows--you might as well use a slimmer one instead.

Overindexed tables containing more columns than are actually selective is a very popular source of PostgreSQL slowdowns.  It's easy to say "oh, I look this data up using columns a,b,c, so lets put an index on a,b,c".  But if an index on a alone is 1% selective, that's probably wrong;  just index it instead, so that you have one lean, easy to maintain index there that's more likely to be in RAM at all times.  Let the CPU chew on filtering out which of those 1% matches also match the (b,c) criteria instead.

Obviously rough guidance here--you need to simulate to know for sure.  Every drop an index in a transaction block just to see how a query plan changes if it's not there anymore, then rollback so it never really went away?  Great fun for this sort of experiment, try it sometime.

Sometimes it takes more work to implement work already done by others
than to reimplement the logics ... but anyway, I will have a look
before I make more snapshot tables ;-) 

You will be surprised at how exactly you are reimplementing that particular project.

-- 
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book

Re: Memory usage - indexes

From
Tobias Brox
Date:
On 25 September 2010 00:00, Greg Smith <greg@2ndquadrant.com> wrote:
> Overindexed tables containing more columns than are actually selective is a
> very popular source of PostgreSQL slowdowns.  It's easy to say "oh, I look
> this data up using columns a,b,c, so lets put an index on a,b,c".  But if an
> index on a alone is 1% selective, that's probably wrong;  just index it
> instead, so that you have one lean, easy to maintain index there that's more
> likely to be in RAM at all times.  Let the CPU chew on filtering out which
> of those 1% matches also match the (b,c) criteria instead.

Hm ... yes, we have quite many of those indexes.  Some of them we
can't live without.  Digging out 1% out of a fat 100M table (1M rows)
when one really just needs 20 rows is just too costly.  Well, I guess
we should try to have a serious walk-through to see what indexes
really are needed.  After all, that really seems to be our main
problem nowadays - some frequently used indexes doesn't fit very
snuggly into memory.

> Every drop an index in a transaction block just to see how a query plan
> changes if it's not there anymore, then rollback so it never really went away?
> Great fun for this sort of experiment, try it sometime.

Yes, I was playing a bit with it long time ago ... but it seems a bit
risky to do this in the production environment ... wouldn't want
inserts to get stuck due to locks.  There is also the problem that we
don't really have an overview of which queries would be affected if
dropping an index.  Best thing we can do is to drop an index and
monitor the stats on seq scans, new slow queries popping up, etc.

Re: Memory usage - indexes

From
Tobias Brox
Date:
I just got this crazy, stupid or maybe genius idea :-)

One thing that I've learned in this thread is that fat indexes (i.e.
some index on some_table(a,b,c,d,e,f)) is to be avoided as much as
possible.

One of our biggest indexes looks like this:

acc_trans(customer_id, trans_type, created)

For the very most of the time an index like this would do:

acc_trans(customer_id, trans_type, created)

But then there are those few troublesome customers that have tens of
thousands of transactions, they interactively inspect transaction
listings through the web, sometimes the query "give me my 20 most
recent transactions of trans_type 6" gets stuck, maybe the customer
has no transactions of trans type 6 and all the transactions needs to
be scanned through.  Since this is done interactively and through our
front-end web page, we want all queries to be lightning fast.

Now, my idea is to drop that fat index and replace it with conditional
indexes for a dozen of heavy users - like those:

  acc_trans(trans_type, created) where customer_id=224885;
  acc_trans(trans_type, created) where customer_id=643112;
  acc_trans(trans_type, created) where customer_id=15;

or maybe like this:

  acc_trans(customer_id, trans_type, created) where customer_id in ( ... );

Any comments?

My sysadmin is worried that it would be a too big hit on performance
when doing inserts.  It may also cause more overhead when planning the
queries.  Is that significant?  Is this idea genius or stupid or just
somewhere in between?

Re: Memory usage - indexes

From
Mark Kirkwood
Date:
On 29/09/10 19:41, Tobias Brox wrote:
> I just got this crazy, stupid or maybe genius idea :-)
>
>
> Now, my idea is to drop that fat index and replace it with conditional
> indexes for a dozen of heavy users - like those:
>
>    acc_trans(trans_type, created) where customer_id=224885;
>    acc_trans(trans_type, created) where customer_id=643112;
>    acc_trans(trans_type, created) where customer_id=15;
>
> or maybe like this:
>
>    acc_trans(customer_id, trans_type, created) where customer_id in ( ... );
>
> Any comments?
>
> My sysadmin is worried that it would be a too big hit on performance
> when doing inserts.  It may also cause more overhead when planning the
> queries.  Is that significant?  Is this idea genius or stupid or just
> somewhere in between?
>
>

Yeah, I think the idea of trying to have a few smaller indexes for the
'hot' customers is a good idea. However I am wondering if just using
single column indexes and seeing if the bitmap scan/merge of smaller
indexes is actually more efficient is worth testing - i.e:

acc_trans(trans_type);
acc_trans(created);
acc_trans(customer_id);

It may mean that you have to to scrutinize your effective_cache_size and
work_mem parameters, but could possibly be simpler and more flexible.

regards

Mark




Re: Memory usage - indexes

From
Tobias Brox
Date:
On 29 September 2010 10:03, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> > Yeah, I think the idea of trying to
have a few smaller indexes for the 'hot'
> customers is a good idea. However I am wondering if just using single column
> indexes and seeing if the bitmap scan/merge of smaller indexes is actually
> more efficient is worth testing - i.e:
>
> acc_trans(trans_type);
> acc_trans(created);
> acc_trans(customer_id);

My gut feeling tells me that it's not a good idea - consider that we
want to pull out 20 rows from a 60M table.  If I'm not mistaken, with
bitmapping it's needed to do operations on the whole indexes - 60M
bits is still 7.5 megabytes.  Well, I suppose that nowadays it's
relatively fast to bitmap 7.5 Mb of memory, but probably some orders
of magnitude more than the few milliseconds it takes to pick out the
20 rows directly from the specialized index.

Well, why rely on gut feelings - when things can be measured.  I
didn't take those figures from the production database server though,
but at least it gives a hint on what to expect.

First, using the three-key index for "select * from acc_trans where
customer_id=? and trans_type=? order by created desc limit 20".  I
chose one of the users with most transactions, and I tested with the
most popular transaction type as well as one transaction type where he
has just a handful of transactions.  Both took significantly less than
1 ms to run.  Then I deleted all indexes and created the three
suggested indexes.  Using the popular transaction type, it took 123
ms.  Well, that's 500 times as much time, but still acceptable.  Here
is the query plan:

=> explain analyze select * from acc_trans where customer_id=67368
and trans_type=8 order by created desc limit 20;

      QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1486.23..1486.28 rows=20 width=200) (actual
time=123.685..123.687 rows=3 loops=1)
   ->  Sort  (cost=1486.23..1486.34 rows=43 width=200) (actual
time=123.684..123.685 rows=3 loops=1)
         Sort Key: created
         Sort Method:  quicksort  Memory: 25kB
         ->  Bitmap Heap Scan on acc_trans  (cost=1313.90..1485.08
rows=43 width=200) (actual time=121.350..123.669 rows=3 loops=1)
               Recheck Cond: ((trans_type = 8) AND (customer_id = 67368))
               ->  BitmapAnd  (cost=1313.90..1313.90 rows=43 width=0)
(actual time=120.342..120.342 rows=0 loops=1)
                     ->  Bitmap Index Scan on
account_transaction_on_type  (cost=0.00..256.31 rows=13614 width=0)
(actual time=12.200..12.200 rows=43209 loops=1)
                           Index Cond: (trans_type = 8)
                     ->  Bitmap Index Scan on
account_transaction_on_user  (cost=0.00..1057.31 rows=56947 width=0)
(actual time=104.578..104.578 rows=59133 loops=1)
                           Index Cond: (users_id = 67368)
 Total runtime: 123.752 ms
(12 rows)

With the most popular trans type it chose another plan and it took
more than 3s (totally unacceptable):

=> explain analyze select * from acc_trans where customer_id=67368
and trans_type=6 order by created desc limit 20;

               QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..44537.82 rows=20 width=200) (actual
time=1746.288..3204.029 rows=20 loops=1)
   ->  Index Scan Backward using account_transaction_on_created on
acc_trans  (cost=0.00..55402817.90 rows=24879 width=200) (actual
time=1746.285..3204.021 rows=20 loops=1)
         Filter: ((customer_id = 67368) AND (trans_type = 6))
 Total runtime: 3204.079 ms
(4 rows)

Although this customer has several tens of thousands of transactions,
dropping the three-key-index and use an index on users_id,created is
clearly a better option than running out of memory:

=> explain analyze select * from acc_trans where customer_id=67368 and
trans_type=8 order by created desc limit 20;

     QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..98524.88 rows=20 width=200) (actual
time=0.669..197.012 rows=3 loops=1)
   ->  Index Scan Backward using account_transaction_by_user_ts on
acc_trans  (cost=0.00..211828.49 rows=43 width=200) (actual
time=0.668..197.006 rows=3 loops=1)
         Index Cond: (customer_id = 67368)
         Filter: (trans_type = 8)
 Total runtime: 197.066 ms
(5 rows)

0.2s sounds acceptable, it's just that this may be just a small part
of building the web page, so it adds up ... and probably (I didn't
check how profitable this customer is) this is probably exactly the
kind of customer we wouldn't want to get annoyed with several seconds
page load time.

Re: Memory usage - indexes

From
Mark Kirkwood
Date:
On 30/09/10 01:09, Tobias Brox wrote:
> With the most popular trans type it chose another plan and it took
> more than 3s (totally unacceptable):
>
>

Try tweeking effective_cache_size up a bit and see what happens - I've
found these bitmap plans to be sensitive to it sometimes.

regards

Mark