Re: [HACKERS] More detail on settings for pgavd? - Mailing list pgsql-performance

From Chester Kustarz
Subject Re: [HACKERS] More detail on settings for pgavd?
Date
Msg-id Pine.BSO.4.44.0311201311200.19584-100000@detroit.arbor.net
Whole thread Raw
In response to Re: [HACKERS] More detail on settings for pgavd?  (Josh Berkus <josh@agliodbs.com>)
Responses Re: [HACKERS] More detail on settings for pgavd?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
On Thu, 20 Nov 2003, Josh Berkus wrote:
> Additionally, you are not thinking of this in terms of an overall database
> maintanence strategy.   Lazy Vacuum needs to stay below the threshold of the
> Free Space Map (max_fsm_pages) to prevent creeping bloat from setting in to
> your databases.   With proper configuration of pg_avd, vacuum_mem and FSM
> values, it should be possible to never run a VACUUM FULL again, and as of 7.4
> never run an REINDEX again either.

is there any command you can run to see how much of the FSM is filled? is
there any way to tell which tables are filling it?

> Analyze is needed only as often as the *aggregate distribution* of data in the
> tables changes.   Depending on the application, this could be frequently, but
> far more often (in my experience running multiple databases for several
> clients) the data distribution of very large tables changes very slowly over
> time.

analyze does 2 things for me:
1. gets reasonable aggregate statistics
2. generates STATISTICS # of bins for the most frequent hitters

(2) is very important for me. my values typically seem to have power-law
like distributions. i need enough bins to reach a "cross-over" point where
the last bin is frequent enough to make an index scan useful. also,
i want enough bins so that the planner can choose index a or b for:
    select * from foo where a=n and b=m;

the selectivity of either index depends not only on the average selectivity
of index a or index b, but on n and m as well. for example, 1M row table:

value    % of rows
v1    23
v2    12
v3    4.5
v4    4
v5    3.5
...

you can see that picking an index for =v1 would be poor. picking the
20th most common value would be 0.5% selective. much better. of course
this breaks down for more complex operators, but = is fairly common.

> So if you're going to have a seperate ANALYZE schedule at all, it should be
> slightly less frequent than VACUUM for large tables.   Either that, or drop
> the idea, and simplify pg_avd by running VACUUM ANALYZE all the time instead
> of having 2 seperate schedules.

i have some tables which are insert only. i do not want to vacuum them
because there are never any dead tuples in them and the vacuum grows the
indexes. plus it is very expensive (they tables grow rather large.) after they
expire i drop the whole table to make room for a newer one (making sort
of a rolling log with many large tables.)

i need to analyze them every so often so that the planner knows that
there is 1 row, 100 rows, 100k rows, 1M. the funny thing is
that because i never vacuum the tables, the relpages on the index never
grows. don't know if this affects anything (this is on 7.2.3).

vacuum is to reclaim dead tuples. this means it depends on update and
delete. analyze depends on data values/distribution. this means it depends on
insert, update, and delete. thus the dependencies are slightly different
between the 2 operations, an so you can come up with use-cases that
justify running either more frequently.

i am not sure how failed transactions fit into this though, not that i think
anybody ever has very many. maybe big rollbacks during testing?



pgsql-performance by date:

Previous
From: Josh Berkus
Date:
Subject: Re: More detail on settings for pgavd?
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] More detail on settings for pgavd?