Re: ANALYZE to be ignored by VACUUM - Mailing list pgsql-hackers
From | ITAGAKI Takahiro |
---|---|
Subject | Re: ANALYZE to be ignored by VACUUM |
Date | |
Msg-id | 20080219160731.6831.52131E4D@oss.ntt.co.jp Whole thread Raw |
In response to | Re: ANALYZE to be ignored by VACUUM (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: ANALYZE to be ignored by VACUUM
Re: ANALYZE to be ignored by VACUUM |
List | pgsql-hackers |
Tom Lane <tgl@sss.pgh.pa.us> wrote: > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > > In my workload, ANALYZE takes long time (1min at statistics_target = 10, > > and 5min at 100), but the updated table needs to be vacuumed every 30 seconds > > because seqscans run on the table repeatedly. > > There is something *seriously* wrong with that. If vacuum can complete > in under 30 seconds, how can analyze take a minute? (I'm also wondering > whether you'll still need such frantic vacuuming with HOT...) There are two tables here: [S] A small table, that is frequently updated and seqscan-ed [L] A large table, that takes a longtime to be analyzed The table [S] should be vacuumed every 30 seconds, because dead tuples affects the performance of seqscan seriously. HOT and autovacuum are very useful here *unless* long transactions begins. Analyzing [L] takes 1-5 minutes and both HOT and vacuums don't work during it. I want to use statistics_target = 100 at heart for more accurate statistics, but I'm using 10 instead because of avoiding long transactions by analyze. Basically, the above is based on avoiding needless long transactions. Aside from ANALYZE, pg_start_backup() is also a long transactional command. It takes checkpoint_timeout * checkpoint_completion_target (2.5- min.) at worst. Users could avoid their own long transactions, but have no choice but to use those provided maintenance commands. > > So, I'm targeting only ANALZYE for now by changing ANALYZE to be ignored > > by VACUUM. > > I think we need to understand what the real problem is with your test > case. This proposal seems very messy/ugly to me, and I'm unconvinced > that it solves anything. I think there are some direct or indirect solutions: 1. VACUUM removes recently dead tuples under some circumstances. For example, tuples updated twice after a long transactionbegins. The oldest tuple can be seen by the old long transaction and the newest can be seen new transactions.However, the intermediate tuple is invisible all transactions. 2. ANALYZE don't disturb vacuuming of other tables. (my first proposal) We know ANALYZE don't touch other tables duringsampling phases. We can treat analyzing transactions as same as PROC_IN_VACUUM xacts. The same can be said for pg_start_backup;non-transactinal starting backup command might be better. 3. Recover density of tuples; i.e, auto-CLUSTER. If the performance recovers after long transactions, the problem willnot be so serious. It would be better that autovacuum invokes CLUSTER if required and we could run CLUSTER concurrently. 4. ANALYZE finishes in a short time. It is ok that VACUUM takes a long time because it is not a transaction, but ANALYZEshould not. It requres cleverer statistics algorithm. Sampling factor 10 is not enough for pg_stats.n_distinct. Weseems to estimate n_distinct too low for clustered (ordered) tables. There might be a matter of research in calculationof n_distinct. Also, this cannot resolve the problem in pg_start_backup. 1 or 3 might be more universal approaches, but I think 2 will be an independent improvement from them. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
pgsql-hackers by date: