Re: ANALYZE command progress checker - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: ANALYZE command progress checker |
Date | |
Msg-id | CAD21AoAMcB4Pu-2Q-sO-5k29OsXri=uK7F9aoASbsieS4Tkcpw@mail.gmail.com Whole thread Raw |
In response to | [HACKERS] ANALYZE command progress checker (vinayak <Pokale_Vinayak_q3@lab.ntt.co.jp>) |
Responses |
Re: ANALYZE command progress checker
|
List | pgsql-hackers |
On Tue, Apr 4, 2017 at 2:12 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/03/30 17:39, Masahiko Sawada wrote: >> On Wed, Mar 29, 2017 at 5:38 PM, vinayak wrote: >>>>> I have updated the patch. >> >> I reviewed v7 patch. >> >> When I ran ANALYZE command to the table having 5 millions rows with 3 >> child tables, the progress report I got shows the strange result. The >> following result was output after sampled all target rows from child >> tables (i.g, after finished acquire_inherited_sample_rows). I think >> the progress information should report 100% complete at this point. >> > > [...] > >> >> I guess the cause of this is that you always count the number of >> sampled rows in acquire_sample_rows staring from 0 for each child >> table. num_rows_sampled is reset whenever starting collecting sample >> rows. >> Also, even if table has child table, the total number of sampling row >> is not changed. I think that main differences between analyzing on >> normal table and on partitioned table is where we fetch the sample row >> from. So I'm not sure if adding "computing inherited statistics" phase >> is desirable. If you want to report progress of collecting sample rows >> on child table I think that you might want to add the information >> showing which child table is being analyzed. > > Two kinds of statistics are collected if the table is a inheritance parent. > > First kind considers the table by itself and calculates regular > single-table statistics using rows sampled from the only available heap > (by calling do_analyze_rel() with inh=false). > > The second kind are called inheritance statistics, which consider rows > sampled from the whole inheritance hierarchy (by calling do_analyze_rel() > with inh=true). Note that we are still collecting statistics for the > parent table, so we not really "analyzing" child tables; we just collect > sample rows from them to calculate whole-tree statistics for the root > parent table. Agreed. > It might still be possible to show the child table being > sampled though. > >> -- >> pg_stat_progress_analyze.num_target_sample_rows is set while ignoring >> the number of rows the target table has actually. So If the table has >> rows less than target number of rows, the num_rows_sampled never reach >> to num_target_sample_rows. >> >> -- >> @@ -1180,6 +1213,9 @@ acquire_sample_rows(Relation onerel, int elevel, >> } >> >> samplerows += 1; >> + >> + /* Report number of rows sampled so far */ >> + >> pgstat_progress_update_param(PROGRESS_ANALYZE_NUM_ROWS_SAMPLED, >> numrows); >> } >> } >> >> I think that it's wrong to use numrows variable to report the progress >> of the number of rows sampled. As the following comment mentioned, we >> first save row into rows[] and increment numrows until numrows < >> targrows. And then we could replace stored sample row with new sampled >> row. That is, after collecting "numrows" rows, analyze can still take >> a long time to get and replace the sample row. To computing progress >> of collecting sample row, IMO reporting the number of blocks we >> scanned so far is better than number of sample rows. Thought? > > We can report progress in terms of individual blocks only inside > acquire_sample_rows(), which seems undesirable when one thinks that we > will be resetting the target for every child table. We should have a > global target that considers all child tables in the inheritance > hierarchy, which maybe is possible if we count them beforehand in > acquire_inheritance_sample_rows(). But why not use target sample rows, > which remains the same for both when we're collecting sample rows from one > table and from the whole inheritance hierarchy. We can keep the count of > already collected rows in a struct that is used across calls for all the > child tables and increment upward from that count when we start collecting > from a new child table. An another option I came up with is that support new pgstat progress function, say pgstat_progress_incr_param, which increments index'th member in st_progress_param[]. That way we just need to report a delta using that function. > >>> /* >>> * The first targrows sample rows are simply copied into the >>> * reservoir. Then we start replacing tuples in the sample >>> * until we reach the end of the relation. This algorithm is >>> * from Jeff Vitter's paper (see full citation below). It >>> * works by repeatedly computing the number of tuples to skip >>> * before selecting a tuple, which replaces a randomly chosen >>> * element of the reservoir (current set of tuples). At all >>> * times the reservoir is a true random sample of the tuples >>> * we've passed over so far, so when we fall off the end of >>> * the relation we're done. >>> */ > > It seems that we could use samplerows instead of numrows to count the > progress (if we choose to count progress in terms of sample rows collected). > I guess it's hard to count progress in terms of sample rows collected even if we use samplerows instead, because samplerows can be incremented independently of the target number of sampling rows. The samplerows can be incremented up to the total number of rows of relation. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
pgsql-hackers by date: