Re: ANALYZE command progress checker - Mailing list pgsql-hackers
From | vinayak |
---|---|
Subject | Re: ANALYZE command progress checker |
Date | |
Msg-id | 95cee56c-da62-13bb-1158-71a41f9ea559@lab.ntt.co.jp Whole thread Raw |
In response to | [HACKERS] ANALYZE command progress checker (vinayak <Pokale_Vinayak_q3@lab.ntt.co.jp>) |
List | pgsql-hackers |
On 2017/03/30 17:39, Masahiko Sawada wrote: > On Wed, Mar 29, 2017 at 5:38 PM, vinayak > <Pokale_Vinayak_q3@lab.ntt.co.jp> wrote: >> On 2017/03/25 4:30, Robert Haas wrote: >>> On Fri, Mar 24, 2017 at 3:41 AM, vinayak >>> <Pokale_Vinayak_q3@lab.ntt.co.jp> wrote: >>>> I have updated the patch. >>> You can't change the definition of AcquireSampleRowsFunc without >>> updating the documentation in fdwhandler.sgml, but I think I don't >>> immediately understand why that's a thing we want to do at all if >>> neither file_fdw nor postgres_fdw are going to use the additional >>> argument. It seems to be entirely pointless because the new value >>> being passed down is always zero and even if the callee were to update >>> it, do_analyze_rel() would just ignore the change on return. Am I >>> missing something here? Also, the whitespace-only changes to fdwapi.h >>> related to AcquireSampleRowsFunc going to get undone by pgindent, so >>> even if there's some reason to change that you should leave the lines >>> that don't need changing untouched. > I reviewed v7 patch. Thank you for reviewing the patch. > > When I ran ANALYZE command to the table having 5 millions rows with 3 > child tables, the progress report I got shows the strange result. The > following result was output after sampled all target rows from child > tables (i.g, after finished acquire_inherited_sample_rows). I think > the progress information should report 100% complete at this point. > > #= select * from pg_stat_progress_analyze ; > pid | datid | datname | relid | phase | > num_target_sample_rows | num_rows_sampled > -------+-------+----------+-------+----------------------------------+------------------------+------------------ > 81719 | 13215 | postgres | 16440 | collecting inherited sample rows | > 3000000 | 1800000 > (1 row) > > I guess the cause of this is that you always count the number of > sampled rows in acquire_sample_rows staring from 0 for each child > table. num_rows_sampled is reset whenever starting collecting sample > rows. > Also, even if table has child table, the total number of sampling row > is not changed. I think that main differences between analyzing on > normal table and on partitioned table is where we fetch the sample row > from. So I'm not sure if adding "computing inherited statistics" phase > is desirable. If you want to report progress of collecting sample rows > on child table I think that you might want to add the information > showing which child table is being analyzed. Yes. I think showing child table information would be good to user/DBA. > -- > pg_stat_progress_analyze.num_target_sample_rows is set while ignoring > the number of rows the target table has actually. So If the table has > rows less than target number of rows, the num_rows_sampled never reach > to num_target_sample_rows. > > -- > @@ -1180,6 +1213,9 @@ acquire_sample_rows(Relation onerel, int elevel, > } > > samplerows += 1; > + > + /* Report number of rows sampled so far */ > + > pgstat_progress_update_param(PROGRESS_ANALYZE_NUM_ROWS_SAMPLED, > numrows); > } > } > > I think that it's wrong to use numrows variable to report the progress > of the number of rows sampled. As the following comment mentioned, we > first save row into rows[] and increment numrows until numrows < > targrows. And then we could replace stored sample row with new sampled > row. That is, after collecting "numrows" rows, analyze can still take > a long time to get and replace the sample row. To computing progress > of collecting sample row, IMO reporting the number of blocks we > scanned so far is better than number of sample rows. Thought? > >> /* >> * The first targrows sample rows are simply copied into the >> * reservoir. Then we start replacing tuples in the sample >> * until we reach the end of the relation. This algorithm is >> * from Jeff Vitter's paper (see full citation below). It >> * works by repeatedly computing the number of tuples to skip >> * before selecting a tuple, which replaces a randomly chosen >> * element of the reservoir (current set of tuples). At all >> * times the reservoir is a true random sample of the tuples >> * we've passed over so far, so when we fall off the end of >> * the relation we're done. >> */ I think we can either report number of blocks scanned so far or number of sample rows. But I agree with you that reporting the number of blocks scanned so far would be better than reporting number of sample rows. >> I Understood that we could not change the definition of >> AcquireSampleRowsFunc without updating the documentation in fdwhandler.sgml. >> In the last patch I have modified the definition of AcquireSampleRowsFunc to >> handle the inheritance case. >> If the table being analyzed has one or more children, ANALYZE will gather >> statistics twice: >> once on the rows of parent table only and second on the rows of the parent >> table with all of its children. >> So while reporting the progress the value of number of target sample rows >> and number of rows sampled is mismatched. >> For single relation it works fine. >> >> In the attached patch I have not change the definition of >> AcquireSampleRowsFunc. >> I have updated inheritance case in the the documentation. >> >>> + /* Reset rows processed to zero for the next column */ >>> + >>> pgstat_progress_update_param(PROGRESS_ANALYZE_NUM_ROWS_PROCESSED, >>> 0); >>> >>> This seems like a bad design. If you see a value other than zero in >>> this field, you still won't know how much progress analyze has made, >>> because you don't know which column you're processing. I also feel >>> like you're not paying enough attention to a key point here that I >>> made in my last review, which is that you need to look at where >>> ANALYZE actually spends most of the time. If the process of computing >>> the per-row statistics consumes significant CPU time, then maybe >>> there's some point in trying to instrument it, but does it? Unless >>> you know the answer to that question, you don't know whether there's >>> even a point to trying to instrument this. >>> >> Understood. The computing statistics phase takes long time so I am looking >> at the code. > Yes, ANALYZE spends most of time on computing statistics phase. I > measured the duration time for each phases on my environment. I set up > the table by pgbench -i -s 100 (total 10 million rows), and filled the > filler column of pgbench_accounts table with random string. And I set > default_statistics_target to 1000, and ran analyze on that table. The > analyze scanned all blocks of target table and collected 3000000 > sample rows. The durations of each phase are, > > * Sampling : 7 sec > * Computing statistics : 28 sec > * aid column : 1 sec > * bid column : 1 sec > * abalance column : 1 sec > * filler column : 25 sec > > I'm not sure if we can calculate the meaningful progress information > in computing statistics function such as compute_scalar_stats, > compute_trivial_stats. But I think that at least showing which > statistics of column is being computed would be good information for > user. +1. I think this kind of progress information would be helpful for user. Regards, Vinayak Pokale NTT Open Source Software Center
pgsql-hackers by date: