Thread: pg_stat_progress_create_index vs. parallel index builds
Hi, While experimenting with parallel index builds, I've noticed a somewhat strange behavior of pg_stat_progress_create_index when a btree index is built with parallel workers - some of the phases seem to be missing. In serial (no parallelism) mode, the progress is roughly this (it's always the first/last timestamp of each phase): | command | phase -------------+--------------+---------------------------------------- 12:56:01 AM | CREATE INDEX | building index: scanning table ... 01:06:22 AM | CREATE INDEX | building index: scanning table 01:06:23 AM | CREATE INDEX | building index: sorting live tuples ... 01:13:10 AM | CREATE INDEX | building index: sorting live tuples 01:13:11 AM | CREATE INDEX | building index: loading tuples in tree ... 01:24:02 AM | CREATE INDEX | building index: loading tuples in tree So it goes through three phases: 1) scanning tuples 2) sorting live tuples 3) loading tuples in tree But with parallel build index build, it changes to: | command | phase -------------+--------------+---------------------------------------- 11:40:48 AM | CREATE INDEX | building index: scanning table ... 11:47:24 AM | CREATE INDEX | building index: scanning table (scan complete) 11:56:22 AM | CREATE INDEX | building index: scanning table 11:56:23 AM | CREATE INDEX | building index: loading tuples in tree ... 12:05:33 PM | CREATE INDEX | building index: loading tuples in tree That is, the "sorting live tuples" phase disappeared, and instead it seems to be counted in the "scanning table" one, as if there was an update of the phase missing. I've only tried this on master, but I assume it behaves like this in the older releases too. I wonder if this is intentional - it sure is a bit misleading. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 2 Jun 2021 at 13:57, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi, > > While experimenting with parallel index builds, I've noticed a somewhat > strange behavior of pg_stat_progress_create_index when a btree index is > built with parallel workers - some of the phases seem to be missing. > > In serial (no parallelism) mode, the progress is roughly this (it's > always the first/last timestamp of each phase): > > | command | phase > -------------+--------------+---------------------------------------- > 12:56:01 AM | CREATE INDEX | building index: scanning table > ... > 01:06:22 AM | CREATE INDEX | building index: scanning table > 01:06:23 AM | CREATE INDEX | building index: sorting live tuples > ... > 01:13:10 AM | CREATE INDEX | building index: sorting live tuples > 01:13:11 AM | CREATE INDEX | building index: loading tuples in tree > ... > 01:24:02 AM | CREATE INDEX | building index: loading tuples in tree > > So it goes through three phases: > > 1) scanning tuples > 2) sorting live tuples > 3) loading tuples in tree > > But with parallel build index build, it changes to: > > | command | phase > -------------+--------------+---------------------------------------- > 11:40:48 AM | CREATE INDEX | building index: scanning table > ... > 11:47:24 AM | CREATE INDEX | building index: scanning table (scan > complete) > 11:56:22 AM | CREATE INDEX | building index: scanning table > 11:56:23 AM | CREATE INDEX | building index: loading tuples in tree > ... > 12:05:33 PM | CREATE INDEX | building index: loading tuples in tree > > That is, the "sorting live tuples" phase disappeared, and instead it > seems to be counted in the "scanning table" one, as if there was an > update of the phase missing. > I've only tried this on master, but I assume it behaves like this in the > older releases too. I wonder if this is intentional - it sure is a bit > misleading. This was a suprise to me as well. According to documentation in sortsupport.h (line 125-129) the parallel workers produce pre-sorted segments during the scanning phase, which are subsequently merged by the leader. This might mean that the 'sorting' phase is already finished during the 'scanning' phase by waiting for the parallel workers; I haven't looked further if this is the case and whether it could be changed to also produce the sorting metrics, but seeing as it is part of the parallel workers API of tuplesort, I think fixing it in current releases is going to be difficult. With regards, Matthias van de Meent
On 6/2/21 3:03 PM, Matthias van de Meent wrote: > On Wed, 2 Jun 2021 at 13:57, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> Hi, >> >> While experimenting with parallel index builds, I've noticed a somewhat >> strange behavior of pg_stat_progress_create_index when a btree index is >> built with parallel workers - some of the phases seem to be missing. >> >> In serial (no parallelism) mode, the progress is roughly this (it's >> always the first/last timestamp of each phase): >> >> | command | phase >> -------------+--------------+---------------------------------------- >> 12:56:01 AM | CREATE INDEX | building index: scanning table >> ... >> 01:06:22 AM | CREATE INDEX | building index: scanning table >> 01:06:23 AM | CREATE INDEX | building index: sorting live tuples >> ... >> 01:13:10 AM | CREATE INDEX | building index: sorting live tuples >> 01:13:11 AM | CREATE INDEX | building index: loading tuples in tree >> ... >> 01:24:02 AM | CREATE INDEX | building index: loading tuples in tree >> >> So it goes through three phases: >> >> 1) scanning tuples >> 2) sorting live tuples >> 3) loading tuples in tree >> >> But with parallel build index build, it changes to: >> >> | command | phase >> -------------+--------------+---------------------------------------- >> 11:40:48 AM | CREATE INDEX | building index: scanning table >> ... >> 11:47:24 AM | CREATE INDEX | building index: scanning table (scan >> complete) >> 11:56:22 AM | CREATE INDEX | building index: scanning table >> 11:56:23 AM | CREATE INDEX | building index: loading tuples in tree >> ... >> 12:05:33 PM | CREATE INDEX | building index: loading tuples in tree >> >> That is, the "sorting live tuples" phase disappeared, and instead it >> seems to be counted in the "scanning table" one, as if there was an >> update of the phase missing. > >> I've only tried this on master, but I assume it behaves like this in the >> older releases too. I wonder if this is intentional - it sure is a bit >> misleading. > > This was a suprise to me as well. According to documentation in > sortsupport.h (line 125-129) the parallel workers produce pre-sorted > segments during the scanning phase, which are subsequently merged by > the leader. This might mean that the 'sorting' phase is already > finished during the 'scanning' phase by waiting for the parallel > workers; I haven't looked further if this is the case and whether it > could be changed to also produce the sorting metrics, but seeing as it > is part of the parallel workers API of tuplesort, I think fixing it in > current releases is going to be difficult. > Maybe. Perhaps it's more complicated to decide when to switch between phases with parallel workers. Still, the table scan is done after ~8 minutes (based on blocks_total vs. blocks_done), yet we keep that phase for another ~9 minutes. It seems this is where the workers do the sort, so "sorting live tuples" seems like a more natural phase for this. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 2 Jun 2021 at 15:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 6/2/21 3:03 PM, Matthias van de Meent wrote: > > On Wed, 2 Jun 2021 at 13:57, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > >> > >> Hi, > >> > >> While experimenting with parallel index builds, I've noticed a somewhat > >> strange behavior of pg_stat_progress_create_index when a btree index is > >> built with parallel workers - some of the phases seem to be missing. > >> > >> In serial (no parallelism) mode, the progress is roughly this (it's > >> always the first/last timestamp of each phase): > >> > >> | command | phase > >> -------------+--------------+---------------------------------------- > >> 12:56:01 AM | CREATE INDEX | building index: scanning table > >> ... > >> 01:06:22 AM | CREATE INDEX | building index: scanning table > >> 01:06:23 AM | CREATE INDEX | building index: sorting live tuples > >> ... > >> 01:13:10 AM | CREATE INDEX | building index: sorting live tuples > >> 01:13:11 AM | CREATE INDEX | building index: loading tuples in tree > >> ... > >> 01:24:02 AM | CREATE INDEX | building index: loading tuples in tree > >> > >> So it goes through three phases: > >> > >> 1) scanning tuples > >> 2) sorting live tuples > >> 3) loading tuples in tree > >> > >> But with parallel build index build, it changes to: > >> > >> | command | phase > >> -------------+--------------+---------------------------------------- > >> 11:40:48 AM | CREATE INDEX | building index: scanning table > >> ... > >> 11:47:24 AM | CREATE INDEX | building index: scanning table (scan > >> complete) > >> 11:56:22 AM | CREATE INDEX | building index: scanning table > >> 11:56:23 AM | CREATE INDEX | building index: loading tuples in tree > >> ... > >> 12:05:33 PM | CREATE INDEX | building index: loading tuples in tree > >> > >> That is, the "sorting live tuples" phase disappeared, and instead it > >> seems to be counted in the "scanning table" one, as if there was an > >> update of the phase missing. > > > >> I've only tried this on master, but I assume it behaves like this in the > >> older releases too. I wonder if this is intentional - it sure is a bit > >> misleading. > > > > This was a suprise to me as well. According to documentation in > > sortsupport.h (line 125-129) the parallel workers produce pre-sorted > > segments during the scanning phase, which are subsequently merged by > > the leader. This might mean that the 'sorting' phase is already > > finished during the 'scanning' phase by waiting for the parallel > > workers; I haven't looked further if this is the case and whether it > > could be changed to also produce the sorting metrics, but seeing as it > > is part of the parallel workers API of tuplesort, I think fixing it in > > current releases is going to be difficult. > > > > Maybe. Perhaps it's more complicated to decide when to switch between > phases with parallel workers. Still, the table scan is done after ~8 > minutes (based on blocks_total vs. blocks_done), yet we keep that phase > for another ~9 minutes. It seems this is where the workers do the sort, > so "sorting live tuples" seems like a more natural phase for this. After looking at it a bit more, it seems like a solution was actually easier than I'd expected. PFA a prototype (unvalidated, but check-world -ed) patch that would add these subphases of progress reporting, which can be backpatched down to 12. Do note that this is a partial fix, as it only fixes it when the leader participates; but I don't think that limitation is too much of a problem because only on builds which explicitly define the non-standard DISABLE_LEADER_PARTICIPATION this will happen, and in such cases the progress reporting for the loading phase will fail as well. With regards, Matthias van de Meent
Attachment
On 6/2/21 4:54 PM, Matthias van de Meent wrote: > On Wed, 2 Jun 2021 at 15:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> ... >> > > After looking at it a bit more, it seems like a solution was actually > easier than I'd expected. PFA a prototype (unvalidated, but > check-world -ed) patch that would add these subphases of progress > reporting, which can be backpatched down to 12. > Nice. I gave it a try on the database I'm experimenting with, and it seems to be working fine. Please add it to the next CF. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 2 Jun 2021 at 17:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 6/2/21 4:54 PM, Matthias van de Meent wrote: > > On Wed, 2 Jun 2021 at 15:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > >> > >> ... > >> > > > > After looking at it a bit more, it seems like a solution was actually > > easier than I'd expected. PFA a prototype (unvalidated, but > > check-world -ed) patch that would add these subphases of progress > > reporting, which can be backpatched down to 12. > > > > Nice. I gave it a try on the database I'm experimenting with, and it > seems to be working fine. Please add it to the next CF. Thanks, cf available here: https://commitfest.postgresql.org/33/3149/ With regards, Matthias van de Meent
On 2021-Jun-02, Tomas Vondra wrote: > Hi, > > While experimenting with parallel index builds, I've noticed a somewhat > strange behavior of pg_stat_progress_create_index when a btree index is > built with parallel workers - some of the phases seem to be missing. Hmm, that's odd. I distinctly recall testing the behavior with parallel workers, and it is mentioned by Rahila in the original thread, and I think we tried to ensure that it was sane. I am surprised to learn that there's such a large gap. I'll go have a deeper look at the provided patch and try to get it backpatched. I think it would be valuable to have some kind of test mode where the progress reporting APIs would make some noise (perhaps with a bespoke GUC option) so that we can test things in some automated manner ... -- Álvaro Herrera 39°49'30"S 73°17'W "In fact, the basic problem with Perl 5's subroutines is that they're not crufty enough, so the cruft leaks out into user-defined code instead, by the Conservation of Cruft Principle." (Larry Wall, Apocalypse 6)
On 6/2/21 6:38 PM, Alvaro Herrera wrote: > On 2021-Jun-02, Tomas Vondra wrote: > >> Hi, >> >> While experimenting with parallel index builds, I've noticed a somewhat >> strange behavior of pg_stat_progress_create_index when a btree index is >> built with parallel workers - some of the phases seem to be missing. > > Hmm, that's odd. I distinctly recall testing the behavior with parallel > workers, and it is mentioned by Rahila in the original thread, and I > think we tried to ensure that it was sane. I am surprised to learn that > there's such a large gap. > Yeah, I quickly skimmed [1] which I think is the thread you're referring to, and there is some discussion about parallel workers. I haven't read it in detail, though. [1] https://www.postgresql.org/message-id/20181220220022.mg63bhk26zdpvmcj%40alvherre.pgsql > I'll go have a deeper look at the provided patch and try to get it > backpatched. > > I think it would be valuable to have some kind of test mode where the > progress reporting APIs would make some noise (perhaps with a bespoke > GUC option) so that we can test things in some automated manner ... > True, but how would that GUC work? Would it add something into the system view, or just log something? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2021-Jun-02, Tomas Vondra wrote: > On 6/2/21 6:38 PM, Alvaro Herrera wrote: > > Hmm, that's odd. I distinctly recall testing the behavior with parallel > > workers, and it is mentioned by Rahila in the original thread, and I > > think we tried to ensure that it was sane. I am surprised to learn that > > there's such a large gap. > > Yeah, I quickly skimmed [1] which I think is the thread you're referring > to, and there is some discussion about parallel workers. I haven't read > it in detail, though. > > [1] > https://www.postgresql.org/message-id/20181220220022.mg63bhk26zdpvmcj%40alvherre.pgsql Well, it is quite possible that we found *some* problems with parallel workers but not all of them :-) > > I think it would be valuable to have some kind of test mode where the > > progress reporting APIs would make some noise (perhaps with a bespoke > > GUC option) so that we can test things in some automated manner ... > > True, but how would that GUC work? Would it add something into the > system view, or just log something? With the GUC turned on, emit some sort of message (maybe at INFO level) whenever some subset of the progress parameters changes. This makes it easy to compare the progress of any command with the expected set of messages. However, it's not very clear which parameters are observed for changes (you can't do it for all params, because you'd get one for each block in some cases, and that's unworkable). The way have #defined the parameters makes it difficult to annotate parameters with flag bits; we could have something #ifdef USE_ASSERT_CHECKING #define PROGRESS_LOG_CHANGES 0x70000000 #else #define PROGRESS_LOG_CHANGES 0x0 #endif #define PROGRESS_CLUSTER_PHASE (1 | PROGRESS_LOG_CHANGES) and the progress-reporting knows to mask-out the LOG_CHANGES bit before storing the value in memory, but also knows to emit the log output if that's enabled and the LOG_CHANGES bit is present. (The assertion flag would be tested at compile time to avoid a performance hit in production builds.) -- Álvaro Herrera Valdivia, Chile "I can't go to a restaurant and order food because I keep looking at the fonts on the menu. Five minutes later I realize that it's also talking about food" (Donald Knuth)
On Thu, Jun 3, 2021 at 1:49 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Wed, 2 Jun 2021 at 17:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > Nice. I gave it a try on the database I'm experimenting with, and it > > seems to be working fine. Please add it to the next CF. > > Thanks, cf available here: https://commitfest.postgresql.org/33/3149/ > The patch looks OK to me. It seems apparent that the lines added by the patch are missing from the current source in the parallel case. I tested with and without the patch, using the latest PG14 source as of today, and can confirm that without the patch applied, the "sorting live tuples" phase is not reported in the parallel-case, but with the patch applied it then does get reported in that case. I also confirmed that, as you said, the patch only addresses the usual case where the parallel leader participates in the parallel operation. What is slightly puzzling to me (and perhaps digging deeper will reveal it) is why this "sorting live tuples" phase seems so short in the serial case compared to the parallel case? For example, in my test I created an index on a column of a table having 10 million records, and it took about 40 seconds, during which the "sorting live tuples" phase seemed to take about 8 seconds. Yet for the serial case, index creation took about 75 seconds, during which the "sorting live tuples" phase seemed to take about 1 second. Regards, Greg Nancarrow Fujitsu Australia
On Fri, Jun 4, 2021 at 5:25 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > What is slightly puzzling to me (and perhaps digging deeper will > reveal it) is why this "sorting live tuples" phase seems so short in > the serial case compared to the parallel case? > For example, in my test I created an index on a column of a table > having 10 million records, and it took about 40 seconds, during which > the "sorting live tuples" phase seemed to take about 8 seconds. Yet > for the serial case, index creation took about 75 seconds, during > which the "sorting live tuples" phase seemed to take about 1 second. > Seems to be because in the serial case, the sort occurs after the scan is complete (obviously) but in the parallel case, the scan and sort are combined, so (after patch application) a portion of the then reported "sorting live tuples" phase is actually "scanning table". Regards, Greg Nancarrow Fujitsu Australia
On 2021-Jun-04, Greg Nancarrow wrote: > I tested with and without the patch, using the latest PG14 source as > of today, and can confirm that without the patch applied, the "sorting > live tuples" phase is not reported in the parallel-case, but with the > patch applied it then does get reported in that case. I also confirmed > that, as you said, the patch only addresses the usual case where the > parallel leader participates in the parallel operation. > What is slightly puzzling to me (and perhaps digging deeper will > reveal it) is why this "sorting live tuples" phase seems so short in > the serial case compared to the parallel case? > For example, in my test I created an index on a column of a table > having 10 million records, and it took about 40 seconds, during which > the "sorting live tuples" phase seemed to take about 8 seconds. Yet > for the serial case, index creation took about 75 seconds, during > which the "sorting live tuples" phase seemed to take about 1 second. I think the reason is that scanning the table is not just scanning the table -- it is also feeding tuples to tuplesort, which internally is already sorting them as it receives them. So by the time you're done scanning the relation, some (large) fraction of the sorting work is already done, which is why the "sorting" phase is so short. Tracing sort is not easy. we discussed this earlier; see https://postgr.es/m/20181218210159.xtkltzm7flrwsm55@alvherre.pgsql for example. -- Álvaro Herrera Valdivia, Chile Maybe there's lots of data loss but the records of data loss are also lost. (Lincoln Yeoh)
On 2021-Jun-04, Greg Nancarrow wrote: > On Thu, Jun 3, 2021 at 1:49 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > > > On Wed, 2 Jun 2021 at 17:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > > > Nice. I gave it a try on the database I'm experimenting with, and it > > > seems to be working fine. Please add it to the next CF. > > > > Thanks, cf available here: https://commitfest.postgresql.org/33/3149/ > > The patch looks OK to me. It seems apparent that the lines added by > the patch are missing from the current source in the parallel case. > > I tested with and without the patch, using the latest PG14 source as > of today, and can confirm that without the patch applied, the "sorting > live tuples" phase is not reported in the parallel-case, but with the > patch applied it then does get reported in that case. I also confirmed > that, as you said, the patch only addresses the usual case where the > parallel leader participates in the parallel operation. So, with Matthias' patch applied and some instrumentation to log (some) parameter updates, this is what I get on building an index in parallel. The "subphase" is parameter 10: 2021-06-09 17:04:30.692 -04 19194 WARNING: updating param 0 to 1 2021-06-09 17:04:30.692 -04 19194 WARNING: updating param 6 to 0 2021-06-09 17:04:30.692 -04 19194 WARNING: updating param 8 to 403 2021-06-09 17:04:30.696 -04 19194 WARNING: updating param 9 to 2 2021-06-09 17:04:30.696 -04 19194 WARNING: updating param 10 to 1 2021-06-09 17:04:30.696 -04 19194 WARNING: updating param 11 to 0 2021-06-09 17:04:30.696 -04 19194 WARNING: updating param 15 to 0 2021-06-09 17:04:30.696 -04 19194 WARNING: updating param 10 to 2 2021-06-09 17:04:30.696 -04 19194 WARNING: updating param 15 to 486726 2021-06-09 17:04:37.418 -04 19194 WARNING: updating param 10 to 3 <-- this one is new 2021-06-09 17:04:42.215 -04 19194 WARNING: updating param 11 to 110000000 2021-06-09 17:04:42.215 -04 19194 WARNING: updating param 15 to 0 2021-06-09 17:04:42.215 -04 19194 WARNING: updating param 10 to 3 2021-06-09 17:04:42.237 -04 19194 WARNING: updating param 10 to 5 The thing to note is that we set subphase to 3 twice. The first of those is added by the patch to _bt_parallel_scan_and_sort. The second is in _bt_leafbuild, just before setting the subphase to LEAF_LOAD. So the change is that we set the subphase to "sorting live tuples" five seconds ahead of what we were doing previously. Seems ok. (We could alternatively skip the progress update call in _bt_leafbuild; but those calls are so cheap that adding a conditional jump is almost as expensive.) (The other potential problem might be to pointlessly invoke the progress update calls when in a worker. But that's already covered because only the leader passes progress=true to _bt_parallel_scan_and_sort.) I'll push now. -- Álvaro Herrera Valdivia, Chile