Thread: [patch] CLUSTER blocks scanned progress reporting

[patch] CLUSTER blocks scanned progress reporting

From
Matthias van de Meent
Date:
Hi,

The pg_stat_progress_cluster view can report incorrect
heap_blks_scanned values when synchronize_seqscans is enabled, because
it allows the sequential heap scan to not start at block 0. This can
result in wraparounds in the heap_blks_scanned column when the table
scan wraps around, and starting the next phase with heap_blks_scanned
!= heap_blks_total. This issue was introduced with the
pg_stat_progress_cluster view.

The attached patch fixes the issue by accounting for a non-0
heapScan->rs_startblock and calculating the correct number with a
non-0 heapScan->rs_startblock in mind.

The issue is reproducible starting from PG12 by stopping a
CLUSTER-command while it is sequential-scanning the table (seqscan
enforceable through enable_indexscan = off), and then re-starting the
seqscanning CLUSTER command (without other load/seq-scans on the
table).


Any thoughts?

Matthias van de Meent

Attachment

Re: [patch] CLUSTER blocks scanned progress reporting

From
Fujii Masao
Date:

On 2020/11/21 2:32, Matthias van de Meent wrote:
> Hi,
> 
> The pg_stat_progress_cluster view can report incorrect
> heap_blks_scanned values when synchronize_seqscans is enabled, because
> it allows the sequential heap scan to not start at block 0. This can
> result in wraparounds in the heap_blks_scanned column when the table
> scan wraps around, and starting the next phase with heap_blks_scanned
> != heap_blks_total. This issue was introduced with the
> pg_stat_progress_cluster view.

Good catch! I agree that this is a bug.

> 
> The attached patch fixes the issue by accounting for a non-0
> heapScan->rs_startblock and calculating the correct number with a
> non-0 heapScan->rs_startblock in mind.

Thanks for the patch! It basically looks good to me.

It's a bit waste of cycles to calculate and update the number of scanned
blocks every cycles. So I'm inclined to change the code as follows.
Thought?

+    BlockNumber    prev_cblock = InvalidBlockNumber;
<snip>
+            if (prev_cblock != heapScan->rs_cblock)
+            {
+                pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+                                             (heapScan->rs_cblock +
+                                              heapScan->rs_nblocks -
+                                              heapScan->rs_startblock
+                                                 ) % heapScan->rs_nblocks + 1);
+                prev_cblock = heapScan->rs_cblock;
+            }

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: [patch] CLUSTER blocks scanned progress reporting

From
Matthias van de Meent
Date:
On Tue, 24 Nov 2020 at 15:05, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> On 2020/11/21 2:32, Matthias van de Meent wrote:
> > Hi,
> >
> > The pg_stat_progress_cluster view can report incorrect
> > heap_blks_scanned values when synchronize_seqscans is enabled, because
> > it allows the sequential heap scan to not start at block 0. This can
> > result in wraparounds in the heap_blks_scanned column when the table
> > scan wraps around, and starting the next phase with heap_blks_scanned
> > != heap_blks_total. This issue was introduced with the
> > pg_stat_progress_cluster view.
>
> Good catch! I agree that this is a bug.
>
> >
> > The attached patch fixes the issue by accounting for a non-0
> > heapScan->rs_startblock and calculating the correct number with a
> > non-0 heapScan->rs_startblock in mind.
>
> Thanks for the patch! It basically looks good to me.

Thanks for the feedback!

> It's a bit waste of cycles to calculate and update the number of scanned
> blocks every cycles. So I'm inclined to change the code as follows.
> Thought?
>
> +       BlockNumber     prev_cblock = InvalidBlockNumber;
> <snip>
> +                       if (prev_cblock != heapScan->rs_cblock)
> +                       {
> +                               pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
> +                                                                                        (heapScan->rs_cblock +
> +                                                                                         heapScan->rs_nblocks -
> +                                                                                         heapScan->rs_startblock
> +                                                                                                ) %
heapScan->rs_nblocks+ 1);
 
> +                               prev_cblock = heapScan->rs_cblock;
> +                       }

That seems quite reasonable.

I noticed that with my proposed patch it is still possible to go to
the next phase while heap_blks_scanned != heap_blks_total. This can
happen when the final heap pages contain only dead tuples, so no tuple
is returned from the last heap page(s) of the scan. As the
heapScan->rs_cblock is set to InvalidBlockNumber when the scan is
finished (see heapam.c#1060-1072), I think it would be correct to set
heap_blks_scanned to heapScan->rs_nblocks at the end of the scan
instead.

Please find attached a patch applying the suggested changes.

Matthias van de Meent

Attachment

Re: [patch] CLUSTER blocks scanned progress reporting

From
Fujii Masao
Date:

On 2020/11/25 0:25, Matthias van de Meent wrote:
> On Tue, 24 Nov 2020 at 15:05, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>> On 2020/11/21 2:32, Matthias van de Meent wrote:
>>> Hi,
>>>
>>> The pg_stat_progress_cluster view can report incorrect
>>> heap_blks_scanned values when synchronize_seqscans is enabled, because
>>> it allows the sequential heap scan to not start at block 0. This can
>>> result in wraparounds in the heap_blks_scanned column when the table
>>> scan wraps around, and starting the next phase with heap_blks_scanned
>>> != heap_blks_total. This issue was introduced with the
>>> pg_stat_progress_cluster view.
>>
>> Good catch! I agree that this is a bug.
>>
>>>
>>> The attached patch fixes the issue by accounting for a non-0
>>> heapScan->rs_startblock and calculating the correct number with a
>>> non-0 heapScan->rs_startblock in mind.
>>
>> Thanks for the patch! It basically looks good to me.
> 
> Thanks for the feedback!
> 
>> It's a bit waste of cycles to calculate and update the number of scanned
>> blocks every cycles. So I'm inclined to change the code as follows.
>> Thought?
>>
>> +       BlockNumber     prev_cblock = InvalidBlockNumber;
>> <snip>
>> +                       if (prev_cblock != heapScan->rs_cblock)
>> +                       {
>> +                               pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
>> +                                                                                        (heapScan->rs_cblock +
>> +                                                                                         heapScan->rs_nblocks -
>> +                                                                                         heapScan->rs_startblock
>> +                                                                                                ) %
heapScan->rs_nblocks+ 1);
 
>> +                               prev_cblock = heapScan->rs_cblock;
>> +                       }
> 
> That seems quite reasonable.
> 
> I noticed that with my proposed patch it is still possible to go to
> the next phase while heap_blks_scanned != heap_blks_total. This can
> happen when the final heap pages contain only dead tuples, so no tuple
> is returned from the last heap page(s) of the scan. As the
> heapScan->rs_cblock is set to InvalidBlockNumber when the scan is
> finished (see heapam.c#1060-1072), I think it would be correct to set
> heap_blks_scanned to heapScan->rs_nblocks at the end of the scan
> instead.

Thanks for updating the patch!

Please let me check my understanding about this. I was thinking that even
when the last page contains only dead tuples, table_scan_getnextslot()
returns the last page (because SnapshotAny is used) and heap_blks_scanned
is incremented properly. And then, heapam_relation_copy_for_cluster()
handles all the dead tuples in that page. No?

Regards,


-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: [patch] CLUSTER blocks scanned progress reporting

From
Matthias van de Meent
Date:
On Wed, 25 Nov 2020 at 10:35, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> On 2020/11/25 0:25, Matthias van de Meent wrote:
> >
> > I noticed that with my proposed patch it is still possible to go to
> > the next phase while heap_blks_scanned != heap_blks_total. This can
> > happen when the final heap pages contain only dead tuples, so no tuple
> > is returned from the last heap page(s) of the scan. As the
> > heapScan->rs_cblock is set to InvalidBlockNumber when the scan is
> > finished (see heapam.c#1060-1072), I think it would be correct to set
> > heap_blks_scanned to heapScan->rs_nblocks at the end of the scan
> > instead.
>
> Thanks for updating the patch!
>
> Please let me check my understanding about this. I was thinking that even
> when the last page contains only dead tuples, table_scan_getnextslot()
> returns the last page (because SnapshotAny is used) and heap_blks_scanned
> is incremented properly. And then, heapam_relation_copy_for_cluster()
> handles all the dead tuples in that page. No?

Yes, my description was incorrect.

The potential for a discrepancy exists for seemingly empty pages. I
thought that 'filled with only dead tuples' would be correct for
'seemingly empty', but SnapshotAny indeed returns all tuples on a
page. But pages can still be empty with SnapshotAny, through being
emptied by vacuum, so the last pages scanned can still be empty pages.
Then, there would be no tuple returned at the last pages of the scan,
and subsequently the CLUSTER/VACUUM FULL would start the next phase
without reporting on the last pages that were scanned and had no
tuples in them (such that heap_blks_scanned != heap_blks_total).

Vacuum truncates empty blocks from the end of the relation, and this
prevents empty blocks at the end of CLUSTER for the default case of
table scans starting at 0; but because the table scan might not start
at block 0, we can have an empty page at the end of the table scan due
to the last page of the scan not being the last page of the table.

Matthias van de Meent



Re: [patch] CLUSTER blocks scanned progress reporting

From
Fujii Masao
Date:

On 2020/11/25 20:52, Matthias van de Meent wrote:
> On Wed, 25 Nov 2020 at 10:35, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>> On 2020/11/25 0:25, Matthias van de Meent wrote:
>>>
>>> I noticed that with my proposed patch it is still possible to go to
>>> the next phase while heap_blks_scanned != heap_blks_total. This can
>>> happen when the final heap pages contain only dead tuples, so no tuple
>>> is returned from the last heap page(s) of the scan. As the
>>> heapScan->rs_cblock is set to InvalidBlockNumber when the scan is
>>> finished (see heapam.c#1060-1072), I think it would be correct to set
>>> heap_blks_scanned to heapScan->rs_nblocks at the end of the scan
>>> instead.
>>
>> Thanks for updating the patch!
>>
>> Please let me check my understanding about this. I was thinking that even
>> when the last page contains only dead tuples, table_scan_getnextslot()
>> returns the last page (because SnapshotAny is used) and heap_blks_scanned
>> is incremented properly. And then, heapam_relation_copy_for_cluster()
>> handles all the dead tuples in that page. No?
> 
> Yes, my description was incorrect.
> 
> The potential for a discrepancy exists for seemingly empty pages. I
> thought that 'filled with only dead tuples' would be correct for
> 'seemingly empty', but SnapshotAny indeed returns all tuples on a
> page. But pages can still be empty with SnapshotAny, through being
> emptied by vacuum, so the last pages scanned can still be empty pages.
> Then, there would be no tuple returned at the last pages of the scan,
> and subsequently the CLUSTER/VACUUM FULL would start the next phase
> without reporting on the last pages that were scanned and had no
> tuples in them (such that heap_blks_scanned != heap_blks_total).
> 
> Vacuum truncates empty blocks from the end of the relation, and this
> prevents empty blocks at the end of CLUSTER for the default case of
> table scans starting at 0; but because the table scan might not start
> at block 0, we can have an empty page at the end of the table scan due
> to the last page of the scan not being the last page of the table.

Thanks for explaining that! I understood that.
That situation can happen easily also by using VACUUM (TRUNCATE off) command.

+                                * A heap scan need not return tuples for the last page it has
+                                * scanned. To ensure that heap_blks_scanned is equivalent to
+                                * total_heap_blks after the table scan phase, this parameter
+                                * is manually updated to the correct value when the table scan
+                                * finishes.

So it's better to update this comment a bit? For example,

     If the scanned last pages are empty, it's possible to go to the next phase
     while heap_blks_scanned != heap_blks_total. To ensure that they are
     equivalet after the table scan phase, this parameter is manually updated
     to the correct value when the table scan finishes.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: [patch] CLUSTER blocks scanned progress reporting

From
Matthias van de Meent
Date:
On Thu, 26 Nov 2020 at 00:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> +                                * A heap scan need not return tuples for the last page it has
> +                                * scanned. To ensure that heap_blks_scanned is equivalent to
> +                                * total_heap_blks after the table scan phase, this parameter
> +                                * is manually updated to the correct value when the table scan
> +                                * finishes.
>
> So it's better to update this comment a bit? For example,
>
>      If the scanned last pages are empty, it's possible to go to the next phase
>      while heap_blks_scanned != heap_blks_total. To ensure that they are
>      equivalet after the table scan phase, this parameter is manually updated
>      to the correct value when the table scan finishes.
>

PFA a patch with updated message and comment. I've reworded the
messages to specifically mention empty pages and the need for setting
heap_blks_scanned = total_heap_blks explicitly.

Feel free to update the specifics, other than that I think this is a
complete fix for the issues at hand.

Regards,

Matthias van de Meent

Attachment

Re: [patch] CLUSTER blocks scanned progress reporting

From
Fujii Masao
Date:

On 2020/11/27 1:45, Matthias van de Meent wrote:
> On Thu, 26 Nov 2020 at 00:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>> +                                * A heap scan need not return tuples for the last page it has
>> +                                * scanned. To ensure that heap_blks_scanned is equivalent to
>> +                                * total_heap_blks after the table scan phase, this parameter
>> +                                * is manually updated to the correct value when the table scan
>> +                                * finishes.
>>
>> So it's better to update this comment a bit? For example,
>>
>>       If the scanned last pages are empty, it's possible to go to the next phase
>>       while heap_blks_scanned != heap_blks_total. To ensure that they are
>>       equivalet after the table scan phase, this parameter is manually updated
>>       to the correct value when the table scan finishes.
>>
> 
> PFA a patch with updated message and comment. I've reworded the
> messages to specifically mention empty pages and the need for setting
> heap_blks_scanned = total_heap_blks explicitly.

Thanks for updating the patch! It looks good to me.
Barring any objection, I will commit this patch (also back-patch to v12).

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: [patch] CLUSTER blocks scanned progress reporting

From
Fujii Masao
Date:

On 2020/11/27 1:51, Fujii Masao wrote:
> 
> 
> On 2020/11/27 1:45, Matthias van de Meent wrote:
>> On Thu, 26 Nov 2020 at 00:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>
>>> +                                * A heap scan need not return tuples for the last page it has
>>> +                                * scanned. To ensure that heap_blks_scanned is equivalent to
>>> +                                * total_heap_blks after the table scan phase, this parameter
>>> +                                * is manually updated to the correct value when the table scan
>>> +                                * finishes.
>>>
>>> So it's better to update this comment a bit? For example,
>>>
>>>       If the scanned last pages are empty, it's possible to go to the next phase
>>>       while heap_blks_scanned != heap_blks_total. To ensure that they are
>>>       equivalet after the table scan phase, this parameter is manually updated
>>>       to the correct value when the table scan finishes.
>>>
>>
>> PFA a patch with updated message and comment. I've reworded the
>> messages to specifically mention empty pages and the need for setting
>> heap_blks_scanned = total_heap_blks explicitly.
> 
> Thanks for updating the patch! It looks good to me.
> Barring any objection, I will commit this patch (also back-patch to v12).

Pushed. Thanks!

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION