Thread: Re: Enhance file_fdw to report processed and skipped tuples in COPY progress
Re: Enhance file_fdw to report processed and skipped tuples in COPY progress
From
Masahiko Sawada
Date:
Hi, On Thu, Oct 3, 2024 at 2:23 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > Hi, > > Currently, file_fdw updates several columns in the pg_stat_progress_copy view, > like relid and bytes_processed, but it doesn't track tuples_processed or > tuples_skipped. Monitoring these would be particularly useful when handling > large data sets via file_fdw, as it helps track the progress of scan. > > The attached patch updates file_fdw to add support for reporting > the number of tuples processed and skipped (due to on_error = 'ignore') > in the pg_stat_progress_copy view. What are your thoughts? While the patch works fine and looks good to me, in the first place, it seems to me that the fact that file_fdw uses the COPY progress itself doesn't work properly. For example, unlike COPY command, queries could have multiple scans on one or more flie_fdw foreign tables when joining tables. I found the discussion for that[1]: there was a proposal of disabling COPY progress for file_fdw but the votes are split. I think it would be better to consider if we really want to support COPY progress for file_fdw before supporting more progress information. [1] https://www.postgresql.org/message-id/flat/20230119054703.GB13860%40telsasoft.com Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 2024/10/04 2:12, Masahiko Sawada wrote: > Hi, > > On Thu, Oct 3, 2024 at 2:23 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> >> Hi, >> >> Currently, file_fdw updates several columns in the pg_stat_progress_copy view, >> like relid and bytes_processed, but it doesn't track tuples_processed or >> tuples_skipped. Monitoring these would be particularly useful when handling >> large data sets via file_fdw, as it helps track the progress of scan. >> >> The attached patch updates file_fdw to add support for reporting >> the number of tuples processed and skipped (due to on_error = 'ignore') >> in the pg_stat_progress_copy view. What are your thoughts? > > While the patch works fine and looks good to me, in the first place, > it seems to me that the fact that file_fdw uses the COPY progress > itself doesn't work properly. For example, unlike COPY command, > queries could have multiple scans on one or more flie_fdw foreign > tables when joining tables. I found the discussion for that[1]: there > was a proposal of disabling COPY progress for file_fdw but the votes > are split. I think it would be better to consider if we really want to > support COPY progress for file_fdw before supporting more progress > information. Yes, you're right. We need to address how to handle multiple commands that trigger progress reporting when executed concurrently, at first. The current progress reporting mechanism assumes only one command triggering progress is running at a time, as each backend has just one memory area for progress reporting. If multiple commands run simultaneously, the progress data would be incorrect. As you mentioned, this could happen when querying multiple file_fdw foreign tables, where multiple COPY commands could execute concurrently. However, this issue already exists without the proposed patch. Since file_fdw already reports progress partially, querying multiple file_fdw tables can lead to inaccurate or confusing progress reports. You can even observe this when analyzing a file_fdw table and also when copying to the table with a trigger that executes progress-reporting commands. So, I don’t think this issue should block the proposed patch. In fact, progress reporting is already flawed in these scenarios, regardless of whether the patch is applied. On the other hand, in many cases where a single file_fdw table is scanned, COPY progress reporting works correctly for file_fdw and is useful. Therefore, I believe it's still worth improving file_fdw’s progress reporting. To prevent misleading reports when multiple commands are run concurrently, just idea, we could consider displaying NULL columns in the progress report if this situation is detected, as a separate patch. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Fri, 11 Oct 2024 10:53:10 +0900 Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > > On 2024/10/04 2:12, Masahiko Sawada wrote: > > Hi, > > > > On Thu, Oct 3, 2024 at 2:23 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >> > >> Hi, > >> > >> Currently, file_fdw updates several columns in the pg_stat_progress_copy view, > >> like relid and bytes_processed, but it doesn't track tuples_processed or > >> tuples_skipped. Monitoring these would be particularly useful when handling > >> large data sets via file_fdw, as it helps track the progress of scan. > >> > >> The attached patch updates file_fdw to add support for reporting > >> the number of tuples processed and skipped (due to on_error = 'ignore') > >> in the pg_stat_progress_copy view. What are your thoughts? > > > > While the patch works fine and looks good to me, in the first place, > > it seems to me that the fact that file_fdw uses the COPY progress > > itself doesn't work properly. For example, unlike COPY command, > > queries could have multiple scans on one or more flie_fdw foreign > > tables when joining tables. I found the discussion for that[1]: there > > was a proposal of disabling COPY progress for file_fdw but the votes > > are split. I think it would be better to consider if we really want to > > support COPY progress for file_fdw before supporting more progress > > information. > > Yes, you're right. We need to address how to handle multiple commands > that trigger progress reporting when executed concurrently, at first. > > The current progress reporting mechanism assumes only one command > triggering progress is running at a time, as each backend has just > one memory area for progress reporting. If multiple commands run simultaneously, > the progress data would be incorrect. As you mentioned, this could happen > when querying multiple file_fdw foreign tables, where multiple COPY commands > could execute concurrently. > > However, this issue already exists without the proposed patch. > Since file_fdw already reports progress partially, querying multiple > file_fdw tables can lead to inaccurate or confusing progress reports. > You can even observe this when analyzing a file_fdw table and also > when copying to the table with a trigger that executes progress-reporting > commands. > > So, I don’t think this issue should block the proposed patch. > In fact, progress reporting is already flawed in these scenarios, > regardless of whether the patch is applied. > > On the other hand, in many cases where a single file_fdw table is scanned, > COPY progress reporting works correctly for file_fdw and is useful. > Therefore, I believe it's still worth improving file_fdw’s progress reporting. I think reporting tuples_processed and tuples_skipped columns additionally in file_fdw is reasonable, since it already reports bytes_processed and bytes_total. By the way, in the documentation of fild_fdw, it is not explicitly described that file_fdw uses COPY internally, although I can find several wordings like "as COPY". To prevent users to face unexpected experiences, how about explaining explicitly that file_fdw uses COPY and updates pg_stat_progress_copy? > To prevent misleading reports when multiple commands are run concurrently, > just idea, we could consider displaying NULL columns in the progress report > if this situation is detected, as a separate patch. Or, could we add additional field to pg_stat_progress_copy to show how much commands are running COPY? It is also just idea. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Re: Enhance file_fdw to report processed and skipped tuples in COPY progress
From
Kirill Reshke
Date:
On Fri, 11 Oct 2024 at 06:53, Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > However, this issue already exists without the proposed patch. > Since file_fdw already reports progress partially, querying multiple > file_fdw tables can lead to inaccurate or confusing progress reports. > You can even observe this when analyzing a file_fdw table and also > when copying to the table with a trigger that executes progress-reporting > commands. > > So, I don’t think this issue should block the proposed patch. > In fact, progress reporting is already flawed in these scenarios, > regardless of whether the patch is applied. Im +1 on this. To include or not to include these new fields, and to fix multiple-run-bug is two separate topics IMO. Also please notice review comments in[0]. [0] https://www.postgresql.org/message-id/20241011153645.a348de1576a3f57092c68355%40sraoss.co.jp Moved to the next CF. -- Best regards, Kirill Reshke